Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification...
-
Upload
melissa-bond -
Category
Documents
-
view
216 -
download
0
Transcript of Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification...
Fast Similarity Metric Based Data Mining
Techniques Using P-trees:
k-Nearest Neighbor Classification
Distance metric based computation using P-trees
A new distance metric, called HOBbit distance
Some useful properties of P-trees
New P-tree Nearest Neighbor classification method
- called Closed-KNN These notes contain NDSU confidential &Proprietary material.Patents pending on bSQ, Ptree technology
Data Mining
extracting knowledge from a large amount of data
Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis
Information Pyramid
Raw data
Useful Information (sometimes 1 bit: Y/N)
Data MiningMore data volume =
less information
Classification
Predicting the class of a data object
Bc3b3a3
Ac2b2a2
Ac1b1a1
ClassFeature3Feature2Feature1
Training data: Class labels are known and supervise the learning
Classifiercba
Sample with unknown class:Predicted class Of the Sample
also called Supervised learning
Eager classifier: Builds a classifier model in advance
e.g. decision tree induction, neural network
Lazy classifier: Uses the raw training data
e.g. k-nearest neighbor
Clustering (unsupervised learning – cpt 8)
The process of grouping objects into
classes,with the objective: the data objects are
• similar to the objects in the same cluster • dissimilar to the objects in the other clusters.
A two dimensional space showing 3 clusters
Clustering is often called unsupervised
learning or unsupervised classification
the class labels of the data objects are unknown
Distance Metric (used in both classification and clustering)
Measures the dissimilarity between two data points.
A metric is a fctn, d, of 2 n-dimensional points X and Y, such
that
d(X, Y) is positive definite: if (X Y), d(X, Y) > 0
if (X = Y), d(X, Y) = 0
d(X, Y) is symmetric: d(X, Y) = d(Y, X)
d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z) d(X,
Z)
Various Distance Metrics
Minkowski distance or Lp distance, pn
i
piip yxYXd
1
1
,
Manhattan distance,
n
iii yxYXd
11 ,
Euclidian distance,
n
iii yxYXd
1
22 ,
Max distance, ii
n
iyxYXd
1max,
(P = 1)
(P = 2)
(P = )
An Example
A two-dimensional space:
Manhattan, d1(X,Y) = XZ+ ZY = 4+3 = 7
Euclidian, d2(X,Y) = XY = 5
Max, d(X,Y) = Max(XZ, ZY) = XZ = 4X (2,1)
Y (6,4)
Z
d1 d2 d
1 pp ddFor any positive integer p,
Some Other Distances
Canberra distance
Squared cord distance
Squared chi-squared distance
n
i ii
iic yx
yxYXd
1
,
n
iiisc yxYXd
1
2,
n
i ii
iichi yx
yxYXd
1
2
,
HOBbit Similarity
Higher Order Bit (HOBbit) similarity:
HOBbitS(A, B) = ii
m
sbasiis
1:max
0
Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
x1: 0 1 1 0 1 0 0 1 x2: 0 1 0 1 1 1 0 1
y1: 0 1 1 1 1 1 0 1 y2: 0 1 0 1 0 0 0 0
HOBbitS(x1, y1) = 3 HOBbitS(x2, y2) = 4
A, B: two scalars (integer)
ai, bi : ith bit of A and B (left to right)
m : number of bits
HOBbit Distance (related to Hamming distance)
The HOBbit distance between two scalar value A and B:
dv(A, B) = m – HOBbit(A, B)
The previous example:Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
x1: 0 1 1 0 1 0 0 1 x2: 0 1 0 1 1 1 0 1
y1: 0 1 1 1 1 1 0 1 y2: 0 1 0 1 0 0 0 0
HOBbitS(x1, y1) = 3 HOBbitS(x2, y2) = 4
dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4 The HOBbit distance between two points X and Y:
HOBmaxmax11
,yxm - ,yxdX,Yd ii
n
iiiv
n
iH
In our example (considering 2-dimensional data):
dh(X, Y) = max (5, 4) = 5
HOBbit Distance Is a Metric
HOBbit distance is positive definite
if (X = Y), = 0
if (X Y), > 0
YXdH ,
YXdH ,
HOBbit distance is symmetric
XYdYXd HH ,,
HOBbit distance holds triangle inequality ZXdZYdYXd HHH ,,,
Neighborhood of a Point
Neighborhood of a target point, T, is a set of points, S,
such that X S if and only if d(T, X) r
2r
T
X
T
2r
X
2r
T
X
T
2r
X
Manhattan Euclidian Max HOBbit
If X is a point on the boundary, d(T, X) = r
Decision Boundary decision boundary between points A and B, is the
locus of the point X satisfying d(A, X) = d(B, X) B
XA
D
R2
R1
d(A,X)d(B,X)
> 45
Euclidian
B
A
Max
Manhattan
< 45
B
A
EuclidianMax
Manhattan
B
A
B
A
Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance
Decision boundaries for Manhattan, Euclidean and max distance
Minkowski MetricsLp-metrics (aka: Minkowski metrics) dp(X,Y) = (i=1 to n wi|xi - yi|p)
1/p
(weights, wi assumed =1) Unit Disks Dividing Lines
p=1 (Manhattan)
p=2 (Euclidean)
p=3,4,…..Pmax (chessboard)
P=½, ⅓, ¼, …
dmax≡ max|xi - yi| d≡ limp dp(X,Y).
Proof (sort of) limp { i=1 to n aip }1/p max(ai) ≡b. For p large enough, other ai
p << bp since y=xp increasingly concave, so i=1 to n ai
p k*bp (k=duplicity of b in the sum), so {i=1 to n aip }1/p k1/p*b and k1/p 1
P>1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2)e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 2 0.5 0 0.5 0 0.25 0.25 0.7071067812 4 0.5 0 0.5 0 0.0625 0.0625 0.5946035575 9 0.5 0 0.5 0 0.001953125 0.001953125 0.5400298694 100 0.5 0 0.5 0 7.888609E-31 7.888609E-31 0.503477775 MAX 0.5 0 0.5 0 0.5
2 0.70 0 0.7071 0 0.5 0.5 1 3 0.70 0 0.7071 0 0.3535533906 0.3535533906 0.8908987181 7 0.70 0 0.7071 0 0.0883883476 0.0883883476 0.7807091822 100 0.70 0 0.7071 0 8.881784E-16 8.881784E-16 0.7120250978 MAX 0.70 0 0.7071 0 0.7071067812
2 0.99 0 0.99 0 0.9801 0.9801 1.4000714267 8 0.99 0 0.99 0 0.9227446944 0.9227446944 1.0796026553 100 0.99 0 0.99 0 0.3660323413 0.3660323413 0.9968859946 1000 0.99 0 0.99 0 0.0000431712 0.0000431712 0.9906864536 MAX 0.99 0 0.99 0 0.99
2 1 0 1 0 1 1 1.4142135624 9 1 0 1 0 1 1 1.0800597389 100 1 0 1 0 1 1 1.0069555501 1000 1 0 1 0 1 1 1.0006933875 MAX 1 0 1 0 1
2 0.9 0 0.1 0 0.81 0.01 0.9055385138 9 0.9 0 0.1 0 0.387420489 0.000000001 0.9000000003 100 0.9 0 0.1 0 0.0000265614 ************** 0.9 1000 0.9 0 0.1 0 1.747871E-46 0 0.9 MAX 0.9 0 0.1 0 0.9
2 3 0 3 0 9 9 4.2426406871 3 3 0 3 0 27 27 3.7797631497 8 3 0 3 0 6561 6561 3.271523198 100 3 0 3 0 5.153775E+47 5.153775E+47 3.0208666502 MAX 3 0 3 0 3
6 90 0 45 0 531441000000 8303765625 90.232863532 9 90 0 45 0 3.874205E+17 7.566806E+14 90.019514317 100 90 0 45 0 **************************** 90 MAX 90 0 45 0 90
d 1/p(X,Y) = (i=1 to n |xi - yi|1/p)p P<1 p=0 (lim as p0) doesn’t exist (Does not converge.)
q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.1 0 0.1 0 0.1 0.1 0.20.8 0.1 0 0.1 0 0.1584893192 0.1584893192 0.2378414230.4 0.1 0 0.1 0 0.3981071706 0.3981071706 0.56568542490.2 0.1 0 0.1 0 0.6309573445 0.6309573445 3.20.1 0.1 0 0.1 0 0.7943282347 0.7943282347 102.4.04 0.1 0 0.1 0 0.9120108394 0.9120108394 3355443.2.02 0.1 0 0.1 0 0.954992586 0.954992586 112589990684263.01 0.1 0 0.1 0 0.977237221 0.977237221 1.2676506002E+29 2 0.1 0 0.1 0 0.01 0.01 0.1414213562
P<1 Minkowski Metrics
q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.5 0 0.5 0 0.5 0.5 1 0.8 0.5 0 0.5 0 0.5743491775 0.5743491775 1.189207115 0.4 0.5 0 0.5 0 0.7578582833 0.7578582833 2.8284271247 0.2 0.5 0 0.5 0 0.8705505633 0.8705505633 16 0.1 0.5 0 0.5 0 0.9330329915 0.9330329915 5120.04 0.5 0 0.5 0 0.9726549474 0.9726549474 167772160.02 0.5 0 0.5 0 0.9862327045 0.9862327045 5.6294995342E+140.01 0.5 0 0.5 0 0.9930924954 0.9930924954 6.3382530011E+29 2 0.5 0 0.5 0 0.25 0.25 0.7071067812
q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.9 0 0.1 0 0.9 0.1 1 0.8 0.9 0 0.1 0 0.9191661188 0.1584893192 1.097993846 0.4 0.9 0 0.1 0 0.9587315155 0.3981071706 2.14447281 0.2 0.9 0 0.1 0 0.9791483624 0.6309573445 10.8211133585 0.1 0.9 0 0.1 0 0.9895192582 0.7943282347 326.270060470.04 0.9 0 0.1 0 0.9957944476 0.9120108394 10312196.96190.02 0.9 0 0.1 0 0.9978950083 0.954992586 3418710524431540.01 0.9 0 0.1 0 0.9989469497 0.977237221 3.8259705676E+29 2 0.9 0 0.1 0 0.81 0.01 0.9055385138
Min dissimilarity functionThe dmin function (dmin(X,Y) = min i=1 to n |xi - yi| is strange. It is not a psuedo-metric. The Unit Disk is:
And the neighborhood of the blue point relative to the red point (dividing nbrhd - those points closer to the blue than the red). Major bifurcations!
http://www.cs.ndsu.nodak.edu/~serazi/research/Distance.html
Canberra metric: dc(X,Y) = (i=1 to n |xi – yi| / (xi + yi) - normalized manhattan distance
Square Cord metric: dsc(X,Y) = i=1 to n ( xi – yi )2
- Already discussed as Lp with p=1/2
Squared Chi-squared metric: dchi(X,Y) = i=1 to n (xi – yi)2 / (xi + yi)
HOBbit metric (Hi Order Binary bit) dH(X,Y) = max i=1 to n {n – HOB(xi - yi)} where, for m-bit integers,A=a1..am and B=b1..bm HOB(A,B) = max i=1 to m {s: i(1 i s ai=bi)} (related to Hamming distance in coding theory)
Scalar Product metric: dchi(X,Y) = X • Y = i=1 to n xi * yi
Hyperbolic metrics: (which map infinite space 1-1 onto a sphere)
Which are rotationally invariant? Translationally invariant? Other?
Other Interesting Metrics
Notations
P1 & P2 : P1 AND P2 (also P1 ^ P2 )
P1 | P2 : P1 OR P2
P´ : COMPLEMENT P-tree of P
Pi, j : basic P-tree for band-i bit-j.
Pi(v) : value P-tree for value v of band i.
Pi([v1, v2]) : interval P-tree for interval [v1, v2] of band i.
P0 : is pure0-tree, a P-tree having the root node which is pure0.
P1 : is pure1-tree, a P-tree having the root node which is pure1.
rc(P) : root count of P-tree,
P
N : number of pixels
n : number of bands
m : number of bits
Properties of P-trees
1. a)
b)
00rc PPP
1rc PPNP
00& PPP
PPP 1&
PPP &
0'& PPP
2. a)
b)
c)
d)
PPP 0|
11| PPP
PPP |
1'| PPP
3. a)
b)
c)
d)
4. rc(P1 | P2) = 0 iff rc(P1) = 0 and rc(P2) = 0
5. v1 v2 rc{Pi (v1) & Pi(v2)} = 0
6. rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2)
7. rc{Pi (v1) | Pi(v2)} = rc{Pi (v1)} + rc{Pi(v2)}, where v1 v2
k-Nearest Neighbor Classification and Closed-KNN
1) Select a suitable value for k 2) Determine a suitable distance metric3) Find k nearest neighbors of the sample using the selected metric4) Find the plurality class of the nearest neighbors
by voting on the class labels of the NNs5) Assign the plurality class to the sample to be classified.
T
T is the target pixels. With k = 3, to find the third nearest neighbor,KNN arbitrarily select one point from the bdry line of the nhbd Closed-KNN includes all points on the boundary
Closed-KNN yields higher classification accuracy than traditional KNN
Searching Nearest Neighbors
We begin searching by finding the exact matches.
Let the target sample, T = <v1, v2, v3, …, vn>
The initial neighborhood is the point T.
We expand the neighborhood along each dimension:
along dim-i, [vi] expanded to the interval [vi – ai , vi+bi], for some pos integers ai and bi.
Continue expansion until there are at least k points in the neighborhood.
HOBbit Similarity Method for KNN
In this method, we match bits of the target to the training data
First, find those matching in all 8 bits of each band (exact matches)
let, bi,j = jth bit of the ith band of the target pixel.
Define target-Ptree, Pt: Pti,j = Pi,j , if bi,j = 1
= Pi,j , otherwise
And precision-value-Ptree, Pvi,1j = Pti,1 & Pti,2 & Pti,3 & … & Pti,j
An Analysis of HOBbit Method
Let ith band value of the target T, vi = 105 = 01101001b
[01101001] [105, 105] 1st expansion
[0110100-] = [01101000, 01101001] = [104, 105] 2nd expansion
[011010- -] = [01101000, 01101011] = [104, 107]
Does not expand evenly in both side: Target = 105 and center of [104, 111] = (104+107) / 2 = 105.5
And expands by power of 2.
Computationally very cheap
Perfect Centering Method
Max distance metric provides better neighborhood by- keeping the target in the center- and expanding by 1 in both side
Initial neighborhood P-tree (exact matching):
Pnn = P1(v1) & P2(v2) & P3(v3) & … & Pn(vn)
If rc(Pnn) < k , Pnn = P1(v1-1, v1+1) & P2(v2-1, v2+1) & … & Pn(vn-1, vn+1)
If rc(Pnn) < k , Pnn = P1(v1-2, v1+2) & P2(v2-2, v2+2) & … & Pn(vn-2, vn+2)
Computationally costlier than HOBbit Similarity method
But a little better classification accuracy
Let, Pc(i) is the value P-trees for the class i
Plurality class = PnniPci
&)(rcmaxarg
Performance
Experimented on two sets of Arial photographs of The
Best Management Plot (BMP) of Oakes Irrigation Test Area
(OITA), ND
Data contains 6 bands: Red, Green, Blue reflectance
values, Soil Moisture, Nitrate, and Yield (class label).
Band values ranges from 0 to 255 (8 bits)
Considering 8 classes or levels of yield values: 0 to 7
Performance – Accuracy
40
45
50
55
60
65
70
75
80
256 1024 4096 16384 65536 262144
Training Set Size (no. of pixels)
Acc
ura
cy (
%)
KNN-Manhattan KNN-Euclidian
KNN-Max KNN-HOBS
P-tree: Perfect Centering (closed-KNN) P-tree: HOBS (closed-KNN)
1997 Dataset:
Performance - Accuracy (cont.)
1998 Dataset:
20
25
30
35
40
45
50
55
60
65
256 1024 4096 16384 65536 262144
Training Set Size (no of pixels)
Acc
ura
cy (
%)
KNN-Manhattan KNN-Euclidian
KNN-Max KNN-HOBS
P-tree: Perfect Centering (closed-KNN) P-tree: HOBS (closed-KNN)
Performance - Time
1997 Dataset: both axis in logarithmic scale
0.00001
0.0001
0.001
0.01
0.1
1
256 1024 4096 16384 65536 262144
Training Set Size (no. of pixels)
Per
Sam
ple
Cla
ssif
icat
ion
tim
e (s
ec)
KNN-ManhattanKNN-EuclidianKNN-MaxKNN-HOBSP-tree: Perfect Centering (cosed-KNN)P-tree: HOBS (closed-KNN)
Performance - Time (cont.)
0.00001
0.0001
0.001
0.01
0.1
1
256 1024 4096 16384 65536 262144Training Set Size (no. of pixels)
Per
Sam
ple
Cla
ssif
icat
ion
Tim
e (s
ec)
KNN-ManhattanKNN-EuclidianKNN-MaxKNN-HOBSP-tree: Perfect Centering (closed-KNN)P-tree: HOBS (closed-KNN)
1998 Dataset : both axis in logarithmic scale