Post on 04-Aug-2020
A Plot for Visualizing Multivariate Data
Rida E. A. Moustafa1
Email: rmoustaf@galaxy.gmu.edu, rmustafa@aalcpas.comApril 8, 2003
Abstract
Due to rapid changes in the data set sizes, both in dimensions and observations,there is an urgent need for highly sophisticated techniques that allow the analyst tovisually discover hidden patterns that might exist in a given data set. In this paper,we introduce a new technique for visualizing large multivariate data by means of di-mension reduction. This new technique will allow the user to visually detect hiddenstructures from any given large, multivariate and complex data set in an efficientand interactive way, compared to widely used dimensional reduction techniques.
Keyword: Visual data mining, Pattern Recognition, Nonlinear transformation.
1 Introduction
Visualizing multivariate data is an indirect approach in data mining and knowledgediscovery processes. It is considered one of the most important steps in explor-ing hidden structures and attaining a better understanding of the relationships inthe data sets. Large dimensions are hard to visualize. Moreover, most of the con-ventional multivariate visualization techniques generally do not scale very well forlarge number of observations; hence, it is difficult to interactively visualize largeand complex data sets. Therefore, more attention must be expended in developingnew techniques and softwares for visual data mining that can deal with complex,large-sized multivariate data.
It is worth mentioning some of the dimensional reduction techniques that widelyused to visualize multivariate data. These techniques are, multidimensional scaling(MDS), principle component analysis (PCA), fisher discriminant analysis (FDA),projection pursuit and self-organized maps (SOM).
In this paper, we introduce an efficient algorithm for visualizing large-sized mul-tivariate data sets. It has strong mathematical and statistical foundations that assistthe analyst, not only in visually distinguishing classes in the data set, but, also, inunderstanding the underlying geometric structures of the data set.
Our algorithm is constructed of two consecutive projections: The first is the lin-ear projection (m) produced by computing the distance of each observation from the
1Dr. Rida Moustafa is the Director of Data Mining and Knowledge Discovery Group at AAL.
origin. The second projection is (v) produced by constructing a nonlinear functionof the data and the first projection and is computed by the following steps:
• find the deviations of the observation from the first projection
• compute the distance of these deviated observations from the origin
• find the relationship between the two projections in a scatter-plot view graph
We demonstrate the efficiency of this mv-method in visualizing multivariate dataand its application in several areas of statistical data mining and pattern recognition.
The rest of the article consists of the following main sections: Section two dis-cusses the theory of the introduced technique with illustrations, using simulateddata. In section three, we demonstrate the power of our method on three realdata sets and compare our results with well known existing methods for visualizingmultivariate data by means of dimension reductions.
2 The theory of mv-algorithm
The mv-plot is a 2-dimensional plot of multivariate large-sized data set. It is de-signed to assist the analyst visually explore the underlying structure(s) of a givendata set. It can be considered a technique for visual clustering and classification.Furthermore, it allows us to recognize geometric objects such as hyper-planes, hyper-spheres, etc. The designed algorithm for the mv-plot consists of linear and nonlinearinvariant transformation processes that map points in Rd into points in R1.
The mapping processes can be described as follows: Given an observation x ∈ Rd
as x = (x1, x2, · · · , xd), then m = f(x) = 1
d
∑dj=1
|xj| and v =√
g(x, f(x)) =√
1
d
∑dj=1
|xj − f(x)|2. In general, for a multivariate data set of n-observations in
d-dimensional space, say X = {xij| i = 1, 2, · · · , n; j = 1, 2, · · · , d, } the mappingprocess will produce the vectors m and v, both of length n. This is represented bythe following equations:
mi =1
d
d∑
j=1
|xij|
vi = (1
d
d∑
j=1
(xij − mi)2)
1
2
Understanding the mathematical and statistical relationship between m and v isessential in order to discover any pattern and to explain the results. Therefore, weinvestigate the relationship from several perspectives of mathematics and statisticson simulated data in two, three and hyper-dimensional space. Furthermore, weconsider several real data sets that have been frequently studied for the same purposeof classification, clustering and visual data mining. Finally, we compare the mv-plot
2
results with other effective techniques that have been used for the same purposes,such as: principle component, multidimensional scaling and Fisher discriminantanalysis.
2.1 The mv plot in two dimensions
In this part, we will discuss the situation where the data X ∈ Rn2. This allows usto show and understand the relationships in 2-dimensional original space and itsprojection into 2-dimensional mv-space. We consider scaling data to be positivereal numbers and this will be the same for all the cases we consider through thisstudy. In this case, computation of m and v values is as follows:
mi =1
2(xi1 + xi2),
v2
i =1
2
(
|xi1 − mi|2 + |xi2 − mi|
2)
=1
4
(
|xi1 − xi2|2 + |xi2 − xi1|
2)
⇒ vi =1
2|xi2 − xi1|.
This is exactly like computing the principle component in two dimensions. Assumeour data is drawn from a linear model of the form xi2 = w1xi1 + w0, then:
mi =1
2((w1 + 1)xi1 + w0) ,
vi =1
2|(w1 − 1)xi1 + w0|.
Consider the weights w′s are large enough such that w1 − 1 ≈ w1 + 1 = a1, w0 = a2
both of which are positive reals. Then, the relationship between m and v is linear.Therefore, the linear pattern in the original space is mapped into a linear patternin the mv-space by scaling the dataset.
As an example, we consider three cases for single line, intersected lines andparallel lines. These lines are shown at the Figure 1, respectively. Its mv-plotsare shown at the bottom of the same figure. One can notice that the structure ispreserved under the projection.
2.2 The mv plot in three dimensions
In this subsection, we will study the situation where the data X ∈ Rn3. This ofcourse, helps in the generalization process and shows the relationship between theprojection of the three dimensional structures in data set and their correspondencein two dimensional ones. Computing m and v values for the data in X, we get:
mi =1
3(xi1 + xi2 + xi3)
3
−2 −1 0 1 2
1015
20
x
y
−2 −1 0 1 2
1015
20x
y−2 −1 0 1 2
1015
2025
30
x
y
2 4 6 8 10 12
78
910
1112
1314
m
v
2 4 6 8 10 12
46
810
1214
1618
m
v
5 10 15
810
1214
1618
20
m
v
Figure 1: mv-plot of Line(s).
v2
i =1
3
(
|xi1 − mi|2 + |xi2 − mi|
2 + |xi3 − mi|2)
⇒
v2
i =1
3
(
|2xi1 − (xi2 + xi3)|2 + |2xi2 − (xi3 + xi1)|
2 + |2xi3 − (xi2 + xi1)|2)
.
Let us have linear model of the form xi3 = w1xi1 + w2xi2 + w0 then:
mi =1
3((w1 + 1)xi1 + (w2 + 1)xi2 + w0) ;
v2
i =1
3(|(w1 − 2)xi1 + (w2 + 1)xi2 + w0|
2 + |(w2 − 2)xi2+
(w1 + 1)xi1 + w0|2 + |(2w1 − 1)xi1 + (2w2 − 1)xi2 + 2w0|
2).
Suppose the weights w′s are large enough such that w1+1 ≈ w1−2 ≈ w1−0.5 =a1, w2 + 1 ≈ w2 − 2 ≈ w2 − 0.5 = a2, w0 = a3 ⇒:
mi ≈ a1xi1 + a2xi2 + a3;
vi ≈ a1xi1 + a2xi2 + a3;
From these two cases, we suspect relationship between mi and vi to be nearlylinear. We will show this case by an example, we consider random data that fit
4
plane(s) in 3-d. Figure 2, at top-left and bottom-left respectively, we show a planeand 2-planes in the original space. The corresponding line and 2-lines in mv-spaceare shown, respectively, at top-right and bottom-right of the figure.
−3 −2 −1 0 1 2 3−60−
40−2
0 0
20
40
−3−2
−1 0
1 2
3 4
x
y
z
10 15 20 25 3010
2030
4050
60
−5 0 5 10 15−100
−50
0
50
100 1
50
−3−2
−1 0
1 2
3 4
x
y
z
10 20 30 40 50 60 70
2040
6080
100
120
Figure 2: mv-plot of 3-d plane(s).
2.3 The mv plot in d-dimensions
In this part we will consider the data X ∈ Rnd drawn from a linear model givenby the following relationship xid = w0 +
∑d−1
j=1wjxij, after some restriction on the
weights, we can express m and v as:
mi =1
d
[
d−1∑
i=1
(wj + 1)xij + w0
]
,
vi =d − 1
d2
[
d−1∑
i=1
((d − 1)wj − 1)xij + (d − 1)w0
]
.
Of course we can adjust the weights to be high enough as in the 3-d situations.we will have a linear relationship between m and v as : mi =
∑d−1
j=1aixij + ad, and
vi =∑d−1
j=1aixij +ad. Therefore, the linear relationship can be detected, even for very
large dimensions, by scaling the weights and, on real data by scaling the variables.
5
2.4 Detecting nonlinear data with mv-plot
In this subsection, we would like to show that it is possible to detect nonlinearpatterns using the mv-technique. Let us consider nonlinear data sampled from thesin and the cos functions on [0, 2π]. Its plot in the original space is shown at the top(left and right) of Figure 3 and its plot in the mv space is shown at the bottom (leftand right). Notice that, at the mv-space, for (x, cos(x)) we have (m = x+cos(x), v =|x − cos(x)|) and for (x, sin(x)), we have (m = x + sin(x), v = |x − sin(x)|). Thisexplains why the produced plots are rotated sin and cos in the mv space.
0 1 2 3 4 5 6
−1.0
−0.5
0.0
0.5
1.0
x
sin(
x)
0 1 2 3 4 5 6
−1.0
−0.5
0.0
0.5
1.0
x
cos(
x)
−0.4 −0.2 0.0 0.2 0.4
0.5
1.0
1.5
m−vlaue
v−va
lue
−0.5 0.0 0.5 1.0
0.5
1.0
1.5
2.0
m−vlaue
v−va
lue
Figure 3: mv-plot of sin and cos functions.
In fact, we can employ mv-technique to detect sphere(s) in hyper-dimensions.Let the given data be distributed on a sphere with radius r and the data centeredat zero. Computing the v-values based on the L2 norm, we get:
v2
i =1
d
d∑
j=1
(xij − mi)2 =
1
d
d∑
j=1
x2
ij − dm2
i
⇒ v2
i + m2
i =r2
d.
6
−1.0 −0.5 0.0 0.5 1.0−1.0−
0.5 0
.0 0
.5 1
.0
−1.0−0.5
0.0 0.5
1.0
x
y
z
0.4 0.6 0.8 1.0 1.2 1.4 1.6
1.0
1.4
1.8
2.2
m
sqrt(
v)
−1 0 1 2 3 4 5 −1 0
1 2
3 4
5
−1 0
1 2
3 4
5
x
y
z
1 2 3 4 5
1.0
1.5
2.0
2.5
3.0
3.5
m
sqrt(
v)
Figure 4: mv-plot of 3-d sphere(s).
In general, for non-zero centered spheres, we have:
v2
i =1
d
d∑
j=1
(xij − xc + xc − mi)2 =
1
d
d∑
j=1
(xij − xc)2 − d(xc − mi)
2
⇒ v2
i + m2
i =r2
d.
Therefore, sphere(s) in hyper-dimension original space will be circle(s) in the mv-space. Let us consider data points that are randomly distributed on a sphere, shownin 3-dimension at the top-left of Figure 4. Its plot in mv-space is a circle shownat top-right of the figure. At bottom-right, we have two separated spheres withdifferent radii. The corresponding circles in mv-space are shown at the bottom-leftof the figure.
7
3 Results and Discussions
3.1 The iris data set
In this subsection, we will present our results on the well-known Fisher’s Iris data.This data and it can be obtained from URL 2. It consists of 150 observations con-taining four measurements based on the petals and sepal of three species of Iris.These three species are: Iris Setose, Iris Virginia, and Iris Versicolor. This data isan appropriate choice, as it is frequently used to illustrate classification, clusteringor visualization techniques (Venables,Ripley 1998),(Martinez(2002).
The visualization shown in Figure 5 demonstrates that mv-plot achieve as sat-isfactory a result as the three well known plots, namely: the Principle componentplot, Multidimensional scaling plot, and Fisher’s discriminant plot. All the plots,including ours, show that the data consists of three groups and one of those three(Iris Setose) is highly separated from the other two. Moreover, according to mv-plottheory, we can confirm that the three groups lie on three hyper-planes in R4.
-3 -2 -1 0 1 2 3
first principle component
-2-1
01
2
seco
nd p
rinci
ple
com
pone
nt
PC
s
sss
s
s
s s
s
s
s
s
ss
s
s
s
s
ss
ss
sss
s
sss
ss
s
ss
ss
ss
s
ss
s
s
s
s
s
s
s
s
s
cc c
c
cc
c
c
c
c
c
c
c
cc
c
cc
cc
c
cc
cc
ccc
c
ccc
c cc
cc
c
c
cc
c
c
c
c
cc c
c
c
v
v
vv
vv
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
v v
vv
v
vv
v
vvv
vv
vv
vvv
v
vv
v
v
v
v
v
-4 -2 0 2
mds-x.values
-1.0
-0.5
0.0
0.5
1.0
mds
-y.v
alue
s
MDS
s
sss
s
s
ss
s
s
s
ss
s
ss
s
s
s
ss sss
sssss
ss
s
ss
ss
s
s
s
ss
s
s
ss
s
s
s
s
s
c
cc
c
c
c
c
c
c
c
c
c
c
cc
c
c cc c
c cc
c
cccc
cc
cc
cc c
cc
c c
cc
c
c
c
ccc
c
c
c
v
v
v
vv
v
v
v
v
v
v
v
v
vv
vv
v
v
v
v
v
v
v
vv
vvv
vv
v
v v
v
v
v vv
vv
v
v
vv v
v
vv
v
-10 -5 0 5
first discriminant variable
-9-8
-7-6
-5-4
seco
nd d
iscr
imin
ant v
aria
ble
FLDA
s
ss
s
s
s
ss
s s
s
s
ss
s
s
s
sss
s
ss
s
ss
sss
ss
s
ss
ss
ss
s
s
s
s
s
ss
s
s
s
s
scc
c
c
cc
c
c
cc
c
c
c
c
ccc
cc c
c
c
c c
cc
c
cc
ccc
cc
c
c
c
c
c
cc
c
cc
cccc
cc
v
vv
v
v
vv
vv
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
v
v vvv
vv
v
v
v
v
v
v
vv
v
v v
v
v
v
v
v
v
v
v
2.0 2.5 3.0 3.5 4.0 4.5 5.0
m.values
34
56
78
v.va
lues
MV
ssss
ss
s
s
s
s
s
sss
ss
s
s
s
ssssss
s s
ss
ss
ss
s
ss
s
s
s
ss
ss
ss
s
s
s
s
s
c
c
c
c
c
cc
c
c
cc
c
cc
c
c
c
cc
cc
c
ccc c
c
c
cccc c
c
cc
cc
ccc
cc
cccc
c
c
c vv
v
v v
v
v
v
v
v
vv v
vv
v
v
v
v
v v
v
v
vv
v
vv
v
vv v
vvv
v
v
v
v
vvv
v
v
vvv v
vv
Figure 5: mv-plot of the Iris data: comparison.
3.2 The process control data set
The second data set is the Synthetic Control Chart Time Series. It contains 600examples of control charts synthetically generated by the process in Alcock and
2http://www.cmu.edu
8
Manolopoulos (1999). The data set lies in 60 dimensional space. There are sixdifferent classes of control charts: 1-normal, 2-cyclic, 3-increasing trend, 4-decreasingtrend, 5-upward shift, 6-downward shift. This time series data is listed on Knowledgediscovery cup-1999 as a challenging problem for the data mining community. It canbe obtained from the URL 3. The data set is generally used for testing classification,clustering and visualization techniques. Several clustering techniques have beentested on this particular data set and did not show the existing clusters(Pham,Chan 1998). Our goal here is to visually explore the existing 6-classes, using mv-algorithm, and to compare our results with other techniques. In Figure 6, we cansee that the PCP and FLDA methods are able to distinguish the first and secondclasses properly, but there is no clear discrimination between the classes (three, five)and (four, six). The MDS method did not show groups at all. Comparing this withthe mv-plot, there is no doubt that one can see that there are six classes in thisparticular data set. Moreover, the computational effort is negligible compared withother methods studied here.
-10 -5 0 5 10
first principle component
-20
24
68
seco
nd p
rinci
ple
com
pone
nt
PC
1111 1111111111111
11
1111 1
111
1111111111
111
11111 1111
1111111
111
111
111
1
11
1111
111111
111111
11
1
1111111111111
222
2
222
2
2
2
22
2
2
2
2
22
2
2
2
2
2
2
222
2
2
2
22
22
22
22
22
2222
22
2
22
2
2
2
2
22
2
2
2
2
22 2
2
2
2
2 2
22
2
222
222
22 22
2
2
2
22
22
22
2
2
2
2
22
2
2
222
33 3
3
33
33 333 333 3
3
33
33
3
333
333
3 3
333 33
33 3
333
33 33 333
3
33 33 3 33
33
3 33 333
333
3
3
3 3333
3
3
33 333 33 33 3
3 33
333
333 33
3 3334
444
4
44 444
4 4
444
44
444444
44
44
4
4
4
444 44 4
44
4444
4 4 4
444
44
444
44
444
44 444 44
444 44 44
44
4 44
44
4
44 44
444
44 44
44 4 44
4
4
44
55 5
555
5
5 555 5
5
55555
55
55
5 5
5
5
55
555555
55
5 5 55
5
5
55
5
55
5
5
5
55
55
55 5
55 5
5
5
5 5555
5
555
5
55
5
555
55
555
5
5
5 555
55
5
55
55 5555
66
6
66
6666
666 6
666 6666
6 66
66 66
6
66
6
666 6
6 666
6 66
66
666
66 6
6666 666 6
66
6666 66
666
666
66
6 666
666
6
66
66 666
6
6 6666
66666
-100 -50 0 50 100
mds-xvalues
-80
-60
-40
-20
020
40
mds
-yvl
aues
MDS
11111111111 1111 11
11111111111111
111111111
111111111
111111111
1111111111
11111111111111111111
1 1111111111 1
2
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
22
2 2
2
2
2
2
2
2
22
2
22 22
22
2
2
2
2
2
2
2
2 2
2 2
2
2
22
22
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
22
2
2
2
233 3333 3
3 3333 33 3 33 333
33 33
333
3 3333 3333 333333 33 333
3 33 33 3 333 33 33 333 333 333 33 33
333 3 3
33 33 33 33
3333
3333 333 33 34
4
44 444 4444
44444 44
44444444 44 4
44
4 444 44 4 44 44 4 4 444 4
44 444 44
4 444
4 444 44 444 44 44 44 4 4
44
44 44 4444444 4
444 4
44
44 4 4 55
5555
55555
5
5 5555 55
55 55 5
5 555 5
5 55 555 555 55 5
5 55
555
5
5555 55 5
5555 5
5555
55
55 5 555
55
55
5
5
555
55
555
555 555
5555 5
55 566 66
66
666 66666
66666666 6
66 66 6
66 666
66
6 666 6 6
6
6
6666
666 6
666 666
6
66 66666 6
666666 66
6 666
66 6
66
6666 666
6 66 66
66666
-20 -15 -10 -5
first discriminant variable
510
15
seco
nd d
iscr
imin
ant v
aria
ble
FLDA
11111
111 11 11111111
1
1
111
11111
111
111
1111
1111
111
111
1
11
111
111
1111111 1
1
11
11 11
111 1
11
111111 1
1
111
1111
111
1111
222
2
222
22 2
22
2
2
2222222222
22
222
222 2
2
2 22
22
2
2
2
222
2222
22 22
222
2
22
222 22
222
22
2
2
2
22 2
2
222
222
2
222222
22
2
22
2
22222
333
3
33
3333 333 33
3
3
3 3
3
333
3
3
3 33
3
3 333 33
333
33
3 33333 33
3 33 333 33
3
33333 3
3 3 333
33 333
3
3
333
3333333
33
33
33
33 33333 33 444 4
4
444 4
4
44
444 44 44
444 4
44
44
44
4
4 444 44 4
4
4
44 44
444 444 4
444444444
444 44 44 444 44
44
44444
4
4
4
4444 44 4
44
4
4 44444
444
55
55
5 5
55
55 55 5
5 55
555
55
5
555
5 55
5
55 5555
555
555
5
5
5
5
5 5
5
55
555 5
55 55
55 5
5
55 555
5
5
55
5
55
555 5
55
5 555
5
555 55 55
5
5
555 5 5
5
6 6
666
66
6 66 6
66
6
66666
6
66
6
6
66
6
6
666
6
666 6
6 66
66
666
66 6666
6
66
66666
6 66 66
6 66
666
6666
6666 66 6
66
6 666666
66
6 666
666 6
6
15 20 25 30 35 40 45
m-values
46
810
12
v-vl
aues
MV
1111 111 11111111 1111111111111111111111111 11111 111
11111111111111111
111111111111111
11111111111111111111
22 2
22
2
2
2
22
22
2
2
22 222
2
2
2
222
22222
22
22
2
2 2
2
2
2
22
2222
2
22
22
2
2
2
2
2
2
2
22
22
2222
22 22222
222
22
2
222
2
22
2
222
22
2
22
2
222
2
2
33
33
3
3
3
3
33
3 3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
33
33
33
33
33
33
33
3
333
33
3
3
3
3
3 3
33
33
3
3
33
3
3
33
3
333
3 333
3
3
3
3
33
3
3
3
3
3
3
33
333 3
3
3
3
33
33
3
4
44
4
44
4
4
4
444 4
44
4
4
444
4
4
4
4
4
4
4444
4
444
4
4
44
44
4
4
4 4
44
44
4
4
4
44
44
4
4
4
4
4
44
44
4
4
44
4
4
44
4
4
4
4
44 4
444
44
4
4
4
4 4
4
4
4
4
4
4
44
4
44
55
55
5
5
55
5
5
5
5
5
5
555
5
55
5
5
5
55
5
55 55
5
5
55
55
5
5
55
5
5
55
5
55
5
55
55
5
5
55
55
5
55 5
5
5
555
55
5
5
5
555
5 5
55
5
5
555
55 5
5
5
5
5
5
5 5
5
5
55
5
5
66
666
6
66
6
6
66
6
6
66
66
6
66
6
6
6
6 66
6 6
6 666
6
66
6
6
66
6
6
66
66
6
6
6
6
6
66
6666
66
6
6
666
6
6
66
6
66
6
6
6
6
66
6
6
6
6
66
6
66
6
66
6
6
6
666
6
6
6
6
6
1111 111 11111111 1111111111111111111111111 11111 111
11111111111111111
111111111111111
11111111111111111111
22 2
22
2
2
2
22
22
2
2
22 222
2
2
2
222
22222
22
22
2
2 2
2
2
2
22
2222
2
22
22
2
2
2
2
2
2
2
22
22
2222
22 22222
222
22
2
222
2
22
2
222
22
2
22
2
222
2
2
33
33
3
3
3
3
33
3 3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
33
33
33
33
33
33
33
3
333
33
3
3
3
3
3 3
33
33
3
3
33
3
3
33
3
333
3 333
3
3
3
3
33
3
3
3
3
3
3
33
333 3
3
3
3
33
33
3
4
44
4
44
4
4
4
444 4
44
4
4
444
4
4
4
4
4
4
4444
4
444
4
4
44
44
4
4
4 4
44
44
4
4
4
44
44
4
4
4
4
4
44
44
4
4
44
4
4
44
4
4
4
4
44 4
444
44
4
4
4
4 4
4
4
4
4
4
4
44
4
44
55
55
5
5
55
5
5
5
5
5
5
555
5
55
5
5
5
55
5
55 55
5
5
55
55
5
5
55
5
5
55
5
55
5
55
55
5
5
55
55
5
55 5
5
5
555
55
5
5
5
555
5 5
55
5
5
555
55 5
5
5
5
5
5
5 5
5
5
55
5
5
66
666
6
66
6
6
66
6
6
66
66
6
66
6
6
6
6 66
6 6
6 666
6
66
6
6
66
6
6
66
66
6
6
6
6
6
66
6666
66
6
6
666
6
6
66
6
66
6
6
6
6
66
6
6
6
6
66
6
66
6
66
6
6
6
666
6
6
6
6
6
Figure 6: mv-plot of the process control time series data: comparison.
3.3 The Pollen data set
This data set consists of 3, 848 observations in 5 dimensional space. It is a challeng-ing data set for statisticians offered by the American Statistical Association. It is
3http://kdd.ics.uci.edu
9
worth mentioning here that most of the existing techniques and, also, the ones wereport on need more computer power (storage and speed) to complete the process;therefore, we will report only on our plot. In Figure 7, we see a filled circular shapeand a line coming out of it. According to the mv-plot theory, one can tell thatthe circular shape is hyper-sphere in 5-dimensions, and, the linear shape is a hyper-plane in 5-dimensions, too. We extract the linear pattern from the data set andvisualize it in Figure 8, which shows the word (EUREKA). In fact, our discoverieswith the mv-plot agree with those found in (Wegman 2002) and with others thathave successfully analyzed this data set.
Figure 7: mv-plot of Pollen data: detecting both of linear and nonlinear structure.
4 Summary and Future work
We introduced an efficient algorithm for visualizing hyper-dimensional data sets.We demonstrated the efficiency of the algorithm on different complex simulated andreal data set in very high-dimensional space and were able to detect the existingpatterns visually. We compared the algorithm on the same data sets with the princi-pal components analysis, fisher discriminant analysis, and multidimensional scaling.In most cases, our algorithm achieved better results than the afore-mentioned al-gorithms. In some other cases, these algorithms didn’t work at all because of thecomputer speed and storage limitations of most computers.
10
Figure 8: mv-plot the hidden linear structure in Pollen data.
5 Acknowledgements
The authors would like to thank the computational statistics group at George MasonUniversity for their help and support and especially Dr. John Miller and Dr. EdwardWegman and AAL’s Advanced Data Mining Group.
This work was funded in part by the Air Force Office of Scientific Research underthe contract F49620-01-1-0274.
6 References
Pham D. , Chan A., 1998“Control Chart Pattern Recognition using a New Typeof Self Organizing Neural Network” Proc. Instn, Mech, Engrs. Vol 212, No 1, pp115-127.Mitchie, Spiegelhalter, and Taylor, 1994Machine Learning, Neural and Statistical
Classification,(STATLOG project).Martinez, W., Martinez, A.,2002 Computational Statistics Handbook with Matlab,CRC.Venables, W, Ripley, B., Modern Applied Statistics with S-PLUS, Second Edition,Springer, 1998
Everitt, B., Dunn, G. ,2001Applied Multivariate Data Analysis, Second Edition.Everitt, B. and Dunn, G. 1983 Advanced Methods of Data Exploration an d Model-
ing, Heinman, London.Everitt, B. and Dunn, G. 1991 Applied Multivariate Data Analysis, John Wiley &Sons, New York.
11
Fienberg, S. 1979 “Graphical methods in statistics,” The American Statistician,
33, 165-178.Fishback, W. T. 1962 Projective and Euclidean Geometry, John Wiley & Sons, NewYork.Kaufman, L. and Rousseeuw, P. 1990 Finding Groups in Data: An Introduction to
Cluster Analysis, John Wiley & Sons, New York.Mortenson, M. E. 1995 Geometric Transformation, Industrial Press Inc., New York.Moustafa, R. E. 2001 Fast Conceptual Clustering Algorithm For Data Mining and
Visualization, Ph.D. Dissertation, George Mason University, Fairfax, Va.Wegman, E., Carr, D., and Luo, Q. 1993 “Visualizing Multivariate Data,” Multi-
variate Analysis: Future Directions,, (C. R. Rao, ed.), Amsterdam: North Holland,423-466.Wegman, E. and Luo, Q. 1997 “High dimensional clustering using parallel coordi-nates and the grand tour,” Computing Science and Statistics, 28 352-360.Wegman, E. and Solka, J. 2002 “ On some mathematics for visualizing high dimen-sional data,” Sankhya, 64(2),429-452. Wegman, E. 2003 “Visual data mining“(inpress) Statistics in Medicine.
12