Visual Data Mining
-
Upload
shyam-kumar -
Category
Documents
-
view
69 -
download
3
Transcript of Visual Data Mining
IEEE TRANSACTIONS ON VISUALIZATION AND
COMPUTER GRAPHICS, VOL. 7 , NO. 1 , JANUARY-
MARCH 2002 100
Information Visual izat ion and Visual Data Mining
Danie l A. Keim
Abstract—Never before in h istory data has been
generated
at such h igh volumes as i t i s today. Explor ing and
analyz ing
the vast vo lumes of data becomes increas ingly d i f f icu l t .
In format ion v isual izat ion and v isual data mining can help
to
deal wi th the f lood of in format ion. The advantage of
v isual
data explorat ion is that the user is d i rect ly involved in
the
data mining process. There is a large number of
in format ion
visual izat ion techniques which have been developed over
the
last decade to support the explorat ion of large data sets .
In
th is paper , we propose a c lass i f icat ion of in format ion
v isual izat ion
and v isual data mining techniques which is based on the
data type to be v isual ized , the v isual izat ion technique
and the interact ion and d istort ion technique . We
exempl i fy the c lass i f icat ion us ing a few examples, most
of them referr ing to
techniques and systems presented in th is specia l issue.
Keywords— Informat ion Visual izat ion, V isual Data
Min ing,
Visual Data Explorat ion, C lass i f icat iona I . Introduct ion
The progress made in hardware technology a l lows
today’s
computer systems to store very large amounts of data.
Researchers f rom the Univers i ty of Berkeley est imate
that
every year about 1 Exabyte (= 1 Mi l l ion Terabyte) of
data
are generated, of which a large port ion is avai lable in
d ig i ta l
form. This means that in the next three years more
data wi l l be generated than in a l l o f human history
before.
The data is of ten automat ica l ly recorded v ia sensors and
monitor ing systems. Even s imple t ransact ions of every
day
l i fe , such as paying by credi t card or us ing the
te lephone,
are typica l ly recorded by computers . Usual ly , many
parameters
are recorded, resul t ing in mult id imensional data
with a h igh d imensional i ty . The data of a l l ment ioned
areas
is co l lected because people bel ieve that i t i s a potent ia l
source of va luable informat ion, provid ing a compet i t ive
advantage(at some point) . F inding the valuable
informat ion
hidden in them, however, is a d i f f icu l t task. With today’s
data management systems, i t i s only poss ib le to v iew
qui te
smal l port ions of the data. I f the data is presented
textual ly ,
the amount of data which can be d isp layed is in the
range of some one hundred data i tems, but th is is l ike a
drop in the ocean when deal ing with data sets conta in ing
mi l l ions of data i tems. Having no poss ib i l i ty to
adequately
explore the large amounts of data which have been
col lected
because of thei r potent ia l usefu lness, the data becomes
useless and the databases become data ‘dumps’ .
Danie l A. Keim is current ly with AT&T Shannon Research
Labs,
F lorham Park, NJ , USA and the Univers i ty of Constance,
Germany
E-mai l : ke [email protected] .com.
This is an extended vers ion of [6] , port ions of which are
copyr ighted
by ACM.
Benef i ts of V isual Data Explorat ion
For data mining to be ef fect ive, i t i s important to inc lude
the human in the data explorat ion process and combine
the
f lex ib i l i ty , creat iv i ty , and general knowledge of the
human
with the enormous storage capaci ty and the
computat ional
power of today’s computers . V isual data explorat ion
a ims
at integrat ing the human in the data explorat ion
process,
apply ing i ts perceptual abi l i t ies to the large data sets
avai lable
in today’s computer systems. The bas ic idea of v isual
data explorat ion is to present the data in some v isual
form,
al lowing the human to get ins ight into the data, draw
conclus ions,
and d i rect ly interact wi th the data. V isual data
mining techniques have proven to be of h igh value in
exploratory
data analys is and they a lso have a h igh potent ia l
for explor ing large databases. V isual data explorat ion is
especia l ly usefu l when l i t t le is known about the data and
the explorat ion goals are vague. S ince the user is
d i rect ly
involved in the explorat ion process, sh i f t ing and
adjust ing
the explorat ion goals is automat ica l ly done i f necessary.
The v isual data explorat ion process can be seen a
hypothes is
generat ion process: The v isual izat ions of the data
al low the user to gain ins ight into the data and come up
with new hypotheses. The ver i f icat ion of the hypotheses
can a lso be done v ia v isual data explorat ion but i t may
a lso
be accompl ished by automat ic techniques f rom stat ist ics
or
machine learning. In addi t ion to the d i rect involvement
of
the user , the main advantages of v isual data explorat ion
over automat ic data mining techniques f rom stat ist ics or
machine learning are:
• v isual data explorat ion can eas i ly deal wi th h ighly
inhomogeneous
and noisy data
• v isual data explorat ion is intu i t ive and requires no
understanding
of complex mathemat ica l or stat ist ica l a lgor i thms
or parameters .
As a resul t , v isual data explorat ion usual ly a l lows a
faster data explorat ion and often provides better resul ts ,
especia l ly in cases where automat ic a lgor i thms fa i l . In
addi t ion,
v isual data explorat ion techniques provide a much
higher degree of conf idence in the f indings of the
explorat ion.
This fact leads to a h igh demand for v isual explorat ion
techniques and makes them indispensable in conjunct ion
with automat ic explorat ion techniques.
Visual Explorat ion Paradigm
Visual Data Explorat ion usual ly fo l lows a three step
process:
Overv iew f i rst , zoom and f i l ter , and then deta i ls -
ondemand
(which has been cal led the Informat ion Seeking
Mantra [1]) . F i rst , the user needs to get an overv iew of
the data. In the overv iew, the user ident i f ies interest ing
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER
GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 101
patterns and focuses on one or more of them. For
analyz ing
the patterns, the user needs to dr i l l -down and access
deta i ls of the data. V isual izat ion technology may be
used
for a l l three steps of the data explorat ion process:
V isual izat ion
techniques are usefu l for showing an overv iew of the
data, a l lowing the user to ident i fy interest ing subsets . In
th is step, i t i s important to keep the overv iew
v isual izat ion
whi le focus ing on the subset us ing an other v isual izat ion
technique. An a l ternat ive is to d istort the overv iew
v isual izat ion
in order to focus on the interest ing subsets . To
further explore the interest ing subsets , the user needs a
dr i l l -down capabi l i ty in order to get the deta i ls about the
data. Note that v isual izat ion technology does not only
provide
the base v isual izat ion techniques for a l l three steps but
a lso br idges the gaps between the steps.
I I . C lass i f icat ion of V isual Data Min ing
Techniques
Informat ion v isual izat ion focuses on data sets lack ing
inherent
2D or 3D semant ics and therefore a lso lack ing a
standard mapping of the abstract data onto the phys ica l
screen space. There are a number of wel l known
techniques
for v isual iz ing such data sets such as x-y p lots ,
l ine p lots , and h istograms. These techniques are usefu l
for data explorat ion but are l imited to re lat ive ly smal l
and
low-dimensional data sets . In the last decade, a large
number
of novel in format ion v isual izat ion techniques have been
developed, a l lowing v isual izat ions of mult id imensional
data
sets without inherent two- or three-d imensional
semant ics .
Nice overv iews of the approaches can be found in a
number
of recent books [2] [3] [4] [5] . The techniques can be
c lass i f ied
based on three cr i ter ia (see f igure 1) [6] : The data to be
visual ized, the v isual izat ion technique, and the
interact ion
and d istort ion technique used.
The data type to be v isual ized [1] may be
• One-dimensional data, such as temporal data as used
in
ThemeRiver (see f igure 2 in [7])
• Two-dimensional data, such as geographica l maps as
used in Polar is (see f igure 3(c) in [8]) and MGV (see
f igure
9 in [9])
• Mult id imensional data, such as re lat ional tables as
used
in Polar is (see f igure 6 in [8]) and the Scalable
Framework
(see f igure 1 in [10])
• Text and hypertext , such as news art ic les and Web
documents
as used in ThemeRiver (see f igure 2 in [7])
• Hierarchies and graphs, such as te lephone cal ls and
Web
documents as used in MGV (see f igure 13 in [9]) and the
Scalable Framework (see f igure 7 in [10])
• A lgor i thms and software, such as debugging operat ions
as used in Polar is (see f igure 7 in [8])
The v isual izat ion technique used may be c lass i f ied into
• Standard 2D/3D disp lays, such as bar charts and x-y
plots as used in Polar is (see f igure 1 in [8])
• Geometr ica l ly t ransformed disp lays, such as
landscapes
and para l le l coordinates as used in Scalable Framework
(see
f igures 2 and 12 in [10])
F ig . 1 . C lass i f icat ion of Informat ion Visual izat ion
Techniques
• Icon-based d isp lays, such as needle icons and star
icons
as used in MGV (see f igures 5 and 6 in [9])
• Dense p ixel d isp lays, such as the recurs ive pattern and
circ le segments techniques (see f igures 3 and 4) [11]
and the
graph scetches as used in MGV (see f igure 4 in [9])
• Stacked d isp lays, such as t reemaps [12] [13] or
d imensional
stacking [14]
The th i rd d imension of the c lass i f icat ion is the
interact ion
and d istort ion technique used. Interact ion and
distort ion techniques a l low users to d i rect ly interact
wi th
the v isual izat ions. They may be c lass i f ied into
• Interact ive Pro ject ion as used in the GrandTour system
[15]
• Interact ive F i l ter ing as used in Polar is (see f igure 6 in
[8])
• Interact ive Zooming as used in MGV and the Scalable
Framework (see f igure 8 in [10])
• Interact ive Distort ion as used in the Scalable
Framework
(see f igure 7 in [10])
• Interact ive L ink ing and Brushing as used in Polar is
(see
f igure 7 in [8]) and the Scalable Framework (see f igures
12
and 14 in [10])
Note that the three d imensions of our c lass i f icat ion -
data type to be v isual ized, v isual izat ion technique, and
interact ion
& distort ion technique - can be assumed to be
orthogonal . Orthogonal i ty means that any of the
v isual izat ion
techniques may be used in conjunct ion with any of
the interact ion techniques as wel l as any of the
d istort ion
techniques for any data type. Note a lso that a speci f ic
system
may be des igned to support d i f ferent data types and
that i t may use a combinat ion of mult ip le v isual izat ion
and
interact ion techniques.
I I I . Data Type to be Visual ized
In informat ion v isual izat ion, the data usual ly cons ists
of a large number of records each consist ing of a number
of var iables or d imensions. Each record corresponds
to an observat ion, measurement, t ransact ion, etc .
Examples
are customer propert ies , e-commerce transact ions, and
physica l exper iments. The number of attr ibutes can
d i f fer
f rom data set to data set : One part icu lar phys ica l
exper iment,
for example, can be descr ibed by f ive var iables,
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER
GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 102
whi le an other may need hundreds of var iables. We cal l
the number of var iables the d imensional i ty of the data
set .
Data sets may be one-dimensional , two-dimensional ,
mult id imensional
or may have more complex data types such
as text /hypertext or h ierarchies/graphs. Somet imes, a
d ist ict ion
is made between dense (or gr id) d imensions and
the d imensions which may have arb i t rary values.
Depending
on the number of d imensions with arb i t rary values the
data is somet imes a lso ca l led univar iate, b ivar iate, etc .
One-dimensional data
One-dimensional data usual ly has one dense d imension.
A typica l example of one-d imensional data is temporal
data. Note that with each point of t ime, one or mult ip le
data values may be associated. An example are t ime
ser ies of stock pr ices (see f igure 3 and f igure 4 for an
example)
or the t ime ser ies of news data used in the ThemeRiver
examples (see f igures 2-5 in [7]) .
Two-dimensional data
Two-dimensional data has two d ist inct d imensions. A
typica l example is geographica l data where the two
d ist inct
d imensions are longi tude and lat i tude. X-Y-p lots are a
typica l
method for showing two-dimensional data and maps
are a specia l type of x-y-p lots for showing two-
dimensional
geographica l data. Examples are the geographica l maps
used in Polar is (see f igure 3(c) in [8]) and in MGV (see
f igure
9 in [9]) . A l though i t seems easy to deal wi th temporal
or geographic data, caut ion is advised. I f the number of
records to be v isual ized is large, temporal axes and
maps
get quick ly g lutted - and may not help to understand the
data.
Mult i -d imensional data
Many data sets cons ists of more than three attr ibutes
and therefore, they do not a l low a s imple v isual izat ion
as
2-d imensional or 3-d imensional p lots . Examples of
mult id imensional
(or mult ivar iate) data are tables f rom re lat ional
databases, which often have tens to hundreds of
co lumns
(or attr ibutes) . S ince there is no s imple mapping of the
attr ibutes
to the two d imensions of the screen, more sophist icated
visual izat ion techniques are needed. An example of
a technique which a l lows the v isual izat ion of
mult id imensional
data is the Para l le l Coordinate Technique [16] (see
f igure 2, which is a lso used in the Scalable Framework
(see
f igure 12 in [10]) . Para l le l Coordinates d isp lay each
mult id imensional
data i tem as a polygonal l ine which intersects
the hor izonta l d imension axes at the pos i t ion
corresponding
to the data value for the corresponding d imension.
Text & Hypertext
Not a l l data types can be descr ibed in terms of
d imensional i ty .
In the age of the wor ld wide web, one important
data type is text and hypertext as wel l as mult imedia
web
page contents . These data types d i f fer in that they can
not
be eas i ly descr ibed by numbers and therefore, most of
the
standard v isual izat ion techniques can not be appl ied. In
F ig. 2 . Para l le l Coordinate Visual izat ion c IEEE
most cases, f i rst a t ransformat ion of the data into
descr ipt ion
vectors is necessary before v isual izat ion techniques can
be used. An example for a s imple t ransformat ion is word
count ing (see ThemeRiver [7]) which is of ten combined
with a pr inc ipal component analys is or mult id imensional
scal ing ( for example, see [17]) .
Hierarchies & Graphs
Data records often have some re lat ionship to other
p ieces
of informat ion. Graphs are widely used to represent such
interdependencies. A graph consists of set of objects ,
ca l led
nodes, and connect ions between these objects , ca l led
edges.
Examples are the e-mai l interre lat ionships among
people,
their shopping behavior , the f i le structure of the hard
d isk
or the hyper l inks in the wor ld wide web. There are a
number
of speci f ic v isual izat ion techniques that deal wi th
h ierarchica l
and graphica l data. A n ice overv iew of h ierachica l
informat ion v isual izat ion techniques can be found in
[18] ,
an overv iew of web v isual izat ion techniques at [19] and
an
overv iew book on a l l aspects re lated to graph drawing is
[20] .
A lgor i thms & Software
Another c lass of data are a lgor i thms & software. Coping
with large software pro jects is a chal lenge. The goal of
v isual izat ion
is to support software development by help ing
to understand a lgor i thms, e .g. by showing the f low of
informat ion
in a program, to enhance the understanding of
wr i t ten code, e .g. by represent ing the structure of
thousands
of source code l ines as graphs, and to support the
programmer in debugging the code, i .e . by v isual iz ing
errors .
There are a large number of too ls and systems which
support these tasks. An n ice overv iew can be found at
[21] .
IV. V isual izat ion Techniques
There is a large number of v isual izat ion techniques
which
can be used for v isual iz ing the data. In addi t ion to
standard 2D/3D-techniques such as x-y (x-y-z) p lots , bar
charts , l ine graphs, etc . , there are a number of more
sophist icated
visual izat ion techniques. The c lasses correspond to
basic v isual izat ion pr inc ip les which may be combined in
order to implement a speci f ic v isual izat ion system.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER
GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 103
Fig. 3 . Dense P ixel Disp lays: Recurs ive Pattern
Technique c IEEE
Geometr ica l ly -Transformed Displays
Geometr ica l ly t ransformed disp lay techniques a im at
f inding “ interest ing” t ransformat ions of mult id imensional
data sets . The c lass of geometr ic d isp lay techniques
inc ludes
techniques f rom exploratory stat ist ics such as
scatterp lot
matr ices [22] [23] and techniques which can be
subsumed
under the term “pro ject ion pursui t” [24] . Other
geometr ic pro ject ion techniques inc lude Prosect ion
Views
[25] [26] , Hypers l ice [27] , and the wel l -known Para l le l
Coordinates
visual izat ion technique [16] . The para l le l coordinate
technique maps the k-d imensional space onto the two
display d imensions by us ing k equid istant axes which are
paral le l to one of the d isp lay axes. The axes corespond
to
the d imensions and are l inear ly scaled f rom the
minimum to
the maximum value of the corresponding d imension.
Each
data i tem is presented as a polygonal l ine, intersect ing
each
of the axes at that point which corresponds to the value
of
the considered d imensions (see f igure 2) .
Iconic Disp lays
Another c lass of v isual data explorat ion techniques are
the iconic d isp lay techniques. The idea is to map the
attr ibute
values of a mult i -d imensional data i tem to the features
of an icon. Icons can be arb i t ra i ly def ined: They may
be l i t t le faces [28] , needle icons as used in MGV (see
f igure
5 in [9]) , s tar icons [14] , s t ick f igure icons [29] , co lor
icons
[30] , [31] , and T i leBars [32] . The v isual izat ion is
generated
by mapping the attr ibute values of each data record to
the
features of the icons. In case of the st ick f igure
technique,
for example, two d imensions are mapped to the d isp lay
dimensions and the remain ing d imensions are mapped to
the angles and/or l imb length of the st ick f igure icon. I f
the data i tems are re lat ive ly dense with respect to the
two
display d imensions, the resul t ing v isual izat ion presents
texture
patterns that vary according to the character ist ics of
the data and are therefore detectable by preattent ive
percept ion.
F ig . 4 . Dense P ixel Disp lays: C i rc le Segments Technique
c IEEE
Dense P ixel Disp lays
The bas ic idea of dense p ixel techniques is to map each
dimension value to a co lored p ixel and group the p ixels
belonging
to each d imension into adjacent areas [11] . S ince
in general dense p ixel d isp lays use one p ixel per data
value,
the techniques a l low the v isual izat ion of the largest
amount
of data poss ib le on current d isp lays (up to about
1.000.000
data values) . I f each data value is represented by one
pixel , the main quest ion is how to arrange the p ixels on
the screen. Dense p ixel techniques use d i f ferent
arrangments
for d i f ferent purposes. By arranging the p ixels in an
appropr iate way, the resul t ing v isual izat ion provides
deta i led
informat ion on local corre lat ions, dependencies, and
hot spots .
Wel l -known examples are the recurs ive pattern
technique
[33] und the c i rc le segments technique [34] . The
recurs ive
pattern technique is based on a gener ic recurs ive back-
andforth
arrangement of the p ixels and is part icu lar a imed at
represent ing datasets with a natura l order according to
one
attr ibute (e.g. t ime ser ies data) . The user may speci fy
parameters
for each recurs ion level , and thereby contro ls the
arrangement of the p ixels to form semant ica l ly
meaningfu l
substructures. The base e lement on each recurs ion level
is a pattern of height h i und width wi as speci f ied by the
user . F i rst , the e lements correspond to s ingle p ixels
which
are arranged with in a rectangle of height h1 and width
w1
from lef t to r ight , then below backwards f rom r ight to
lef t ,
then again forward f rom lef t to r ight , and so on. The
same
basic arrangement is done on a l l recurs ion levels with
the
only d i f ference that the bas ic e lements which are
arranged
on level i are the pattern resul t ing f rom the level ( i − 1)
arrangements. In f igure 3, an example recurs ive pattern
visual izat ion of f inancia l data is shown. The v isual izat ion
shows twenty years ( January 1974 - Apr i l 1995) of dai ly
pr ices of the 100 stocks conta ined in the Frankfurt Stock
Index (FAZ). The idea of the c i rc le segments technique
[34]
is to represent the data in a c i rc le which is d iv ided into
segments,
one for each attr ibute. With in the segments each
attr ibute value is again v isual ized by a s ingle co lored
p ixel .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER
GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 104
Fig. 5 . Dimensional Stacking Visual izat ion of Oi l Min ing
Data
(used by permiss ion of M. Ward, Worchester Po lytechnic
c IEEE)
The arrangment of the p ixels starts at the center of the
c i rc le and cont inues to the outs ide by p lott ing on a l ine
orthogonal to the segment halv ing l ine in a back and
forth
manner. The rat ional of th is approach is that c lose to the
center a l l at t r ibutes are c lose to each other enhancing
the
visual compar ison of thei r va lues. F igure 4 shows an
example
c i rc le segment v isual izat ion of the same data (50 stocks)
as shown in f igure 3.
Stacked Displays
Stacked d isp lay techniques are ta i lored to present data
part i t ioned in a h ierarchica l fashion. In case of
mult id imensional
data, the data d imensions to be used for part i t ion ing
the data and bui ld ing the h ierarchy have to be
selected appropr iate ly . An example of a stacked d isp lay
technique is Dimensional Stacking [35] . The bas ic idea is
to embed one coordinate systems ins ide an other
coordinate
system, i .e . two attr ibutes form the outer coordinate
system, two other attr ibutes are embedded into the
outer
coordinate system, and so on. The d isp lay is generated
by d iv id ing the outmost level coordinate systems into
rectangular
cel ls and with in the cel ls the next two attr ibutes
are used to span the second level coordinate system.
This
process may be repeated one more t ime. The usefu lness
of the resul t ing v isual izat ion largely depends on the
data
distr ibut ion of the outer coordinates and therefore the
d imensions
which are used for def in ing the outer coordinate
system have to be se lected carefu l ly . A ru le of thumb is
to
choose the most important d imensions f i rst . A
d imensional
stacking v isual izat ion of o i l min ing data with longi tude
and
lat i tude mapped to the outer x and y axes, as wel l as ore
grade and depth mapped to the inner x and y axes is
shown
in f igure 5. Other examples of stacked d isp lay
techniques
inc lude Wor lds-with in-Wor lds [36] , Treemap [12] [13] ,
and
Cone Trees [37] .
V. Interact ion and Distort ion Techniques
In addi t ion to the v isual izat ion technique, for an
ef fect ive
data explorat ion i t i s necessary to use some interact ion
and d istort ion techniques. Interact ion techniques a l low
the
data analyst to d i rect ly interact wi th the v isual izat ions
and
dynamical ly change the v isual izat ions according to the
explorat ion
object ives, and they a lso make i t poss ib le to re late
and combine mult ip le independent v isual izat ions.
Distort ion
techniques help in the data explorat ion process by
provid ing means for focus ing on deta i ls whi le preserv ing
an overv iew of the data. The bas ic idea of d istort ion
techniques
i s to show port ions of the data with a h igh level of
deta i l whi le others are shown with a lower level of
deta i l .
We d ist inguish between the terms dynamic and
interact ive
depending on whether the changes to the v isual izat ions
are
made automat ica l ly or manual ly (by d i rect user
interact ion) .
Dynamic Pro ject ions
The bas ic idea of dynamic pro ject ions is to dynamical ly
change the pro ject ions in order to explore a
mult id imensional
data set . A c lass ic example is the Grand-
Tour system [15] which tr ies to show al l interest ing
twodimensional
pro ject ions of a mult i -d imensional data set as
a ser ies of scatter p lots . Note that the number of
poss ib le
pro ject ions is exponent ia l in the number of d imensions,
i .e .
i t i s intractable for a large d imensional i ty . The sequence
of
pro ject ions shown can be random, manual , precomputed,
or data dr iven. Systems support ing dynamic pro ject ion
techniques are XGobi [38] [39] , XL ispStat [40] , and
ExplorN
[41] .
Interact ive F i l ter ing
In explor ing large data sets , i t i s important to
interact ive ly
part i t ion the data set into segements and focus on
interest ing subsets . This can be done by a d i rect
se lect ion
of the des i red subset (browsing) or by a speci f icat ion
of propert ies of the des i red subset (query ing) . Browsing
is
very d i f f icu l t for very large data sets and query ing often
does not produce the des i red resul ts . Therefore a
number
of interact ion techniques have been developed to
improve
interact ive f i l ter ing in data explorat ion. An example of
an
interact ive tool which can be used for an interact ive
f i l ter ing
are Magic Lenses [42] [43] . The bas ic idea of Magic
Lenses is to use a tool l ike a magni fy ing g lasses to
support
f i l ter ing the data d i rect ly in the v isual izat ion. The data
under the magni fy ing g lass is processed by the f i l ter ,
and
the resul t is d isp layed d i f ferent ly than the remain ing
data
set . Magic Lenses show a modi f ied v iew of the se lected
region,
whi le the rest of the v isual izat ion remains unaf fected.
Note that severa l lenses with d i f ferent f i l ters may be
used; i f
the f i l ter over lap, a l l f i l ters are combined. Other
examples
of interact ive f i l ter ing techniques and tools are
InfoCrysta l
[44] , Dynamic Quer ies [45] [46] [47] , and Polar is [8]
(see
f igure 6 in [8] for an example) .
Interact ive Zooming
Zooming is a wel l -known technique which is widely used
in a number of appl icat ions. In deal ing with large
amounts
of data, i t i s important to present the data in a h ighly
compressed
form to provide an overv iew of the data but at the
same t ime a l low a var iable d isp lay of the data on
d i f ferent
resolut ions. Zooming does not only mean to d isp lay the
data objects larger but i t a lso means that the data
representat ion
automat ica l ly changes to present more deta i ls on
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER
GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 105
Fig. 6 . Table Lenses (used by permiss ion of R. Rao, Xerox
PARC
c ACM)
higher zoom levels . The objects may, for example, be
represented
as s ingle p ixels on a low zoom level , as icons on an
intermediate zoom level , and as labeled objects on a
h igh
resolut ion. An interest ing example apply ing the zooming
idea to large tabular data sets is the TableLens approach
[48] . Gett ing an overv iew of large tabular data sets is
d i f f icu l t
i f the data is d isp layed in textual form. The bas ic
idea of TableLens is to represent each numerica l va lue
by a
smal l bar . A l l bars have a one-p ixel height and the
lengths
are determined by the attr ibute values. This means that
the number of rows on the d isp lay can be near ly as h igh
as
the vert ica l resolut ion and the number of co lumns
depends
on the maximum width of the bars for each attr ibute.
The
in i t ia l v iew a l lows the user to detect patterns,
corre lat ions,
and out l iers in the data set . In order to explore a region
of interest the user can zoom in, wi th the resul t that the
af fected rows (or co lumns) are d isp layed in more deta i l ,
poss ib ly even in textual form. F igure 6 shows an
example
of a basebal l database with a few rows being selected
in fu l l deta i l . Other examples of techniques and systems
which use interact ive zooming inc lude PAD++ [49] [50]
[51] , IVEE/Spotf i re [52] , and DataSpace [53] . A
compar ison
of f isheye and zooming techniques can be found in [54] .
Interact ive Distort ion
Interact ive d istort ion techniques support the data
explorat ion
process by preserv ing an overv iew of the data dur ing
dr i l l -down operat ions. The bas ic idea is to show port ions
of
the data with a h igh level of deta i l whi le others are
shown
with a lower level of deta i l . Popular d istort ion
techniques
are hyperbol ic and spher ica l d istort ions which are often
used on h ierarchies or graphs but may be a lso appl ied to
any other v isual izat ion technique. An example of
spher ica l
d istort ions is provided in the Scalable Framework paper
(see f igure 5 in [10]) . An overv iew of d istort ion
techniques
is provided in [55] and [56] . Examples of d istort ion
techniques
inc lude Bi focal Disp lays [57] , Perspect ive Wal l [58] ,
Graphica l F isheye Views [59] [60] , Hyperbol ic
V isual izat ion
[61] [62] , and Hyperbox [63] .
Interact ive L ink ing and Brushing
There are many poss ib i l i t ies to v isual ize
mult id imensional
data but a l l o f them have some strength and
some weaknesses. The idea of l ink ing and brushing is to
combine d i f ferent v isual izat ion methods to overcome the
shortcomings of s ingle techniques. Scatterp lots of
d i f ferent
project ions, for example, may be combined by co lor ing
and
l ink ing subsets of points in a l l pro ject ions. In a s imi lar
fashion,
l ink ing and brushing can be appl ied to v isual izat ions
generated by a l l v isual izat ion techniques descr ibed
above.
As a resul t , the brushed points are h ighl ighted in a l l
v isual izat ions,
making i t poss ib le to detect dependencies and
corre lat ions. Interact ive changes made in one
v isual izat ion
are automat ica l ly ref lected in the other v isual izat ions.
Note that connect ing mult ip le v isual izat ions through
interact ive
l ink ing and brushing provides more informat ion than
consider ing the component v isual izat ions independent ly .
Typica l examples of v isual izat ion techniques which are
combined by l ink ing and brushing are mult ip le
scatterp lots ,
bar charts , para l le l coordinates, p ixel d isp lays, and
maps.
Most interact ive data explorat ion systems a l low some
form
of l ink ing and brushing. Examples are Polar is (see f igure
7 in [8]) and the Scalable Framework (see f igures 12 and
14 in [10]) . Other tools and systems inc lude S P lus [64] ,
XGobi [38] [65] , Xmdv [14] , and DataDesk [66] [67] .
VI . Conclus ion
The explorat ion of large data sets is an important but
d i f f icu l t
problem. Informat ion v isual izat ion techniques may
help to so lve the problem. Visual data explorat ion has
a h igh potent ia l and many appl icat ions such as f raud
detect ion
and data mining wi l l use informat ion v isual izat ion
technology for an improved data analys is .
Future work wi l l involve the t ight integrat ion of
v isual izat ion
techniques with t radi t ional techniques f rom such
disc ip l ines as stat ist ics , maschine learning, operat ions
research,
and s imulat ion. Integrat ion of v isual izat ion techniques
and these more establ ished methods would combine
fast automat ic data mining a lgor i thms with the intu i t ive
power of the human mind, improving the qual i ty
and speed of the v isual data mining process. V iusal data
mining techniques a lso need to be t ight ly integrated
with
the systems used to manage the vast amounts of
re lat ional
and semistructured informat ion, inc luding database
management
and data warehouse systems. The u l t imate goal
is to br ing the power of v isual izat ion technology to every
desktop to a l low a better , faster and more intu i t ive
explorat ion
of very large data resources. This wi l l not only be
valuable in an economic sense but wi l l a lso st imulate
and
del ight the user .
References
[1] B. Shneiderman, “The eye have i t : A task by data
type taxonomy
for informat ion v isual izat ions,” in V isual Languages ,
1996.
[2] S . Card, J . Mackin lay, and B. Shneiderman, Readings
in Informat ion
Visual izat ion , Morgan Kaufmann, 1999.
[3] C. Ware, Informat ion Visual izat ion: Percept ion for
Design,
Morgen Kaufman, 2000.
[4] B. Spence, Informat ion Visual izat ion , Pearson
Educat ion
Higher Educat ion publ ishers , UK, 2000.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER
GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 106
[5] H. Schumann and W. M¨ul ler , V isual is ierung:
Grundlagen und
al lgemeine Methoden , Spr inger , 2000.
[6] D. Keim, “Visual explorat ion of large databases,”
Communicat ions
of the ACM, vol . 44, no. 8 , pp. 38–44, 2001.
[7] L . Nowel l S . Havre, B. Hetz ler and P. Whitney,
“Themeriver :
V isual iz ing themat ic changes in large document
co l lect ions,”
Transact ions on Visual izat ion and Computer Graphics ,
2001.
[8] D. Tang C. Sto l te and P. Hanrahan, “Polar is : A system
for query, analys is and v isual izat ion of mult i -d imensional
re lat ional
databases,” Transact ions on Visual izat ion and Computer
Graphics , 2001.
[9] J . Abel lo and J . Korn, “Mgv: A system for v isual iz ing
mass ive
mult i -d igraphs,” Transact ions on Visual izat ion and
Computer
Graphics , 2001.
[10] N. Lopez M. Kreuseler and H. Schumann, “A scalable
f ramework
for informat ion v isual izat ion,” Transact ions on
Visual izat ion
and Computer Graphics , 2001.
[11] D. Keim, “Designing p ixel -or iented v isual izat ion
techniques:
Theory and appl icat ions,” Transact ions on Visual izat ion
and
Computer Graphics , vo l . 6 , no. 1 , pp. 59–78, Jan–Mar
2000.
[12] B. Shneiderman, “Tree v isual izat ion with t reemaps:
A 2D spacef i l l ing
approach,” ACM Transact ions on Graphics , vo l . 11, no.
1, pp. 92–99, 1992.
[13] B. Johnson and B. Shneiderman, “Treemaps: A
space- f i l l ing
approach to the v isual izat ion of h ierarchica l
in format ion,” in
Proc. V isual izat ion ’91 Conf , 1991, pp. 284–291.
[14] M. O. Ward, “Xmdvtool : Integrat ing mult ip le
methods for v isual iz ing
mult ivar iate data,” in Proc. V isual izat ion 94,
Washington,
DC, 1994, pp. 326–336.
[15] D. As imov, “The grand tour: A tool for v iewing
mult id imensional
data,” S IAM Journal of Sc ience & Stat . Comp. , vo l . 6 ,
pp. 128–143, 1985.
[16] A. Inselberg and B. Dimsdale, “Para l le l coordinates:
A tool for
v isual iz ing mult i -d imensional geometry,” in Proc.
V isual izat ion
90, San Francisco, CA , 1990, pp. 361–370.
[17] J . A . Wise, J . J . Thomas, K. Pennock, D. Lantr ip , M.
Pott ier ,
Schur A. , and V. Crow, “Visual iz ing the non-v isual :
Spat ia l
analys is and interact ion with informat ion f rom text
documents,”
in Proc. Symp. on Informat ion Visual izat ion, At lanta,
GA, 1995, pp. 51–58.
[18] C. Chen, Informat ion Visual isat ion and Vir tual
Environments ,
Spr inger-Ver lag, London, 1999.
[19] M. Dodge, “Web v isual izat ion,”
http: / /www.geog.uc l .ac.uk/
casa/mart in/geography of cyberspace.html , oct 2001.
[20] G. D. Batt is ta , P . Eades, R. Tamassia , and I . G.
To l l i s , Graph
Drawing, Prent ice Hal l , 1999.
[21] J . Tr i lk , “Software v isual izat ion,” http: / /wwwbroy.
informat ik . tu-muenchen.de/˜tr i lk /sv.html , Oct 2001.
[22] D. F . Andrews, “P lots of h igh-d imensional data,”
B iometr ics ,
vo l . 29, pp. 125–136, 1972.
[23] W. S. C leveland, V isual iz ing Data , AT&T Bel l
Laborator ies ,
Murray Hi l l , N J , Hobart Press, Summit NJ , 1993.
[24] P . J . Huber, “The annals of stat ist ics ,” Pro ject ion
Pursui t , vo l .
13, no. 2 , pp. 435–474, 1985.
[25] G. W. Furnas and A. Buja, “Prosect ions v iews:
Dimensional
inference through sect ions and pro ject ions,” Journal of
Computat ional
and Graphica l Stat ist ics , vo l . 3 , no. 4 , pp. 323–353,
1994.
[26] R. Spence, L . Tweedie, H. Dawkes, and H. Su,
“Visual izat ion
for funct ional des ign,” in Proc. Int . Symp. on Informat ion
Visual izat ion
( InfoVis ’95) , 1995, pp. 4–10.
[27] J . J . van Wi jk and R. . D. van L iere, “Hypers l ice,” in
Proc.
Visual izat ion ’93, San Jose, CA , 1993, pp. 119–125.
[28] H. Chernof f , “The use of faces to represent points in
kdimensional
space graphica l ly ,” Journal Amer. Stat ist ica l Associat ion ,
vol . 68, pp. 361–368, 1973.
[29] R. M. P ickett and G. G. Gr inste in, “ Iconographic
d isp lays for
v isual iz ing mult id imensional data,” in Proc. IEEE Conf .
on
Systems, Man and Cybernet ics , IEEE Press, P iscataway,
NJ ,
1988, pp. 514–519.
[30] H. Levkowitz , “Color icons: Merging co lor and
texture percept ion
for integrated v isual izat ion of mult ip le parameters ,” in
Proc. V isual izat ion 91, San Diego, CA , 1991, pp. 22–25.
[31] D. A. Keim and H. -P . Kr iegel , “Visdb: Database
explorat ion
using mult id imensional v isual izat ion,” Computer
Graphics &
Appl icat ions , vo l . 6 , pp. 40–49, Sept . 1994.
[32] M. Hearst , “T i lebars: V isual izat ion of term
distr ibut ion informat ion
in fu l l text informat ion access,” in Proc. of ACM Human
Factors in Comput ing Systems Conf . (CHI ’95) , 1995, pp.
59–66.
[33] D. A. Keim, H. -P . Kr iegel , and M. Ankerst ,
“Recurs ive pattern:
A technique for v isual iz ing very large amounts of data,”
in Proc.
Visual izat ion 95, At lanta, GA , 1995, pp. 279–286.
[34] M. Ankerst , D. A. Keim, and H. -P . Kr iegel , “Ci rc le
segments:
A technique for v isual ly explor ing large mult id imensional
data
sets ,” in Proc. V isual izat ion 96, Hot Topic Sess ion, San
Francisco,
CA, 1996.
[35] J . LeBlanc, M. O. Ward, and N. Witte ls , “Explor ing
ndimensional
databases,” in Proc. V isual izat ion ’90, San Francisco,
CA, 1990, pp. 230–239.
[36] S. Fe iner and C. Beshers , “Visual iz ing n-d imensional
v i r tual
wor lds with n-v is ion,” Computer Graphics , vo l . 24, no. 2 ,
pp.
37–38, 1990.
[37] G. G. Robertson, J . D. Mackin lay, and S. K . Card,
“Cone
t rees: Animated 3D v isual izat ions of h ierarchica l
in format ion,”
in Proc. Human Factors in Comput ing Systems CHI 91
Conf . ,
New Or leans, LA , 1991, pp. 189–194.
[38] D. F . Swayne, D. Cook, and A. Buja, User ’s Manual
for XGobi :
A Dynamic Graphics Program for Data Analys is , Bel lcore
Technica l
Memorandum, 1992.
[39] A. Buja, D. F . Swayne, and D. Cook, “ Interact ive
h ighdimensional
data v isual izat ion,” Journal of Computat ional and
Graphica l Stat ist ics , vo l . 5 , no. 1 , pp. 78–99, 1996.
[40] L . T ierney, “L ispstat : An object -or ientated
environment for
stat ist ica l comput ing and dynamic graphics ,” in Wi ley,
New
York, NY, 1991.
[41] D. B. Carr , E . J . Wegman, and Q. Luo, “Explorn:
Design considerat ions
past and present ,” in Technica l Report , No. 129,
Center for Computat ional Stat ist ics , George Mason
Univers i ty ,
1996.
[42] E . A. B ier , M. C. Stone, K. P ier , W. Buxton, and T.
DeRose,
“Toolg lass and magic lenses: The see-through inter face,”
in
Proc. S IGGRAPH ’93, Anaheim, CA , 1993, pp. 73–80.
[43] K. F ishkin and M. C. Stone, “Enhanced dynamic
quer ies v ia
movable f i l ters ,” in Proc. Human Factors in Comput ing
Systems
CHI ’95 Conf . , Denver , CO , 1995, pp. 415–420.
[44] A. Spoerr i , “ Infocrysta l : A v isual too l for informat ion
retr ieval ,”
in Proc. V isual izat ion ’93, San Jose, CA , 1993, pp. 150–
157.
[45] C. Ahlberg and B. Shneiderman, “Visual in format ion
seeking:
T ight coupl ing of dynamic query f i l ters with star f ie ld
d isp lays,”
in Proc. Human Factors in Comput ing Systems CHI ’94
Conf . ,
Boston, MA, 1994, pp. 313–317.
[46] S. G. E ick, “Data v isual izat ion s l iders ,” in Proc. ACM
UIST,
1994, pp. 119–120.
[47] J . Goldste in and S. F . Roth, “Us ing aggregat ion and
dynamic
quer ies for explor ing large data sets ,” in Proc. Human
Factors
in Comput ing Systems CHI ’94 Conf . , Boston, MA , 1994,
pp.
23–29.
[48] R. Rao and S. K . Card, “The table lens: Merging
graphica l
and symbol ic representat ion in an interact ive
focus+context v isual izat ion
for tabular informat ion,” in Proc. Human Factors
in Comput ing Systems CHI 94 Conf . , Boston, MA , 1994,
pp.
318–322.
[49] K. Per l in and D. Fox, “Pad: An a l ternat ive approach
to the
computer inter face,” in Proc. S IGGRAPH, Anaheim, CA ,
1993,
pp. 57–64.
[50] B. Bederson, “Pad++: Advances in mult isca le
inter faces,” in
Proc. Human Factors in Comput ing Systems CHI ’94
Conf . ,
Boston, MA, 1994, p . 315.
[51] B. B. Bederson and J . D. Hol lan, “Pad++: A zooming
graphica l
inter face for explor ing a l ternate inter face phys ics ,” in
Proc.
UIST, 1994, pp. 17–26.
[52] C. Ahlberg and E. Wistrand, “ Ivee: An informat ion
v isual izat ion
and explorat ion environment,” in Proc. Int . Symp. on
Informat ion Visual izat ion, At lanta, GA , 1995, pp. 66–73.
[53] V. Anupam, S. Dar , T . Le ibfr ied, and E. Peta jan,
“Dataspace:
3D v isual izat ion of large databases,” in Proc. Int . Symp.
on
Informat ion Visual izat ion, At lanta, GA , 1995, pp. 82–88.
[54] Schaffer , Doug, Zuo, Zhengping, Bartram, Lyn, Di l l ,
John,
Dubs, Shel l i , Greenberg, Saul , and Roseman, “Compar ing
f isheye
and fu l l -zoom techniques for navigat ion of h ierarchica l ly
c lustered networks,” in Proc. Graphics Inter face (GI ’93) ,
Toronto, Ontar io , 1993, in : Canadian Informat ion
Process ing
Soc. , Toronto, Ontar io , Graphics Press, Cheshire, CT ,
1993,
pp. 87–96.
[55] Y . Leung and M. Apper ley, “A rev iew and taxonomy
of
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER
GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 107
distort ion-or iented presentat ion techniques,” in Proc.
Human
Factors in Comput ing Systems CHI ’94 Conf . , Boston, MA ,
1994, pp. 126–160.
[56] M. S . T . Carpendale, D. J . Cowperthwaite, and F. D.
Fracchia,
“ Ieee computer graphics and appl icat ions, specia l issue
on informat ion
visual izat ion,” IEEE Journal Press , vo l . 17, no. 4 , pp.
42–51, Ju ly 1997.
[57] R. Spence and M. Apper ley, “Data base navigat ion:
An of f ice
environment for the profess ional ,” Behaviour and
Informat ion
Technology, vo l . 1 , no. 1 , pp. 43–54, 1982.
[58] J . D. Mackin lay, G. G. Robertson, and S. K . Card,
“The perspect ive
wal l : Deta i l and context smoothly integrated,” in Proc.
Human Factors in Comput ing Systems CHI ’91 Conf . , New
Or leans,
LA, 1991, pp. 173–179.
[59] G. Furnas, “General ized f isheye v iews,” in Proc.
Human Factors
in Comput ing Systems CHI 86 Conf . , Boston, MA , 1986,
pp.
18–23.
[60] M. Sarkar and M. Brown, “Graphica l f isheye v iews,”
Communicat ions
of the ACM, vol . 37, no. 12, pp. 73–84, 1994.
[61] J . Lamping, Rao R. , and P. P i ro l l i , “A focus + context
technique
based on hyperbol ic geometry for v isual iz ing large
h ierarchies,”
in Proc. Human Factors in Comput ing Systems CHI 95
Conf . ,
1995, pp. 401–408.
[62] T . Munzner and P. Burchard, “Visual iz ing the
structure of the
wor ld wide web in 3D hyperbol ic space,” in Proc. VRML
’95
Symp, San Diego, CA , 1995, pp. 33–38.
[63] B. A lpern and L . Carter , “Hyperbox,” in Proc.
V isual izat ion
’91, San Diego, CA , 1991, pp. 133–139.
[64] R. Becker , J . M. Chambers, and A. R. Wi lks , “The
new s language,
wadsworth & brooks/co le advanced books and software,”
Paci f ic Grove, CA, 1988.
[65] R. A. Becker , W. S. C leveland, and M.- J . Shyu, “The
v isual
des ign and contro l of t re l l i s d isp lay,” Journal of
Computat ional
and Graphica l Stat ist ics , vo l . 5 , no. 2 , pp. 123–155,
1996.
[66] P . F Vel leman, Data Desk 4.2: Data Descr ipt ion ,
Data Desk,
I thaca, NY, 1992, 1992.
[67] A. Wi lhelm, A.R. Unwin, and M. Theus, “Software for
interact ive
stat ist ica l graphics - a rev iew,” in Proc. Int . Softstat 95
Conf . , Heidelberg, Germany , 1995.
Biography
DANIEL A. KEIM is work ing in the area of in format ion
visual izat ion and data mining. In the f ie ld of in format ion
visual izat ion, he developed several novel techniques
which
use v isual izat ion technology for the purpose of explor ing
large databases. He has publ ished extens ively on
informat ion
visual izat ion and data mining; he has g iven tutor ia ls
on re lated issues at severa l large conferences inc luding
Visual izat ion, S IGMOD, VLDB, and KDD; he has been
program co-chair of the IEEE Informat ion Visual izat ion
Symposia in 1999 and 2000; he is program co-chair of
the
ACM SIGKDD conference in 2002; and he is an edi tor of
TVCG and the Informat ion Visual izat ion Journal .
Danie l Keim received h is d ip loma (equivalent to an MS
degree) in Computer Sc ience f rom the Univers i ty of
Dortmund
in 1990 and h is Ph.D. in Computer Sc ience f rom the
Univers i ty of Munich in 1994. He has been ass istant
professor
at the CS department of the Univers i ty of Munich,
associate professor at the CS department of the Mart in-
Luther-Univers i ty Hal le , and fu l l professor at the CS
department
of the Univers i ty of Constance. Current ly , he
is on leave f rom the Univers i ty of Constance, work ing at
AT&T Shannon Research Labs, F lorham Park, NJ , USA