Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied...

24
Information Information Visualization in Data Visualization in Data Mining Mining S.T. Balke S.T. Balke Department of Chemical Department of Chemical Engineering and Applied Engineering and Applied Chemistry Chemistry University of Toronto University of Toronto

Transcript of Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied...

Page 1: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Information Information Visualization in Data Visualization in Data MiningMining

S.T. BalkeS.T. BalkeDepartment of Chemical Department of Chemical Engineering and Applied Engineering and Applied ChemistryChemistryUniversity of TorontoUniversity of Toronto

Page 2: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

MotivationMotivation

Data visualization Data visualization – relies primarily on human cognition for relies primarily on human cognition for

value discovery;value discovery;– permits direct incorporation of human permits direct incorporation of human

ingenuity and analytic capabilities into ingenuity and analytic capabilities into data mining;data mining;

– can very effectively deal with very large can very effectively deal with very large quantities of data;quantities of data;

– powerfully combines with machine-based powerfully combines with machine-based discovery techniques.discovery techniques.

Page 3: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

UsesUses

Explorative AnalysisExplorative Analysis– Data cleaningData cleaning– Provide hypothesesProvide hypotheses

Confirmative AnalysisConfirmative Analysis– Confirm or reject hypothesesConfirm or reject hypotheses

PresentationPresentation– Communicate your workCommunicate your work

Page 4: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

http://www.alz.washington.edu/DATA2001/GERALD1/sld011.htm

Page 5: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Calculated Properties Calculated Properties of the Anscombe Data of the Anscombe Data SetsSets

mean of the x values = 9.0

mean of the y values = 7.5

equation of the least-squared regression line is: y = 3 + 0.5x

sums of squared errors (about the mean) = 110.0

Page 6: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Calculated Properties Calculated Properties of the Anscombe Data of the Anscombe Data SetsSets

regression sums of squared errors (variance accounted for by x) = 27.5

residual sums of squared errors (about the regression line) = 13.75

correlation coefficient = 0.82

coefficient of determination = 0.67

Page 7: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

The Anscombe DataThe Anscombe Data

Page 8: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Marley, 1885

Page 9: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Snow’s Cholera Map, 1855

Page 10: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

http://pupgg.princeton.edu/disk20/anonymous/groth/lick/licknorth.gif

Page 11: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Graphical ExcellenceGraphical Excellence

Graphical displays should:Graphical displays should: show the datashow the data induce the viewer to think about the substance, not induce the viewer to think about the substance, not

the methodologythe methodology avoid distorting what the data saysavoid distorting what the data says present many numbers in a small spacepresent many numbers in a small space make large data sets coherentmake large data sets coherent encourage the eye to compare different pieces of dataencourage the eye to compare different pieces of data reveal the data at several levels of detail (broad reveal the data at several levels of detail (broad

overview to fine structure)overview to fine structure) serve a reasonably clear purpose: description, serve a reasonably clear purpose: description,

exploration, tabulation, or decorationexploration, tabulation, or decoration be closely integrated with the statistical and verbal be closely integrated with the statistical and verbal

descriptions of the data set.descriptions of the data set.

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 12: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Graphical ExcellenceGraphical Excellence

Gives the viewer the greatest Gives the viewer the greatest number of ideas in the shortest number of ideas in the shortest time with the least ink in the time with the least ink in the smallest space.smallest space.

Nearly always multivariate.Nearly always multivariate. Requires telling the truth about Requires telling the truth about

the data.the data.(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 13: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Lie Factor=14.8

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 14: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Lie FactorLie Factor

dataineffectofsize

graphicinshowneffectofsizeFactorLie

8.14

6.0100)6.03.5(

18100)0.185.27(

FactorLie

Require: 0.95<Lie Factor<1.05

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 15: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Using Area for One Using Area for One Dimensional DataDimensional Data

Lie Factor=2.8

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 16: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

More guidelines:More guidelines:

The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.

No legends: use labels on graph Graphics must not quote data out

of context.(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 17: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Data Ink RatioData Ink Ratio

graphictheprtousedinktotal

inkdataRatioinkData

int

Data ink Ratio = proportion of a graphic’s ink devoted to the

non-redundant display of data-information.

Data ink Ratio=1.0-(proportion of a graphic that can be erasedwithout loss of data-information)

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 18: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Maximize Data DensityMaximize Data Density

graphicdataofarea

matrixdatatheinentriesofnumbergraphicaofdensitydata

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 19: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Beware ChartjunkBeware Chartjunk

NO

“Isn’t it remarkable that the computer can be programmedto draw like that.”

YES:

“My, what interesting data!”

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Page 20: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

How to Say Nothing with How to Say Nothing with Information Visualization Information Visualization

http://www.crs4.it/~zip/13ways.htmlhttp://www.crs4.it/~zip/13ways.html

Never include a color legend.Never include a color legend. Avoid annotation.Avoid annotation. Never mention error characteristics of the Never mention error characteristics of the

visualization method.visualization method. When in doubt, smooth.When in doubt, smooth. Don’t say how long it required to plot.Don’t say how long it required to plot. Never compare your results with other data Never compare your results with other data

visualization techniques.visualization techniques. Never cite references for the data.Never cite references for the data. Claim generality but show results from a single Claim generality but show results from a single

data set.data set. Use viewing angle to hide blemishes in 3D Use viewing angle to hide blemishes in 3D

objects.objects.

Page 21: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

An Overview of An Overview of Information Information Visualization MethodsVisualization Methods

http://www.informatik.uni-http://www.informatik.uni-halle.de/~keim/tutorials.htmlhalle.de/~keim/tutorials.html

Page 22: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Methods of InterestMethods of Interest

Scatterplot MatricesScatterplot Matrices Parallel CoordinatesParallel Coordinates Pixel Oriented MethodsPixel Oriented Methods Icon based MethodsIcon based Methods Dimensional StackingDimensional Stacking TreemapTreemap

Page 23: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Assignment 1: see Assignment 1: see handouthandout

Page 24: Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied Chemistry University of Toronto.

Some websites of Some websites of interest:interest: http://http://

dmoz.org/Computers/Software/Databases/Data_Miningdmoz.org/Computers/Software/Databases/Data_Mining/ / Public_Domain_SoftwarePublic_Domain_Software//

http://www.cs.man.ac.uk/~ngg/InfoViz/Projects_and_Prohttp://www.cs.man.ac.uk/~ngg/InfoViz/Projects_and_Products/Visualization/ducts/Visualization/

Try a search at google.com using Try a search at google.com using the followng key words together:the followng key words together:

name_of_method download softwarename_of_method download software