Data analysis04 morethantwovariables

21
http://publicationslist.org/junio Data Analysis More Than Two Variables: Graphical Multivariate Analysis Prof. Dr. Jose Fernando Rodrigues Junior ICMC-USP

description

Revised version - spell checked.

Transcript of Data analysis04 morethantwovariables

Page 1: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Data AnalysisMore Than Two Variables: Graphical Multivariate Analysis

Prof. Dr. Jose Fernando Rodrigues JuniorICMC-USP

Page 2: Data analysis04 morethantwovariables

http://publicationslist.org/junio

What is it about?More than two variables determine a tough analytical problem

Although, there are multivariate graphical techniques, they cannot deal too many variables (less than 15 – 25)

Page 3: Data analysis04 morethantwovariables

http://publicationslist.org/junio

What is it about?More than two variables determine a tough analytical

problem

Although, there are multivariate graphical techniques, theycan not deal too many variables (less than 15 – 25)

Page 4: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Three variablesFor example, consider the data defined by function:

푦 = 푓 푥, 푎 = 푥2 + 푎푥 −

푥2 +

푎4

that corresponds to the three-variable setting y, x, and a

One way to analyze this is by means of a surface plot:

Page 5: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Three variablesHowever, the surface plot does not permit an intuitive

reading of quantitative informationAnother way is to use a two-dimensional xy plot with

multiple curves, one for each value of interest of one of thevariables; in the example, variable a is considered for values 2,1, 0, -1, -2

Page 6: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Three variablesSurface plots and multiple-curve xy plots can be in

combination, one providing an aesthetically appealingoverview, the other providing fine details for values ofinterest

It is interesting to note that surface plots go against thecommonsense that 3D plots should be moreinformative than 2D plots

Yet another possibility is to project the function into thebase plane bellow the surface, using either:contour plots false-color plots

Page 7: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Three variablesThe false-color plot is achieved by mapping all values of the

dependent variable y following a palette of colorsThe false-color plot for function f(x, a)

Page 8: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Three variablesThe false-color plot is achieved by mapping all values of the

dependent variable y following a palette of colorsThe false-color plot for function f(x, a)

False-color plots are very effective for presenting quantitative information

It is important to note, however, that its efficiency depends heavily on the color mapping, which must be

intuitive according to the task at hand

For an overview of color-mapping guidelines, see the textbook on page 104

Page 9: Data analysis04 morethantwovariables

http://publicationslist.org/junio

MoreThere are basically two ways to get more information on a

plotPut similar graphs next to each other and vary the

variables in a systematic fashion from one subgraph to the nextmultiplots

Make the graph elements themselves richer with color, shape,and interaction

MultiplotsThe most common forms of multiplots are the scatter-plot

matrix, and the co-plot,

Page 10: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Scatter-plot matrixThe scatter-plot matrix is constructed considering all the

possible two-variable combinations achieved from theset of variables

For each combination, a sub-region of the space isreserved and all the combinations are put together accordingto a straight layout

The more variables, the bigger must be the screen,limits start to manifest around 10 variables, the same for thenumber of data points, limited around 100

For example, consider a 250 wines data set consisting ofseven different properties: acidity, sugar, chlorides, sulfurdioxide, density, alcohol, and quality

The data can be found in the “Wine Quality” data set, available at the UCI MachineLearning repository - http://archive.ics.uci.edu/ml/.

Page 11: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Scatter-plot matrix

Page 12: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Scatter-plot matrix

The scatter plot shows:

• sugar content and density are positively correlated• as the alcohol content goes up, density goes down, inverse

correlation• wine quality seems to increase with increasing alcohol content:

apparently, more potent wines are considered to be better

Page 13: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Co-plotsCo-plots work by partitioning the data according to one of

the variables and plotting each partition in a different plot

In this example, the upperfigure shows how one ofthe variables was used topartition the data

It is possible to notice thatthe intervals overlap and havedifferent sizes so that eachplot has the same numberof points

Page 14: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Co-plotsCo-plots work by partitioning the data according to one of

the variables and plotting each partition in a different plot

In this example, the upperfigure shows how one ofthe variables was used topartition the data

It is possible to notice thatthe intervals overlap and havedifferent sizes so that eachplot has the same numberof points

The partitioning can be determined in many differentways according to the goals of the analysis

Page 15: Data analysis04 morethantwovariables

http://publicationslist.org/junio

CompositionAnother way to visualize more than two variables is to

compose multiple plots according to some of the variables

For example: imagine a company that makes five productslabeled A, B, C, D, and E, and two questions:how many items of each kind are produced overallhow the item mix is changing over time

A simple solution, but not quiteeffective is to plot (days x quantity)the different curves all together

Page 16: Data analysis04 morethantwovariables

http://publicationslist.org/junio

CompositionAnother solution is to use the same plot but stacking the

information, so as to have a notion of total In absolute numbers (left), or In relative contributions (percentage)

Page 17: Data analysis04 morethantwovariables

http://publicationslist.org/junio

CompositionAnother solution is to use the same plot but stacking the

information, so as to have a notion of total In absolute numbers (left), or In relative contributions (percentage)

Surprisingly, the total number ofitems does not change appreciablyover the long run

Here, one can see the contribution ofeach product, an information thatwas lost in the left plot

Page 18: Data analysis04 morethantwovariables

http://publicationslist.org/junio

CompositionAnother solution is to use the same plot but stacking the

information, so as to have a notion of total In absolute numbers (left), or In relative contributions (percentage)

Surprisingly, the total number ofitems does not change appreciablyover the long run

Here, one can see the contribution ofeach product, an information thatwas lost in the left plot

It is desirable to use multiple kinds of plots combinedin order to verify the many aspects of data

Page 19: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Parallel CoordinatesIn a parallel coordinate plot, the coordinate axes are parallel

to each otherFor every data point, its value for each of the variables is

marked on the corresponding axis, and then all these pointsare connected with lines

Page 20: Data analysis04 morethantwovariables

http://publicationslist.org/junio

Information VisualizationMany other techniques are presented according the findings of the field

known as InformationVisualization:GlyphsChernoff FacesTree-maps Star coordinatesTable LensMultidimensional Projection

And many others, all improved by means of interaction techniques:Querying and zooming Linking and BrushingCombined projections, and so forth

Page 21: Data analysis04 morethantwovariables

http://publicationslist.org/junio

References Philipp K. Janert, Data Analysis with Open Source Tools,

O’Reilly, 2010. Wikipedia, http://en.wikipedia.org Wolfram MathWorld, http://mathworld.wolfram.com/