ICPSR Biennial Meeting October 2, 2015 Ryan Womack ...€¦ · to present a 3-D surface...

Post on 22-Jul-2020

0 views 0 download

Transcript of ICPSR Biennial Meeting October 2, 2015 Ryan Womack ...€¦ · to present a 3-D surface...

(A bit about) Data VisualizationICPSR Biennial Meeting

October 2, 2015

Ryan Womack (rwomack@rutgers.edu)Data Librarian, Rutgers University

This work is licensed under a Creative Commons Attribution

-NonCommercial-ShareAlike 4.0 International License.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 1 / 52

Introduction

What this talk IS:

Discusses standard techniques of data visualization, the day-to-daypower tools for understanding data

Reviews various graphical techniques, from early to recent, fromsimple to advanced

Presents principles of good data presentation, and show the Rimplementation of many functions

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 2 / 52

Introduction

What this talk is NOT:

It is not about “infographics”, the beautiful, heavily customizedproducts of expert graphic designers. [See 1 and 2 for morediscussion]

It is not about the cognitive science aspects of data perception[wish I knew more about this!]

It is not about how to use R or other software [although code isprovided for those who are interested]

It is not necessarily a balanced survey of all data visualization. Inparticular, it is light on graph networks, clustering, and trees [notmy expertise]

Very little mapping, too [Others are better at this]

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 3 / 52

Setup

Most of the graphics examples that are not web accessible are runin R.

R is open source software available at http://r-project.org

RStudio is a useful freely available editor available athttp://rstudio.com

Workshop materials, including R scripts, supplemental images anddata, are available for download fromhttp://ryanwomack.com/ICPSR2015

The R script file contains working demonstrations of many of theconcepts mentioned here for you to try on your own.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 4 / 52

Outline

Why?

Whirlwhind tour of historical data viz

Standard visualization vs. some less commonly used examples

3-D and Animation

Interactivity, data exploration

A little bit of big data

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 5 / 52

Why Data Visualization?

Data visualization can:

provide clear understanding of patterns in data

detect hidden structures in data

condense information

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 6 / 52

Anscombe’s Quartet

For example, see Anscombe’s quartet (image source:http://commons.wikimedia.org/wiki/File:Anscombe%27s quartet 3.svg):

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 7 / 52

Links to DataViz sites

Some examples of good data visualization (and fancy infographics) canbe found at:

Information Aesthetics

Chart Porn

Eagereyes

DataVis.ca

VizWiz

US Census Data Visualization Gallery

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 8 / 52

Bad Graphs

Pie Charts are known to be problematic

Clutter and other issues can ruin graphics

Novel or nonsensical?

For more bad ideas, try:

Junk Charts

Ten Worst Graphs

WTFviz

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 9 / 52

Pie Chart Examples

image source: http://peltiertech.com/WordPress/3d-pie-charts/

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 10 / 52

Pie Chart Examples

image source: http://ndevisual.wordpress.com/tag/uses-of-pie-charts/

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 11 / 52

Pie Chart Examples

image source: http://www.nbcchicago.com/news/local/FOX-News-Chart-Fails-Math-73711092.html

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 12 / 52

Pie Chart Examples

image source: http://tips.vovici.com/content/111031 swb

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 13 / 52

Pie Chart Examples

image source: http://tips.vovici.com/content/111031 swb

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 14 / 52

Clutter Example

image source:http://junkcharts.typepad.com/junk charts/2013/03/which-software-is-responsible-for-this.html

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 15 / 52

Playfair

Astronomical observations, charts, and maps led in graphicalinnovation prior to 1800. See also Classic Data Visualizations

William Playfair is the pioneer of the line chart, bar chart, timeseries plots, and pie chart.

Playfair, W. (1786). Commercial and Political Atlas: Representing, byCopper-Plate Charts, the Progress of the Commerce, Revenues, Expenditure,and Debts of England, during the Whole of the Eighteenth Century,

Playfair, W. (1801). Statistical Breviary.

Both republished in The Commercial and Political Atlas and StatisticalBreviary, 2005, Cambridge University Press.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 16 / 52

Playfair Examples

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 17 / 52

Playfair Examples

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 18 / 52

Minard

Charles Joseph Minard was the next influential data graphic creatorafter Playfair.

Minard’s flow map of Napoleon’s Russian campaign is celebratedby Tufte and others as one of the greatest information graphics.

It embodies an ideal of highly compressed informative elements,presented with style

Six variables: size, location in 2 dimensions, the direction of thearmy, temperature, date [and group]

However, this is a one-off design that crosses into Infographics, butit can be reproduced in R and other software.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 19 / 52

Minard Examples

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 20 / 52

Minard Examples

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 21 / 52

Fisher and Tukey

In the 20th century, statisticians such as Ronald Fisher and JohnTukey continued to advance graphical methods for the analysis ofdata.

Fisher emphasized plotting the data to understand relationships.

Tukey’s Exploratory Data Analysis emphasized the use of graphicsto understand the data during analysis, rather than the finalpresentation to an outside audience.

Tukey created the box and whiskers plot and the stem and leafplot.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 22 / 52

Tufte

Edward R. Tufte’s series of books, beginning with The Visual Displayof Quantitative Information, have become the most widely know workson data visualization.

There is considerable overlap between the various publications

Tufte’s ideal is highly compressed, elegant, and informative data,as expressed in dense printed graphics

Tufte sometimes emphasizes beauty and design to the detriment ofsimplicity and clarity [e.g., train schedules]

“Graphical elegance is often found in simplicity of design andcomplexity of data.”

“Beautiful graphics do not traffic with the trivial.”

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 23 / 52

Train Schedule from Marey

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 24 / 52

Tufte’s principles

Tufte has developed and popularized numerous principles andterminology:

Graphics reveal data - show the data without distorting it - “above allelse show the data”

Small multiple - understanding one slice makes understanding otherseasier

Lie factor - effect shown/effect in reality

Graphical Integrity - no lies, let data vary, not design

Data density - maximize data/ink ratio

Sparklines - seems they haven’t caught on

chartjunk - self-explanatory

Powerpoint is responsible for most of the world’s sorrows [TheCognitive Style of Powerpoint]

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 25 / 52

Lie Factor

image source: http://www.datavis.ca/gallery/lie-factor.php

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 26 / 52

Cleveland

William Cleveland’s Elements of Graphing Data and VisualizingData pioneered systematic considerations of data legibility

Cleveland is particularly known for promoting the dot plot as aalternative to bars and pies.

The dot plot provides clarity and easy comparison of data.

Cleveland also pioneered Trellis graphics

Trellis graphics emphasizes comparison of multiple panels of data

The lattice package implements Trellis graphics in R

See Cleveland.pdf for a summary of Cleveland’s recommendations

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 27 / 52

Scatterplot matrix

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 28 / 52

The Grammar of Graphics

The Grammar of Graphics, by Leland Wilkinson, was extremelyinfluential in thinking about graphics

Grammar means ”rules for art and science”

The Grammar of Graphics specifies rules both mathematical andaesthetic

Earlier graph producers focused on aesthetics of static content

Dynamic graphics and scientific visualization, by contrast, requiresophisticated designs to enable brushing, drill-down, zooming,linking

The Grammar of Graphics is easily adapted to this approach

ggplot2 was developed by Hadley Wickham as an implementationof the Grammar of Graphics

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 29 / 52

From Barchart to Dot Plot

The Cleveland dot plot

use to compare labeled quantities, ordered lists

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 30 / 52

Figure: Bar chart v. Dot Plot

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 31 / 52

Visualizing Distributions of Data

Box and Whiskers Plot

illustrate quantiles and outliers. There is also a Tufte version.

Violin plot

Blends density information with box and whiskers style (in anartistic manner)

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 31 / 52

Figure: Box Plot v. Violin Plot

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 32 / 52

Visualizing Categorical Data

Beyond the pie chart

The mosaic plot allows multiple categories to be displayed on thesame graph, but can be complicated to interpret.

The spineplot is a variant of the mosaic plot, plotting proportionsin 2 dimensions.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 32 / 52

Figure: Pie Chart v. Mosaic Plot

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 33 / 52

Maps and Glyphs

Maps are obviously an important and widespread way of presentingdata.

We examine a few examples of choropleth maps, in which shadingindicates data levels

See also Interactive Maps in R and 5 kinds of Interactive maps inPlot.ly for further exploration

Glyphs present iconic representations of data elements.

Weather maps often use glyphs.

A more dynamic example is here.

As an R example, consider Chernoff faces and the aplpack

package. Also, Smiley faces [and many more graph variants in thischapter].

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 33 / 52

Figure: Choropleth Map v. Chernoff Faces

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 34 / 52

3-D

3-D scatterplots

cloud (lattice)

contour plots

to plot standardized levels of data

wireframe plots

to present a 3-D surface representation of data

rgl (a separate package containing several 3d plotting functionsand animation)

mosaic3d extends the mosaic paradigm to three dimensions

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 34 / 52

Animation

Animation is an easy way to step through data over time

or to provide comparisons of different views of data

R makes animation easy with the animation package

Just enclose a sequence of graphics in the animation command togenerate interactive HTML (or GIF, SWF, LATEX, Video).

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 35 / 52

Interactive DataViz - Principles

Why aren’t all of our graphs interactive?

Brushing is used to select data points and track them throughvarious analyses.

Drilling down, zooming, and subsetting are also interactivetechniques.

Data displays can be linked so that a selection in one panelmodifies the output displayed in another panel.

Interactivity is especially useful for data exploration, studyingmultidimensional relationships.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 36 / 52

Interactive Data in Practice

There are many R packages that allow for interactive data work in agraphical user interface, including:

playwith - versatile package that works with any graphicsfunction. Graphics can be explored, edited, and exported.

requires separate installation of GTK+ on your computer [method variesby OS]

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 37 / 52

googleVis

In many contexts, visualizing the relationships between data elementsis made easier by viewing related data interactively.

Making this easy are googleVis and other “Vis” packages, e.g.bdvis for biodiversity or rainfreq.

A Library example - comparing selected ARL Statistics for publicCIC universities

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 38 / 52

Interactive Data on the Web - Rcharts

Rcharts is a package that uses javascript to create interactivevisualizations.

Lattice-style commands are used.

The package can output javascript for use in an HTML page.

Some commands depend on supplemental javascript libraries thatmust be installed, such as NVD3

Can embed in documents too, with slidify

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 39 / 52

Interactive Data on the Web - shiny

The shiny package is developed by the Rstudio folks

You can learn shiny in half a day via the online tutorial

More custom control of the design is possible with shiny, incomparison to other do-it-all packages

Graphics use familiar R syntax (including ggplot2), with wrappersto implement web functionality

Every shiny app has the same structure: two R scripts savedtogether in a directory [ui and server files]

You must install the shiny server to deliver pages via the web

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 40 / 52

Interactive Data on the Web - shiny, cont.

There are samples built into the shiny package.

You can build a Census Explorer of your own with theseinstructions from Ari Lamstein.

You can see more in the shiny gallery

Rcharts works with shiny too.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 41 / 52

Interactive Data on the Web - ggvis

The ggvis package is ALSO developed by the Rstudio folks

Think ggplot meets shiny

Similar syntax to ggplot

Some ability to add interactive controls

Can embed in shiny for web access

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 42 / 52

Interactive Data on the Web - radiant

Radiant is another new R interface built with shiny

The following links demonstrate capabilities:

vnijs.shinyapps.io/basevnijs.shinyapps.io/quantvnijs.shinyapps.io/marketing

By automating the mechanics of interacting with data, we canfocus on exploring and understanding.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 43 / 52

Other (non-R) options for Web visualization

D3.js, free at http://d3js.org/

Inkscape, free at https://inkscape.org/

Tableau, free 1-year student license athttp://www.tableau.com/academic/students

Plot.ly environment at http://plot.ly

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 44 / 52

Interactive Power

Population pyramids are one example whereinteractivity + animation = insight .

Populationpyramid.net - for all countries, basic animation

The German Population Pyramid from Destatis is even moreinteractive

Doing it in R is possible with these instructions (Part 1) and (Part2)

The ggvis package is ALSO developed by the Rstudio folks

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 45 / 52

Big Data

Big data presents special issues for data visualization

While many techniques and graphics are the same, explorationand plotting must be optimized for the size of the data set

Representation of the complexity of the data may require specialtechniques

hexbin

bigvis

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 46 / 52

bigvis

bigvis was an experimental package by Hadley Wickham to deal withthe issues of Big Data

There is a Preprint and R Meetup presentation by Hadley Wickham

Complete code is available at https://github.com/hadley/bigvis-infovis

Target: process 100 million observations in under 5 seconds.

Fundamental principle: No need for more data points than there arepixels on the screen.

“ggstat” package has been mentioned as a future project that willincorporate these ideas.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 47 / 52

bigvis steps

Condense (bin, condense)

Smooth (smooth, best_h, peel)

Visualize (autoplot plus standard methods)

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 48 / 52

Trelliscope (Tessera)

Tessera is developed by Purdue, Pacific Northwest NationalLaboratory, and Mozilla. Launched in November 2014, this projectholds a lot of promise.

Running in the R environment, Tessera provides its own commands thatexecute across a cluster, easing the burden of analysis in this environment.

The datadr package “divides and recombines” in a manner similarto MapReduce, providing a simplified interface to Hadoop.

Tessera has its own visualization interface, Trelliscope, that canhandle views across many variables and observations. Described inthis paper.

Tessera’s Bootcamp is a good introduction, or try the quickstart.

Live demo is here.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 49 / 52

Infographics links

Although not covered here, the following links are a sampling ofinfographics sites for your later enjoyment:

Data Storytelling in Video

Art of Data Visualization - in spite of its title, more on theinfographics side

Parisian Subway Traffic and New York Subway Inequality

Tulp Interactive

Mapping London and London Riots + Twitter

YouTube Trends Map

Global Burden of Disease Visualizations

and the Tree of Life

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 50 / 52

Keep Exploring

Data Visualization represents a nearly infinite world of possibilty forexploration:

plunge into programming

deep dives into data

indulge in interactivity

...have fun and keep learning! [e.g., R-bloggers.com]

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 51 / 52

References

There is also an online bibliography of references to accompany thispresentation on my home page.

Ryan Womack (rwomack@rutgers.edu) Data Librarian, Rutgers University(A bit about) Data Visualization 52 / 52