Download - Lies, damned lies & dataviz

Transcript
Page 1: Lies, damned lies & dataviz
Page 2: Lies, damned lies & dataviz

Lies, Damned Lies & Dataviz

Bad visualization, and how to avoid it

Dr. Andrew CleggDirector, Learner Analytics & Data SciencePearson

@andrew_clegg

Page 3: Lies, damned lies & dataviz

Part I — Why Visualize?

What are the benefits — when it’s done right?

Part II — Bad Dataviz

How to spot the failures — and how to avoid them yourself

Warning: Contains Opinion!

Introduction

Page 4: Lies, damned lies & dataviz

Part I — Why Visualize?

Page 5: Lies, damned lies & dataviz

● Summarizing and communicating numbers

● Drawing attention to trends and patterns

● Exploring data interactively

● Capturing attention

● Telling stories

What is the goal?

Page 6: Lies, damned lies & dataviz

Playing to your neural hardware’s strengths

Your visual system excels at pattern detection & parallel processing.

Representing data graphically means you can leverage this “for free”.

How does visualization help?

Page 7: Lies, damned lies & dataviz

Challenge: estimate x when y = 0

x y x y x y

27.38 24.05 32.31 31.61 75.67 14.83

62.64 7.31 51.84 28.61 34.23 31.65

50.76 16.30 59.04 18.29 51.21 7.69

42.94 26.78 74.63 1.15 47.26 22.90

8.72 42.35 56.15 11.37 66.60 3.21

30.62 30.87 47.23 19.49 17.46 40.31

62.63 9.14 59.36 8.82 65.70 12.79

63.21 18.66 44.58 19.12 52.24 12.92

40.49 23.29 47.85 20.55 62.56 14.17

22.07 41.46 68.21 11.99 40.43 19.77

Page 8: Lies, damned lies & dataviz

Challenge: estimate x when y = 0

Page 9: Lies, damned lies & dataviz

Challenge: estimate x when y = 0

Page 10: Lies, damned lies & dataviz

Challenge: find most similar data point

x y x y x y

54.88 71.52 97.86 79.92 35.95 43.70

60.28 54.49 46.15 78.05 69.76 6.02

42.37 64.59 11.83 63.99 66.68 67.06

43.76 89.18 14.34 94.47 21.04 12.89

96.37 38.34 52.18 41.47 31.54 36.37

79.17 52.89 26.46 77.42 57.02 43.86

56.80 92.56 45.62 56.84 98.84 10.20

7.10 8.71 1.88 61.76 20.89 16.13

2.02 83.26 61.21 61.69 65.31 25.33

77.82 87.00 94.37 68.18 46.63 24.44

Page 11: Lies, damned lies & dataviz

Challenge: find most similar data point

Page 12: Lies, damned lies & dataviz

Challenge: find the outlier

x y x y x y

54.88 71.52 97.86 79.92 35.95 43.70

60.28 54.49 46.15 78.05 69.76 6.02

42.37 64.59 11.83 63.99 66.68 67.06

43.76 89.18 14.34 94.47 21.04 12.89

96.37 38.34 52.18 41.47 31.54 36.37

79.17 52.89 26.46 77.42 57.02 43.86

56.80 92.56 45.62 56.84 98.84 10.20

7.10 8.71 1.88 61.76 20.89 16.13

2.02 83.26 61.21 61.69 65.31 25.33

77.82 87.00 94.37 68.18 46.63 24.44

Page 13: Lies, damned lies & dataviz

Challenge: find the outlier

Page 14: Lies, damned lies & dataviz

Avoiding limitations of statistics

Showing patterns in large data sets with minimal information loss.

Revealing structure of “tricky” data sets where typical summary statistics do a poor job.

How does visualization help?

Page 15: Lies, damned lies & dataviz

Showing patterns in large data sets

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

Page 16: Lies, damned lies & dataviz

Describing statistically tricky data

http://www.stanford.edu/~mwaskom/software/seaborn/examples/anscombes_quartet.html

All four have the same:

mean(x)variance(x)mean(y)variance(y)correlation coefficientregression coefficients

Anscombe’s Quartet

(Francis Anscombe, 1973)

Page 17: Lies, damned lies & dataviz

Describing statistically tricky data

Much web data, especially involving human preferences or choices, looks like this.

There is no “central tendency” so typical descriptive statistics are useless.

Zipfian distribution, an example of a power law.

Page 18: Lies, damned lies & dataviz

How does visualization help?

Illustrating a story

Visualizations are often used simply to clarify or reinforce the main points of a story, narrative or message.

This process fails when the conclusions suggested by the graphic are irrelevant to the narrative, or even contradict it.

It can also fail when the graphic has no clear message or multiple conflicting interpretations, or is largely incomprehensible.

Many of the following examples illustrate these mistakes.

Page 19: Lies, damned lies & dataviz

Part II — Bad Dataviz

Page 21: Lies, damned lies & dataviz

http://www.google.co.uk/trends/explore#q=%22data science%22

Unlabelled axes

Page 23: Lies, damned lies & dataviz

https://twitter.com/jk_keller/status/410498080765919232/photo/1

Axis scale manipulation (totally shameless version)

Version published by Reuters Version “fixed” by @jk_keller

Page 24: Lies, damned lies & dataviz

Example from Stephen Few (PDF)

Dual axes: caution

Natural interpretation:

Units sold “dipped below” revenue (A) and is now “catching up” (B).

But these impressions are meaningless.

They are just artefacts of the chosen axis scales.

A

B

Page 25: Lies, damned lies & dataviz

Proportionality errors

From an Australian document found at The Guardian

1 row of people = roughly 43,000 nurses.

10 rows = roughly 48,000 nurses.

?!?

Page 27: Lies, damned lies & dataviz

Quick quiz: what happened in 2005?

Page 28: Lies, damned lies & dataviz

Axis inversion: when “down” means “up”?!?

From Thomson Reuters via Business Insider

Version published by Reuters Version “fixed” by @PFedewa

Page 29: Lies, damned lies & dataviz

Bad dataviz

2. Distance vs. area vs. volume

http://muhammadfamizwanabdullah.blogspot.co.uk/2010/11/10-introduction-of-teaching-volume-of.html

Page 30: Lies, damned lies & dataviz

Pie charts: avoid

Bad

Colours used for separating slices, so can’t easily be put to another use.

No way to show time dimension statically.

Comparing relative sizes of slices is hard.

Doing it in 3D is harder. Perspective inflates nearer slices, and the similar volume of the objects is a red herring.

Doing it with deep, discontinuous 3D objects is even harder.

Worse

Worst

Page 31: Lies, damned lies & dataviz

Perhaps justifiable (in 2D) if numbers are sufficiently different.

Otherwise, use a much simpler design and avoid all those problems.

Pie charts: avoid

Page 32: Lies, damned lies & dataviz

Pie chart horrors

http://junkcharts.typepad.com/junk_charts/2014/03/two-charts-that-fail-self-sufficiency.html

Pie charts are supposed to show proportions of a whole.

People expect the %s to add up to 100%.

This one shows proportions of separate quantities.

Page 33: Lies, damned lies & dataviz

Pie chart horrors

From a World Bank report (PDF) found at The Guardian

These ones show 96% and 40% as full circles.

This one is falling apart.

This one thinks 76% is less than three quarters.

Page 34: Lies, damned lies & dataviz

Even worse uses of 3D

https://www.tableausoftware.com/public/blog/2011/01/viz-wiz-1-11and http://www.simplexnumerica.com/Gallery/gallery_pyramid.html

Cones, pyramids, spheres etc…

Are we comparing width, height, area or volume? Nobody knows!

26.76% = tiny peak

23.32% = massive slab

?!?

Page 35: Lies, damned lies & dataviz

Stacked charts: caution

Stacked charts show how a data series breaks down by another attribute of the data.

But people often misread these as two distinct data series, reading off a separate y-axis value for each one.

Page 36: Lies, damned lies & dataviz

Bubble charts: avoid

http://commons.wikimedia.org/wiki/File:Bubble_Chart_Chicago_Deposit_Market.jpg

Page 37: Lies, damned lies & dataviz

Bubble charts: avoid

http://bit.ly/1okS3nE and http://bit.ly/1hdZQtO

Page 38: Lies, damned lies & dataviz

Bad dataviz

3. Bad maps

http://xkcd.com/1138/

Page 39: Lies, damned lies & dataviz

Non-normalized quantities are useless

http://personal.frostburg.edu/jibandy0/starbucks%20map.jpg

Don’t use absolute values without a very good reason.

Normalize appropriately:

per capita, per adult, per student, per household, per square km, per journey, per voter …

Page 40: Lies, damned lies & dataviz

Remember: geopolitical boundaries are artificial

This map shows all the countries I’ve visited.

The relative size of USA makes me seem much more widely travelled than I really am.

Is “country” the right level of aggregation?

Page 41: Lies, damned lies & dataviz

Remember: map projections lie

http://en.wikipedia.org/wiki/File:Hobo%E2%80%93Dyer_projection_SW.jpg

Page 43: Lies, damned lies & dataviz

Consider using fixed-size bins

http://bit.ly/O9EPta

Page 44: Lies, damned lies & dataviz

Drawbacks of maps

● Can’t easily show time dimension, without animation

● Hard to show multiple attributes of data at once

● Physical proximity can obscure demographic/cultural differences, and vice versa

Just because you can map the data, doesn’t mean you should.

Save maps for when geographical trends are the key focus.

Page 45: Lies, damned lies & dataviz

4. Colour choice

Bad dataviz

Good colour palettes from RColorBrewer

Page 46: Lies, damned lies & dataviz

Sequential data

http://mapsdeguap.blogspot.co.uk/2012/04/choropleth-map.html

Use a smooth transition from min to max.

Don’t “cycle” more than once.

This map goes purple-green twice.

A better choice would be:

Page 47: Lies, damned lies & dataviz

Diverging data

http://www-03.ibm.com/press/us/en/pressrelease/35359.wss

Here the yellow section indicates the median. Red/green = above/below median.

However, the red and green ranges are not scaled well. 75 (close to median) is almost the same colour as 108 (max).

Sequential data, but with a well-defined midpoint.

Two directions from this midpoint -- two poles:

above/below average, positive/negative, female/male, Democrat/Republican etc.

Page 48: Lies, damned lies & dataviz

Categorical data

Also known as nominal or qualitative.

Colours should not form a pattern, as this can imply a false relationship.

The ethnicity colours here are reasonable, although quite close in colour space.

The location colours are badly chosen. They suggest a linear progression, which is meaningless.

http://www.visualizing.org/full-screen/10886

Page 49: Lies, damned lies & dataviz

Consistency

Don’t do this.

http://www.raterush.com/pages/digg-reddit

Page 50: Lies, damned lies & dataviz

Other considerations

● Colour blindness -- nearly 10% of men -- rare in women● Print and photocopy friendliness● Characteristics of different screens, esp. projectors

ColorBrewer is a great help:

See also…● brewer2mpl (Python)● RColorBrewer (R)● ColorBrewer (Matlab)

http://colorbrewer2.org/

Page 51: Lies, damned lies & dataviz

Bad dataviz

5. Correlation vs. causation

http://xkcd.com/552/

Page 52: Lies, damned lies & dataviz

Beware of bogus correlations

http://gizmodo.com/5977989/internet-explorer-vs-murder-rate-will-be-your-favorite-chart-today/and http://pubs.acs.org/doi/abs/10.1021/ci700332k

Correlation does not prove causation, even with a good R2 score.

Page 53: Lies, damned lies & dataviz

Beware of bogus correlations

Even respectable journals sometimes get carried away.

Ask yourself:

Are these both effects of a common cause?

Or just sheer chance? (Multiple comparisons)

http://www.nejm.org/doi/full/10.1056/NEJMon1211064

Page 54: Lies, damned lies & dataviz

Bad dataviz

6. Trying to say too much

Each visualization needs a clear purpose. But some designers and analysts try to include every possible piece of information.

This is not a good idea.

Unnecessary detail and ostentatiously “clever” presentation can obscure the real message.

Page 55: Lies, damned lies & dataviz

http://bit.ly/1gVzeUe

Don’t do this.

Page 56: Lies, damned lies & dataviz

7. Tips for developing a critical eye

Here are some techniques you can use for critical analysis.

They are often subjective, debatable, context-dependent and partly based on aesthetics… So don’t expect absolute rules.

Bad dataviz

Page 57: Lies, damned lies & dataviz

Usability

Does the chart need detailed instructions in order for it to be comprehensible and usable?

● Acceptable if this is a standard visualization method used in a particular domain

● Less acceptable if this is a one-off for general consumption

Page 58: Lies, damned lies & dataviz

First impressions test

What is the first thing you infer from looking at the visualization?

(Don’t stop to read every detail -- see what you get from a glance.)

Does this impression prove to be accurate,on closer inspection?

If not, then there may be a problem.

Many people will only glance and neverperform the close inspection.

Page 59: Lies, damned lies & dataviz

Return on effort (Kaiser Fung)

http://bit.ly/1dKewGo

Page 60: Lies, damned lies & dataviz

Self-sufficiency test (Kaiser Fung)

Would the chart make sense without the numbers printed on each data point?

If not, the chart has failed the self-sufficiency test.

http://junkcharts.typepad.com/junk_charts/2013/03/blowing-the-whistle-at-bubble-charts.html

Page 61: Lies, damned lies & dataviz

Trifecta checkup (Kaiser Fung)

Ask the following:

● What practical question does the graphic attempt to address?

● What answer does the data imply?● What answer does the graphic imply?

Can you answer these clearly?Do the three answers align?

If not, there is something wrong.

http://junkcharts.typepad.com/junk_charts/2014/02/pets-may-need-shelter-from-this-terrible-chart.html

Page 62: Lies, damned lies & dataviz

Data-ink score (Edward Tufte)

Main principle: Remove redundant or uninformative elements from the design, to reduce distraction. High data-ink ratio = clarity.

http://www.infovis-wiki.net/index.php/Data-Ink_Ratio

Page 63: Lies, damned lies & dataviz

And finally…

Ask yourself how much you trust the data.

Professional presentation does not imply reliable numbers.

Is there enough data to be sure of statistical significance?What are the margins of error?

Is there a plausible mechanism of action?

What about sources of bias (accidental or intentional), confounding factors, missing data, or measurement error (noise)?