Data Visualization by David Kretch
-
Upload
summit-consulting-llc -
Category
Data & Analytics
-
view
198 -
download
4
Transcript of Data Visualization by David Kretch
Data Visualization
April 3, 2015
• When you should graph
• What you should graph
• Given some data, how would you graph it
2
When should you graph your data?
Data Visualization
AlwaysDon’t just make graphs for client reports -- graph your data for yourself, so you understand it.
If you use a table in a report, see if you can make it into a graph.
3
Why graphs?
Because of the environment that humans evolved in, we are much
better at getting info from color, size, shape, and position than from reading text.
Data Visualization
Find the dangerous creatures!
4
Why graphs work
• Color
• Size• Shape
• Position
Data Visualization
5
Why else do people like graphs?
People like cool-looking stuff.
Data Visualization
Not cool Cool
6
What are we currently doing?
• Making lots of tables
Data Visualization
Group Mean 25% 50% 75%
Bananas 11.3 2.7 4.6 23.1
Kittens 4.0 0.9 3.6 7.5
Phones -3.1 -11.0 -2.9 2.2
Variable Parameter Estimate
Cuteness 0.6***
Ability to Fly 1.4***
Deadliness 11.2***
Telepathy -9.8***
Big Ears -17.3***
7
What is wrong with tables?
Tables give only a partial picture – means only tell us so much.
Figuring out what’s bigger, and by how much, requires more work.
The information is not necessarily in any order, so we need to read all the numbers.
Data Visualization
8
What kinds of graphs should you make?
• The distribution, instead of giving just mean, median, etc.
• The relationship between two variables – the conditional distribution
• Graph estimation results’ point estimates and confidence intervals
Data Visualization
9
What to expect out of this presentation
1. Discussion of the type of graph (e.g. distributions)
2. How the type of graph applies to continuous vs. categorical data
3. Extensions (e.g. graphing more than one at a time)
What not to expect: how to do these in any particular software.
Data Visualization
10
Distributions
Data Visualization
11
Distributions – Continuous variables
Make density plots/histograms for continuous variables. These give much more information than means, medians, etc.
Two distributions with the same mean, but which are dramatically different.
Data Visualization
12
Density vs. histogram
A density plot is basically a smoothed histogram.
Data Visualization
13
Distributions – Categorical variables
Make bar charts for categorical variables.
Tip: if your categories don’t have any inherent order, order them from largest to smallest.
Data Visualization
14
Compare distributions using color
Suppose we want to compare the distribution of income among different occupations. Plot all the distributions, distinguished by color, and use transparency to make them all visible simultaneously.
Data Visualization
15
Highlighting important facts
Add vertical lines to highlight the means.
Data Visualization
16
Relationships
Data Visualization
17
Relationships between variables
If we’re asking, for example, what GDP growth looks like at different levels of government spending, we can show this using a scatterplot.
Data Visualization
18
How to show trends
We can highlight the trend using scatterplot smoothing, which adapts the shape of the trend line to the data.
Data Visualization
19
How to show multiple groups
We can see if the relationship differs among groups by giving each group a color.
Data Visualization
20
Another use for colors
Suppose we want to come up with rules to identify people’s favorite food based on population density and elevation (bear with me)
Can we see this on a graph?
Data Visualization
21
Graphing relationships with categorical data
With categorical data, you typically can’t use scatterplots because points fall right on top of each other (‘overplotting’).However! We can use jittering to move the plotted points slightly.
Data Visualization
Without jittering With jittering
22
Graphing relationships with categorical data
The next step beyond jittering is to use a boxplot, which shows– The mean, – 25th and 75th percentiles, – 1.5 times the inter-quartile range (IQR)– outliers (plotted as points)
Data Visualization
mean
75th pctile
mean + 1.5 *IQR
outlier
23
Looping back
A boxplot isn’t, after all, all that different from the multi-colored density plot we showed earlier. Which is better depends on what you’re trying to show.
Data Visualization
24
Use log scale if your data spans a wide range
Let’s say you have a large range of values, but most of your data is concentrated to one part of the range.
It’s easier to see what’s going when we use log scale.
Data Visualization
25
Estimation results
Data Visualization
26
Graphing estimation results
We make a lot of regression tables, but we can make them easier to understand by putting them into graphs.
Data Visualization
27
ggplot(df, aes(population_density, elevation, color = favorite_food)) + geom_point()
Data Visualization
dataset x variable y variable
make scatterplot
color variable
All graphs made in R and ggplot2
28
Data Visualization Checklist
• Always graph
• Use color, size, shape, and position
• Three important types of graph:– Distribution– Relationship– Estimation results
• Highlight important facts
• Make it cool-looking
Data Visualization