MIT OpenCourseWare http://ocw.mit.edu
Supplemental Resource: Brain and Cognitive SciencesStatistics & Visualization for Data Analysis & InferenceJanuary (IAP) 2009
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
http://ocw.mit.eduhttp://ocw.mit.edu/terms
Statistics and Visualization for Data Analysis
and Visualization
Mike Frank & Ed Vul
IAP 2009
2
Who we are
Ed Vul – 4th year grad student in the Kanwisher lab. Interested in optimal decision-making, resource allocation, and visual attention.
Mike Frank – 4th year grad student in the Gibson lab. Interested in language acquisition, interactions of language and cognition.
3
Why this course?
now what?
statistics and data analysis
Courtesy of xkcd.org
4
Our goals
• summarize and discuss an approach – contrast the “null-hypothesis significance testing”
framework
– to a “model-driven” framework • link data to theory
– “statistical models are models of data” – acknowledge scientific practice in data analysis
and theory development
• get feedback from all of you
5
Your goals
• Name • What you work on / where you work
• Your statistical background • (optional) a question that bothers you
sometimes when you’re analyzing data
6
Approach
data visualization/data modeling: look at your data and try to understand where it came from
1. visualization – creating appropriate and informative pictures of a dataset – iterative exploration of data
2. modeling – don't just test for differences, try to understand the factors – not just "looking for interactions,” main effects too – emphasis on effect size, not significance – use appropriate computational tools, don’t rely on simple analytic
approximations (e.g. t-tests) if they don’t fit 3. experimental design
– choose designs and measures that test questions – don’t choose designs based on arbitrary statistical frameworks (e.g.
ANOVA)
7
Classes
1. Visualization – how can I see what my data show?
2. Resampling – how do I estimate the
uncertainty of my measures?
3. Distributions – how do I summarize what I believe about the world?
4. The Linear Model – how can I create a simple model of my data?
5. Bayesian Modeling – how can I describe the processes that generated my data?
8
Classes
1. Visualization – how can I see what my data show?
2. Resampling –how do I estimate the
uncertainty of my measures?
3. Distributions – how do I summarize what I believe about the world?
4. The Linear Model – how can I create a simple model of my data?
5. Bayesian Modeling – how can I describe the processes that generated my data?
VISUALIZATION
10
Outline
• Why visualize? – to understand data – a worked example
• The visual vocabulary – elements – perceptual motivations
• Conventional modes of combination – taxonomy of visualization
• Tips & Tricks, Tradeoffs, & Trouble
(many slides courtesy of Chris Collins, U of T)
11
Example: Movements of the French Army
Minard, 1861; Tufte, 2001
12
Three principles for visualization:
1. be true to your research – design your
display to illustrate a particular point
2. maximize information, minimize ink –use the simplest possible representation for the bits you want to convey
3. organize hierarchically – what should a viewer see first? what if they look deeper?
13
Worked example
• Participants heard examples from an artificial language
• Three different presentation methods for examples – index cards, list of sentences, mp3 files on ipod
• Task was to spread $100 of “bets” across different continuations for a new example
• Dependent measure was bet on the correct answer
14
Worked example
trial.num bet trial sub expt modality 1 80 1 S1 MNPQ mp3 2 5 2 S1 MNPQ mp3 3 90 3 S1 MNPQ mp3 4 25 4 S1 MNPQ mp3 5 0 1 S2 MNPQ mp3 6 0 2 S2 MNPQ mp3 7 50 3 S2 MNPQ mp3 8 33 4 S2 MNPQ mp3 9 0 1 S3 MNPQ mp3
10 40 2 S3 MNPQ mp3 11 60 3 S3 MNPQ mp3 12 40 4 S3 MNPQ mp3 … … … … … …
191 40 3 S48 MNPQ list 192 50 4 S48 MNPQ list
15
Worked example
bar graph
16
Worked example
bar graph with standard errors
17
17
Worked example
bar graph with 95% CIs
18
Worked example
box plot
19
Worked example
viola plot
20
Worked example
strip chart
21
Worked example
strip chart with means
22
Worked example
strip chart with means
23
Morals of the example
• Summary statistics – almost always necessary – but at what level of analysis?
• Distribution is important – what is the form of the data?
– is your summary misleading?
• Fancier is not always better – pretty pictures are awesome
– but not if they obscure the data
24
Value
11
x variable 9.0
ach x variable 10.0
y variable 7.5
ach y variable 3.75
es 110.0
or of slope 0.1
of squares of Y: 13.8
0.7
m of squares 27.5
etween each x and y0.816
sion line y = 3 + 0.5x
12
10
8
6
4
6 8 10 12 14 16 184
12
10
8
6
4
6 8 10 12 14 16 184
12
10
8
6
4
6 8 10 12 14 16 184
12
10
8
6
4
6 8 10 12 14 16 184
y1y2
y3y4
x4
Property
N
Mean of each
Variance of e
Mean of each
Variance of e
Sum of squar
Standard err
Residual sum
r-squared
Regression su
Correlation bvariable
Linear regres
Anscombe’s Quartet
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
F. J. Anscombe, 1973
Figure by MIT OpenCourseWare.
25
Conventional visualizations
more discrete dimensions
mor
e co
ntin
uous
dim
ensi
ons
heatmap
venn diagram
Courtesy of Amit Agarwal. Used with permission.
Histogram
•
•
•
Important first way of looking at your data
One dimensional
Shows shape by binning a continuous distribution
27
Grouped histogram
Grouped histogram
Copyright (© 2009) Wiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc. Reprinted with permission of John Wiley & Sons., Inc.
Pie chart
• A whole split into parts
• Emphasizes that all
parts sum to a constant
• Single dimension with
discrete categories
Figure by MIT OpenCourseWare.
Stacked bar graph
• Wholes split into
parts
• Easy to compare – often better than pie
chart
• Can have multiple
discrete dimensions
Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.
http://www.sciencedirect.com
31
Venn diagram
• Shows overlap between discrete groups
• Sometimes the only way to display overlapping sets
• Unintuitive – no “popout”
LIZARDS
DOGS
HUMANS
mammals
have scales have a tail
found in tropical homes
4 legs
animals
Have 2 legs
Figure by MIT OpenCourseWare.
32
Conventional visualizations
more discrete dimensions
mor
e co
ntin
uous
dim
ensi
ons
heatmap
venn diagram
Courtesy of Amit Agarwal. Used with permission.
33
Scatter plot
• Relationship between
observations on two
continuous dimensions
• Can show multiple
groups
• Can show trend lines
etc.
• Uninformative with
too much data
Courtesy of xkcd.org.
http:xkcd.org
34
Scatter plot … with many discrete items (identity as a dimension)
XKCD
Courtesy of Wikipedia. Used with permission.
35
Line graph
• Also ubiquitous!
• Good for showing one variable (e.g., time) as continuous even though you have discrete measures
• Can compare several discrete groups
Courtesy of American Psychological Association. Used with permission.
Bar graph
• aka “dynamite plot”
• Ubiquitous! • Can be used for
lots of discrete grouping factors
• Natural semantics of grouping
• Conceals data Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.
Sommervilleet al. (2005)
http://www.sciencedirect.com
37
More bar graphs
Box plot
Very useful for showing individual subject means
not focused on individual subjects Courtesy of National Academy of Sciences, U. S. A. Used with permission.Source: Goldstein et. al. "Social Interaction Shapes Babbling: Testing Parallels Between Birdsong and Speech." PNAS 100, no. 13 (2003): 8030-8035.
Courtesy of American Psychological Association Copyright , 2003, National Academy of Sciences, U.S.A.
Strip chart
Shows the shape of distribution but
Endress, Ansgar D. et. Al. "The Role of Salience in the Extraction of Algebraic Rules.” Journal of Experimental Psychology: General, Vol. 134, No. 3 (2005): 406-419.
©
38
Conventional visualizations
more discrete dimensions
mor
e co
ntin
uous
dim
ensi
ons
heatmap
venn diagram
Courtesy of Amit Agarwal. Used with permission.
Heat map
• Worksvery well when there are natural semantics
• Color mapping can Image removed due to copyright restriction.http://tedlab.mit.edu/~mcfrank/papers/FVJ-cognition.pdfbe problematic
• grayscale usually fine
• Can be unintuitive
3mos
6mos
9mos
adults
Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.
http://tedlab.mit.edu/~mcfrank/papers/FVJ-cognition.pdfhttp://www.sciencedirect.com
40
Courtesy of National Academy of Sciences, U. S. A. Used with permission.
Bubble plot
• Can be very intuitive • Size is not perfectly
quantitative
Source: Butterworth et. al. "Numerical Thought with and Without
Words: Evidence from Indigenous Australian Children." PNAS 105, no. 35
(2008): 13179–13184.
Copyright 2008, National Academy of Sciences, U.S.A. © ,
41
Trellis plots
Courtesy of American Statistical Institution. Used with permission.
TIPS AND TRICKS
43
Three tricks for doing more with less
• Multiple plots – simple, easily interpretable subplots – can be beautiful but overwhelming
• Hybrid plots – a scatter plot of histograms – or a venn-diagram of histograms, etc.
• Multiple axes – plot two (or more) different things on one graph
44
Hybrid plots
Courtesy of http://addictedtor.free.fr/graphiques/addNote.php?graph=78
http://addictedtor.free.fr/graphiques/addNote.php?graph=78
45
Hybrid plots
46
Hybrid plots
Courtesy of http://addictedtor.free.fr/graphiques/addNote.php?graph=109
http://addictedtor.free.fr/graphiques/addNote.php?graph=109
47
Multiple plots
Courtesy of Cognitive Science Society. Used with permission.
48
Multiple plots
Courtesy of Cognitive Science Society. Used with permission.
Baker, Tenenbaum, & Saxe (2007)
49
Multiple plots
Courtesy of Andrew Gelman. Used with permission.
50
Courtesy of Andrew Gelman. Used with permission.
51
Multiple axes
1999 2000 2001 2002 2003 2004
Num
ber o
f toy
s sol
d
Toy A
Toy B
Toy C
Toy D
Toy E
Toy F
Toy G
Toy H
Toy I
Years
Figure by MIT OpenCourseWare.
52
Multiple axes
P(re
port
) P(
repo
rt)
Log(
obse
rved
/cha
nce
freq
uenc
y)
Guess 1
Guess 2
Serial position (0=target)
TWO TRADEOFFS
54
Two tradeoffs
• Informativeness vs. readability – Too little information can conceal data – But too much information can be overwhelming – Possible solution: hierarchical organization?
• Data-centric vs. viewer-centric – Viewers are accustomed to certain types of
visualization
– But novel visualizations can be truer to data
55
Information vs. readability
• Pirahã people of Brazil –Isolated indigenous group –No words for numbers
• Previous research suggested that they were unable to do simple matching games
• Five matching games, 14 participants, quantities 4-10 (split among participants)
56
Information vs. readibility?
Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.
http://www.sciencedirect.com
58
Copyright (c) (2008) National Academy of Sciences, U.S.A.
57
Information vs. readability
Brady, Konkle, Alvarez, Oliva (2008)
Courtesy of National Academy of Sciences, U. S. A. Used with permission. Source: Brady et. al. "Visual Long-term Memory has a Massive Storage Capacity for Object Details." PNAS 105, no. 38 (2008): 14325-14329. Copyright 2008, National Academy of Sciences, U.S.A. ©,
Copyright (c) (2008) National Academy of Sciences, U.S.A.
58
Information vs. readability
Courtesy of National Academy of Sciences, U. S. A. Used with permission. Source: Brady et. al. "Visual Long-term Memory has a Massive Storage Capacity for Object Details." PNAS 105, no. 38 (2008): 14325-14329.
Brady, Konkle, Alvarez, Oliva (2008)
Copyright © , 2008, National Academy of Sciences, U.S.A.
Data-centric vs. viewer-centric
• Web study of word learning
– n=700 – lots of noise
• varied number of objects with different properties – asked for bets
• had a model that predicted performance
this one is daxy. what does daxy mean?
60
Data-centric vs. viewer-centric
61
Data-centric vs. viewer-centric objects with
named property
62
Data-centric vs. viewer-centric
63
64
65
Figure by MIT OpenCourseWare.
302010
00 50 100Bet
Res
pons
es 0.50 0.70 0.75 0.75 0.80 0.83
0.740.640.580.590.51
0.51 0.55 0.61 0.70
0.680.560.48
0.52 0.66
0.47
Number of objects in unnamed category
Num
ber of objects in named category
1
1
2
2
3
3
4
4
5
5
6
6
90
90
80
80
70
70
60
60
50
50
Mea
n hu
man
bet
Model predictions
r = .93
TROUBLE
68
High Dimensionality Doesn’t Guarantee Excellence
Courtesy of Robin Whittle. Used with permission.
69
High Dimensionality Doesn’t Guarantee Excellence
informative direct mail
door to door
junk faxes
hated accepted
telemarketing
manipulative
70
1.00
.90
.80
.70
.60
.50
Decision Outcome Decision Outcome Decision OutcomeP = .50 P = .60 P = .70
Change
1.00
.90
.80
.70
.60
.50
Decision Outcome Decision Outcome Decision OutcomeP = .80 P = .90 P = .95
Change
Certainty
Certainty
Sample no data: change, coin flip, etc.
Sample single case
Sample two cases, plus tie breaker if necessary
Sample all data, i.e., total available population
A
B
C
D
Strategy Key
Messy bar graphs
• Sometimes you can discretize way too many variables
Nisbett& Ross (1980) - ?? Figure by MIT OpenCourseWare.
73
Too much data for one plot
Oaksford & Chater, 1994 Courtesy of American Psychological Association. Used with permission.
74
Difficulty of comparison
75
Bad semantics
Image removed due to copyright restriction. (http://junkcharts.typepad.com/junk_charts/2008/02/ordering-and-gr.html)
http://junkcharts.typepad.com/junk_charts/2008/02/ordering-and-gr.html
76
Summary and conclusions
1. Mapping data to a visual representation
1. What dimensions matter
2. What vocabulary elements (color, shape, orientation) will map to those dimensions
2. Three principles 1. be true to your research
2. maximize information, minimize ink
3. organize hierarchically
77
More worked examples
• Short email describing experiment – what is being measured, what is being
manipulated
• .CSV (comma-separated value) file – header row with good variable names
• GOOD: sub.name,trial.type,correct.ans • BAD: num,tt,g
– each row is a single observation • e.g., one or two ys (dependent variable like answer
correctness or RT), many xs (independent variables like subject, condition, etc.)
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 300 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects true /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False
/Description >>> setdistillerparams> setpagedevice
Top Related