January (IAP) 2009 For information about citing these ... · 2 Who we are Ed Vul –4th year grad...

78
MIT OpenCourseWare http://ocw.mit.edu Supplemental Resource: Brain and Cognitive Sciences Statistics & Visualization for Data Analysis & Inference January (IAP) 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Transcript of January (IAP) 2009 For information about citing these ... · 2 Who we are Ed Vul –4th year grad...

  • MIT OpenCourseWare http://ocw.mit.edu

    Supplemental Resource: Brain and Cognitive SciencesStatistics & Visualization for Data Analysis & InferenceJanuary (IAP) 2009

    For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

    http://ocw.mit.eduhttp://ocw.mit.edu/terms

  • Statistics and Visualization for Data Analysis

    and Visualization

    Mike Frank & Ed Vul

    IAP 2009

  • 2

    Who we are

    Ed Vul – 4th year grad student in the Kanwisher lab. Interested in optimal decision-making, resource allocation, and visual attention.

    Mike Frank – 4th year grad student in the Gibson lab. Interested in language acquisition, interactions of language and cognition.

  • 3

    Why this course?

    now what?

    statistics and data analysis

    Courtesy of xkcd.org

  • 4

    Our goals

    • summarize and discuss an approach – contrast the “null-hypothesis significance testing”

    framework

    – to a “model-driven” framework • link data to theory

    – “statistical models are models of data” – acknowledge scientific practice in data analysis

    and theory development

    • get feedback from all of you

  • 5

    Your goals

    • Name • What you work on / where you work

    • Your statistical background • (optional) a question that bothers you

    sometimes when you’re analyzing data

  • 6

    Approach

    data visualization/data modeling: look at your data and try to understand where it came from

    1. visualization – creating appropriate and informative pictures of a dataset – iterative exploration of data

    2. modeling – don't just test for differences, try to understand the factors – not just "looking for interactions,” main effects too – emphasis on effect size, not significance – use appropriate computational tools, don’t rely on simple analytic

    approximations (e.g. t-tests) if they don’t fit 3. experimental design

    – choose designs and measures that test questions – don’t choose designs based on arbitrary statistical frameworks (e.g.

    ANOVA)

  • 7

    Classes

    1. Visualization – how can I see what my data show?

    2. Resampling – how do I estimate the

    uncertainty of my measures?

    3. Distributions – how do I summarize what I believe about the world?

    4. The Linear Model – how can I create a simple model of my data?

    5. Bayesian Modeling – how can I describe the processes that generated my data?

  • 8

    Classes

    1. Visualization – how can I see what my data show?

    2. Resampling –how do I estimate the

    uncertainty of my measures?

    3. Distributions – how do I summarize what I believe about the world?

    4. The Linear Model – how can I create a simple model of my data?

    5. Bayesian Modeling – how can I describe the processes that generated my data?

  • VISUALIZATION

  • 10

    Outline

    • Why visualize? – to understand data – a worked example

    • The visual vocabulary – elements – perceptual motivations

    • Conventional modes of combination – taxonomy of visualization

    • Tips & Tricks, Tradeoffs, & Trouble

    (many slides courtesy of Chris Collins, U of T)

  • 11

    Example: Movements of the French Army

    Minard, 1861; Tufte, 2001

  • 12

    Three principles for visualization:

    1. be true to your research – design your

    display to illustrate a particular point

    2. maximize information, minimize ink –use the simplest possible representation for the bits you want to convey

    3. organize hierarchically – what should a viewer see first? what if they look deeper?

  • 13

    Worked example

    • Participants heard examples from an artificial language

    • Three different presentation methods for examples – index cards, list of sentences, mp3 files on ipod

    • Task was to spread $100 of “bets” across different continuations for a new example

    • Dependent measure was bet on the correct answer

  • 14

    Worked example

    trial.num bet trial sub expt modality 1 80 1 S1 MNPQ mp3 2 5 2 S1 MNPQ mp3 3 90 3 S1 MNPQ mp3 4 25 4 S1 MNPQ mp3 5 0 1 S2 MNPQ mp3 6 0 2 S2 MNPQ mp3 7 50 3 S2 MNPQ mp3 8 33 4 S2 MNPQ mp3 9 0 1 S3 MNPQ mp3

    10 40 2 S3 MNPQ mp3 11 60 3 S3 MNPQ mp3 12 40 4 S3 MNPQ mp3 … … … … … …

    191 40 3 S48 MNPQ list 192 50 4 S48 MNPQ list

  • 15

    Worked example

    bar graph

  • 16

    Worked example

    bar graph with standard errors

  • 17

    17

    Worked example

    bar graph with 95% CIs

  • 18

    Worked example

    box plot

  • 19

    Worked example

    viola plot

  • 20

    Worked example

    strip chart

  • 21

    Worked example

    strip chart with means

  • 22

    Worked example

    strip chart with means

  • 23

    Morals of the example

    • Summary statistics – almost always necessary – but at what level of analysis?

    • Distribution is important – what is the form of the data?

    – is your summary misleading?

    • Fancier is not always better – pretty pictures are awesome

    – but not if they obscure the data

  • 24

    Value

    11

    x variable 9.0

    ach x variable 10.0

    y variable 7.5

    ach y variable 3.75

    es 110.0

    or of slope 0.1

    of squares of Y: 13.8

    0.7

    m of squares 27.5

    etween each x and y0.816

    sion line y = 3 + 0.5x

    12

    10

    8

    6

    4

    6 8 10 12 14 16 184

    12

    10

    8

    6

    4

    6 8 10 12 14 16 184

    12

    10

    8

    6

    4

    6 8 10 12 14 16 184

    12

    10

    8

    6

    4

    6 8 10 12 14 16 184

    y1y2

    y3y4

    x4

    Property

    N

    Mean of each

    Variance of e

    Mean of each

    Variance of e

    Sum of squar

    Standard err

    Residual sum

    r-squared

    Regression su

    Correlation bvariable

    Linear regres

    Anscombe’s Quartet

    I II III IV

    x y x y x y x y

    10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

    8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

    13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

    9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

    11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

    14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

    6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

    4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

    12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

    7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

    5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

    F. J. Anscombe, 1973

    Figure by MIT OpenCourseWare.

  • 25

    Conventional visualizations

    more discrete dimensions

    mor

    e co

    ntin

    uous

    dim

    ensi

    ons

    heatmap

    venn diagram

    Courtesy of Amit Agarwal. Used with permission.

  • Histogram

    Important first way of looking at your data

    One dimensional

    Shows shape by binning a continuous distribution

  • 27

    Grouped histogram

  • Grouped histogram

    Copyright (© 2009) Wiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc. Reprinted with permission of John Wiley & Sons., Inc.

  • Pie chart

    • A whole split into parts

    • Emphasizes that all

    parts sum to a constant

    • Single dimension with

    discrete categories

    Figure by MIT OpenCourseWare.

  • Stacked bar graph

    • Wholes split into

    parts

    • Easy to compare – often better than pie

    chart

    • Can have multiple

    discrete dimensions

    Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.

    http://www.sciencedirect.com

  • 31

    Venn diagram

    • Shows overlap between discrete groups

    • Sometimes the only way to display overlapping sets

    • Unintuitive – no “popout”

    LIZARDS

    DOGS

    HUMANS

    mammals

    have scales have a tail

    found in tropical homes

    4 legs

    animals

    Have 2 legs

    Figure by MIT OpenCourseWare.

  • 32

    Conventional visualizations

    more discrete dimensions

    mor

    e co

    ntin

    uous

    dim

    ensi

    ons

    heatmap

    venn diagram

    Courtesy of Amit Agarwal. Used with permission.

  • 33

    Scatter plot

    • Relationship between

    observations on two

    continuous dimensions

    • Can show multiple

    groups

    • Can show trend lines

    etc.

    • Uninformative with

    too much data

    Courtesy of xkcd.org.

    http:xkcd.org

  • 34

    Scatter plot … with many discrete items (identity as a dimension)

    XKCD

    Courtesy of Wikipedia. Used with permission.

  • 35

    Line graph

    • Also ubiquitous!

    • Good for showing one variable (e.g., time) as continuous even though you have discrete measures

    • Can compare several discrete groups

    Courtesy of American Psychological Association. Used with permission.

  • Bar graph

    • aka “dynamite plot”

    • Ubiquitous! • Can be used for

    lots of discrete grouping factors

    • Natural semantics of grouping

    • Conceals data Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.

    Sommervilleet al. (2005)

    http://www.sciencedirect.com

  • 37

    More bar graphs

    Box plot

    Very useful for showing individual subject means

    not focused on individual subjects Courtesy of National Academy of Sciences, U. S. A. Used with permission.Source: Goldstein et. al. "Social Interaction Shapes Babbling: Testing Parallels Between Birdsong and Speech." PNAS 100, no. 13 (2003): 8030-8035.

    Courtesy of American Psychological Association Copyright , 2003, National Academy of Sciences, U.S.A.

    Strip chart

    Shows the shape of distribution but

    Endress, Ansgar D. et. Al. "The Role of Salience in the Extraction of Algebraic Rules.” Journal of Experimental Psychology: General, Vol. 134, No. 3 (2005): 406-419.

    ©

  • 38

    Conventional visualizations

    more discrete dimensions

    mor

    e co

    ntin

    uous

    dim

    ensi

    ons

    heatmap

    venn diagram

    Courtesy of Amit Agarwal. Used with permission.

  • Heat map

    • Worksvery well when there are natural semantics

    • Color mapping can Image removed due to copyright restriction.http://tedlab.mit.edu/~mcfrank/papers/FVJ-cognition.pdfbe problematic

    • grayscale usually fine

    • Can be unintuitive

    3mos

    6mos

    9mos

    adults

    Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.

    http://tedlab.mit.edu/~mcfrank/papers/FVJ-cognition.pdfhttp://www.sciencedirect.com

  • 40

    Courtesy of National Academy of Sciences, U. S. A. Used with permission.

    Bubble plot

    • Can be very intuitive • Size is not perfectly

    quantitative

    Source: Butterworth et. al. "Numerical Thought with and Without

    Words: Evidence from Indigenous Australian Children." PNAS 105, no. 35

    (2008): 13179–13184.

    Copyright 2008, National Academy of Sciences, U.S.A. © ,

  • 41

    Trellis plots

    Courtesy of American Statistical Institution. Used with permission.

  • TIPS AND TRICKS

  • 43

    Three tricks for doing more with less

    • Multiple plots – simple, easily interpretable subplots – can be beautiful but overwhelming

    • Hybrid plots – a scatter plot of histograms – or a venn-diagram of histograms, etc.

    • Multiple axes – plot two (or more) different things on one graph

  • 44

    Hybrid plots

    Courtesy of http://addictedtor.free.fr/graphiques/addNote.php?graph=78

    http://addictedtor.free.fr/graphiques/addNote.php?graph=78

  • 45

    Hybrid plots

  • 46

    Hybrid plots

    Courtesy of http://addictedtor.free.fr/graphiques/addNote.php?graph=109

    http://addictedtor.free.fr/graphiques/addNote.php?graph=109

  • 47

    Multiple plots

    Courtesy of Cognitive Science Society. Used with permission.

  • 48

    Multiple plots

    Courtesy of Cognitive Science Society. Used with permission.

    Baker, Tenenbaum, & Saxe (2007)

  • 49

    Multiple plots

    Courtesy of Andrew Gelman. Used with permission.

  • 50

    Courtesy of Andrew Gelman. Used with permission.

  • 51

    Multiple axes

    1999 2000 2001 2002 2003 2004

    Num

    ber o

    f toy

    s sol

    d

    Toy A

    Toy B

    Toy C

    Toy D

    Toy E

    Toy F

    Toy G

    Toy H

    Toy I

    Years

    Figure by MIT OpenCourseWare.

  • 52

    Multiple axes

    P(re

    port

    ) P(

    repo

    rt)

    Log(

    obse

    rved

    /cha

    nce

    freq

    uenc

    y)

    Guess 1

    Guess 2

    Serial position (0=target)

  • TWO TRADEOFFS

  • 54

    Two tradeoffs

    • Informativeness vs. readability – Too little information can conceal data – But too much information can be overwhelming – Possible solution: hierarchical organization?

    • Data-centric vs. viewer-centric – Viewers are accustomed to certain types of

    visualization

    – But novel visualizations can be truer to data

  • 55

    Information vs. readability

    • Pirahã people of Brazil –Isolated indigenous group –No words for numbers

    • Previous research suggested that they were unable to do simple matching games

    • Five matching games, 14 participants, quantities 4-10 (split among participants)

  • 56

    Information vs. readibility?

    Courtesy Elsevier, Inc., http://www.sciencedirect.com. Used with permission.

    http://www.sciencedirect.com

  • 58

    Copyright (c) (2008) National Academy of Sciences, U.S.A.

    57

    Information vs. readability

    Brady, Konkle, Alvarez, Oliva (2008)

    Courtesy of National Academy of Sciences, U. S. A. Used with permission. Source: Brady et. al. "Visual Long-term Memory has a Massive Storage Capacity for Object Details." PNAS 105, no. 38 (2008): 14325-14329. Copyright 2008, National Academy of Sciences, U.S.A. ©,

  • Copyright (c) (2008) National Academy of Sciences, U.S.A.

    58

    Information vs. readability

    Courtesy of National Academy of Sciences, U. S. A. Used with permission. Source: Brady et. al. "Visual Long-term Memory has a Massive Storage Capacity for Object Details." PNAS 105, no. 38 (2008): 14325-14329.

    Brady, Konkle, Alvarez, Oliva (2008)

    Copyright © , 2008, National Academy of Sciences, U.S.A.

  • Data-centric vs. viewer-centric

    • Web study of word learning

    – n=700 – lots of noise

    • varied number of objects with different properties – asked for bets

    • had a model that predicted performance

    this one is daxy. what does daxy mean?

  • 60

    Data-centric vs. viewer-centric

  • 61

    Data-centric vs. viewer-centric objects with

    named property

  • 62

    Data-centric vs. viewer-centric

  • 63

  • 64

  • 65

  • Figure by MIT OpenCourseWare.

    302010

    00 50 100Bet

    Res

    pons

    es 0.50 0.70 0.75 0.75 0.80 0.83

    0.740.640.580.590.51

    0.51 0.55 0.61 0.70

    0.680.560.48

    0.52 0.66

    0.47

    Number of objects in unnamed category

    Num

    ber of objects in named category

    1

    1

    2

    2

    3

    3

    4

    4

    5

    5

    6

    6

    90

    90

    80

    80

    70

    70

    60

    60

    50

    50

    Mea

    n hu

    man

    bet

    Model predictions

    r = .93

  • TROUBLE

  • 68

    High Dimensionality Doesn’t Guarantee Excellence

    Courtesy of Robin Whittle. Used with permission.

  • 69

    High Dimensionality Doesn’t Guarantee Excellence

    informative direct mail

    door to door

    junk faxes

    hated accepted

    telemarketing

    manipulative

  • 70

    1.00

    .90

    .80

    .70

    .60

    .50

    Decision Outcome Decision Outcome Decision OutcomeP = .50 P = .60 P = .70

    Change

    1.00

    .90

    .80

    .70

    .60

    .50

    Decision Outcome Decision Outcome Decision OutcomeP = .80 P = .90 P = .95

    Change

    Certainty

    Certainty

    Sample no data: change, coin flip, etc.

    Sample single case

    Sample two cases, plus tie breaker if necessary

    Sample all data, i.e., total available population

    A

    B

    C

    D

    Strategy Key

    Messy bar graphs

    • Sometimes you can discretize way too many variables

    Nisbett& Ross (1980) - ?? Figure by MIT OpenCourseWare.

  • 73

    Too much data for one plot

    Oaksford & Chater, 1994 Courtesy of American Psychological Association. Used with permission.

  • 74

    Difficulty of comparison

  • 75

    Bad semantics

    Image removed due to copyright restriction. (http://junkcharts.typepad.com/junk_charts/2008/02/ordering-and-gr.html)

    http://junkcharts.typepad.com/junk_charts/2008/02/ordering-and-gr.html

  • 76

    Summary and conclusions

    1. Mapping data to a visual representation

    1. What dimensions matter

    2. What vocabulary elements (color, shape, orientation) will map to those dimensions

    2. Three principles 1. be true to your research

    2. maximize information, minimize ink

    3. organize hierarchically

  • 77

    More worked examples

    • Short email describing experiment – what is being measured, what is being

    manipulated

    • .CSV (comma-separated value) file – header row with good variable names

    • GOOD: sub.name,trial.type,correct.ans • BAD: num,tt,g

    – each row is a single observation • e.g., one or two ys (dependent variable like answer

    correctness or RT), many xs (independent variables like subject, condition, etc.)

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 300 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects true /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice