01 Intro

36
Hadley Wickham Stat405 Statistical computing & graphics

Transcript of 01 Intro

Page 1: 01 Intro

Hadley Wickham

Stat405Statistical computing & graphics

Page 2: 01 Intro

1. Introductions

2. Syllabus

3. Introduction to linux

4. Introduction to R

5. Basic graphics

Page 3: 01 Intro

HadleyHELLO

my name is

Page 4: 01 Intro

had.co.nz/stat405(if you can’t remember just google stat405)

[email protected]

Page 5: 01 Intro

About me

From New Zealand

Divisional advisor for McMurtry

Major advisor for statistics

Page 6: 01 Intro

Syllabus

Page 7: 01 Intro

Introduction to linux

Page 8: 01 Intro

Essential toolsThe terminal to run R. gedit to edit your R code.

To load the terminal, right-click on the desktop.

To load R, type R in the terminal. To load gedit, type gedit & in the terminal (the & tells it to run separately). To open a file in gedit, type gedit filename &

Page 9: 01 Intro

Setup

Work through the instructions at http://had.co.nz/stat405/linux.html.

I’ll circulate and make sure everyone gets set up right.

Page 10: 01 Intro

Terminal essentials

Mouse select = Copy Middle button = Paste

Ctrl + A = homeCtrl + D = end

Alt + tab = change applications

Press tab to complete file names

Page 11: 01 Intro

Introduction to R

Page 12: 01 Intro

Learning a newlanguage is hard!

Page 13: 01 Intro

Scatterplot basicsinstall.packages("ggplot2")library(ggplot2)

?mpghead(mpg)str(mpg)summary(mpg)

qplot(displ, hwy, data = mpg)

Page 14: 01 Intro

Scatterplot basicsinstall.packages("ggplot2")library(ggplot2)

?mpghead(mpg)str(mpg)summary(mpg)

qplot(displ, hwy, data = mpg)

Always explicitly specify the data

Page 15: 01 Intro

displ

hwy

15

20

25

30

35

40

●●

●●

●● ●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

● ●

●●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●●

●● ●

2 3 4 5 6 7

qplot(displ, hwy, data = mpg)

Page 16: 01 Intro

Additional variables

Can display additional variables with aesthetics (like shape, colour, size) or facetting (small multiples displaying different subsets)

Page 17: 01 Intro

displ

hwy

15

20

25

30

35

40

●●

●●

●● ●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●● ●●

● ●

●● ●

● ●

● ●

● ●

● ●

●●●

●●

2 3 4 5 6 7

class● 2seater● compact● midsize● minivan● pickup● subcompact● suv

qplot(displ, hwy, colour = class, data = mpg)

Page 18: 01 Intro

displ

hwy

15

20

25

30

35

40

●●

●●

●● ●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●● ●●

● ●

●● ●

● ●

● ●

● ●

● ●

●●●

●●

2 3 4 5 6 7

class● 2seater● compact● midsize● minivan● pickup● subcompact● suv

Legend chosen and displayed automatically.

qplot(displ, hwy, colour = class, data = mpg)

Page 19: 01 Intro

Your turn

Experiment with colour, size, and shape aesthetics.

What’s the difference between discrete or continuous variables?

What happens when you combine multiple aesthetics?

Page 20: 01 Intro

Discrete Continuous

Colour

Size

Shape

Rainbow of colours

Gradient from red to blue

Discrete size steps

Linear mapping between radius

and value

Different shape for each

Doesn’t work

Page 21: 01 Intro

Faceting

Small multiples displaying different subsets of the data.

Useful for exploring conditional relationships. Useful for large data.

Page 22: 01 Intro

Your turnqplot(displ, hwy, data = mpg) + facet_grid(. ~ cyl)

qplot(displ, hwy, data = mpg) + facet_grid(drv ~ .)

qplot(displ, hwy, data = mpg) + facet_grid(drv ~ cyl)

qplot(displ, hwy, data = mpg) + facet_wrap(~ class)

Page 23: 01 Intro

Summary

facet_grid(): 2d grid, rows ~ cols, . for no split

facet_wrap(): 1d ribbon wrapped into 2d

Page 24: 01 Intro

Aside: workflow

Keep a copy of the slides open so that you can copy and paste the code.

For complicated commands, write them in gedit and then copy and paste.

Page 25: 01 Intro

cty

hwy

15

20

25

30

35

40

● ●

● ●

● ●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●●●

● ●●

● ●●

10 15 20 25 30 35qplot(cty, hwy, data = mpg)

What’s the problem with this plot?

Page 26: 01 Intro

cty

hwy

15

20

25

30

35

40

●●

● ●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

● ●

●●

●●

● ●

●●●

● ●●

● ●●

10 15 20 25 30 35qplot(cty, hwy, data = mpg, geom = "jitter")

Page 27: 01 Intro

cty

hwy

15

20

25

30

35

40

●●

● ●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

● ●

●●

●●

● ●

●●●

● ●●

● ●●

10 15 20 25 30 35qplot(cty, hwy, data = mpg, geom = "jitter")

geom controls “type” of plot

Page 28: 01 Intro

class

hwy

15

20

25

30

35

40

●●

●●

●●●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

2seater compact midsize minivan pickup subcompact suv

qplot(class, hwy, data = mpg)

Page 29: 01 Intro

class

hwy

15

20

25

30

35

40

●●

●●

●●●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

2seater compact midsize minivan pickup subcompact suv

qplot(class, hwy, data = mpg)

How could we improve this plot?

Brainstorm for 1 minute.

Page 30: 01 Intro

reorder(class, hwy)

hwy

15

20

25

30

35

40

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●●

●●●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●●●

pickup suv minivan 2seater midsize subcompact compact

Page 31: 01 Intro

reorder(class, hwy)

hwy

15

20

25

30

35

40

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●●

●●●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●●●

pickup suv minivan 2seater midsize subcompact compact

qplot(reorder(class, hwy), hwy, data = mpg)

Incredibly useful technique!

Page 32: 01 Intro

reorder(class, hwy)

hwy

15

20

25

30

35

40

●●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●●

●●● ●

●●

●●

●●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

pickup suv minivan 2seater midsize subcompact compactqplot(reorder(class, hwy), hwy, data = mpg, geom = "jitter")

Page 33: 01 Intro

reorder(class, hwy)

hwy

15

20

25

30

35

40

●●●

●●

pickup suv minivan 2seater midsize subcompact compactqplot(reorder(class, hwy), hwy, data = mpg, geom = "boxplot")

Page 34: 01 Intro

reorder(class, hwy)

hwy

15

20

25

30

35

40

●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

● ●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●●

● ●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

pickup suv minivan 2seater midsize subcompact compactqplot(reorder(class, hwy), hwy, data = mpg, geom = c("jitter", "boxplot"))

Page 35: 01 Intro

Your turn

Read the help for reorder. Redraw the previously plots with class ordered by median hwy.

How would you put the jittered points on top of the boxplots?

Page 36: 01 Intro

Aside: coding strategy

At the end of each interactive session, you want a summary of everything you did. Two options:

1. Save everything you did with savehistory() then remove the unimportant bits.

2. Build up the important bits as you go. (this is how I work)