STAT 260: Lecture 9 - stats.otago.ac.nz

41
STAT 260: Lecture 9 Mik Black STAT 260: Lecture 9 Slide 1

Transcript of STAT 260: Lecture 9 - stats.otago.ac.nz

Page 1: STAT 260: Lecture 9 - stats.otago.ac.nz

STAT 260: Lecture 9

Mik Black

STAT 260: Lecture 9 Slide 1

Page 2: STAT 260: Lecture 9 - stats.otago.ac.nz

More ggplot2. . .

• Today: faceting and lines• As always, don’t forget to call the ggplot2 package before we start:

library(ggplot2)

• And later I also use dplyr:library(dplyr)

• Might not get through all these slides today. . .

STAT 260: Lecture 9 Slide 2

Page 3: STAT 260: Lecture 9 - stats.otago.ac.nz

Faceting

• Faceting refers to the technique of making a particular plot across the levels of adiscrete variable (i.e., a factor in R).

• ggplot gives us the ability to do this in a single plot call via the facet_wrap

function.• We’ll look at this functionality using one of the data sets that are part of the

ggplot2 package - the “mpg” data• This is a data set that records the gas mileage of automobiles relative to their other

characteristics.

STAT 260: Lecture 9 Slide 3

Page 4: STAT 260: Lecture 9 - stats.otago.ac.nz

MPG data - variables

• manufacturer: name of manufacturer• model: model name• displ: engine displacement, in liters• year: year of manufacture• cyl: number of cylinders• trans: type of transmission• drv (f = front-wheel drive, r = rear wheel drive, 4 = 4wd)• cty: city miles per gallon• hwy: highway miles per gallon• fl: fuel type _ class: “type” of car

STAT 260: Lecture 9 Slide 4

Page 5: STAT 260: Lecture 9 - stats.otago.ac.nz

MPG data - structurestr(mpg)

## tibble[,11] [234 x 11] (S3: tbl_df/tbl/data.frame)## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...## $ drv : chr [1:234] "f" "f" "f" "f" ...## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...## $ fl : chr [1:234] "p" "p" "p" "p" ...## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...

STAT 260: Lecture 9 Slide 5

Page 6: STAT 260: Lecture 9 - stats.otago.ac.nz

MPG data - first rows

head(mpg)

## # A tibble: 6 x 11## manufacturer model displ year cyl trans drv cty hwy fl class## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~

STAT 260: Lecture 9 Slide 6

Page 7: STAT 260: Lecture 9 - stats.otago.ac.nz

MPG data - scatterplot of highway versus city mileageggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point()

10

15

20

25

30

35

20 30 40hwy

cty

STAT 260: Lecture 9 Slide 7

Page 8: STAT 260: Lecture 9 - stats.otago.ac.nz

Aside - adding jitter (reminder from last lecture)

• there is a lot of overplotting going on - sometimes adding a little noise improve theplot by making the relationship more obvious (i.e., revealing the overplotted datapoints):

ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position="jitter")

10

20

30

20 30 40hwy

cty

STAT 260: Lecture 9 Slide 8

Page 9: STAT 260: Lecture 9 - stats.otago.ac.nz

Colour by vehicle class

• lets use colour to add vehicle class information to the plot:ggplot(data=mpg, aes(x=hwy, y=cty, colour=class)) + geom_point(position="jitter")

10

20

30

20 30 40hwy

cty

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

STAT 260: Lecture 9 Slide 9

Page 10: STAT 260: Lecture 9 - stats.otago.ac.nz

Hard to see what is going on. . .

• using colour to denote vehicle class does work, but it is hard to see exactly whatthe relationship is between city and highway mileage for each class.

• this is where “faceting” comes in - we can ask ggplot to make the scatterplot foreach type of vehicle.

• to do this we use the facet_wrap function, along with the ~ operator (you’ll learnmore about this later in the course):

ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +facet_wrap(~class)

STAT 260: Lecture 9 Slide 10

Page 11: STAT 260: Lecture 9 - stats.otago.ac.nz

Facet by vehicle classggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +

facet_wrap(~class)

suv

minivan pickup subcompact

2seater compact midsize

20 30 40

20 30 40 20 30 40

10

20

30

10

20

30

10

20

30

hwy

cty

STAT 260: Lecture 9 Slide 11

Page 12: STAT 260: Lecture 9 - stats.otago.ac.nz

Facet by vehicle class• we can also specify the number of rows to using for faceting:

ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +facet_wrap(~class, nrow=2)

pickup subcompact suv

2seater compact midsize minivan

10 20 30 40 10 20 30 40 10 20 30 40

10 20 30 40

10

20

30

10

20

30

hwy

cty

STAT 260: Lecture 9 Slide 12

Page 13: STAT 260: Lecture 9 - stats.otago.ac.nz

Facet mileage histograms by drive type

• we can use faceting for (almost) any sort of plot:ggplot(data=mpg, aes(x=hwy)) + geom_histogram(bins=15) + facet_wrap(~drv)

4 f r

10 20 30 40 10 20 30 40 10 20 30 40

0

10

20

30

40

hwy

coun

t

STAT 260: Lecture 9 Slide 13

Page 14: STAT 260: Lecture 9 - stats.otago.ac.nz

More information: engine displacement

• engine displacement, displ, is a continuous variable:ggplot(data=mpg, aes(x=displ)) + geom_histogram(bins=15, colour='black', fill='white')

0

10

20

30

40

2 4 6displ

coun

t

STAT 260: Lecture 9 Slide 14

Page 15: STAT 260: Lecture 9 - stats.otago.ac.nz

Engine displacement• definitely varies by vehicle class:

ggplot(data=mpg, aes(x=class, y=displ)) + geom_boxplot() +geom_jitter(width=0.15, alpha=0.3)

2

3

4

5

6

7

2seater compact midsize minivan pickup subcompact suvclass

disp

l

STAT 260: Lecture 9 Slide 15

Page 16: STAT 260: Lecture 9 - stats.otago.ac.nz

Colour by engine displacement• can also colour by a continuous variable (mentioned this at the end of the last

lecture):ggplot(data=mpg, aes(x=hwy, y=cty, colour=displ)) + geom_point(position="jitter")

10

15

20

25

30

35

20 30 40hwy

cty

2

3

4

5

6

7displ

STAT 260: Lecture 9 Slide 16

Page 17: STAT 260: Lecture 9 - stats.otago.ac.nz

Facet by vehicle class & colour by displacement• and now lets facet by class!

ggplot(data=mpg, aes(x=hwy, y=cty, colour=displ)) + geom_point(position='jitter') +facet_wrap(~class)

suv

minivan pickup subcompact

2seater compact midsize

20 30 40

20 30 40 20 30 40

10

20

30

10

20

30

10

20

30

hwy

cty

2

3

4

5

6

7displ

STAT 260: Lecture 9 Slide 17

Page 18: STAT 260: Lecture 9 - stats.otago.ac.nz

Linking point size to a variable

• instead of colour we could use point size to include information about a variables:ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) + geom_point()

10

15

20

25

30

35

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 18

Page 19: STAT 260: Lecture 9 - stats.otago.ac.nz

Linking point size of a variable (alpha)• add transparency via alpha levels:

ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) +geom_point(alpha=0.2)

10

15

20

25

30

35

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 19

Page 20: STAT 260: Lecture 9 - stats.otago.ac.nz

Linking point size of a variable (with alpha and jitter)

• now ad some jitter. . .ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) + geom_point(alpha=0.2, position='jitter')

10

20

30

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 20

Page 21: STAT 260: Lecture 9 - stats.otago.ac.nz

Local aesthetics

• ggplot allows us to specify aesthetic locally (i.e., specific to a geom).• if the local value is different to the aes values specified in the main ggplot call,

then those aesthetics will be used for that particular geometric object.• this becomes useful when customising multiple layers in a single plot - we’ll see an

example of this later in the lecture.• here is an example of specifying the point size within geom_point (it gives the

same result as above):ggplot(data=mpg, aes(x=hwy, y=cty)) +

geom_point(aes(size=displ), alpha=0.2, position='jitter')

STAT 260: Lecture 9 Slide 21

Page 22: STAT 260: Lecture 9 - stats.otago.ac.nz

Local aestheticsggplot(data=mpg, aes(x=hwy, y=cty)) +

geom_point(aes(size=displ), alpha=0.2, position='jitter')

10

15

20

25

30

35

20 30 40hwy

cty

displ

2

3

4

5

6

7

STAT 260: Lecture 9 Slide 22

Page 23: STAT 260: Lecture 9 - stats.otago.ac.nz

Adding lines

• another very powerful feature of ggplot is the ability to add lines to a plot.• in particular, lines that are generated by the application of a statistical procedure to

the data in the plot. For example:I linear regressionI local smoothing techniques such as “loess”

• here we are using the geom_smooth geometric object.• if no method is specified, geom_smooth will choose a method based on sample size:“loess” for n<1000, otherwise a generalised additive model is used (don’t worryabout this for now. . . )

• the syntax is:ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth()

STAT 260: Lecture 9 Slide 23

Page 24: STAT 260: Lecture 9 - stats.otago.ac.nz

Adding linesggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth()

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 24

Page 25: STAT 260: Lecture 9 - stats.otago.ac.nz

Adding lines: straight line• use geom_smooth(method=lm) to fit a linear model (i.e., simple linear regression)

to the data:ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(method=lm)

10

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 25

Page 26: STAT 260: Lecture 9 - stats.otago.ac.nz

Linear regression

• Here the geom_smooth() function is fitting a linear regression, and then addingthat line (and confidence interval, if se=TRUE) to the plot. Let’s check manually:

linreg = lm(hwy ~ displ, data=mpg)summary(linreg)$coefficients

## Estimate Std. Error t value Pr(>|t|)## (Intercept) 35.697651 0.7203676 49.55477 2.123519e-125## displ -3.530589 0.1945137 -18.15085 2.038974e-46

STAT 260: Lecture 9 Slide 26

Page 27: STAT 260: Lecture 9 - stats.otago.ac.nz

Add regression line to plot (base R)plot(mpg$displ, mpg$hwy)abline(linreg)

2 3 4 5 6 7

1520

2530

3540

45

mpg$displ

mpg

$hw

y

STAT 260: Lecture 9 Slide 27

Page 28: STAT 260: Lecture 9 - stats.otago.ac.nz

Calculating and adding confidence intervals

newx = seq(min(mpg$displ), max(mpg$displ), by = 0.05)conf_interval = predict(linreg, newdata=data.frame(displ=newx),

interval="confidence", level = 0.95)ci = data.frame(newx, conf_interval)head(ci)

## newx fit lwr upr## 1 1.60 30.04871 29.17768 30.91974## 2 1.65 29.87218 29.01686 30.72750## 3 1.70 29.69565 28.85590 30.53540## 4 1.75 29.51912 28.69479 30.34345## 5 1.80 29.34259 28.53352 30.15166## 6 1.85 29.16606 28.37208 29.96005

STAT 260: Lecture 9 Slide 28

Page 29: STAT 260: Lecture 9 - stats.otago.ac.nz

Calculating and adding confidence intervalsplot(mpg$displ, mpg$hwy)abline(linreg, col="lightblue")lines(ci$newx, ci$lwr, col="blue", lty=2)lines(ci$newx, ci$upr, col="blue", lty=2)

2 3 4 5 6 7

1520

2530

3540

45

mpg$displ

mpg

$hw

y

STAT 260: Lecture 9 Slide 29

Page 30: STAT 260: Lecture 9 - stats.otago.ac.nz

Check against ggplotggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +

geom_smooth(method='lm', se=TRUE) +geom_abline(intercept=linreg$coef[1], slope=linreg$coef[2], colour='red') +geom_line(data=ci, aes(x=newx, y=lwr)) + geom_line(data=ci, aes(x=newx, y=upr))

10

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 30

Page 31: STAT 260: Lecture 9 - stats.otago.ac.nz

Adding lines: remove confidence intervalggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(se=FALSE)

20

30

40

2 3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 31

Page 32: STAT 260: Lecture 9 - stats.otago.ac.nz

Colour points by class

• It would be useful to colour the points on the plot by vehicle class (2seater,compact etc)

• Intuitively we can do this by setting colour=class.• Works when we only have geom_point - what happens when we also have the

geom_smooth layer in the plot?

STAT 260: Lecture 9 Slide 32

Page 33: STAT 260: Lecture 9 - stats.otago.ac.nz

Colour points by class: oops. . .ggplot(data=mpg, aes(x=displ, y=hwy, colour=class)) + geom_point() + geom_smooth()

20

30

40

2 3 4 5 6 7displ

hwy

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

STAT 260: Lecture 9 Slide 33

Page 34: STAT 260: Lecture 9 - stats.otago.ac.nz

What happened?

• The colour=class specification in the main ggplot aesthetics was used for allgeometric objects in the plot.

• What if we only want it to apply to geom_point but not geom_smooth?• Remember the example with point size from above. . . ?• We can specify the colour=class aesthetic within geom_point so that it is only

used for that layer:

ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point(aes(colour=class)) +

geom_smooth()

STAT 260: Lecture 9 Slide 34

Page 35: STAT 260: Lecture 9 - stats.otago.ac.nz

Local aesthetics to the rescue!ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point(aes(colour=class)) +

geom_smooth()

20

30

40

2 3 4 5 6 7displ

hwy

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

STAT 260: Lecture 9 Slide 35

Page 36: STAT 260: Lecture 9 - stats.otago.ac.nz

Lines and facets• we can also add lines to our faceted plots:

ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +geom_smooth(method=lm, se=FALSE) + facet_wrap(~drv, nrow=1)

4 f r

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

displ

hwy

STAT 260: Lecture 9 Slide 36

Page 37: STAT 260: Lecture 9 - stats.otago.ac.nz

Caution! Faceting and confidence intervals

• When the geom_smooth function is used to add lines (and confidence intervals),the calculations are performed per facet group.

• This can lead to differences to the confidence intervals that are calculated,compared to a regression model fit to the full data set.

I the regression lines will be the sameI the confidence intervals will be different

• This occurs because in the full regression model, all of the data points are used toestimate the standard error, whereas in the per-facet model, only the data pointsfrom that group are used.

STAT 260: Lecture 9 Slide 37

Page 38: STAT 260: Lecture 9 - stats.otago.ac.nz

Faceting and confidence intervalsggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +

geom_smooth(method='lm', se=TRUE) + facet_wrap(~drv)

4 f r

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

10

20

30

40

displ

hwy

STAT 260: Lecture 9 Slide 38

Page 39: STAT 260: Lecture 9 - stats.otago.ac.nz

Close up for the rear wheel drive grouprwd = filter(mpg, drv=="r")ggplot(data=rwd, aes(x=displ, y=hwy)) + geom_point() +

geom_smooth(method='lm', se=TRUE) + xlim(3,7) + ylim(10,30)

10

15

20

25

30

3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 39

Page 40: STAT 260: Lecture 9 - stats.otago.ac.nz

Regression model, with drv interaction term

linreg2 = lm(hwy ~ displ*drv, data=mpg)summary(linreg2)$coef

## Estimate Std. Error t value Pr(>|t|)## (Intercept) 30.6831131 1.0960630 27.993933 1.018637e-75## displ -2.8784863 0.2637577 -10.913372 1.392287e-22## drvf 6.6949631 1.5670461 4.272346 2.841696e-05## drvr -4.9033952 4.1821302 -1.172464 2.422346e-01## displ:drvf -0.7243016 0.4979149 -1.454669 1.471361e-01## displ:drvr 1.9550477 0.8147555 2.399552 1.721899e-02

STAT 260: Lecture 9 Slide 40

Page 41: STAT 260: Lecture 9 - stats.otago.ac.nz

Confidence intervals on ggplot• Confidence intervals from full regression model (using all data with drv interaction

term: black lines) are narrower than the “per-facet” interval calculated bygeom_smooth.

10

15

20

25

30

3 4 5 6 7displ

hwy

STAT 260: Lecture 9 Slide 41