Practical on Exploratory Data Analysis with...

Practical on Exploratory Data Analysis with R

Dr. Emile ChimusaDepartment of Integrative Biomedical Sciences

University of Cape Town

May 9, 2016

1

CONTENTS CONTENTS

Contents

1 Introduction 3

2 Advanced Data Manipulation with R 32.1 Manipulate Multi-tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Advanced Data Visualization with R 213.1 Data Visualization with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Data Visualization with Lattice Graphics . . . . . . . . . . . . . . . . . . . . 34

4 Computing Principal Components 38

5 Tutorial 42

2

Introduction CONTENTS

1 Introduction

In the previous chapter, we introduced basic manipulation of data. This chapter coversmore advance data cleaning and ”tidy data” and will introduce R packages that enable datamanipulation, analysis, and visualization. Upon completing this chapter, you will be ableto use the dplyr package in R to effectively manipulate and conditionally compute summarystatistics over subsets of a ”big” dataset containing many observations.R often uses data frames to store heterogeneous tabular data: tabular, meaning that in-dividuals or observations are typically represented in rows, while variables or features arerepresented as columns; heterogeneous, meaning that columns/features/variables can be dif-ferent classes (on variable, e.. age, can be numeric, while another, e.g., cause of death, canbe text). In this chapter, we explore the dplyr package (https://github.com/hadley/dplyr)which is a relatively new R package that makes data manipulation fast and easy. It importsfunctionality from another package called magrittr that allows you to chain commands to-gether into a pipeline that will completely change the way you write R code such that you’rewriting code the way you’re thinking about the problem.

2 Advanced Data Manipulation with R

In this section, we will mainly explore the gapminder data (gapminder.csv). This particulardataset has 1704 observations on six variables:

(a) country a categorical variable (aka ”factor”) 142 levels.

(b) continent, a categorical variable with 5 levels.

(c) year: going from 1952 to 2007 in increments of 5 years.

(d) pop: population.

(e) gdpPercap: GDP per capita.

(f) lifeExp: life expectancy.

Let’s load the data first. There are two ways to load the above data. Using RStudio’s menusto select a file from your computer (tools -> import dataset -> from text file). The best way,is to save the data and read it using R function such as read.csv. It’s important to tell Rthat the file has a header, which tells R the names of the columns. We tell this to R with theheader=TRUE argument. we can type the name of the object itself (GM DATA) to viewthe entire data frame. However, doing this with large data frames can cause trouble.

> GM_DATA <- read.csv("gapminder.csv", header=TRUE)

> # Alternatively, read directly from the web:

> # GM_DATA <- read.csv(url("http://bioconnector.org/data/gapminder.csv"), header=TRUE)

> class(GM_DATA)

[1] "data.frame"

3

Advanced Data Manipulation with R CONTENTS

> #GM_DATA

Let’s go back to our gapminder data. It’s currently stored in an object called GM DATA.Remember that GM DATA is a data.frame object if we look at its class(). Also, rememberwhat happens if we try to display GM DATA on the screen? It’s too big to show us and itfills up our console.

> class(GM_DATA)

[1] "data.frame"

> #GM_DATA

The tbl df class in dplyr package, is basically an improved version of the data.frame object.The main advantage to using a tbl df over a regular data frame is the printing: tbl objectsonly print a few rows and all the columns that fit on one screen, describing the rest of it astext. We can turn the GM DATA data frame into a tbl df using the tbl df() function.

> library(dplyr)

> GM_DATA <- tbl_df(GM_DATA)

> class(GM_DATA)

[1] "tbl_df" "tbl" "data.frame"

> #GM_DATA

You don’t have to turn all your data.frame objects into tbl df objects, but it does makeworking with large datasets a bit easier.The dplyr package gives you a handful of useful verbs for managing data. On their own theydon’t do anything that base R can’t do. In addition, they all take a data.frame or tbl df astheir input for the first argument.

(1) filter: if you want to filter rows of the data where some condition is true, use the filter()function. The first argument is the data frame or tbl, and the second argument is thecondition you want to satisfy.

> # Show only stats for the year 1982

> filter(GM_DATA, year==1982)

Source: local data frame [142 x 6]

country continent year lifeExp pop gdpPercap

(fctr) (fctr) (int) (dbl) (int) (dbl)

1 Afghanistan Asia 1982 39.854 12881816 978.0114

2 Albania Europe 1982 70.420 2780097 3630.8807

3 Algeria Africa 1982 61.368 20033753 5745.1602

4 Angola Africa 1982 39.942 7016384 2756.9537

4


5 Argentina Americas 1982 69.942 29341374 8997.8974

6 Australia Oceania 1982 74.740 15184200 19477.0093

7 Austria Europe 1982 73.180 7574613 21597.0836

8 Bahrain Asia 1982 69.052 377967 19211.1473

9 Bangladesh Asia 1982 50.009 93074406 676.9819

10 Belgium Europe 1982 73.930 9856303 20979.8459

.. ... ... ... ... ... ...

> # Show only stats for the US

> filter(GM_DATA, country=="United States")




1 United States Americas 1952 68.440 157553000 13990.48












> # Show only stats for American contries in 1997

> filter(GM_DATA, continent=="Americas" & year==1997)




1 Argentina Americas 1997 73.275 36203463 10967.282

2 Bolivia Americas 1997 62.050 7693188 3326.143

3 Brazil Americas 1997 69.388 168546719 7957.981

4 Canada Americas 1997 78.610 30305843 28954.926

5 Chile Americas 1997 75.816 14599929 10118.053

6 Colombia Americas 1997 70.313 37657830 6117.362

7 Costa Rica Americas 1997 77.260 3518107 6677.045

8 Cuba Americas 1997 76.151 10983007 5431.990

9 Dominican Republic Americas 1997 69.957 7992357 3614.101

10 Ecuador Americas 1997 72.312 11911819 7429.456

.. ... ... ... ... ... ...

5


> # Show only stats for countries with per-capita GDP of less than 300 OR a

> # life expectancy of less than 30. What happened those years?

> filter(GM_DATA, gdpPercap<300 | lifeExp<30)





2 Congo, Dem. Rep. Africa 2002 44.966 55379852 241.1659

3 Congo, Dem. Rep. Africa 2007 46.462 64606759 277.5519

4 Guinea-Bissau Africa 1952 32.500 580653 299.8503

5 Lesotho Africa 1952 42.138 748747 298.8462

6 Rwanda Africa 1992 23.599 7290203 737.0686

(2) select: The filter() function allows you to return only certain rows matching a condition.However, select() function lets you subset the data and restrict to a number of columns.The first argument is the data, and subsequent arguments are the columns you want.Let’s just get the year and the population variables.

> select(GM_DATA, year, pop)

Source: local data frame [1,704 x 2]

year pop

(int) (int)

1 1952 8425333

2 1957 9240934

3 1962 10267083

4 1967 11537966

5 1972 13079460

6 1977 14880372

7 1982 12881816

8 1987 13867957

9 1992 16317921

10 1997 22227415

.. ... ...

(3) mutate: The mutate() function adds new columns to the data. Remember, the variablein our dataset is GDP per capita, which is the total GDP divided by the populationsize for that country, for that year. Let’s mutate this dataset and add a column calledgdp:

> mutate(GM_DATA, gdp=pop*gdpPercap)

6



country continent year lifeExp pop gdpPercap gdp

(fctr) (fctr) (int) (dbl) (int) (dbl) (dbl)

1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330










.. ... ... ... ... ... ... ...

With mutate, you can mutate and additionally add one variable, then continue mutat-ing to add more variables based on that variable. Let’s make another column that’sGDP in billions.

> mutate(GM_DATA, gdp=pop*gdpPercap, gdpBil=gdp/1e9)


country continent year lifeExp pop gdpPercap gdp gdpBil

(fctr) (fctr) (int) (dbl) (int) (dbl) (dbl) (dbl)

1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330 6.567086










.. ... ... ... ... ... ... ... ...

(4) arrange: The arrange() function does what it sounds like. It takes a data frame or tbland arranges (or sorts) by column(s) of interest. The first argument is the data, andsubsequent arguments are columns to sort on. Use the desc() function to arrange bydescending.

> arrange(GM_DATA, lifeExp)

7





1 Rwanda Africa 1992 23.599 7290203 737.0686


3 Gambia Africa 1952 30.000 284320 485.2307

4 Angola Africa 1952 30.015 4232095 3520.6103

5 Sierra Leone Africa 1952 30.331 2143249 879.7877


7 Cambodia Asia 1977 31.220 6978607 524.9722

8 Mozambique Africa 1952 31.286 6446316 468.5260

9 Sierra Leone Africa 1957 31.570 2295678 1004.4844

10 Burkina Faso Africa 1952 31.975 4469979 543.2552

.. ... ... ... ... ... ...

> arrange(GM_DATA, year, desc(lifeExp))




1 Norway Europe 1952 72.67 3327728 10095.422

2 Iceland Europe 1952 72.49 147962 7267.688

3 Netherlands Europe 1952 72.13 10381988 8941.572

4 Sweden Europe 1952 71.86 7124673 8527.845

5 Denmark Europe 1952 70.78 4334000 9692.385

6 Switzerland Europe 1952 69.62 4815000 14734.233

7 New Zealand Oceania 1952 69.39 1994794 10556.576

8 United Kingdom Europe 1952 69.18 50430000 9979.508

9 Australia Oceania 1952 69.12 8691212 10039.596

10 Canada Americas 1952 68.75 14785584 11367.161

.. ... ... ... ... ... ...

(5) summarize: The summarize() function summarizes multiple values to a single value.On its own the summarize() function doesn’t seem to be all that useful.

> summarize(GM_DATA, mean(pop))


mean(pop)

(dbl)

1 29601212

> summarize(GM_DATA, meanpop=mean(pop))

8



meanpop

(dbl)

1 29601212

> summarize(GM_DATA, n())


n()

(int)

1 1704

> summarize(GM_DATA, n_distinct(country))


n_distinct(country)

(int)

1 142

(6) group by: The summarize() isn’t useful on its own, However, the group by() functionall it does, is to take an existing tbl and coverts it into a grouped tbl where operationsare performed by group.

> GM_DATA














.. ... ... ... ... ... ...

> group_by(GM_DATA, continent)

9



Groups: continent [5]













.. ... ... ... ... ... ...

> class(group_by(GM_DATA, continent))

[1] "grouped_df" "tbl_df" "tbl" "data.frame"

The real power comes in where group by() and summarize() are used together. Let’stake the same grouped tbl from last time, and pass all that as an input to summarize,where we get the mean population size. We can also group by more than one variable.

> summarize(group_by(GM_DATA, continent), mean(pop))


continent mean(pop)

(fctr) (dbl)

1 Africa 9916003

2 Americas 24504795

3 Asia 77038722

4 Europe 17169765

5 Oceania 8874672

> group_by(GM_DATA, continent, year)


Groups: continent, year [60]





10










.. ... ... ... ... ... ...

> summarize(group_by(GM_DATA, continent, year), mean(pop))


Groups: continent [?]

continent year mean(pop)

(fctr) (int) (dbl)

1 Africa 1952 4570010

2 Africa 1957 5093033

3 Africa 1962 5702247

4 Africa 1967 6447875

5 Africa 1972 7305376

6 Africa 1977 8328097

7 Africa 1982 9602857

8 Africa 1987 11054502

9 Africa 1992 12674645

10 Africa 1997 14304480

.. ... ... ...

The dplyr package imports functionality from the magrittr package that lets you pipe theoutput of one function to the input of another, so you can avoid nesting functions. It lookslike this: %>%. You don’t have to load the magrittr package to use it since dplyr imports itsfunctionality when you load the dplyr package. Thinking of the head() function. It expectsa data frame as input, and the next argument is the number of lines to print. These twocommands below are identical:

> head(GM_DATA, 5)









11


> GM_DATA %>% head(5)









What if we wanted to get the life expectancy and GDP averaged across all Asian countriesfor each year? We will actually perform the follows:

12


1. Take the GM DATA dataset

2. mutate() it to add raw GDP

3. filter() it to restrict to Asian countries only

4. group by() year

5. and summarize() it to get the mean life expectancy and mean GDP.

Translate the above in R code, we have the follows,

> mutate(GM_DATA, gdp=gdpPercap*pop)














.. ... ... ... ... ... ... ...

Wrap that whole command with filter().

> filter(mutate(GM_DATA, gdp=gdpPercap*pop), continent=="Asia")














.. ... ... ... ... ... ... ...

13


Wrap that again with group by():

> group_by(filter(mutate(GM_DATA, gdp=gdpPercap*pop), continent=="Asia"), year)


Groups: year [12]













.. ... ... ... ... ... ... ...

Finally, wrap everything with summarize():

> library(dplyr)

> summarize(group_by(filter(mutate(GM_DATA, gdp=gdpPercap*pop),continent=="Asia"), year),mean(lifeExp), mean(gdp))


year mean(lifeExp) mean(gdp)

(int) (dbl) (dbl)

1 1952 46.31439 34095762661

2 1957 49.31854 47267432088

3 1962 51.56322 60136869012

4 1967 54.66364 84648519224

5 1972 57.31927 124385747313

6 1977 59.61056 159802590186

7 1982 62.61794 194429049919

8 1987 64.85118 241784763369

9 1992 66.53721 307100497486

10 1997 68.02052 387597655323

11 2002 69.23388 458042336179

12 2007 70.72848 627513635079

The pipe operator % > % allows you to pass data frame or tbl objects from one function toanother, so long as those functions expect data frames or tables as input. This is how wewould do that in code. It’s as simple as replacing the word ”then” in words to the symbol‘% > %‘ in code.

14


> GM_DATA %>%

+ mutate(gdp=gdpPercap*pop) %>%

+ filter(continent=="Asia") %>%

+ group_by(year) %>%

+ summarize(mean(lifeExp), mean(gdp))



(int) (dbl) (dbl)

1 1952 46.31439 34095762661

2 1957 49.31854 47267432088

3 1962 51.56322 60136869012

4 1967 54.66364 84648519224

5 1972 57.31927 124385747313

6 1977 59.61056 159802590186

7 1982 62.61794 194429049919

8 1987 64.85118 241784763369

9 1992 66.53721 307100497486

10 1997 68.02052 387597655323

11 2002 69.23388 458042336179

12 2007 70.72848 627513635079

So far we’ve dealt exclusively with tidy data, data that’s easy to work with, manipulate, andvisualize. That’s because our gapminder dataset has two key properties:

(1.) Each column is a variable.

(2.) Each row is an observation.

You can read a lot more about tidy data (http://www.jstatsoft.org/v59/i10/paper). Let’sload some untidy data and see if we can see the difference. This is some made-up data forfive different patients (Jon, Ann, Bill, Kate, and Joe) given three different drugs (A, B, andC), and measuring their heart rate.

> hr <-read.csv("heartrate.csv")

> hr

name a b c

1 jon 60 65 70

2 ann 65 70 75

3 bill 70 75 80

4 kate 75 80 85

5 joe 80 85 90

Notice how with the gapminder data each variable, country, continent, year, lifeExp, pop,and gdpPercap, were each in their own column. In the heart rate data, we have three

15


variables: name, drug, and heart rate. Name is in a column, but drug is in the top row,and heart rate is scattered around the entire table. If we wanted to do things like filter thedataset where drug==”a” or heartrate>=80 we couldn’t do it because these variables aren’tin columns.The tidyr package helps with this. There are other functions in the tidyr package but oneparticularly handy one is the gather() function, which takes multiple columns, and gathersthem into key-value pairs: it makes ”wide” data longer. We can get a little help with ?gather.

> library(tidyr)

> hr %>% gather(key=drug, value=heartreate, a,b,c)

name drug heartreate

1 jon a 60

2 ann a 65

3 bill a 70

4 kate a 75

5 joe a 80

6 jon b 65

7 ann b 70

8 bill b 75

9 kate b 80

10 joe b 85

11 jon c 70

12 ann c 75

13 bill c 80

14 kate c 85

15 joe c 90

> hr %>% gather(key=drug, value=heartreate, -name)

name drug heartreate

1 jon a 60

2 ann a 65

3 bill a 70

4 kate a 75

5 joe a 80

6 jon b 65

7 ann b 70

8 bill b 75

9 kate b 80

10 joe b 85

11 jon c 70

12 ann c 75

13 bill c 80

14 kate c 85

15 joe c 90

16


If we create a new data frame that’s a tidy version of hr, we can do those kinds of manipu-lations we talked about before

> # Create a new data.frame

> hrtidy <- hr %>%

+ gather(key=drug, value=heartrate, -name)

> hrtidy

name drug heartrate

1 jon a 60

2 ann a 65

3 bill a 70

4 kate a 75

5 joe a 80

6 jon b 65

7 ann b 70

8 bill b 75

9 kate b 80

10 joe b 85

11 jon c 70

12 ann c 75

13 bill c 80

14 kate c 85

15 joe c 90

> # filter

> hrtidy %>% filter(drug=="a")

name drug heartrate

1 jon a 60

2 ann a 65

3 bill a 70

4 kate a 75

5 joe a 80

> hrtidy %>% filter(heartrate>=80)

name drug heartrate

1 joe a 80

2 kate b 80

3 joe b 85

4 bill c 80

5 kate c 85

6 joe c 90

17

2.1 Manipulate Multi-tables CONTENTS

> # analyze

> hrtidy %>%

+ group_by(drug) %>%

+ summarize(mean(heartrate))


drug mean(heartrate)

(chr) (dbl)

1 a 70

2 b 75

3 c 80

2.1 Manipulate Multi-tables

It’s rare that a data analysis involves only a single table of data. You normally have manytables that contribute to an analysis, and you need flexible tools to combine them. Thedplyr package has several tools that let you work with multiple tables at once. The moststraightforward one is the inner join. First, let’s read in some BMI data from the gapminderproject. This is the average body mass index for men only for the years 1980-2008. Now, ifwe look at the data we’ll immediately see that it’s in wide format: we’ve got three variables(Country, year, and BMI), but just like the heart rate data we looked at before, the yearvariable is stuck in the column header while all the BMI variables are spread out throughoutthe table.

|country | 1980| 1981| 1982| 1983| 1984|

|:-----------|----:|----:|----:|----:|----:|

|Afghanistan | 21.5| 21.5| 21.5| 21.4| 21.4|

|Albania | 25.2| 25.2| 25.3| 25.3| 25.3|

|Algeria | 22.3| 22.3| 22.4| 22.5| 22.6|

|Andorra | 25.7| 25.7| 25.7| 25.8| 25.8|

|Angola | 20.9| 20.9| 20.9| 20.9| 20.9|

There’s also another problem here. Let’s read in the data to see what happens. Let’s makeit a tbl df while we’re at it.

> bmi <- read.csv("malebmi.csv") %>% tbl_df()

Look at that. R decides to put an ”X” in front of each year name because you can’t havecolumn names or variables in R that start with a number. You can see where we’re goingwith this though. We want to ”gather” this data into a long format, and join it together withour original gapminder data by year and by country, but to do this, we’ll first need to havethis in a tidy format, and the year variable can’t have those X’s in front. There’s a functioncalled gsub() that takes three arguments: a pattern to replace, what you want to replace itwith, and a character vector. For example:

18


> tmp <- c("id1", "id2", "id3")

> gsub("id", "", tmp)

[1] "1" "2" "3"

Notice that what we’re left with is a character vector instead of numbers. The last thingwe’ll have to do is pass that whole thing through as.numeric() to coerce it back to a numericvector:

> gsub("id", "", tmp) %>% as.numeric

[1] 1 2 3

So let’s tidy up our BMI data.

> library(dplyr)

> bmitidy <- bmi %>%

+ gather(key=year, value=bmi, -country) %>%

+ mutate(year=gsub("X", "", year) %>% as.numeric)

> #bmitidy

Now, look up some help for ?inner join. There are many different kinds of two-table joinsin dplyr. Inner join will return a table with all rows from the first table where there arematching rows in the second table, and returns all columns from both tables.

> gmbmi <- inner_join(GM_DATA, bmitidy, by=c("country", "year"))

Now we can do all sorts of things that we were doing before, but now with BMI as well.

19


22

24

26

1990 2000year

bmi

continent

Africa

Americas

Asia

Europe

Oceania

Figure 1: BMI per year

> GM_DATA %>%

+ mutate(gdp=gdpPercap*pop) %>%

+ filter(continent=="Asia") %>%

+ group_by(year) %>%

+ summarize(mean(lifeExp), mean(gdp))



(int) (dbl) (dbl)

1 1952 46.31439 34095762661

2 1957 49.31854 47267432088

3 1962 51.56322 60136869012

4 1967 54.66364 84648519224

5 1972 57.31927 124385747313

6 1977 59.61056 159802590186

7 1982 62.61794 194429049919

8 1987 64.85118 241784763369

9 1992 66.53721 307100497486

10 1997 68.02052 387597655323

20

Advanced Data Visualization with R CONTENTS

11 2002 69.23388 458042336179

12 2007 70.72848 627513635079

3 Advanced Data Visualization with R

3.1 Data Visualization with ggplot2

This section covers fundamental concepts for creating effective data visualization and willintroduce tools and techniques for visualizing large, high-dimensional data using R. We willreview fundamental concepts for visually displaying quantitative information, such as usingseries of small multiples, avoiding ”chart-junk,” and maximizing the data-ink ratio. Afterbriefly reviewing data visualization using base R graphics, we will introduce the ggplot2package for advanced high-dimensional visualization. We will cover the grammar of graphics(geoms, aesthetics, stats, and faceting), and using ggplot2 to create plots layer-by-layer.Upon completing this section, you will be able to use ggplot2 to explore a high-dimensionaldataset by faceting and scaling scatter plots in small multiples.This section additionally requires some experience with data manipulation using dplyr func-tions discussed in previous section. You can also learn the basics of dplyr by spending fewminutes with (http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html).ggplot2 is a widely used R package that extends R’s visualization capabilities. It takes thehassle out of things like creating legends, mapping other variables to scales like color, orfaceting plots into small multiples. We’ll learn about what all these things mean shortly.Specifically, ggplot2 allows you to build a plot layer-by-layer by specifying:

1. a geom, which specifies how the data are represented on the plot (points, lines, bars,etc.),

2. aesthetics that map variables in the data to axes on the plot or to plotting size, shape,color, etc.,

3. a stat, a statistical transformation or summary of the data applied prior to plotting,

4. facets, which we’ve already seen above, that allow the data to be divided into chunkson the basis of other categorical or continuous variables and the same plot drawn foreach chunk.

> GM_DATA <- read.csv("gapminder.csv", header=TRUE)

> class(GM_DATA)

[1] "data.frame"

> #GM_DATA

The qplot() function is a quick and dirty way of making ggplot2 plots. You might see it ifyou look for help with ggplot2, and it’s even covered extensively in the ggplot2 book. Andif you’re used to making plots with built-in base graphics, the qplot() function will probably

21

3.1 Data Visualization with ggplot2 CONTENTS

feel more familiar. But the sooner you abandon the qplot() syntax the sooner you’ll start toreally understand ggplot2’s approach to building up plots layer by layer. So we’re not goingto use it at all in this class.

(A.) Plotting bivariate data: continuous Y by continuous X:

The ggplot function has two required arguments: the *data* used for creating theplot, and an *aesthetic* mapping to describe how variables in said data are mappedto things we can see on the plot.

First let’s load the package:

> library(ggplot2)

Now, let’s lay out the plot. If we want to plot a continuous Y variable by a continuousX variable we’re probably most interested in a scatter plot. Here, we’re telling ggplotthat we want to use the ”GM DATA” dataset, and the aesthetic mapping will mapgdpPercap onto the x-axis and lifeExp onto the y-axis.

> #ggplot(GM_DATA, aes(x = gdpPercap, y = lifeExp))

Look at that, we get an error, and it’s pretty clear from the message what the problemis. We’ve laid out a two-dimensional plot specifying what goes on the x and y axes, butwe haven’t told it what kind of geometric object to plot. The obvious choice here is apoint. Check out (http://docs.ggplot2.org/) to see what kind of geoms are available.

22


> ggplot(GM_DATA, aes(x = gdpPercap, y = lifeExp)) + geom_point()

40

60

80

0 30000 60000 90000gdpPercap

lifeE

xp

Here, we’ve built our plot in layers. First, we create a canvas for plotting layers tocome using the ggplot function, specifying which data to use (here, the GM DATAdata frame), and an aesthetic mapping of gdpPercap to the x-axis and lifeExp tothe y-axis. We next add a layer to the plot, specifying a geom, or a way of visuallyrepresenting the aesthetic mapping.

Now, the typical workflow for building up a ggplot2 plot is to first construct the figureand save that to a variable (for example, p), and as you’re experimenting, you cancontinue to re-define the p object as you develop ”keeper commands”.

First, let’s construct the graphic. Notice that we don’t have to specify ”x=” and ”y=”if we specify the arguments in the correct order (x is first, y is second).

> p <- ggplot(GM_DATA, aes(gdpPercap, lifeExp))

Now, if we tried to display p here alone we’d get another error because we don’t haveany layers in the plot. Let’s experiment with adding points and a different scale to thex-axis.

> # Experiment with adding poings

> p + geom_point()

> # Experiment with a different scale

> p + geom_point() + scale_x_log10()

23


I like the look of using a log scale for the x-axis. Let’s make that stick.

> p <- p + scale_x_log10()

Then re-plot again with a layer of points:

> p + geom_point()

Now notice what I’ve saved to p at this point: only the basic plot layout and the log10mapping on the x-axis. I didn’t save any layers yet because I want to fiddle aroundwith the points for a bit first. Above we implied the aesthetic mappings for the x- andy- axis should be gdpPercap and lifeExp, but we can also add aesthetic mappings tothe geoms themselves. For instance, what if we wanted to color the points by the valueof another variable in the dataset, say, continent?

> p + geom_point(aes(color=continent))

Notice the difference here. If I wanted the colors to be some static value, I wouldn’twrap that in a call to aes(). I would just specify it outright. Same thing with otherfeatures of the points. For example, lets make all the points huge (”size=8”) blue(color=”blue”) semitransparent (”alpha=(1/4)”) triangles (”pch=17):

> p + geom_point(color="blue", pch=17, size=8, alpha=1/4)

Now, this time, let’s map the aesthetics of the point character to certain features of thedata. For instance, let’s give the points different colors and character shapes accordingto the continent, and map the size of the point onto the life Expectancy:

> p + geom_point(aes(col=continent, pch=continent, size=lifeExp))

Now, this isn’t a great plot because there are several aesthetic mappings that areredundant. Life expectancy is mapped to both the y-axis and the size of the points –the size mapping is superfluous. Similarly, continent is mapped to both the color andthe point character (the point character is superfluous). Let’s get rid of that, but let’smake the points a little bigger outsize of an aesthetic mapping.

> p + geom_point(aes(col=continent), size=4)

Adding layers to the plot: Let’s add a fitted curve to the points.

> p <- ggplot(GM_DATA, aes(gdpPercap, lifeExp)) + scale_x_log10()

> p + geom_point() + geom_smooth()

By default geom smooth() will try to lowess for data with n < 1000 or generalizedadditive models for data with n > 1000. We can change that behavior by tweaking theparameters to use a thick red line, use a linear model instead of a GAM, and to turnoff the standard error stripes.

> p + geom_point() + geom_smooth(lwd=2, se=FALSE, method="lm", col="red")

24


But let’s add back in our aesthetic mapping to the continents. Notice what happenshere. We’re mapping continent as an aesthetic mapping to the color of the points only,so geom smooth() still works only on the entire data.

> p + geom_point(aes(color = continent)) + geom_smooth()

But notice what happens here: we make the call to aes() outside of the geom point()call, and the continent variable gets mapped as an aesthetic to any further geoms. Sohere, we get separate smoothing lines for each continent. Let’s do it again but removethe standard error stripes and make the lines a bit thicker.

> p + aes(color = continent) + geom_point() + geom_smooth()

> p + aes(color = continent) + geom_point() + geom_smooth(se=F, lwd=2)

Facets display subsets of the data in different panels. There are a couple ways to dothis, but facet wrap() tries to sensibly wrap a series of facets into a 2-dimensional gridof small multiples. Just give it a formula specifying which variables to facet by. Wecan continue adding more layers, such as smoothing. If you have a look at the help for?facet wrap() you’ll see that we can control how the wrapping is laid out.

> p + geom_point() + facet_wrap(~continent)

> p + geom_point() + geom_smooth() + facet_wrap(~continent)

> p + geom_point() + geom_smooth() + facet_wrap(~continent, ncol=1)

There are a few ways to save ggplots. The quickest way, that works in an interactivesession, is to use the ggsave() function. You give it a file name and by default it savesthe last plot that was printed to the screen.

> p + geom_point()

> ggsave(file="myplot.png")

But if you’re running this through a script, the best way to do it is to pass ggsave() theobject containing the plot that is meant to be saved. We can also adjust things like thewidth, height, and resolution. ggsave() also recognizes the name of the file extensionand saves the appropriate kind of file. Let’s save a PDF.

> pfinal <- p + geom_point() + geom_smooth() + facet_wrap(~continent, ncol=1)

> ggsave(pfinal, file="myplot.pdf", width=5, height=15)

Example: (1) Make a scatter plot of lifeExp on the y-axis against year on the x. (2)Make a series of small multiples faceting on continent. (3) Add a fitted curve, smoothor lm, with and without facets. (4) geom line() and aesthetic mapping country togroup=, make a ”spaghetti plot”, showing semitransparent lines connected for eachcountry, faceted by continent. Add a smoothed loess curve with a thick (”lwd=3”) linewith no standard error stripe. Reduce the opacity (”alpha=”) of the individual blacklines.

25


> p <- ggplot(GM_DATA, aes(year, lifeExp))

> p + geom_point()

> p + geom_point() + geom_smooth()

> p + geom_point() + geom_smooth() + facet_wrap(~continent)

> p + facet_wrap(~continent) + geom_line()

> p + facet_wrap(~continent) + geom_line(aes(group=country))

> p + facet_wrap(~continent) + geom_line(aes(group=country), alpha=.5) + geom_smooth(lwd=3, se=FALSE)

> p + geom_point(color="blue", pch=17, size=8, alpha=1/4)

40

60

80

1950 1960 1970 1980 1990 2000year

lifeE

xp

26


(B.) Plotting bivariate data: continuous Y by categorical X: With the last examplewe examined the relationship between a continuous Y variable against a continuous Xvariable. A scatter plot was the obvious kind of data visualization. But what if wewanted to visualize a continuous Y variable against a categorical X variable? We sortof saw what that looked like in the last exercise. ”year” is a continuous variable, butin this dataset, it’s broken up into 5-year segments, so you could almost think of eachyear as a categorical variable. But a better example would be life expectancy againstcontinent or country.

First, let’s set up the basic plot:

> p <- ggplot(GM_DATA, aes(continent, lifeExp))

Then add points:

> p + geom_point()

That’s not terribly useful. There’s a big overplotting problem. We can try to solvewith transparenc:

> p + geom_point(alpha=1/4)

But that really only gets us so far. What if we spread things out by adding a little bitof horizontal noise (aka ”jitter”) to the data.

> p + geom_jitter()

Note that the little bit of horizontal noise that’s added to the jitter is random. If yourun that command over and over again, each time it will look slightly different. Theidea is to visualize the density at each vertical position, and spreading out the pointshorizontally allows you to do that. If there were still lots of over-plotting you mightthink about adding some transparency by setting the ‘alpha=‘ value for the jitter.

> p + geom_jitter(alpha=1/2)

Probably a more common visualization is to show a box plot:

> p + geom_boxplot()

But why not show the summary and the raw data?

> p + geom_jitter() + geom_boxplot()

Notice how in that example we first added the jitter layer then added the boxplot layer.But the boxplot is now superimposed over the jitter layer. Let’s make the jitter layergo on top. Also, go back to just the boxplots. Notice that the outliers are representedas points. But there’s no distinction between the outlier point from the boxplot geomand all the other points from the jitter geom. Let’s change that. Notice the Britishspellin.

27


> p + geom_boxplot(outlier.colour = "red") + geom_jitter(alpha=1/2)

There’s another geom that’s useful here, called a voilin plot.

> p + geom_violin()

> p + geom_violin() + geom_jitter(alpha=1/2)

Let’s go back to our boxplot for a moment.


This plot would be a lot more effective if the continents were shown in some sort oforder other than alphabetical. To do that, we’ll have to go back to our basic build ofthe plot again and use the ‘reorder‘ function in our original aesthetic mapping. Here,reorder is taking the first variable, which is some categorical variable, and ordering itby the level of the mean of the second variable, which is a continuous variable. It lookslike this

> p <- ggplot(GM_DATA, aes(x=reorder(continent, lifeExp), y=lifeExp))


Example: (1) Make a jittered strip plot of GDP per capita against continent. (2) Makea box plot of GDP per capita against continent. (3) Using a log<sub>10</sub> y-axis scale, overlay semitransparent jittered points on top of box plots, where outlyingpoints are colored. (4) Try to reorder the continents on the x-axis by GDP per capita.Why isn’t this working as expected? See ‘?reorder‘ for clues.

28


> p <- ggplot(GM_DATA, aes(continent, gdpPercap))

> p + geom_jitter()


> p <- ggplot(GM_DATA, aes(reorder(continent, gdpPercap), gdpPercap))

> p <- p + scale_y_log10()

> p + geom_boxplot(outlier.colour="red") + geom_jitter(alpha=1/2)

> library(dplyr)

> GM_DATA %>% group_by(continent) %>% summarize(mean(gdpPercap))


continent mean(gdpPercap)

(fctr) (dbl)

1 Africa 2193.755

2 Americas 7136.110

3 Asia 7902.150

4 Europe 14469.476

5 Oceania 18621.609

> GM_DATA %>% group_by(continent) %>% summarize(mean(log10(gdpPercap)))


continent mean(log10(gdpPercap))

(fctr) (dbl)

1 Africa 3.147894

2 Americas 3.741568

3 Asia 3.505493

4 Europe 4.058228

5 Oceania 4.246647

> p <- ggplot(GM_DATA, aes(reorder(continent, gdpPercap, FUN=function(x) mean(log10(x))), gdpPercap))

> p <- p + scale_y_log10()

> p + geom_boxplot(outlier.colour="red") + geom_jitter(alpha=1/2)

0

30000

60000

90000

Africa Americas Asia Europe Oceaniacontinent

gdpP

erca

p

29


(C.) Plotting univariate continuous data: What if we just wanted to visualize distri-bution of a single continuous variable? A histogram is the usual go-to visualization.Here we only have one aesthetic mapping instead of two.

> p <- ggplot(GM_DATA, aes(lifeExp))

> p + geom_histogram()

When we do this ggplot let us know that we’re automatically selecting the width ofthe bins, and we might want to think about this a little further.

> p + geom_histogram(binwidth=5)

> p + geom_histogram(binwidth=1)

> p + geom_histogram(binwidth=.25)

0

100

200

300

20 40 60 80lifeExp

coun

t

Alternative we could plot a smoothed density curve instead of a histogram:

> p + geom_density()

Back to histograms. What if we wanted to color this by continent?

> p + geom_histogram(aes(color=continent))

That’s not what we had in mind. That’s just the outline of the bars. We want tochange the fill color of the bars.

> p + geom_histogram(aes(fill=continent))

Well, that’s not exactly what we want either. If you look at the help for ?geom histogramyou’ll see that by default it stacks overlapping points. This isn’t really an effective vi-sualization. Let’s change the position argument.

> p + geom_histogram(aes(fill=continent), position="identity")

But the problem there is that the histograms are blocking each other. What if we triedtransparency?

30


> p + geom_histogram(aes(fill=continent), position="identity", alpha=1/3)

That’s somewhat helpful, and might work for two distributions, but it gets cumbersomewith 5. Let’s go back and try this with density plots, first changing the color of theline:

> p + geom_density(aes(color=continent))

Then by changing the color of the fill and setting the transparency to 25%:

> p + geom_density(aes(fill=continent), alpha=1/4)

(D.) Using of themes: Let’s make a plot we made earlier (life expectancy versus the logof GDP per capita with points colored by continent with lowess smooth curves overlaidwithout the standard error ribbon):

> p <- ggplot(GM_DATA, aes(gdpPercap, lifeExp))

> p <- p + scale_x_log10()

> p <- p + aes(col=continent) + geom_point() + geom_smooth(lwd=2, se=FALSE)

Give the plot a title and axis labels:

> p <- p + ggtitle("Life expectancy vs GDP by Continent")

> p <- p + xlab("GDP Per Capita (USD)") + ylab("Life Expectancy (years)")

> p

They ”gray” theme is the usual background.

31


> p + theme_gray()

40

60

80

1e+03 1e+04 1e+05GDP Per Capita (USD)

Life

Exp

ecta

ncy

(yea

rs)

continent

Africa

Americas

Asia

Europe

Oceania

Life expectancy vs GDP by Continent

We could also get a black and white background:

> p + theme_bw()

Or go a step further and remove the gridlines:

> p + theme_classic()

Finally, there’s another package that gives us lots of different themes. This packageisn’t on CRAN, so you’ll have to use devtools to install it directly from the source codeon github.

> install.packages("devtools")

> devtools::install_github("jrnold/ggthemes")

32


> library(ggthemes)

> p + theme_excel()

> p + theme_excel() + scale_colour_excel()

> p + theme_gdocs() + scale_colour_gdocs()

> p + theme_stata() + scale_colour_stata()

> p + theme_wsj() + scale_colour_wsj()

> p + theme_economist()

> p + theme_fivethirtyeight()

> p + theme_tufte()

40

60

80

1e+03 1e+04 1e+05GDP Per Capita (USD)

Life

Exp

ecta

ncy

(yea

rs)

continentAfrica

Americas

Asia

Europe

Oceania

Life expectancy vs GDP by Continent

33

3.2 Data Visualization with Lattice Graphics CONTENTS

3.2 Data Visualization with Lattice Graphics

An other approach to visualize the data is to use lattice graphics, a R package. The firststep for using lattice is to load the lattice package using the check box in the Packages tabor using the following command:

> library(lattice) # or require(lattice)

lattice plots make use of a formula interface:plotname( y x|z, data=dataname, groups=grouping.variable, ...)

1. x is the name of the variable that is plotted along the horizontal (x) axis.

2. y is the name of the variable that is plotted along the vertical (y) axis. (For some plots,this slot is empty because R computes these values from the values of x.)

3. z is a conditioning variable used to split the plot into multiple subplots called panels.

4. grouping variable is used to display different groups differently (different colors orsymbols, for example) within the same panel.

5. ..., means there are many additional arguments to these functions that let you controljust how the plots look. (But we’ll focus on the basics for now.)

> #install.packages("xtable",repo="http://cran.r-project.org", dep=TRUE)

> library("xtable")

> data_csv = read.csv("class_data.csv", header=TRUE)

> attach(data_csv) # so can use names directly

Below are different names of several lattice plots

(a) histogram (for histograms).

34


> histogram(~siblings)

> histogram(~siblings|gender, n=20) # Need to include ~, to make a 'formula'

> histogram(~siblings|gender + tongue,type='density')

siblings

Per

cent

of T

otal

0

5

10

15

20

25

0 2 4 6 8 10 12

(b) bwplot (for boxplots).

35


> par(mfrow=c(2,1))

> bwplot(~siblings|gender + tongue) # Need to include ~, to make a 'formula'

> densityplot(~siblings|gender) #

siblings

0 2 4 6 8 10

F M

Fno

Mno

Fyes

0 2 4 6 8 10

Myes

(c) xyplot (for scatter plots).

36


> par(mfrow=c(1,2))

> xyplot(footlength~tongue | gender)

> xyplot(footlength~tongue, groups=gender)

tongue

foot

leng

th

10

15

20

25

30

no yes

F

no yes

M

(d) qqmath (for quantile-quantile plots).

37

Computing Principal Components CONTENTS

> qqmath(footlength~siblings | gender)

qnorm

sibl

ings

0

2

4

6

8

10

−2 −1 0 1 2

F

−2 −1 0 1 2

M

There are several ways to save plots, but the easiest is probably the following:

(1) In the Plots tab, click the ”Export” button.

(2) Copy the image to the clipboard using right click.

(3) Go to your Word document and paste in the image.

(4) Resize or reposition your image in Word as needed.

4 Computing Principal Components

Let’s make this precise. Suppose we have a matrix of data, X. The goal is to find a matrixP such that the covariance matrix of PX is diagonal and the entries on the diagonal arein descending order. If we do this, PX will be our new data; the new variables will belinear combinations of the original variables whose weights are given by P . Because thecovariance matrix is diagonal, we know that the covariance (and hence correlation) of anypair of distinct variables is zero, and the variance of each of our new variables is listed alongthe diagonal.To see what we should choose for P , we’ll need a few results from linear algebra.

38


1. The covariant matrix of amxnmatrixX, is a nxn symmetric matrix given by1

n− 1XXT

2. Any symmetric matrix nxn has a set of n orthogonal eigenvectors (v1, . . . , vn) andassociated to eigenvalues (λ1, . . . , λn), it follows,

(a) For any i, Avi = λivi

(b) ‖vi‖ = 1

(c) 〈vi, vj〉 = 0, i 6= j

3. If A is a nxn symmetric matrix and E is an nxn matrix whose ith column is theith eigenvector of A (where the eigenvectors are ordered according of the decreasingvalue of their associated eigenvalues). Thus, there is a diagonal matrix D, such thatA = EDET

4. If the rows of a matrix E are orthogonal, then E−1 = ET

We now show how we can choose P so that the covariance matrix for PX is diagonal. LetA = XXT . Let P be the matrix whose ith row is the i the eigenvector of A (in the notationof step 3 above, this means P = ET ). Let’s check that this gives us what we want:

Cov (PX) =1

n− 1(PX) (PX)T

=1

n− 1PXXTP T

=1

n− 1PAP T , A = XXT

=1

n− 1P(EDET

)P T , (step 3 above)

=1

n− 1P(P TDP

)P T , P = ET

=1

n− 1PP TDPP T , (step 2 and 4)

=1

n− 1D

Our goal was to find a matrix P such that the covariance matrix of PX is diagonal, andwe’ve achieved that. Though it’s not immediately clear from the derivation, by orderingthe rows of P in decreasing order of associated eigenvalue, our second goal of having thevariances in decreasing order is also achieved. To conduct PCA, we have the follows steps:

1. Get X in the proper form. This will probably mean subtracting off the means of eachrow. If the variances are significantly different in your data, you may also wish to scaleeach row by dividing by its standard deviation to give the rows a uniform variance of1.

2. Calculate A = XXT

39


3. Find the eigenvectors of A and stack them to make the matrix P .

4. Your new data is PX, the new variables (principal components) are the rows of P .

5. The variance for each principal component can be read off the diagonal of the covariancematrix.

Let’s set up and look at the iris dataset

> #library(rgl)

> data(iris)

> iris$Species <- factor(iris$Species,levels = c("versicolor","virginica","setosa"))

> names(iris)

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

> round(cor(iris[,1:4]), 2) #look at how correlated the four variables

Sepal.Length Sepal.Width Petal.Length Petal.Width

Sepal.Length 1.00 -0.12 0.87 0.82

Sepal.Width -0.12 1.00 -0.43 -0.37

Petal.Length 0.87 -0.43 1.00 0.96

Petal.Width 0.82 -0.37 0.96 1.00

We’ll use princomp to do the PCA here. There are many alternative implementations forthis technique. Here I choose to use the correlation matrix rather than the covariance matrix,and to generate scores for all the input data observations.

40


> pc <- princomp(iris[,1:4], cor=TRUE, scores=TRUE)

> summary(pc)

Importance of components:

Comp.1 Comp.2 Comp.3 Comp.4

Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497

Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709

Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000

> biplot(pc)

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

0.2

Comp.1

Com

p.2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

2425

26

27

28

29

3031

32

33

34

35

36

3738

39

40

41

42

43

44

45

46

47

48

49

50

51

52 53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

8182

8384

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125126

127

128

129

130

131

132

133134

135

136

137

138

139

140141142

143

144

145

146

147

148

149

150

−10 −5 0 5 10

−10

−5

05

10

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

Figure 2: Iris PCA plot.

> fit.df <- as.data.frame(pc$scores)

> fit.df$state <- rownames(fit.df)

41

Tutorial CONTENTS

> library(ggplot2)

> ggplot(data=fit.df,aes(x=Comp.1,y=Comp.2))+ geom_text(aes(label=rownames(iris),size=1,hjust=0,vjust=0))

> #ggbiplot(pc, labels = rownames(iris))

> #library(rgl)

> #plot3d(pc$scores[,1:3], col=iris$Species)

> #text3d(pc$scores[,1:3],texts=rownames(iris))

> #text3d(pc$loadings[,1:3], texts=rownames(pc$loadings), col="red")

> #coords <- NULL

> #for (i in 1:nrow(pc$loadings)) {

> # coords <- rbind(coords, rbind(c(0,0,0),pc$loadings[i,1:3]))

> #}

> #lines3d(coords, col="red", lwd=4)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

2425

26

27

28

29

30

31

32

33

34

35

36

3738

39

40

41

42

43

44

45

46

47

48

49

50

51

52 53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

8182

8384

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125 126

127

128

129

130

131

132

133134

135

136

137

138

139

140 141142

143

144

145

146

147

148

149

150

−2

−1

0

1

2

−2 0 2Comp.1

Com

p.2 1

a 1

Figure 3: ggplot plot version of Iris PCA plot.

5 Tutorial

(1) Continue the PCA example in 3D visualization.

(2) Work on Bilogycal visualization see the ”Visualization.R”, improve the code, commentand report your obervations.

42

Practical on Exploratory Data Analysis with...

Documents

Transcript of Practical on Exploratory Data Analysis with...