An Introduction to R for Epidemiologists using RStudio ...sjm2186/SER2014/intro.pdfFirst steps: Use...
Transcript of An Introduction to R for Epidemiologists using RStudio ...sjm2186/SER2014/intro.pdfFirst steps: Use...
An Introduction to R for Epidemiologists using RStudiothe basics
Steve Mooney (much borrowed from C. DiMaggio)
Department of EpidemiologyColumbia UniversityNew York, NY 10032
An Introduction to R for Epidemiologists using RStudioIntroduction to R Concepts and Object Types
SER Summer 2014
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 2 / 56
getting our hands dirty calculating, assigning, combining
First steps: Use R as a calculator
math operators and functions
arithmetic + , - , * , /
power ^
convert 68 degrees Fahrenheit to Celsius (C 0 = 59(F 0 − 32))
5/9*(68-32)
First type it directly in the console. Then type it into the editor and sendit to the console.(Remember how to do that?)
S. Mooney (Columbia University) R intro 2014 3 / 56
getting our hands dirty calculating, assigning, combining
assignment operator‘memory’ key
<-
y <- 5/9*(68-32) #assignment (no display)
y
(y <- 5/9*(68-32)) #assignment (display)
S. Mooney (Columbia University) R intro 2014 4 / 56
getting our hands dirty calculating, assigning, combining
functions
FunctionName(parameter1, parameter2, ...)
math operators and functions
mathematical functions - sqrt, log, exp, sin, cos, tan
simple functions - max, min, length, sum, mean, var, sort
abs(-23) #absolute value
exp(8) # exponentiation
log(exp(8)) # natural logarithm
sqrt(64) # square root
S. Mooney (Columbia University) R intro 2014 5 / 56
getting our hands dirty calculating, assigning, combining
concatenation functioncombine or ”vectorize”
c()
x <- c(100,90,80,70,60)
x
y <- c("a", "b", "c", "d")
y
S. Mooney (Columbia University) R intro 2014 6 / 56
getting our hands dirty calculating, assigning, combining
Put it together: Vectorized computations
The calcuation you tried with 68 (a scalar) can also work with the vectoryou just created:
5/9*(x-32)
z<-5/9*(x-32)
z
S. Mooney (Columbia University) R intro 2014 7 / 56
getting our hands dirty from calculations to programming
write your own functionR is a programming language
my.function<-function(x){
5/9*(x-32)
}
my.function(68)
[1] 20
a<-c(134,156,222)
my.function(a)
[1] 56.66667 68.88889 105.55556
We’ll revisit creating functions if we have time...
S. Mooney (Columbia University) R intro 2014 8 / 56
how R thinks (vs. SAS and SPSS)
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 9 / 56
how R thinks (vs. SAS and SPSS)
A Quick Warning...
We’re going to shift gears into abstract territory for a little while.
I hope the material that follows will orient you as we learn more concretepieces of R.
But I want to acknowledge that it gets away from the concrete; pleasedon’t worry if you feel you’re not grasping all the details fully right now.
S. Mooney (Columbia University) R intro 2014 10 / 56
how R thinks (vs. SAS and SPSS)
Programming and AnalyzingHow to work with R
In my experience, data analysis is usually an iterative process:
1 Massage data (merge datasets, select the items you want to analyze,ensure measures are created properly, etc)
2 Call some procedure to do analytic step (e.g. look at 2x2 table)
3 Interpret output & generate new questions (back to step 1 or 2)
In SAS (& SPSS(?)), data massage mostly happens in DATA statementsand analysis mostly happens in PROC steps.
In R, there’s no formal separation between massage and analysis: we usesimilar functions for both.
S. Mooney (Columbia University) R intro 2014 11 / 56
how R thinks (vs. SAS and SPSS)
Programming and AnalyzingGetting abstract for a moment...
Most data massage and analysis procedures actually have similar components:
1 Three classes of thing you tell the statistical program:1 What type of operation to do2 How specifically to do it3 A dataset to do it on
2 Two types of thing happen when you run the code:1 Changes get made to the data2 Output or results are returned
Let’s look at how this plays out in SAS, SPSS, and R...
S. Mooney (Columbia University) R intro 2014 12 / 56
how R thinks (vs. SAS and SPSS)
Programming and AnalyzingFunction input (SAS example)
In SAS:
1 The type of thing to do is the DATA or PROC XYZ statement.2 The dataset to use is specified with data=XYZ for a PROC step and set
XYZ; for a data step.3 And everything else is how specifically to do it.
For example, consider the following SAS statement:
PROC FREQ DATA=XYZ; table X*Y/missing; RUN;
FREQ is the type of thing.XYZ is the dataset (and the X*Y specifies the columns)./missing is how specifically to do the FREQ
S. Mooney (Columbia University) R intro 2014 13 / 56
how R thinks (vs. SAS and SPSS)
Programming and AnalyzingFunction input (SPSS example)
In SPSS:
1 The type of thing to do is the statement type.2 The dataset is implicit based on a previous DATA statement.3 And everything else is how specifically to do it.
For example, consider the following SPSS statement:
crosstabs
/tables X by Y
/missing=report
crosstabs is the type of thing.the current dataset is the dataset (and X by Y specifies the columns)./missing= report indicates how specifically to do the crosstab
S. Mooney (Columbia University) R intro 2014 14 / 56
how R thinks (vs. SAS and SPSS)
Programming and AnalyzingFunction input in R
In R:
1 The type of thing to do is the function name type.2 The dataset and how specifically to do it are both parameters to the
function.
For example, consider the following R statement:
table(XYZ$X, XYZ$Y, na.rm=TRUE)
table is the type of thing.XYZ$X and XYZ$Y are the data.na.rm=TRUE is how specifically to handle missing data
S. Mooney (Columbia University) R intro 2014 15 / 56
how R thinks (vs. SAS and SPSS)
Programming and AnalyzingFunction output in R
I claimed that analytic steps have up to two kinds of effects:
1 Changes made to the data2 Output or results returned
In SAS and SPSS:
1 There is an output window that displays the results of an analyticprocedure.
2 Some procedures change data and others do not. The programmer knowswhich procedures modify the data
In R:
1 An analytic function typically returns an object whose default display isthe result of interest.
2 If the programmer wants data modified by the procedure, she or he usuallyworks with the return value of the function in the next programming step.
S. Mooney (Columbia University) R intro 2014 16 / 56
how R thinks (vs. SAS and SPSS)
Programming and AnalyzingFunction output in R
Consider the R statement
table(XYZ$X, XYZ$Y, na.rm=TRUE)
This is a function that returns an object (a 2x2 matrix, in this case) whosedefault visualization looks like a 2x2 table.
If you want a chi-square test on that 2x2 table, you can use the output fromthe table function as the input to the chisq.test function as follows:
chisq.test(table(XYZ$X, XYZ$Y, na.rm=TRUE))
Using return values rather than side effects is characteristic of a functionalprogramming model of language design.
Don’t worry
This may seem complicated or abstract, but it will become more clear afterusing R more.
S. Mooney (Columbia University) R intro 2014 17 / 56
data
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 18 / 56
data
the cbind() functionsCombining vectors into matrices
weight <- c(134, 156, 222)
height <- c(60, 63, 72)
bmi <- (weight*703)/height^2
cbind(weight, height, bmi)
weight height bmi
[1,] 134 60 26.16722
[2,] 156 63 27.63114
[3,] 222 72 30.10532
S. Mooney (Columbia University) R intro 2014 19 / 56
data
getting your own data into R”there’s a function for that”
read.table() (/read.csv/read.fwf) is how you get data into base R
but RStudio’s Import Dataset can generate the code for you...
cars<-read.table(
"http://www.columbia.edu/~sjm2186/SER2014/cars.txt",
header=T, stringsAsFactors=F)
str(cars)
We will revisit this...
S. Mooney (Columbia University) R intro 2014 20 / 56
packages
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 21 / 56
packages
packages
Packages contain code that enable extra functionality in RAnalogous to a SAS file containing several macros
install.packages("epitools")
library(epitools)
epitab(c(10, 20, 30, 40))
We will revisit these as well...
S. Mooney (Columbia University) R intro 2014 22 / 56
help
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 23 / 56
help
getting help
R has a lot of built-in help mechanisms...
help() opens help page
apropos() displays all objects matching topic
library(help=packageName) help on a specific package
vignette(package=”packageName”);
help(sample) ; ?sample ; ??sample
apropos("sam")
library(help=epitools)
vignette(package="utils")
vignette("Sweave")
S. Mooney (Columbia University) R intro 2014 24 / 56
help
getting help
..but web resources can be even more helpful:
tutorial: http://www.ats.ucla.edu/stat/r/
search: http://www.r-project.org/search.html
books: Venebles, Aragon, etc.
Two major online communities:
R mailing list archive: http://r.789695.n4.nabble.com/
Stack Overflow: http://stackoverflow.com/questions/tagged/r
S. Mooney (Columbia University) R intro 2014 25 / 56
objects
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 26 / 56
objects
5 important objects
objects are ”specialized data structures”
1 vector - collection of like elements (numbers, characters...)
2 matrix - 2-dimensional vector
3 array - >2-dimensional vector
4 list - collection of groups of like elements any kind
5 dataframe - tabular data set, each row a record, each column a (like)element or variable
S. Mooney (Columbia University) R intro 2014 27 / 56
objects
objects for epidemiologists
matrix for contingency, e.g. 2x2, tables
arrays for stratified tables
dataframe for observations and variables
factors for categorical variables
numeric representation of charactersread.table converts characters to factors
S. Mooney (Columbia University) R intro 2014 28 / 56
objects
examples of R objects
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
y <- matrix(x, nrow = 2)
z <- array(x, dim = c(2, 3, 2))
mylist <- list(x, y, z)
names <- c("alice", "bob", "charlie")
gender <- c("girl", "boy", "boy")
age <- c(28, 22, 34)
race <- factor(c("Asian", "Asian", "Black"),
levels=c("Asian", "Black", "White"))
data<- data.frame(names, gender, age, race)
S. Mooney (Columbia University) R intro 2014 29 / 56
objects about objects
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 30 / 56
objects about objects
modesatomic vs recursive objects
mode - type of data: numeric, character, logical, factor
age <- c(34, 20); mode(age)
lt25 <- age<25
lt25
mode(lt25)
atomic - only one modecharacter, numeric, factor, or logical, i.e. vectors, matrices, arrays
logical (1 for TRUE 0 for FALSE)categorical - appear numeric but stored as factors
recursive - more than one mode
lists, data frames, functions
S. Mooney (Columbia University) R intro 2014 31 / 56
objects about objects
atomic objectsall elements the same
vector: one dimension
y <- c("Tom", "Dick", "Harry") ; y #character
x <- c(1, 2, 3, 4, 5) ; x #numeric
z <- x<3 ; z #logical
matrix: two-dimensional vector
x <- c("a", "b", "c", "d")
y <- matrix(x, 2, 2) ; y
array: n-dimensional vector
x <- 1:8
y <- array(x, dim=c(2, 2, 2)) ; y
S. Mooney (Columbia University) R intro 2014 32 / 56
objects about objects
recursive objectsdiffer
list: collections of data
x <- c(1, 2, 3)
y <- c("Male", "Female", "Male")
z <- matrix(1:4, 2, 2)
xyz <- list(x, y, z)
dataframe: tabular (2-dimensional) list
subjno <- c(1, 2, 3, 4) ; age <- c(34, 56, 45, 23)
sex <- c("Male", "Male", "Female", "Male")
case <- c("Yes", "No", "No", "Yes")
mydat <- data.frame(subjno, age, sex, case) ; mydat
S. Mooney (Columbia University) R intro 2014 33 / 56
objects about objects
coercionchanging an object’s mode
this is importantR will automatically coerce all the elements in an atomic object to a singlemode (character >numeric >logical)
c("hello", 4.56, FALSE)
c(4.56, FALSE)
S. Mooney (Columbia University) R intro 2014 34 / 56
objects about objects
coercing objectsdo it yourself
is.xxx / as.xxx - to assess / coerce objectsxxx = vector, matrix, array, list, data.frame, function, character,numeric, factor, na etc...
is.matrix(1:3) # false
as.matrix(1:3)
is.matrix(as.matrix(1:3)) # true
# coercing factor to character
sex <- factor(c("M", "M", "M", "M", "F", "F", "F", "F"))
sex
unclass(sex) #does not coerce into character
as.character(sex) #works
S. Mooney (Columbia University) R intro 2014 35 / 56
objects about objects
Reviewbasic characteristics of R objects
Objects - vector, matrix, array, list, dataframe
mode() - ”type” of object: numeric, character, factor, logical
vectors and matrices - atomic, one mode onlylists and data frames - recursive, can be of >1 mode
class() - for simple vectors, same as mode
more complex objects, array and data frames have their own classaffects how printed, plotted and otherwise handled
S. Mooney (Columbia University) R intro 2014 36 / 56
objects vector
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 37 / 56
objects vector
vectors are 1-dimensional strings of like elements
the basic building block of data in R
use them for quick data entry
S. Mooney (Columbia University) R intro 2014 38 / 56
objects vector
fun with vectors
y<-1:5 #create a vector of consecutive integers
y+2 #scalar addition
2*y #scalar multiplication
x<-c(1,3,2,10,5)
cumsum(x)
S. Mooney (Columbia University) R intro 2014 39 / 56
objects vector
more fun with vectorsvectorized arithmetic
c(1,2,3,4)/2
c(1,2,3,4)/c(4,3,2,1)
log(c(0.1,1,10,100), 10)
c(1,2,3,4) + c(4,3)
c(1,2,3,4) + c(4,3,2)
S. Mooney (Columbia University) R intro 2014 40 / 56
objects vector
creating numerical vectorssequences
the sequence operator :
-9:8
seq() greater flexibility
> seq(1, 5, by = 0.5) # specify interval
> seq(1, 5, length = 8) #specify length
S. Mooney (Columbia University) R intro 2014 41 / 56
objects vector
operations on vectors
x <- rnorm(100)
sum(x)
x <- rep(2, 10)
cumsum(x)
mean(x)
sum(x)/length(x)
var(x) #sample variance
sd(x)
sqrt(var(x)) #sample standard deviation
x <- rnorm(100)
y <- rnorm(100)
var(x, y) # covariance
S. Mooney (Columbia University) R intro 2014 42 / 56
objects vector
logical vectorsthe special vector...
series of TRUEs and FALSEs (Ts and Fs)
created with relational operators:
<, >, <=, >=, ==, !=
used to index, select and subset data
S. Mooney (Columbia University) R intro 2014 43 / 56
objects vector
about logical vectors
logical operators are the key to indexing, and indexing is the keyto manipulating data
= <= > >= == !
x<-1:26
temp<- x > 13 #logical vector temp
#same length as vector x
#TRUE= 1, when condition met
#FALSE = 0, when not met
sum(temp)
We will revist logical vectors when we discuss indexing (coming soon...)
S. Mooney (Columbia University) R intro 2014 44 / 56
objects matrix & array
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 45 / 56
objects matrix & array
a matrix is a 2-dimensional vector2x2 and contingency tables
Option 1: define the matrix from raw data:
myMatrix<-matrix(c("a","b","c","d"),2,2)
myMatrix
myMatrix2<-matrix(c("a","b","c","d"),2,2, byrow=T)
myMatrix2
colnames(myMatrix2)<-c("case", "control")
rownames(myMatrix2)<-c("exposed", "unexposed")
myMatrix2
S. Mooney (Columbia University) R intro 2014 46 / 56
objects matrix & array
cbind and rbind
Option 2: define the data by binding vectors together:
names<-c("Alice", "Bob", "Charlie")
ages<-c(6,7,8)
names
ages
cbind(names, ages)
rbind(names, ages)
S. Mooney (Columbia University) R intro 2014 47 / 56
objects matrix & array
caution: recycling
cbind and rbind - will recycle data
when performing vector or mixed vector and array arithmetic, shortvectors are extended by recycling till they match size of otheroperands
R may return an error message, but still complete the operation
S. Mooney (Columbia University) R intro 2014 48 / 56
objects matrix & array
creating a matrix from dataThe more common scenario...
table() - from characters
titanic<-read.csv(
"http://www.columbia.edu/~sjm2186/SER2014/titanic.csv",
stringsAsFactors=F) #load titanic data
str(titanic) # Check the structure
table(titanic$sex,titanic$survived) # Make the matrix
S. Mooney (Columbia University) R intro 2014 49 / 56
objects matrix & array
an array is an n-dimensional vectorstratified epi tables
stratified titanic survival table:
sex vs. survival vs. passenger class
table(titanic$sex,titanic$survived, titanic$pclass)
S. Mooney (Columbia University) R intro 2014 50 / 56
objects list
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 51 / 56
objects list
a list is a recursive collection of unlike elementslike epi ”variables” and ”observations”
often used to ”store” function results
str() is your friend
also, see stackoverflow discussion
x <- 1:5 ; y <- matrix(c("a","c","b","d"), 2,2)
z <- c("Peter", "Paul", "Mary")
mm <- list(x, y, z)
mm
str(mm)
S. Mooney (Columbia University) R intro 2014 52 / 56
objects dataframe
Outline
1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming
2 how R thinks (vs. SAS and SPSS)
3 data
4 packages
5 help
6 objectsabout objectsvectormatrix & arraylistdataframe
S. Mooney (Columbia University) R intro 2014 53 / 56
objects dataframe
dataframestabular epi data sets
2-dimensional tabular lists with equal-length fieldseach row is a record or observationeach column is a field or variable (usually numeric vector or factors)
data(infert)
str(infert)
head(infert)
”a list that behaves like a matrix”
S. Mooney (Columbia University) R intro 2014 54 / 56
objects dataframe
creating data frames
1 data.frame()
x <- data.frame(id=1:2, sex=c("M","F"))
2 read.table(), read.csv(), read.delim(), read.fwf()
titanic<-read.csv(
"http://www.columbia.edu/~sjm2186/SER2014/titanic.csv",
stringsAsFactors=F) #load titanic data
str(titanic)
(caution: default char → factor, numeric → integer)
S. Mooney (Columbia University) R intro 2014 55 / 56
objects dataframe
Exercises
You should now be able to complete exercises 1 and 2 inhttp://www.columbia.edu/~sjm2186/SER2014/Exercises.pdf
S. Mooney (Columbia University) R intro 2014 56 / 56