Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf ·...
Transcript of Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf ·...
![Page 1: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/1.jpg)
Using R Basics Data Manipulation
Introduction to R
Peter Dalgaard
Department of BiostatisticsUniversity of Copenhagen
Statistical Practice in Epidemiology, Tartu 2006
![Page 2: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/2.jpg)
Using R Basics Data Manipulation
Outline
Using R
Basics
Data Manipulation
![Page 3: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/3.jpg)
Using R Basics Data Manipulation
What is R?
I R is an “enviroment for statistical computing and graphics”I Highly flexible graphics routinesI Statistical functions (standard tests, modelling)I Controlled by a programming language
I In this course we use R exclusivelyI The first practical is a workbook exercise designed to help
you getting started with RI This lecture is intended to give you the broader picture
![Page 4: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/4.jpg)
Using R Basics Data Manipulation
Basics of R
I What is R?I Interacting with RI Extended user interfacesI Later: Dealing with R’s workspace
![Page 5: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/5.jpg)
Using R Basics Data Manipulation
Key Points about R
I Environment built around the programming language R,(an Open Source dialect of the S language).
I R is Free Software, and runs on a variety of platforms (I’llbe using Linux. Computer labs run on Windows.)
I Command-line execution based on function callsI Extensible with user functionsI Workspace containing data and functionsI Graphics devices
![Page 6: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/6.jpg)
Using R Basics Data Manipulation
Interacting with R
I Command line interface (CLI)I The basic mode of interaction is “read – evaluate – print”I User types an expression at the command line,I R evaluates itI . . . and prints the resultI Batch variation: read commands from a file
![Page 7: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/7.jpg)
Using R Basics Data Manipulation
Extended Interfaces
I Windows, Macintosh GUI: Fairly simple extensions of CLI,mostly offloads some tasks to menu interface, and addscommand recall
I Script editing: The ability to work with multiple lines of Rcode, save them to a file for later use, etc. A simple scripteditor is built into the R GUI in recent versions.
I External editor interfaces: TINN-R, R-WinEdt adds syntaxhighlighting. Highly recommended.
I R embedded in a text editor (ESS – Emacs SpeaksStatistics). Popular on Unix/Linux systems.
![Page 8: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/8.jpg)
Using R Basics Data Manipulation
Demo 1
2+2log(10)help(log)summary(airquality)demo(graphics) # pretty pictures...
![Page 9: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/9.jpg)
Using R Basics Data Manipulation
R packages
I An important new thing in R has been its handling ofadd-on packages
I Standard formatI Easy end-user handlingI Quality control system (portability, version dependency)
I CRAN – Comprehensive R Archive Network, modelled onCTAN (TeX), CPAN (Perl). Kurt Hornik, Fritz Leisch,TU-Vienna
I Currently over 500 packages on CRAN.
![Page 10: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/10.jpg)
Using R Basics Data Manipulation
Language
I R is a programming language – also on the command lineI Basic structure: Functions acting on objectsI (Functions are also a kind of object, operators a kind of
function)I Print an object by typing its nameI Evaluate an expression by entering it on the command lineI Call a function, giving the arguments in parentheses –
possibly emptyI Notice ls vs. ls()
![Page 11: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/11.jpg)
Using R Basics Data Manipulation
Objects
I The basic object type is the vectorI Modes: numeric, integer, character, generic (list)I Operations are vectorized: you can add entire vectors witha + b
I Recycling of objects: If the lenghts don’t match, the shortervector is reused
![Page 12: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/12.jpg)
Using R Basics Data Manipulation
Demo 2
x <- round(rnorm(10,mean=20,sd=5)) # simulate dataxmean(x)m <- mean(x)x - m # notice recycling(x - m)^2sum((x - m)^2)sqrt(sum((x - m)^2)/9)sd(x)
![Page 13: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/13.jpg)
Using R Basics Data Manipulation
Smart indexing
I R has several unusual but highly useful indexingmechanisms:
I a[5] single elementI a[5:7] several elementsI a[-6] all except the 6thI a[b>200] logical indexI a["name"] by name
![Page 14: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/14.jpg)
Using R Basics Data Manipulation
Lists
I Lists are vectors where the elements can have differenttypes
I Functions often return listsI lst <- list(A=rnorm(5), B="hello")
I Special indexing:I lst$A
I lst[[1]] first elementI lst[1] list containing the first element
![Page 15: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/15.jpg)
Using R Basics Data Manipulation
Functions
I logit <- function(p) log(p/(1-p))
I logit(0.5)
I Formal argumentsI Actual argumentsI Positional matching: plot(x,y)I Keyword matching: t.test(x ~ g, mu=2,alternative="less")
I Partial matching: t.test(x ~ g, mu=2, alt="l")
![Page 16: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/16.jpg)
Using R Basics Data Manipulation
Compound objects
I Attributes (dimensions, dimnames)I Allows you to define complex datastructures
I Matrices, arrays, tablesI Factors (categorical variables)I Data framesI Return values from tests, model fits
![Page 17: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/17.jpg)
Using R Basics Data Manipulation
Classes, generic functions
I R objects have classes (there are two different classsystems, but ignore that for now)
I Functions can behave differently depending on the class ofan object
I E.g. summary(x) or print(x) does different things if xis numeric, a factor, or a linear model fit
![Page 18: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/18.jpg)
Using R Basics Data Manipulation
Data Manipulation Functions
I Constructors of simple objectsI Single-column modificationsI Modifying and subsetting data frames
![Page 19: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/19.jpg)
Using R Basics Data Manipulation
Constructors
I R deals with many kinds of objects besides data setsI Need to have ways of constructing them from the
command lineI We have (briefly) seen the c and list functionsI Notice the naming forms c(boys=1.2, girls=1.1)
I Extracting and setting names with names(x)
I For matrices and arrays, use the (surprise) matrix andarray functions. data.frame for data frames.
I It is also fairly common to construct a matrix from itscolumns using cbind
![Page 20: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/20.jpg)
Using R Basics Data Manipulation
Demo 3
x <- c(boys = 1.2, girls = 1.1)xnames(x)names(x) <- c("M", "F")xmatrix(1:4,ncol=2)cbind(x=0:3,"exp(x)"=exp(0:3))
![Page 21: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/21.jpg)
Using R Basics Data Manipulation
The factor Function
I This is typically used when read.table gets it wrongI E.g. group codes read as numericI Or read as factors, but with levels in the wrong order (e.g.c("rare", "medium", "well-done") sortedalphabetically.)
I Notice the slightly confusing use of levels and labelsarguments.
I levels are the value codes on inputI labels are the value codes on output (and become the
levels of the resulting factor)
![Page 22: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/22.jpg)
Using R Basics Data Manipulation
Demo 4
aq <- airqualityaq$Month <- factor(aq$Month, levels=5:9,
labels=month.name[5:9])aq$Monthlevels(aq$Month) <- month.abb[5:9]aq$Month
![Page 23: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/23.jpg)
Using R Basics Data Manipulation
The cut Function
I The cut function converts a numerical variable into groupsaccording to a set of break points
I Notice that the number of breaks is one more than thenumber of intervals
I Notice also that the intervals are left-open, right-closed bydefault (right=FALSE changes that)
I . . . and that the lowest endpoint is not included by default(set include.lowest=TRUE if it bothers you)
![Page 24: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/24.jpg)
Using R Basics Data Manipulation
Demo 5
library(ISwR); data(juul)age <- subset(juul, age >= 10 & age <= 16)$agerange(age)agegr <- cut(age, seq(10,16,2), right=FALSE,
include.lowest=TRUE)length(age)table(agegr)agegr2 <- cut(age, seq(10,16,2), right=FALSE)table(agegr2)
![Page 25: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical](https://reader033.fdocuments.us/reader033/viewer/2022042711/5f7d5e694da9ea3bb7342733/html5/thumbnails/25.jpg)
Using R Basics Data Manipulation
Working with Dates
I Dates are usually read as character or factor variablesI Use the as.Date function to convert them to objects of
class "Date"I If data are not in the default format (YYYY-MM-DD) you
need to supply a format specification> as.Date("11/3-1959",format="%d/%m-%Y")[1] "1959-03-11"
I You can calculate differences between Date objects. Theresult is an object of class "difftime", with a unit ofdays. You need as.numeric to get the actual number.