Lecture 2 (week 1): introduction to R - NYU Computer...

28
Computing with large data sets Richard Bonneau, spring 2009 Lecture 2 (week 1): introduction to R Thursday, January 22, 2009

Transcript of Lecture 2 (week 1): introduction to R - NYU Computer...

Page 1: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

Computing with large data setsRichard Bonneau, spring 2009

Lecture 2 (week 1): introduction to R

Thursday, January 22, 2009

Page 2: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

other notes, courses, lectures about R and S

v22.0480: computing with data, Richard Bonneau

Ingo Ruczinski and Rafael Irizarry (Johs Hopkins Biostat):

http://www.biostat.jhsph.edu/~bcaffo/statcomp/index.htmlhttp://www.biostat.jhsph.edu/~ririzarr/Teaching/688/

Roger D. Peng (JHU):

http://www.biostat.jhsph.edu/~rpeng/

Read the manual !!:

http://cran.r-project.org/doc/manuals/R-intro.html

Lecture 2Thursday, January 22, 2009

Page 3: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

S history

Lecture 1v22.0480: computing with data, Richard Bonneau

S is a language and system for organizing, visualizing, and analyzing data.

S started at Bell Labs since 1976.

The language has evolved through several major versions to become the most widely used environment for research in data analysis and statistics.

In 1998, S became the first statistical system to receive the Software System Award, the top software award from the ACM.

( For a great account of the early history of S see the paper on the course websitehttp://www.research.att.com/areas/stat/doc/94.11.ps )

Lecture 2Thursday, January 22, 2009

Page 4: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

R history and facts

Lecture 1v22.0480: computing with data, Richard Bonneau

R is an environment for data analysis and visualization. R is an open source implementation of the S language (S-Plus is a commercial implementation of the S language). The current version of R (September 2004) is 1.9.1. The R Core group consists of Doug Bates, John Chambers, Peter Dalgaard, Rober t Gentleman, Kur t Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Mar tin Maechler, Guido Masarotto, Paul Murrell, Brian Ripley, Duncan Temple Lang, and Luke Tierney.

join the R Foundation for Statistical Computing http://www.r-project.org/ .

1991 Ross Ihaka and Rober t Gentleman begin work on a project that will ultimately become R. 1992 Design and implementation of pre-R. 1993 The first announcement of R. 1995 R available by ftp under the GPL. 1996 A mailing list is star ted and maintained by Martin Maechler at ETH. 1997 The R core group is formed. 1999 DSC meeting in Vienna, the first time many R core members meet. 2000 R 1.0.0 is released. 2009 R is still very actively developed and availiable for all platforms, open source, pervasive in bioinformatics and several other fields.

Lecture 2Thursday, January 22, 2009

Page 5: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

playing around

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- 65.56> y <- 4> x <- 2.0> y <- c(2,4,6)> x * y[1] 4 8 12> x * 2[1] 4> y * y[1] 4 16 36> sqrt( -1 )[1] NaNWarning message:In sqrt(-1) : NaNs produced> sqrt(-1+0i) [1] 0+1i

Lecture 2Thursday, January 22, 2009

Page 6: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

playing around

v22.0480: computing with data, Richard Bonneau Lecture 1

> y <- 1:10> y ^^ 2Error: syntax error> y ^ 2 [1] 1 4 9 16 25 36 49 64 81 100> y [1] 1 2 3 4 5 6 7 8 9 10> y <- jitter( y )> y [1] 1.003289 2.011200 2.965646 3.885774 4.909870 5.993501 6.907029 7.956502 8.902033[10] 10.104556> class( y )[1] "numeric"> class( x )[1] "numeric"> length( x )[1] 1> length( y )[1] 10> dim( y )NULL> dim( x )NULL

Lecture 2Thursday, January 22, 2009

Page 7: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

playing around

v22.0480: computing with data, Richard Bonneau Lecture 1

> z <- matrix( sample(y), nrow = 5, ncol = 5)> z [,1] [,2] [,3] [,4] [,5][1,] 90.42972 0.7650916 90.42972 0.7650916 90.42972[2,] 9743.39636 6225.7870318 9743.39636 6225.7870318 9743.39636[3,] 3973.01279 250.6684005 3973.01279 250.6684005 3973.01279[4,] 560.98542 1253.9400470 560.98542 1253.9400470 560.98542[5,] 2420.98800 15.4225674 2420.98800 15.4225674 2420.98800> dim(z)[1] 5 5> length( z )[1] 25> summary( y ) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.7651 130.5000 907.5000 2454.0000 3585.0000 9743.0000

Lecture 2Thursday, January 22, 2009

Page 8: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

playing around

v22.0480: computing with data, Richard Bonneau Lecture 1

> hist( y )> hist( y, nclass = 20 )> hist( y )> pdf("hist.l2.pdf")> hist( y )> dev.off()quartz 2 > x <- 1:20> y <- runif( length( x ) )> plot( x, y )> abline(h=0.5, lty=2, col="green",lwd=2) > pdf("sample-sesion.pdf")> plot( x, y )> abline(h=0.5, lty=2, col="green",lwd=2)> dev.off()quartz 2

Histogram of y

y

Frequency

0 20 40 60 80 100

01

23

4

●●

●●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

x

y

Lecture 2Thursday, January 22, 2009

Page 9: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

using built in examples

v22.0480: computing with data, Richard Bonneau Lecture 1

> ? heatmap ### then cut and paste in exmples> require(graphics); require(grDevices)> x <- as.matrix(mtcars)> rc <- rainbow(nrow(x), start=0, end=.3)> cc <- rainbow(ncol(x), start=0, end=.3)> hv <- heatmap(x, col = cm.colors(256), scale="column",+ RowSideColors = rc, ColSideColors = cc, margins=c(5,10),+ xlab = "specification variables", ylab= "Car Models",+ main = "heatmap(<Mtcars data>, ..., scale = \"column\")")

## mtcars is a datastructure provided as an example## of how to use heatmap()##

> str( mtcars )'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17.0 18.6 19.4 17.0 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...> ?mtcars ## for description of what it actually is ##

cyl

am vs

carb wt drat

gear

qsec

mpg hp disp

specification variables

Maserati BoraChrysler ImperialLincoln ContinentalCadillac FleetwoodHornet SportaboutPontiac FirebirdFord Pantera LCamaro Z28Duster 360ValiantHornet 4 DriveAMC JavelinDodge ChallengerMerc 450SLCMerc 450SEMerc 450SLHonda CivicToyota CorollaFiat X1−9Fiat 128Ferrari DinoMerc 240DMazda RX4Mazda RX4 WagMerc 280CMerc 280Lotus EuropaMerc 230Volvo 142EDatsun 710Porsche 914−2Toyota Corona

Car M

odel

s

heatmap(<Mtcars data>, ..., scale = "column")

Lecture 2Thursday, January 22, 2009

Page 10: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

dumping functions : example code galore

v22.0480: computing with data, Richard Bonneau Lecture 1

> hist ### type function with no “( )” or argumentfunction (x, ...) UseMethod("hist")<environment: namespace:graphics>

you get info, but not code if the function is part of the main R code (part of the base orcore)

> heatmap ### for higher level functions ### or defined functions you ### you get the code

function (x, Rowv = NULL, Colv = if (symm) "Rowv" else NULL, distfun = dist, hclustfun = hclust, reorderfun = function(d, w) reorder(d, w), add.expr, symm = FALSE, revC = identical(Colv, "Rowv"), scale = c("row", "column", "none"), na.rm = TRUE, margins = c(5, 5), ColSideColors, RowSideColors, cexRow = 0.2 + 1/log10(nr), cexCol = 0.2 + 1/log10(nc), labRow = NULL, labCol = NULL, main = NULL, xlab = NULL, ylab = NULL, keep.dendro = FALSE, verbose = getOption("verbose"), ...) { scale <- if (symm && missing(scale)) "none" else match.arg(scale) if (length(di <- dim(x)) != 2 || !is.numeric(x)) stop("'x' must be a numeric matrix") ...truncated

cyl

am vs

carb wt drat

gear

qsec

mpg hp disp

specification variables

Maserati BoraChrysler ImperialLincoln ContinentalCadillac FleetwoodHornet SportaboutPontiac FirebirdFord Pantera LCamaro Z28Duster 360ValiantHornet 4 DriveAMC JavelinDodge ChallengerMerc 450SLCMerc 450SEMerc 450SLHonda CivicToyota CorollaFiat X1−9Fiat 128Ferrari DinoMerc 240DMazda RX4Mazda RX4 WagMerc 280CMerc 280Lotus EuropaMerc 230Volvo 142EDatsun 710Porsche 914−2Toyota Corona

Car M

odel

s

heatmap(<Mtcars data>, ..., scale = "column")

Lecture 2Thursday, January 22, 2009

Page 11: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

R basic types / atomic classes of objects

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- character() ### char, strings, vectors or both> x <- "test1"> x[1] "test1"> x[1] <- "test1"> x[2] <- "test1"> x[3] <- "test2"> x[1] "test1" "test1" "test2"> class(x)[1] "character"

> x <- numeric() ### double floats, vectors of floats> x <- complex() ### complex numbers> xcomplex(0)> x <- logical() ## logicals, can be used ## to index other objects> x <- 12> class(x)[1] "numeric"> x <- 12L ### force integer> class(x) [1] “integer”

> x <- Inf ## infinity> x[1] Inf> x <- NA ### missing values are NA or NaN> x [1] NA> is.na( x ) ### built in functions help in dealing ### with NAs[1] TRUE> x <- logical()> xlogical(0)> x <- NaN> is.na( x )[1] TRUE

Lecture 2Thursday, January 22, 2009

Page 12: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

NA, NaN, empty/missing values

v22.0480: computing with data, Richard Bonneau Lecture 1

Values can be missing for lots of good reasons.

Technical:-the measurement failed (it was cloudy that night, the probe for that DNA was synthesized incorrectly)

Budgetary/Social:- we could only afford to measure so many points / attributes- people will only answer 15 minutes of questions...

Bugs (incorrect explicit type coercion)

Values not filled in YET

see also:is.nan(), is.null(), as.null()

> x <- Inf ## infinity> x[1] Inf ### this IS a number

> x <- NA ### missing values are NA or NaN> x [1] NA> is.na( x ) ### built in functions help in dealing ### with NAs[1] TRUE

> ### messed up explicit coercion> x <- c( "f" , "fg" )> as.numeric ( x )[1] NA NAWarning message:NAs introduced by coercion

Lecture 2Thursday, January 22, 2009

Page 13: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

R basic types vectors

v22.0480: computing with data, Richard Bonneau Lecture 1

Integers> x <- 1:12> class( x)[1] "integer"> x <- c(1L, 2L, 3L)> x[1] 1 2 3

Numeric> x <- c(1, 2, 3.2)> x[1] 1.0 2.0 3.2

Logical> x <- c( TRUE, TRUE, FALSE) > x[1] TRUE TRUE FALSE

Logical from conditional statement> x <- c("azure", "red", "green", "red")> x[1] "azure" "red" "green" "red" > x == "azure"[1] TRUE FALSE FALSE FALSE

> x <- c(1, 2, 3.2)> x[1] 1.0 2.0 3.2> x < 2.1 [1] TRUE TRUE FALSE

Integer indexes from conditionals> which ( x < 2.1 )[1] 1 2

Lecture 2Thursday, January 22, 2009

Page 14: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

R basic types: vectors

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- numeric( 10 ) ## a length 10 numeric vactor> x ## short for print(x) [1] 0 0 0 0 0 0 0 0 0 0

> x <- character( 10 )> x [1] "" "" "" "" "" "" "" "" "" ""> x[ length(x) + 1] <- "a"> x [1] "" "" "" "" "" "" "" "" "" "" "a"> x <- c(x, "b")> x [1] "" "" "" "" "" "" "" "" "" "" "a" "b"> ### attributes> length ( x )[1] 12> names( x )NULL> str( x ) chr [1:12] "" "" "" "" "" "" "" "" "" "" "a" "b"

> x <- 1:5 ## loading atributes> names( x ) <- c("one", "two", "three", "four", "five")> x one two three four five 1 2 3 4 5 > names( x )[1] "one" "two" "three" "four" "five" > class( x )[1] "integer"

Lecture 2Thursday, January 22, 2009

Page 15: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

creative ways of making nasty bugs

v22.0480: computing with data, Richard Bonneau Lecture 1

> ## you can, but shouldn't do nutz stuff like this> x <- c( 1, "two" )> x[1] "1" "two"> class(x )[1] "character"> y <- c(1,0,TRUE, FALSE)> y[1] 1 0 1 0> class( y )[1] "numeric"> y <- c( "true", TRUE, FALSE) ## nuts!> y[1] "true" "TRUE" "FALSE"> class( y )[1] "character"> ## creative ways of writing nasty nasty bugs

R variables, vectors and matrices assume the type 1st specified OR loaded.

assigning different types laterin the code will often override this initial type.

for example

> x <- 1:10> example.function( x ) ## function returns a charcter vec> x <- length( x ) * pi> x <- FALSE

x has been 4 types in 4 lines of code

Lecture 2Thursday, January 22, 2009

Page 16: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

factors ...

v22.0480: computing with data, Richard Bonneau Lecture 1

Making a factor vector

> youare <- as.factor ( c("M", "F", "F", "U" ) )> youare[1] M F F ULevels: F M U

> youare <- rep( 1, 10)> youare [1] 1 1 1 1 1 1 1 1 1 1> ?runif> y <- runif( 10 )> youare[ y > 0.5 ] <- "big"> youare[ y <= 0.5 ] <- "small"> youare [1] "big" "big" "big" "big" "big" "small" "big" "small" "big" "big" > as.factor(youare) [1] big big big big big small big small big big Levels: big small

Factors are integers with a label, but the label is storedmuch more efficiently (once for the whole vector offactors)

Using Factors is better in that they have meaningful attributes ... why say 1, 2, 3 as integers when you can say“male”, “female”, “undetermined” ?

Many functions ( functions that aim to classify instances based on vectors of mixed attributes) use factors.

Lecture 2Thursday, January 22, 2009

Page 17: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

forcing type conversions, explicit coercion

v22.0480: computing with data, Richard Bonneau Lecture 1

> ### explicit coercion --- forcing the type> x <- character( "1", "2", "3", "4", "0", "0" )Error in character("1", "2", "3", "4", "0", "0") : unused argument(s) ("2", "3", "4", "0", "0")> x <- c( "1", "2", "3", "4", "0", "0" )> class ( x )[1] "character"> x[1] "1" "2" "3" "4" "0" "0"> x <- as.numeric( x )> x[1] 1 2 3 4 0 0> str( x) num [1:6] 1 2 3 4 0 0> as.logical( x )[1] TRUE TRUE TRUE TRUE FALSE FALSE> as.complex( x )[1] 1+0i 2+0i 3+0i 4+0i 0+0i 0+0i> as.integer( x )[1] 1 2 3 4 0 0

* remember, many times coercionto the type you think is a good way of checking youʼve read in OR computes what you think you have ... e.g. coercion of a character to a numeric can often produce NAs that lead you to bugs.

so declaring and coercion of type is a good idea even if Rdoesnʼt strictly require it.

Lecture 2Thursday, January 22, 2009

Page 18: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

coercion of matrix objects

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- c(1,2,3,4,0,0)> x[1] 1 2 3 4 0 0> matrix( x, ncol = 2, nrow = 2 ) [,1] [,2][1,] 1 3[2,] 2 4> matrix( x, ncol = 2, nrow = 3 ) [,1] [,2][1,] 1 4[2,] 2 0[3,] 3 0> matrix( x, ncol = 2, nrow = 4 ) [,1] [,2][1,] 1 0[2,] 2 0[3,] 3 1[4,] 4 2Warning message:In matrix(x, ncol = 2, nrow = 4) : data length [6] is not a sub-multiple or multiple of the number of rows [4]> ### but it still did it !!!!! is this a feature or a bug waiting to happen?

> x <- c(1,2,3,4,0,0)> dim(x) <- c(3,2)> x [,1] [,2][1,] 1 4[2,] 2 0[3,] 3 0> ### but the dim has to match the length?

Lecture 2Thursday, January 22, 2009

Page 19: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

matrix names

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- c(NA, NA, 1)> x[1] NA NA 1> is.na(x)[1] TRUE TRUE FALSE> > y <- matrix( x, ncol = 2, nrow = 3 )> > dim(y )[1] 3 2> y [,1] [,2][1,] NA NA[2,] NA NA[3,] 1 1> y[ is.na(y) ] <- 0.01641428> y [,1] [,2][1,] 0.01641428 0.01641428[2,] 0.01641428 0.01641428[3,] 1.00000000 1.00000000

> y[1,2] <- 5.67> rownames( y ) <- c( "eq", "er", "es")> colnames( y ) <- c("qr", "rq" )> dimnames( y )[[1]][1] "eq" "er" "es"

[[2]][1] "qr" "rq"

> y qr rqeq 0.01641428 0.01641428er 0.01641428 0.01641428es 1.00000000 1.00000000>

Lecture 2Thursday, January 22, 2009

Page 20: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

matrices

v22.0480: computing with data, Richard Bonneau Lecture 1

> ## matrix are filled starting in the upper left courner and then running down > ## the column. The first indexis the row, and the second is the > > y <- 1:10> dim(y) <- c(2,5)> y [,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10> dim(y) <- c(5,2)> y [,1] [,2][1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10> dim(y) <- c(5,5) ### oops?Error in dim(y) <- c(5, 5) : dims [product 25] do not match the length of object [10]

Lecture 2Thursday, January 22, 2009

Page 21: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

rbind, cbind

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- 1:10> y <- 10:1> z <- c(1:5, 5:1)> xyz <- rbind( x,y,z )> xyz [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]x 1 2 3 4 5 6 7 8 9 10y 10 9 8 7 6 5 4 3 2 1z 1 2 3 4 5 5 4 3 2 1> xyz <- cbind( x,y,z )> xyz x y z [1,] 1 10 1 [2,] 2 9 2 [3,] 3 8 3 [4,] 4 7 4 [5,] 5 6 5 [6,] 6 5 5 [7,] 7 4 4 [8,] 8 3 3 [9,] 9 2 2[10,] 10 1 1>

> ## adding to a matrix one row at a time> xyz <-rbind( xyz, c( 3,4,5) )> xyz x y z [1,] 1 10 1 [2,] 2 9 2 [3,] 3 8 3 [4,] 4 7 4 [5,] 5 6 5 [6,] 6 5 5 [7,] 7 4 4 [8,] 8 3 3 [9,] 9 2 2[10,] 10 1 1[11,] 3 4 5 > ## could do a similar thing with cbind()

Lecture 2Thursday, January 22, 2009

Page 22: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

lists

v22.0480: computing with data, Richard Bonneau Lecture 1

Making a list

> p1 <- list()> ### Declare a list> p1$x <- 1> p2$x <- 2.0Error in p2$x <- 2 : object "p2" not found> p1$x <- 2.0> p2 <- list()> p2$x <- 3.0> p2$y <- 2.0> p2$y <- 2.0> p1$y <- 2.0> p1$x[1] 2$y[1] 2

> p2$x[1] 3$y[1] 2

Making a list of lists

> all.p <- list()> all.p[[1]] <- p1> all.p[[2]] <- p2> all.p[[1]][[1]]$x[1] 2

[[1]]$y[1] 2

[[2]][[2]]$x[1] 3

[[2]]$y[1] 2

Naming and accessing lists:

> names( all.p ) <- c("p1","p2")> all.p$p1$p1$x[1] 2

$p1$y[1] 2

$p2$p2$x[1] 3

$p2$y[1] 2

> all.p$p1$x[1] 2

$y[1] 2

> all.p$p1$x[1] 2> all.p[[1]]$x[1] 2> all.p[[1]][[2]][1] 2

Lecture 2Thursday, January 22, 2009

Page 23: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

lists are a great way to return and pass data

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- rnorm( 20, 1.2, 0.5 ) ## 20 draws from a normal N(1.2, 0.5)> hist.x <- hist( x ) > hist.x$breaks[1] 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0$counts[1] 2 3 2 6 2 3 0 2$intensities[1] 0.4999999 0.7500000 0.5000000 1.5000000 0.5000000 0.7500000 0.0000000 0.5000000$density[1] 0.4999999 0.7500000 0.5000000 1.5000000 0.5000000 0.7500000 0.0000000 0.5000000$mids[1] 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9$xname[1] "x"$equidist[1] TRUE

attr(,"class")[1] "histogram"

> class(hist.x )[1] "histogram" ## so it is not ʻjustʼ a list ## more on that later

Histogram of rnorm(2000, 1.2, 0.5)

rnorm(2000, 1.2, 0.5)

Freq

uenc

y

−0.5 0.0 0.5 1.0 1.5 2.0 2.50

2040

6080

100

Lecture 2Thursday, January 22, 2009

Page 24: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

a strange thing lists do ... name autocompletion

v22.0480: computing with data, Richard Bonneau Lecture 1

> x <- rnorm( 20, 1.2, 0.5 ) ## 20 draws from a normal N(1.2, 0.5)> hist.x <- hist( x ) > hist.x$breaks[1] 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0$counts[1] 2 3 2 6 2 3 0 2$intensities[1] 0.4999999 0.7500000 0.5000000 1.5000000 0.5000000 0.7500000 0.0000000 0.5000000$density[1] 0.4999999 0.7500000 0.5000000 1.5000000 0.5000000 0.7500000 0.0000000 0.5000000$mids[1] 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9$xname[1] "x"$equidist[1] TRUE

attr(,"class")[1] "histogram"

> class(hist.x )[1] "histogram"> ## so it is not ʻjustʼ a list> ## more on that later

Lecture 2Thursday, January 22, 2009

Page 25: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

dataframes

v22.0480: computing with data, Richard Bonneau Lecture 1

Data.frames are tables of data,most of the time you get them by reading in tab delimitedtables or flat files, read.table()

Letʼs look at an example data.frame mostinstalls of R should have loaded.

> class( USJudgeRatings )[1] "data.frame"> str(USJudgeRatings)'data.frame': 43 obs. of 12 variables: $ CONT: num 5.7 6.8 7.2 6.8 7.3 6.2 10.6 7 7.3 8.2 ... $ INTG: num 7.9 8.9 8.1 8.8 6.4 8.8 9 5.9 8.9 7.9 ... $ DMNR: num 7.7 8.8 7.8 8.5 4.3 8.7 8.9 4.9 8.9 6.7 ... $ DILG: num 7.3 8.5 7.8 8.8 6.5 8.5 8.7 5.1 8.7 8.1 ... ... $ RTEN: num 7.8 8.7 7.8 8.7 4.8 8.6 9 5 8.8 7.9 ...

> pairs( USJudgeRatings[,1,5] ) ## this function knows ## what to do with a ## dataframe> USJudgeRatings$CONT [1] 5.7 6.8 7.2 6.8 ...> USJudgeRatings[,1] [1] 5.7 6.8 7.2 6.8 ...

CONT

6.0 7.5 9.0

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

● ●

5 6 7 8 9

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

● ●

67

89

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

● ●

6.0

7.5

9.0

●●

●●●

● ●

●●

●●

●●

●●●

● INTG ●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

● ●

●●●

● DMNR●

●●

●●

●●

●●●

● ●

●●●

56

78

9

●●

●●

● ●

●●●

● ●

●●●

56

78

9

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● DILG ●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

6 7 8 9

●●

●●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

5 6 7 8 9

●●

●● ●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

5.5 7.0 8.5

5.5

7.0

8.5

CFMG

Lecture 2Thursday, January 22, 2009

Page 26: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

dataframes

v22.0480: computing with data, Richard Bonneau Lecture 1

coerce a data.frame to a matrix

> x <- as.matrix ( USJudgeRatings )> str( x ) num [1:43, 1:12] 5.7 6.8 7.2 6.8 7.3 6.2 10.6 7 7.3 8.2 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:43] "AARONSON,L.H." "ALEXANDER,J.M." "ARMENTANO,A.J." "BERDON,R.I." ... ..$ : chr [1:12] "CONT" "INTG" "DMNR" "DILG" ...

Lecture 2Thursday, January 22, 2009

Page 27: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

reading in code

v22.0480: computing with data, Richard Bonneau Lecture 1

# ~bonneau/v22-class/ > cat mean.vec.R## function to report the mean of a vectormean.vec <- function ( x, na.remove = T ) {

if ( class( x ) == "numeric" || class( x) == "integer") { return( mean(x, na.rm = na.remove) ) } else { return( NULL ) ## we could also return a NA }}

# ~bonneau/v22-class/ > R

...R startup...

> ? source>source( file = “mean.vec.R” ) ## you might need a path ...> mean.vec( c( 2,3) )2.5> mean.vec( c( 2, 3, NA) )2.5> mean.vec( c(“ps”, “qs”) )NULL>

Lecture 2Thursday, January 22, 2009

Page 28: Lecture 2 (week 1): introduction to R - NYU Computer Sciencebonneau/lectures/datasets/lecture2.pdfLecture 2 (week 1): introduction to R Thursday, January 22, 2009 other notes, courses,

Lecture 1v22.0480: computing with data, Richard Bonneau

1. Read the R manual.

2. non-graded homework:Make a function that:given a matrix returns a vector containing the means of each rowgiven a list of numeric vectors returns the mean of each vector in the list

for test data either use the link to “small test expression matrix”or use a built in R data object ( like volcano ):> dim( volcano )[1] 87 61> str( volcano ) num [1:87, 1:61] 100 101 102 103 104 105 105 106 107 108 ...> dim(volcano )[1] 87 61> ? image

use loops, don’t worry about NAs for now, this is not a graded assignment, but give it a try to get your feet wet.

if you want a hint stay after class ... ... next lecture we’ll play with plotting and graphics. If you’re confused there will be time to catch up next week.

homework and reading for next time

Lecture 2Thursday, January 22, 2009