R Examples

56
Set a working directory for reading form a file: session > set working directory > choose directlry 1. Concatenation, sequences Many tasks involving sequences of numbers. Here are some basic examples on how to manipulate and create sequences. The function c, concatenation, is used often in R, as are rep and seq X = 3 Y = 4 c(X,Y) [1] 3 4 The function rep denotes repeat:

description

doc/243530153/Studio-Ddoc/243530153/Studio-Ddoc/243530153/Studio-Ddoc/243530153/Studio-Ddoc/243530153/Studio-D

Transcript of R Examples

Page 1: R Examples

Set a working directory for reading form a file:

session > set working directory > choose directlry

1. Concatenation, sequences

Many tasks involving sequences of numbers. Here are some basic examples on

how to manipulate and create sequences. The function c, concatenation, is used

often in R, as are rep and seq

X = 3

Y = 4

c(X,Y)

[1] 3 4

The function rep denotes repeat:

Page 2: R Examples

print(rep(1,4))

print(rep(2,3))

c(rep(1,4), rep(2,3))

[1] 1 1 1 1

[1] 2 2 2

[1] 1 1 1 1 2 2 2

The function seq denotes sequence. There are various ways of specifying the

sequence.

seq(0,10,length=11)

[1] 0 1 2 3 4 5 6 7 8 9 10

In [10]: %%R

seq(0,10,by=2)

[1] 0 2 4 6 8 10

You can sort and order sequences

In [11]: %%R

X = c(4,6,2,9)

sort(X)

[1] 2 4 6 9

Use an ordering of X to sort a list of Y in the same order

Y = c(1,2,3,4)

o = order(X)

X[o]

Y[o]

[1] 3 1 2 4

1. Ploting 2 variables in a dataset : airquality

Page 3: R Examples

Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6

>Plot(ozone~wind, data = airquality)

2. Simple Regression analysis on the file:

data= read.csv("file_predict.csv",header=TRUE)

reg <- lm(y ~ x, data)

3. In regression analysis we use lm() function where y can be equal to any

function of x like y= x^2 etc so reg<- lm(y ~ x^2, data)

This gives you regression equation output:

Coefficients:

(Intercept) x

-1.45e-15 5.00e-02

You can also check this:

1. plot(reg)

The plots can be made nicer by adding colors and using di_erent symbols.

See the help for function par.

plot(X, Y, pch=21, bg='red')

2. summary(reg)

3. predict(reg)

4. http://www.dummies.com/how-to/content/how-to-predict-new-data-values-

with-r.html

5. coplot(a ~ b | c) produces a number of scatterplots of a against b for given

values of c.

Page 4: R Examples

6. image(x, y, z, ...)

contour(x, y, z, ...)

persp(x, y, z, ...)

Plots of three variables. The image plot draws a grid of rectangles using different

colours to represent the value of z, the contour plot draws contour lines to represent

the value of z, and the persp plot draws a 3D surface.

7. To add connecting lines and shapes:

lines(X[21:40], Y[21:40], lwd=2, lty=3, col='orange')

8. abline(a, b)

abline(h=y)

abline(v=x)

abline(lm.obj)

example:

mean_jordan <- mean(data_jordan$points)

plot(data_jordan, main = "1st NBA season of Michael Jordan")

abline(h = mean_jordan)

Adds a line of slope b and intercept a to the current plot. h=y may be used to

specify y-coordinates for the heights of horizontal lines to go across a plot, and

v=x similarly for the x-coordinates for vertical lines. Also lm.obj may be list with a

coefficients component of length 2 (such as the result of model-fitting functions,)

which are taken as an intercept and slope, in that order.

9. More information, including a full listing of the features available can obtained

from within R using the commands:

> help(plotmath)

Page 5: R Examples

> example(plotmath)

> demo(plotmath)

10. x <- array(1:20, dim=c(4,5)) to Generate a 4 by 5 array.

11. Creating a function:

y <- function(x) {x^2+x+1}

y(2)

12. plot(x, y, pch="+")

produces a scatterplot using a plus sign as the plotting character, without

changing the default plotting character for future plots.

13. Read data from a URL into a dataframe called PE (physical endurance)

PE <- read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header = TRUE)

# Summary statistics

describe(PE) or summary(PE)

# Scatter plots

plot(PE$age ~ PE$activeyears)

plot(PE$endurance ~ PE$activeyears)

plot(PE$endurance ~ PE$age)

14. x <- seq(-pi, pi, len=50)

15. Time-series. The function ts creates an object of class "ts" from a vector

(single time-series) or a matrix (multivariate time-series), and some op-

tions which characterize the series. The options, with default values are:

ts(data = NA, start = 1, end = numeric(0), frequency = 1, deltat = 1, ts.eps =

getOption("ts.eps"), class, names)

Eg: ts(1:10, start = 1959)

Eg: ts(1:47, frequency = 12, start = c(1959, 2))

Page 6: R Examples

16. Suppose we want to repeat the integers 1 through 3 twice. That's a simple

command:

c(1:3, 1:3)

17. Now suppose we want these numbers repeated six times, or maybe sixty times.

Writing a function that abstracts this operation begins to make sense. In fact that

abstraction has already been done for us:

rep(1:3, 6)

18. A global assignment can be performed with <<-

19. Package functionality: Suppose you have seen a command that you want to try,

such as fortune('dog')

You try it and get the error message: Error: could not find function "fortune"

You, of course, think that your installation of R is broken. I don't have evidence

that your installation is not broken, but more likely it is because your current

R session does not include the package where the fortune function lives. You

can try:

require(fortune)

Where upon you get the message: Error in library(package, ...) :

there is no package called 'fortune'.

The problem is that you need to install the package onto your computer. Assuming

you are connected to the internet, you can do this with the command:

install.packages('fortune')

After a bit of a preamble, you will get:

Warning message: package 'fortune' is not available

Page 7: R Examples

Now the problem is that we have the wrong name for the package. Capitalization

as well as spelling is important. The successful incantation is:

install.packages('fortunes')

require(fortunes)

fortune('dog')

Installing the package only needs to be done once, attaching the package with

the require function needs to be done in every R session where you want the

functionality.

The command: library() shows you a list of packages that are in your standard

location for packages.

20. If you want to do multiple tests, you don't get to abbreviate. With the x1:

> x1 == 4 | 6

OR

> x1 == (4 | 6)

21. The Apply() Function: apply function returning a vector

If you use apply with a function that returns a vector, that becomes the _rst

dimension of the result. This is likely not what you naively expect if you are

operating on rows:

> matrix(15:1, 3)

[,1] [,2] [,3] [,4] [,5]

[1,] 15 12 9 6 3

[2,] 14 11 8 5 2

[3,] 13 10 7 4 1

> apply(matrix(15:1, 3), 1, sort)

[,1] [,2] [,3]

[1,] 3 2 1

Page 8: R Examples

[2,] 6 5 4

[3,] 9 8 7

[4,] 12 11 10

[5,] 15 14 13

The naive expectation is really arrived at with:

t(apply(matrix(15:1, 3), 1, sort))

But note that no transpose is required if you operate on columns|the naïve

expectation holds in that case.

Examples: Matrix(Row, column)

matrix(15:1, 5,3) [,1] [,2] [,3] [1,] 15 10 5 [2,] 14 9 4 [3,] 13 8 3 [4,] 12 7 2 [5,] 11 6 1 > matrix(15:1, 3,5) [,1] [,2] [,3] [,4] [,5] [1,] 15 12 9 6 3 [2,] 14 11 8 5 2 [3,] 13 10 7 4 1

22. Combining Lists: c(List1,List2)

23. File Encoding for file import: read.table("intro.dat", fileEncoding = "UTF-8")

24. Create a data frame and enter data frame by frame:

df <- data. f rame ( time=numeric (N) , temp=numeric (N) , pr e s sur

e=numeric (N) )

df [ 1 , ] <- c ( 0 , 100 , 80)

df [ 2 , ] <- c (10 , 110 , 87)

OR

Page 9: R Examples

m <- matrix ( nrow=N, ncol=3)

colnames (m) <- c ( "time " , "temp " , "pr e s sur e ")

m[ 1 , ] <- c ( 0 , 100 , 80)

m[ 2 , ] <- c (10 , 110 , 87)

25. Generate random numbers with fixed mean/variance:

R> x <- rnorm(100 , mean = 5 , sd = 2)

R> x <- ( x - mean( x ) ) / s q r t ( var ( x ) )

R> mean( x )

[ 1 ] 1 .385177e-16

R> var ( x )

[ 1 ] 1

and now create your sample with mean 5 and sd 2:

R> x <- x*2 + 5

R> mean( x )

[ 1 ] 5

R> var ( x )

[ 1 ] 4

26. Extract particular columns in database: Take columns x1, x2, and x3.

datSubset <- dat [ , c ( "x1 " , "x2 " , "x3 ") ]

27. select observations for men over age 40 from a column, and sex was coded either

m or M. Use the “subset” function:

maleOver40 <- subset(column_name , sex %in% main_dataset ( "m" , "M") & age > 40)

Examples:

fifththmonth <- subset(airquality, airquality$Month < 6) fifththonly <- subset(airquality$Month, airquality$Month < 6)

28. If you want to omit all rows for which one or more column is NA (missing):

x2 <- na.omit ( x )

Page 10: R Examples

29. If you just need to remove rows 1, 6, and 13, do:

New_data <- old_data [ -c ( 1 , 6 , 1 3 ) , ]

30. Difference betrween “Order” and “Sort” :

For a table data with column “x”

X

1

1

3

2

3

1

1

2

3

4

3

sort(dd$x, decreasing=T)

[1] 4 3 3 3 2 2 1 1 1 1

Hence sort output is the actual sorting of input values.

order(dd$x)

[1] 1 2 5 6 4 7 3 8 10 9

Hence the output of “order” function is the indexes of the values of input variable.

31. Find Mean of each and every row:

For a dataset data_set with 100 rown and 4 columns:

y <- apply(as.matrix(d4),1,mean)

x <- seq(along = y)

And now combine the 2 data frames by:

cbind(x,y)

Page 11: R Examples

32. Increase length of a vector: length ( v ) <- 2* l ength ( v )

33. Select every “n”th item in a vector : seq (n , length ( vec ) , by=n)

34. Find index of missing values:

seq ( along=Pes ) [ i s . n a ( Pes ) ]

or

which ( i s . n a ( Pes ) )

35. Find index of largest item in vector

A[ which (A==max(A, na.rm=TRUE) ) ]

36. Count number of items meeting a criterion: length (which (data_set<3))

37. Inverse of a matrix: solve(A)

38. Regression Example:

x <- rnorm(100)

e <- rnorm (100)

y <- 12 + 4*x +13 * e

mylm <- lm( y _ x )

pl o t (x , y , main = "My r e g r e s s i o n ")

a b l i n e (mylm)

39. Smooth line connecting points:

x <- 1:5

y <- c(1,3,4,2.5,2)

plot(x , y )

sp <- spline(x , y , n = 50)

lines( sp )

40. how to plot several \lines" in one graph?

x <- rnorm(10)

Page 12: R Examples

y1 <- rnorm(10)

y2 <- rnorm(10)

y3 <- rnorm(10)

plot(x,y1,type="l")

lines(x,y2)

lines(x,y3)

41. Code that creates a table and then calls \prop.table" to get percents on the columns:

x<- c ( 1 , 3 , 1 , 3 , 1 , 3 , 1 , 3 , 4 , 4 )

y <- c ( 2 , 4 , 1 , 4 , 2 , 4 , 1 , 4 , 2 , 4 )

hmm <- table (x , y )

hmm_out <- prop.table (hmm, 2 ) * 100

42. If you want to sum the column, there is a function \margin.table()" for that, but it is

just the same as doing a sum manually, as in:

apply(hmm, 2, sum)

43. To get equation of a line from x and y vector coordinates:

x<-c ( 1 , 3 , 2 , 1 , 4 , 2 , 4)

y<-c ( 6, 3 , 4 , 1 , 3 , 1 , 4 )

mod1 <- lm(y~x)

44. To predict the next values of a datadet of x and y coordinates:

x<-c ( 1 , 3 , 2 , 1 , 4 , 1 , 4 ,NA)

y<-c ( 4 , 3 , 4 , 1 , 4 , 1 , 4 , 2 )

mod1 <- lm(y~x)

Page 13: R Examples

testdata <- data.frame (x=c(1))

predict(mod1 , newdata = testdata)

45. To calculate predicted values for a variety of different outputs. Here is where the fun

ction \expand.grid" comes in very handy. If one specifes a list of values to be considered

for each variable, then expand.grid will create a \mix and match" data frame to represen

t all possibilities:

x <- c ( 1 , 2 , 3 , 4 )

y <- c ( 22 .1 , 33 .2 , 44 . 4 )

expand.grid (x , y )

46. Create a table from x and y values: Cbind

table_data <- cbind(x,y)

"cbind" and "rbind" functions that put data frames side by side or on top of each

other: they also work with matrices.

> cbind( c(1,2), c(3,4) )

[,1] [,2]

[1,] 1 3

[2,] 2 4

> rbind( c(1,3), c(2,4) )

[,1] [,2]

[1,] 1 3

[2,] 2 4

47. To use a tabular data in vcov() functions we need to convert table into data frame:

frame_data <- data.frame(table_data)

48. Calculate standard errors in a dataset(x,y) named table_data:

frame_data <- data.frame(table_data)

m <- lm( y~x+x+x,data=frame_data)

vcov(m)

Standard_errors <- sqrt (diag(vcov(m)))

Page 14: R Examples

49. LOOPS:

a.)

for (i in 1:10)

{

print(i^2)

}

b.) for (w in c('red', 'blue', 'green'))

{

print(w)

}

50. The matrix product is %*%, tensor product (aka Kronecker product) is %x%.

A <- matrix(c(1,2,3,4), nr=2, nc=2)

J <- matrix(c(1,0,2,1), nr=2, nc=2)

A

[,1] [,2]

[1,] 1 3

[2,] 2 4

J

[,1] [,2]

[1,] 1 2

[2,] 0 1

> J %x% A

[,1] [,2] [,3] [,4]

[1,] 1 3 2 6

[2,] 2 4 4 8

[3,] 0 0 1 3 51. We can create a factor that follows a certain pattern with the "gl" command.

> gl(1,4)

[1] 1 1 1 1

Levels: 1

> gl(2,4)

[1] 1 1 1 1 2 2 2 2

Levels: 1 2

Page 15: R Examples

> gl(2,4, labels=c(T,F))

[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE

Levels: TRUE FALSE

> gl(2,1,8)

[1] 1 2 1 2 1 2 1 2

Levels: 1 2

> gl(2,1,8, labels=c(T,F))

[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE

Levels: TRUE FALSE

51. The "expand.grid" computes a cartesian product (and yields a data.frame).

> x <- c("A", "B", "C")

> y <- 1:2

> z <- c("a", "b")

> expand.grid(x,y,z)

52. When playing with factors, people sometimes want to turn them into numbers.

This can be ambiguous and/or dangerous.

> x <- factor(c(3,4,5,1))

> as.numeric(x) # No NOT do that

[1] 2 3 4 1

> x

[1] 3 4 5 1

Levels: 1 3 4 5

53. In R, the function rowSums() conveniently calculates the totals for each row of

a matrix. This function creates a new vector:

sum_of_rows_vector <- rowSums(my_matrix)

54. survey_vector <- c("M", "F", "F", "M", "M")

# Encode survey_vector as a factor

factor_survey_vector <- factor(survey_vector)

# Specify the levels of 'factor_survey_vector'

levels(factor_survey_vector) <- c("Female", "Male")

Page 16: R Examples

factor_survey_vector

output: [1] Male Female Female Male Male

Levels: Female Male

summary(factor_survey_vector)

Output: Female Male

2 3

55. str(dataset) function : Another method that is often used to get a rapid

overview of your data is the function str(). The function str() shows you structure

of your data set. For a data frame it tells you:

The total number of observations (e.g. 32 car types)

The total number of variables (e.g. 11 car features)

A full list of the variables names (e.g. mpg, cyl … )

The data type of each variable (e.g. num for car features)

The first observations: Applying the str() function will often be the first thing that

you do when receiving a new data set or data frame. It is a great way to get more in

sight in your data set before diving into the real analysis.

56. The subset( ) function: is the easiest way to select variables and observations.

In the following example, we select all rows that have a value of age greater than

or equal to 20 or age less then 10. We keep the ID and Weight columns.

newdata <- subset(mydata, age >= 20 | age < 10, select=c(ID, Weight))

57. Comparing Vectors:

linkedin <- c(16, 9, 13, 5, 2, 17, 14)

linkedin > 15

Output: [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE

58. Types of Statistical Variables :

Page 17: R Examples
Page 18: R Examples

59. Factor Vector:

> # Create a numeric vector with the identifiers of the participants of your survey

> participants_1 <- c(2, 3, 5, 7, 11, 13, 17)

> # Check what type of values R thinks the vector consists of

> class(participants_1)

[1] "numeric"

> # Transform the numeric vector to a factor vector

> participants_2 <- factor(participants_1)

> # Check what type of values R thinks the vector consists of now

> class(participants_2)

[1] "factor"

60.

Page 19: R Examples

61.

62. Histogram Function: > hist(AirPassengers)

Create a histogram of the verbal_baseline variable. Set "Distribution of verbal

memory baseline scores" as the title, "score" as the x-axis label and "frequency" as

the y-axis label.

hist(verbal_baseline, main = "Distribution of verbal memory baseline

scores", xlab = "score", ylab = "frequency")

63. Print basic statistical properties of the red_wine_data data.frame:

describe(red_wine_data)

64. Negatively skewed distributions characteristically have a longer left tail while

positively skewed distributions have a longer right tail.

65.

66. The Scale() Function

Page 20: R Examples

The scale() function makes use of the following arguments.

x: a numeric object

center: if TRUE, the objects' column means are subtracted from the values in

those columns (ignoring NAs); if FALSE, centering is not performed

scale: if TRUE, the centered column values are divided by the column's

standard deviation (when center is also TRUE; otherwise, the root mean

square is used); if FALSE, scaling is not performed

> x <- matrix(1:9,3,3)

> x

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> y <- scale(x)

> y

[,1] [,2] [,3]

[1,] -1 -1 -1

[2,] 0 0 0

[3,] 1 1 1

attr(,"scaled:center")

[1] 2 5 8

attr(,"scaled:scale")

[1] 1 1 1

67. To Generate Z scores: Y<- scale(x, center = TRUE, scale = TRUE)

OR

Y <- ( x – mean(x) ) / sd(x)

Page 21: R Examples

68.

Summary(dataset) gives you following information:

Min. 1st Qu. Median Mean 3rd Qu. Max.

Example:

Solution:

That is the probability is 78%.

To find a probability of height being >180 then we need to do this:

1- pnorm(180, mean =172, sd = 10)

69.

Page 22: R Examples
Page 23: R Examples

70. Variance also known as Mean Squares.

Var() function is used to calculate it in R

Example: var(data_jordan$points)

71. Analysis of Variance(ANOVA): ANOVA is used when more than two group

means are compared, whereas a t-test can only compare two group means.

Suppose a dataset has :

Page 24: R Examples

72. Generate density plot of the F-distribution

# Create the vector x

x <- seq(from = 0, to = 2, length = 100)

x

# Simulate the F-distributions

y_1 <- df(x, 1, 1)

y_2 <- df(x, 3, 1)

y_3 <- df(x, 6, 1)

y_4 <- df(x, 3, 3)

y_5 <- df(x, 6, 3)

y_6 <- df(x, 3, 6)

y_7 <- df(x, 6, 6)

# Plot the F-distributions

plot(x, y_1, col = 1, "l")

lines(x, y_2, col = 2, "l")

lines(x, y_3, col = 3, "l")

lines(x, y_4, col = 4, "l")

lines(x, y_5, col = 5, "l")

lines(x, y_6, col = 6, "l")

# Summary statistics by all groups (8 sessions, 12 sessions, 17

sessions, 19 sessions)

describeBy(WM, WM$condition)

# Boxplot IQ versus cond

boxplot(WM$IQ ~ WM$cond, main = "Boxplot", xlab = "Group

(cond)", ylab = "IQ")

# Apply the aov function

anova.WM <- aov(WM$IQ ~ WM$cond)

# Look at the summary table of the result

summary(anova.WM)

Page 25: R Examples

lines(x, y_7, col = 7, "l")

# Add the legend

legend("topright", c("df = (1,1)", "df = (3,1)", "df = (6,1)", "df = (3,3)", "df = (6,3)"

, "df = (3,6)", "df = (6,6)"), title = "F distributions", col = c(1,2,3,4,5,6,7), lty = 1)

73. Correlation and Covariance: cor(A,B) command

R is the correlation coefficient :

Step1: calculate covariance

> #

> # Take a quick peek at both vectors

> A

[1] 1 2 3

> B

[1] 3 6 7

> # Save differences of each vector element with the mean in a new variable

> diff_A <- A - mean(A)

> diff_B <- B - mean(B)

> # Do the summation of the elements of the vectors and divide by N-1 in

order to acquire covariance

> cov <- sum(diff_A * diff_B)/(3 - 1)

Page 26: R Examples

Step2: calculate standard deviations:

Step 3: calculation of r:

OR

Use the cor(A,B) command : cor(A,B)

# Square the differences that were found in the previous step

sq_diff_A <- diff_A^2

sq_diff_B <- diff_B^2

# Take the sum of the elements, divide them by N-1 and consequently take

the square root to acquire the sample standard deviations

sd_A <- sqrt(sum(sq_diff_A)/(3-1))

sd_B <- sqrt(sum(sq_diff_B)/(3-1))

# Combine all the pieces of the puzzle

correlation <- cov/(sd_A*sd_B)

correlation

Page 27: R Examples

74. Multiple Regression

a. Simple regression

# fs is available in your working environment

fs

# Perform the two single regressions and save them in a variable

model_years <- lm(fs$salary ~ fs$years)

model_pubs <- lm(fs$salary ~ fs$pubs)

# Plot both enhanced scatter plots in one plot matrix of 1 by 2

par(mfrow = c(1, 2))

plot(fs$salary ~ fs$years, main = "plot_years", xlab = "years", ylab = "salary")

abline(model_years)

plot(fs$salary ~ fs$pubs, main = "plot_pubs", xlab = "pubs", ylab = "salary")

abline(model_pubs)

b. R2 coefficients of regression models: These coefficients are often used in practi

ce to select the best regression model in case of competing models. TheR2 coeffici

ent of a regression model is defined as the percentage of the variation in the outco

me variable that can be explained by the predictor variables of the model. In genera

l, the R2 coefficient of a model increases when more predictor variables are added

to the model. After all, adding more predictor variables to the model tends to incre

ase the odds of explaining more variation in the outcome variable.

> # fs is available in your working environment

> # Do a single regression of salary onto years of experience and check the output

> model_1 <- lm(fs$salary ~ fs$years)

> summary(model_1)

Page 28: R Examples

Call:

lm(formula = fs$salary ~ fs$years)

Residuals:

Min 1Q Median 3Q Max

-82972 -9537 4305 17703 57949

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 68672.5 8259.0 8.315 5.38e-13 ***

fs$years 2689.9 318.4 8.448 2.78e-13 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30220 on 98 degrees of freedom

Multiple R-squared: 0.4214, Adjusted R-squared: 0.4155

F-statistic: 71.37 on 1 and 98 DF, p-value: 2.779e-13

> # Do a multiple regression of salary onto years of experience and numbers of pub

lications and check the output

> model_2 <- lm(fs$salary ~ fs$years + fs$pubs)

> summary(model_2)

Call:

lm(formula = fs$salary ~ fs$years + fs$pubs)

Residuals:

Min 1Q Median 3Q Max

-67835 -14589 2362 13358 69613

Page 29: R Examples

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 58828.7 7605.9 7.735 9.79e-12 ***

fs$years 1337.4 387.1 3.455 0.000819 ***

fs$pubs 634.9 123.6 5.137 1.44e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26930 on 97 degrees of freedom

Multiple R-squared: 0.5451, Adjusted R-squared: 0.5357

F-statistic: 58.12 on 2 and 97 DF, p-value: < 2.2e-16

> # Save the R squared of both models in preliminary variables

> preliminary_model_1 <- summary(model_1)$r.squared

> preliminary_model_2 <- summary(model_2)$r.squared

> # Round them off while you save them in a vector

> r_squared <- c()

> r_squared[1] <- round(preliminary_model_1, 3)

> r_squared[2] <- round(preliminary_model_2, 3)

> # Print out the vector to see both R squared coefficients

> r_squared

[1] 0.421 0.545

Page 30: R Examples

75. GGVIS Package: Every ggvis graph contains 4 essential components: da

ta, a coordinate system, marks and corresponding properties. By changing the valu

es of each of these components you can create a vast array of unique plots.

TO install ggvis: > install.packages("ggvis")

Syntax: dataset %>% ggvis(~variable1, ~variable2, other options)

Example:

# The ggvis packages is loaded into the workspace already

# Change the code below to make a graph with red points

mtcars %>% ggvis(~wt, ~mpg, fill := "red") %>% layer_points()

# Change the code below draw smooths instead of points

mtcars %>% ggvis(~wt, ~mpg) %>% layer_smooths()

# Change the code below to amke a graph containing both points and a smoothed s

ummary line

mtcars %>% ggvis(~wt, ~mpg) %>% layer_points() %>% layer_smooths()

76. GGVIS Exmple:

# Make a scatterplot of the pressure dataset

pressure %>% ggvis(~temperature, ~pressure) %>% layer_points()

# Adapt the code you wrote for the first challenge: show bars instead of points

pressure %>% ggvis(~temperature, ~pressure) %>% layer_bars()

# Adapt the code you wrote for the first challenge: show lines instead of points

pressure %>% ggvis(~temperature, ~pressure) %>% layer_lines()

Page 31: R Examples

# Adapt the code you wrote for the first challenge: map the fill property to the tem

perature variable

pressure %>% ggvis(~temperature, ~pressure, fill = ~temperature) %>% layer_poi

nts()

# Extend the code you wrote for the previous challenge: map the size property to th

e pressure variable

pressure %>% ggvis(~temperature, ~pressure, fill = ~temperature, size = ~pressure

) %>% layer_points()

77. Importing Data with rxImport Function

A) Introduction for Big Data Import

The first step in this problem is to declare the file paths that point to where the data

is being stored on the server.

i. The file.path function constructs a path to a file, and in this exercise, we will

use this function to define the location of the csv file that we would like to

import. As arguments tofile.path, we pass it the directory where the data

lives, and then the basename of the file we would like to import.

ii. We can use rxGetOption(“sampleDataDir”) to get the appropriate directory.

In the exercise, fill in the appropriate filename for the csv file that contains

the appropriate airline dataset. in this example, we will just use a small

subsample of approximately 2.5% of those observations. These are available

in the file “2007_subset.csv”.

iii. Next, let's import the data. The rxImport function has the following syntax:

iv. rxImport(inData, outFile, overwrite = TRUE)

v. The inData argument is the file we want to convert, so it should be assigned

the csv file path. The outFile argument corresponds to the imported file we

want to create, so it should be defined as the xdf file path.

Page 32: R Examples

vi. If we specify overwrite as TRUE, any existing data in the output file will be

overwritten by the results from this process. You should take extra care

when setting this argument to TRUE!

vii. Once you have run the rxImport() command, you can runlist.files() to make

sure your xdf file has been created!

# Declare the file paths for the csv and xdf files

myAirlineCsv <- file.path(rxGetOption("sampleDataDir"), "2007_subset.csv")

myAirlineXdf <- "2007_subset.xdf"

# Use rxImport to import the data into xdf format

rxImport(inData = myAirlineCsv, outFile = myAirlineXdf, overwrite = TRUE)

list.files()

B) Operations on Big Data: Functions for Summarizing Data

Use the rxGetInfo(), rxSummary(), and rxHistogram() functions to summarize the

flight data.

i. rxGetInfo() provides information about the dataset as a whole — how many

rows and variables there are, the type of compression used, etc. Additionally,

if you set the getVarInfoargument to TRUE, then it will also provide a brief

synopsis of the variables. If you specify the numRows argument, it will also

return the first numRows rows.

ii. rxSummary() provides summaries for individual variables within a dataset.

iii. rxHistogram() will provide a histogram of a single variable.

C) Practical Use Example:

i. First, we will use rxGetInfo() in order to extract meta-data about the dataset.

The syntax of rxGetInfo() is:

Page 33: R Examples

rxGetInfo(data, getVarInfo, numRows)

data corresponds to the dataset that we would like to get information for.

getVarInfo is a boolean argument that determines whether meta-data regardin

g the variables is also returned (default = FALSE).

numRows determines the number of rows of the dataset that is returned (defa

ult = 0).

ii. Use rxGetInfo() to summarize the myAirlineXdf dataset we created in the pr

evious exercise. In this function call, obtain information on all the variables,

and return the first ten rows of data.

iii. Next, we will use rxSummary() to summarize a few variables in the dataset.

The syntax of rxSummary() is:

rxSummary(formula, data)

formula specifies the variables that you want to extract. The variables you

want to summarize should appear on the right-hand side of the formula, and

under most circumstances, will be separated by a + symbol.

data corresponds to dataset from which you would like to extract variables.

iv. Summarize the variables - ActualElapsedTime, AirTime,DepDelay, and Dist

ance.

v. Use rxSummary() to summarize ActualElapsedTime,AirTime, DepDelay, an

d Distance.

vi. Finally, we will use rxHistogram() to visualize the distribution of one of our

variables. The syntax of rxHistogram() is:

rxHistogram(formula, data, …)

formula specifies the variable that you want to visualize. In the simplest case

, it will take the form of ~ variable.

Page 34: R Examples

data corresponds to the dataset from which you would like to extract variabl

es.

… is used to represent additional variables that govern the appearance of the

figure.

Use rxHistogram() to obtain a histogram for the variableDepDelay.

Build a second histogram in which the x axis is limited to departure delays b

etween −100 and 400 minutes. In this histogram, segment the data such that

there is one segment for a minute of delay. When the histogram is plotted ha

ve only ten ticks on the x-axis.

D) Preparing Data For Analysis: Import

Let's get a little more practice with importing and preparing data for analysis. Go a

head and declare the file paths, and use therxImport() command to import the airlin

e data to an xdf file.

Instructions

The first step in this problem is to declare the file paths that point to where the data

is being stored on the server.

The file.path() function constructs a path to a file, and in this problem you must

define the big data directory,sampleDataDir, where the files are being stored.

We can get the sampleDataDir by using the rxGetOption()function.

Once we know where the files are, we can look in that directory to examine what

files exist with list.files().

Once we have found the name of the file (AirlineDemoSmall.csv), we can import

it using rxImport(). In this case, we will also use the argument colInfo in order to

specify that the character variable DayOfWeek should be interpreted as a factor,

and that its levels should have the same order as the days of the week (Monday -

Sunday).

Next, let's import the data. You can use the help() or args() in order to remind

yourself of the arguments to the rxImport()function.

Page 35: R Examples

# Declare the file paths for the csv and xdf files

myAirlineCsv <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv

")

myAirlineXdf <- "ADS.xdf"

# Use rxImport to import the data into xdf format

rxImport(inData = myAirlineCsv,

outFile = myAirlineXdf,

overwrite = TRUE,

colInfo = list(

DayOfWeek = list(

type = "factor",

levels = c("Monday", "Tuesday", "Wednesday",

"Thursday", "Friday", "Saturday", "Sunday")

)

)

)

E) Preparing Data For Analysis: Exploration

If we are interested in predicting Arrival Delay by Day of the week, then the first

thing we might want to do is to explicitly compute the average arrival delay for

each day of the week. We can do this using the rxSummary() function.

rxSummary() provides a summary of variables. In the simplest case, it provides

simple univariate summaries of individual variables. In this example, we will use it

to create summaries of a numeric variable at different levels of a second,

categorical variable.

Page 36: R Examples

Instructions

Use rxSummary() in order to compute the average delay for each day of the week

in the myAirlineXdf dataset.

The basic syntax is: rxSummary(formula, data)

formula - A formula indicating the variables for which you want a summary. If

there is a variable on the left hand side of this formula, then it will compute

summary statistics of that variable for each level of a categorical variable on the

right hand side. If there is no variables on the left hand side, it will compute

summaries of the variables on the right hand side of the formula across the entire

dataset.

data - The dataset in which to look for variables specified informula.

Go ahead use rxSummary() to compute the mean arrival delay for each day of the

week.

After you have viewed the average arrival delay for each day of the week, you

might also want to view the distribution of arrival delays. We can do this

using rxHistogram(). Go ahead and use rxHistogram to visualize the distribution.

# Summarize arrival delay for each day of the week.

rxSummary(formula = ArrDelay ~ DayOfWeek, data = myAirlineXdf)

## Vizualize the arrival delay histogram

rxHistogram(formula = ~ ArrDelay, data = myAirlineXdf)

F) Hint: Remember that you can see what objects are available to you in the works

pace by using ls(), and that you can get help for functions by typing ?FunctionNam

e at the console. For example, ?rxGetInfo will provide a help browser for therxGetI

nfo() function.

For the rxGetInfo() command, remember that the dataargument corresponds to the

dataset you would like to get information about, getVarInfo corresponds to whethe

Page 37: R Examples

r or not information about variables should be extracted, andnumRows corresponds

to the number of rows to extract.

For the rxSummary() command, remember that the formulaargument specifies the

variables you would like to summarize. The variables should be placed on the right

hand side of the formula (You can see more help on formulas with ?formula). Rem

ember that you can get variable names from a dataset by using the rxGetInfo() func

tion with getVarInfo set to TRUE. The data argument should reference the dataset i

n which you want to look for the variables specified in the formula.

For the first rxHistogram command, remember that formulashould specify the vari

able you want to plot, in the format of a right-sided formula. If you can't remember

the name of appropriate variable, you can always run rxGetInfo() function with get

VarInfo set to TRUE. The data argument should reference the dataset in which you

want to look for the variables specified in the formula.

For the more specific histogram, several arguments have been added. You can vie

w the helpf for rxHistogram() with ?rxHistogram. xAxisMinMax controls the lowe

r and upper limits of the x axis. numBreaks controls the number of bins in the histo

gram, and xNumTicks controls the number of ticks on the x axis.

Example:

# Get basic information about your data

rxGetInfo(data = myAirlineXdf,

getVarInfo = TRUE,

numRows = 10)

## Summarize the variables corresponding to actual elapsed time, time in the air, d

eparture delay, flight Distance.

rxSummary(formula = ~ ActualElapsedTime + AirTime + DepDelay + Distance,

Page 38: R Examples

data = myAirlineXdf)

# Histogram of departure delays

rxHistogram(formula = ~DepDelay,

data = myAirlineXdf)

# Use parameters similar to a regular histogram to zero in on the interesting area

rxHistogram(formula = ~DepDelay,

data = myAirlineXdf,

xAxisMinMax = c(-100, 400),

numBreaks = 500,

xNumTicks = 10)

78. Construct a linear model

Use rxLinMod to create a simple linear model

The structure to a call to rxLinMod() is very similar to the lm()function in

the stats package. The syntax is: rxLinMod(formula, data, …)

formula - The model specification.

data - The data in which you want to search for variables informula.

… - Additional arguments

Using the data set myAirlineXdf, go ahead and start by creating a simple linear

model predicting arrival delay by day of the week.

## predict arrival delay by day of the week:

myLM1 <- rxLinMod( ArrDelay ~ DayOfWeek, data = myAirlineXdf)

## summarize the model

summary(myLM1)

Page 39: R Examples

## Use the transforms argument to create a factor variable associated with departur

e time "on the fly,"

## predict Arrival Delay by the interaction between Day of the week and that new f

actor variable

myLM2 <- rxLinMod( ArrDelay ~ DayOfWeek:catDepTime, data = myAirlineXdf

,

transforms = list(

catDepTime = cut(CRSDepTime, breaks = seq(from = 5, to = 23, by

= 2))

),

cube = TRUE

)

## summarize the model

summary(myLM2)

79. Correlation in Big Data: Use rxCor to construct the correlation matrix:

The syntax of rxCor function is: rxCor(formula, data, rowSelection) where

formula - a right-sided formula that specifies the variables you would like

correlate.

data - The dataset in which you want to search for the variables in the

formula.

rowSelection - a logical expression that will be used to select particular rows

for analysis. Rows in which the expression is TRUE will be included in the

correlation values; rows in which hte expression is FALSE will not.

# Correlation for departure delay, arrival delay, and air speed

Page 40: R Examples

rxCor(formula = ~ DepDelay + ArrDelay + airSpeed,

data = myAirlineXdf,

rowSelection = (airSpeed > 50) & (airSpeed < 800))

80. Linear Regression: Use rxLinMod to construct the regression: The syntax

is:rxLinMod(formula, data, rowSelection)

formula - a two-sided formula that specifies the variable you want to predict

on the left side, and the predictor variable(s) on the right side.

data - The dataset in which you want to search for the variables in the

formula.

rowSelection - a logical expression that will be used to select particular rows

for analysis. Rows in which the expression is *

Remember to use summary() to return your analysis results.

Example: Construct a linear regression predicting air speed by departure delay, for

air speeds between 50 and 800. Once you have computed the model, be sure to su

mmarize it in order to get inference results.

# Regression for airSpeed based on departure delay

myLMobj <- rxLinMod(formula = airSpeed ~ DepDelay,

data = myAirlineXdf,

rowSelection = (airSpeed > 50) & (airSpeed < 800))

summary(myLMobj)

81. Generating Predictions and Residuals

Use rxPredict() to generate arrival delay predictions for the model we created in

the prior exercise (myLM2).

Before we start with making predictions. Go ahead and summarize myLM2, so that

you can refresh your memory on the model and results.

Page 41: R Examples

rxPredict() is the RevoScaleR function that allows us to generate predictions and

residuals based on a variety of different models.

The syntax is: rxPredict(modelObject, data, outData, computeResiduals, …)

modelObject - The model you would like to use in order to generate

predictions.

data - The data for which you want to make predictions.

outData - The location where you would like to store the residuals.

computeResiduals - Whether to compute residuals or not.

… - Additional arguments.

In this case, we need to be careful. We need to keep in mind that it will generate as

many predictions as there are observations in the dataset. If you are trying to

generate predictions for a billion observations, your output will also have a billion

predictions. Because of this, the output is, by default not stored in memory, but

rather, stored in an xdf file.

Go ahead and create a new xdf file to store the predicted values of our original

dataset. Like other RevoScaleRfunctions, it can take additional arguments that

control which variables are kept in the creation of new data sets. Since we are

going to create our own copy of the dataset, we should also specify

the writeModelVars argument so that the values of the predictor variables are also

included in the new prediction file.

After using rxPredict() and rxGetInfo() to generate predictions, use the same

methods to generate the residuals.

## summarize model first

summary(myLM2)

## path to new dataset storing predictions

myNewADS <- "myNEWADS.xdf"

## generate predictions

Page 42: R Examples

rxPredict(modelObject = myLM2, data = myAirlineXdf,

outData = myNewADS,

writeModelVars = TRUE)

## get information on the new dataset

rxGetInfo(myNewADS, getVarInfo = TRUE)

## Generate residuals.

rxPredict(modelObject = myLM2, data = myAirlineXdf,

outData = myNewADS,

writeModelVars = TRUE,

computeResiduals = TRUE,

overwrite = TRUE)

## get information on the new dataset

rxGetInfo(myNewADS, getVarInfo = TRUE)

Page 43: R Examples

82. Computing k Means with rxKmeans()

Target: Cluster the mortgage default data into 4 clusters usingrxKmeans().

Use rxKmeans() function to perform k-means clustering on the data.

Before starting, be sure to take explore your workspace, and look at

the mortData dataset with rxGetInfo().

The syntax is: rxKmeans(formula, data, outFile, numClusters, …)

formula - The formula that specifies which variables you would like to

include in the clustering algorithm.

data - The dataset in which to search for variables included in formula.

outFile - The dataset in which you would like to write the cluster IDs.

numClusters - The number (k) of clusters you would like to estimate.

… - Additional arguments, several of which control the k means algorithm

(e.g. algorithm, numStarts, centers; see ?rxKmeans for more information)

Go ahead and use rxKmeans() to estimate a set of 4 clusters in

the mortData dataset. Use the variables corresponding to credit card debt, credit

score, and house age. Use therowSelection argument so that you only estimate the

clusters based on the observations from year 2000. Use theoutFile argument to

create a new dataset in which the model variables and the cluster IDs for each row

are written. Assign the output of the rxKmeans() call to an object namedKMout,

and then print that object, so that you can see some properties of the k-means

solution.

Once you have printed out some properties of the solution, take a look at the new

dataset that you have created. Extract the meta-data from the xdf file, and then

summarize the cluster variable using rxSummary().

Next, we are going to visualize the clustering. Go ahead and use

the rxXdfToDataFrame() function to read some of the new dataset into an

internal data.frame object. The first argument to rxXdfToDataFrame should be the

name of the xdf file you want to read in to memory. While this dataset is not

particularly large, we will go ahead and use

therowSelection and transforms arguments in order to randomly sample a small

subset of the data. In this case, we use transforms to create a new

variable randSamp by randomly sampling the numbers 1 through 10 with

replacement. Using rowSelection to select only those rows where randSamp ==

1 means that we pull approximately 10% of our dataset into memory.

Once you have pulled the data into memory, visualize the clusters by creating each

possible pairwise scatter plot, with the color of each point corresponding to the

cluster it was assigned. The plot.data.frame() method is very useful here, but

remember to remove the variable corresponding to the cluster ID from the plot.

Page 44: R Examples

Can you explain why a single variable drives the clustering?

## Examine the mortData dataset

rxGetInfo(mortData, getVarInfo = TRUE)

## set up a path to a new xdf file

myNewMortData = "myMDwithKMeans.xdf"

## run k-means:

KMout <- rxKmeans(formula = ~ ccDebt + creditScore + houseAge,

data = mortData,

outFile = myNewMortData,

rowSelection = year == 2000,

numClusters = 4,

writeModelVars = TRUE)

print(KMout)

## Examine the variables in the new dataset:

rxGetInfo(myNewMortData, getVarInfo = TRUE)

## summarize the cluster variable:

rxSummary(~ F(.rxCluster), data = myNewMortData)

## read into memory 10% of the data:

mydf <- rxXdfToDataFrame(myNewMortData,

rowSelection = randSamp == 1,

varsToDrop = "year",

transforms = list(randSamp = sample(10, size = .rxNumRows, repla

ce = TRUE)))

## visualize the clusters:

plot(mydf[-1], col = mydf$.rxCluster)

Page 45: R Examples

83. Create some Decision Trees

Create a regression tree predicting default status using rxDTree().

Use rxDTree() to create a regression tree for the mortDatadataset.

rxDTree() is the RevoScaleR function that computes both regression and

classification trees. It decides which kind of tree to compute based on the type of

the dependent variable used in its formula argument. If that variable is numeric,

then it produces a regression tree; if that variable is a factor, then it produces a

classification tree.

The syntax is: rxDTree(formula, data, maxdepth, …)

formula - The model specification

data - The data set in which to search for variables used informula

maxdepth - The maximum depth of the regression tree to estimate.

… - Additional arguments

Go ahead and use rxDTree() in order to produce a decision tree predicting default

status by credit score, credit card debt, years employed, and house age, using a

maximum depth = 5. Use rowSelection to estimate this tree on only the data from

the year 2000. Assign the output to an object named regTreeOut. This can take a

few seconds, so be patient.

Once you have created this object, you can print it in order to view a summary and

textual description of the tree. Go ahead and print regTreeOut and spend a couple

of minutes looking at the output.

Although the text output can be useful, it is usually more intuitive to visualize such

an analysis via a dendrogram. You can produce this type of visualization a few

ways. First, you can make the output of rxDTree() appear to be an object produced

by the rpart() function by usingrxAddInheritance(). Once that is the case, then you

can use all of the methods associated with rpart: Using plot() on the object after

adding inheritance will produce a dendrogram, and then running text() on the

object after adding inheritance will add appropriate labels in the correct locations.

Go ahead and practice producing this dendrogram.

Page 46: R Examples

In most cases, you can also use the RevoTreeView library by loading that library

and running createTreeView() on theregTreeOut object. This does not work in the

datacamp platform, but will typically open an interactive web page where you can

expand nodes of the decision tree by clicking on them, and you can extract

information about each node by mousing over them.

Similar to the other modeling approaches we've seen, we can use rxPredict() in

order to generate predictions based on the model object. In order to generate

predictions, we need specify the same arguments as before: modelOject, data,

and outData. Since we will create a new dataset, we will also need to make sure to

write the model variables as well. And since we could generate predictions later as

well, let's go ahead and give the name of the new variable something more

specific. We can use teh predVarNames argument to specify the name of the

predicted values to be default_RegPred. Go ahead and try this. Once you have

created the variables, be sure to get information on your new variables

usingrxGetInfo() on the new dataset.

Another useful visualization for regression trees is the Receiver Operating

Characteristic (ROC) curve. This curve plots the "Hit" (or True Positive) rate as a

function of the False Positive rate for different criteria of classifying a positive or

negative. A good model will have a curve that deviates strongly from the identity

line y = x. One measure that is based on this ROC curve is the area under the curve

(AUC), which has a range between 0 and 1, with 0.5 corresponding to chance. We

can compute and display an ROC curve for a regression tree using rxRocCurve().

If we want to create a classification tree rather than a regression tree, we simply

need to convert our dependent measure into a factor. We can do this using

the transformsargument within the call to rxDTree() itself.

Page 47: R Examples

## regression tree:

regTreeOut <- rxDTree(default ~ creditScore + ccDebt + yearsEmploy + houseAge

,

rowSelection = year == 2000,

data = mortData,

maxdepth = 5)

## print out the object:

print(regTreeOut)

## plot a dendrogram, and add node labels:

plot(rxAddInheritance(regTreeOut))

text(rxAddInheritance(regTreeOut))

## Another visualization:

# library(RevoTreeView)

# createTreeView(regTreeOut)

## predict values:

myNewData = "myNewMortData.xdf"

rxPredict(regTreeOut,

data = mortData,

outData = myNewData,

writeModelVars = TRUE,

predVarNames = "default_RegPred")

## visualize ROC curve

rxRocCurve(actualVarName = "default",

predVarNames = "default_RegPred",

data = myNewData)

Page 48: R Examples

84. R Examples

a. Str(dataset) : for getting structure of data

b. Head(dataset): for getting top rows of data

c. Fivenum(dataset$column): Returns Tukey's five number summary (minimum, l

ower-hinge, median, upper-hinge, maximum) for the input data.

d. IQR(dataset$column) : difference bwtween 75th and 25th quartile values.

e. Boxplot(dataset$column) : diagramitic analysis of min, max, median etc.

f. Newdataset <- edit(dataset) : to edit individual values in dataset.

g. Confidence and Support:

We will make some assumptions for what we might find in an experiment and find

the resulting confidence interval using a normal distribution. Here we assume that t

he sample mean is 5, the standard deviation is 2, and the sample size is 20. In the e

xample below we will use a 95% confidence level and wish to find the confidence i

nterval. The commands to find the confidence interval in R are the following:

> a <- 5

> s <- 2

> n <- 20

> error <- qnorm(0.975)*s/sqrt(n)

> left <- a-error

> right <- a+error

> left

[1] 4.123477

> right

[1] 5.876523

The true mean has a probability of 95% of being in the interval between 4.12 and 5.88

assuming that the original random variable is normally distributed, and the samples are

independent.

Calculating a Confidence Interval From a t Distribution:

Calculating the confidence interval when using a t-test is similar to using a normal

distribution. The only difference is that we use the command associated with the t-

Page 49: R Examples

distribution rather than the normal distribution. Here we repeat the procedures above,

but we will assume that we are working with a sample standard deviation rather than an

exact standard deviation.

Again we assume that the sample mean is 5, the sample standard deviation is 2, and the

sample size is 20. We use a 95% confidence level and wish to find the confidence

interval. The commands to find the confidence interval in R are the following:

> a <- 5

> s <- 2

> n <- 20

> error <- qt(0.975,df=n-1)*s/sqrt(n)

> left <- a-error

> right <- a+error

> left

[1] 4.063971

> right

[1] 5.936029

The true mean has a probability of 95% of being in the interval between 4.06 and 5.94

assuming that the original random variable is normally distributed, and the samples are

independent.

We now look at an example where we have a univariate data set and want to find the

95% confidence interval for the mean. In this example we use one of the data sets given

in the data input chapter. We use the w1.dat data set:

> w1 <- read.csv(file="w1.dat",sep=",",head=TRUE)

> summary(w1)

vals

Min. :0.130

1st Qu.:0.480

Median :0.720

Mean :0.765

3rd Qu.:1.008

Max. :1.760

> length(w1$vals)

[1] 54

Page 50: R Examples

> mean(w1$vals)

[1] 0.765

> sd(w1$vals)

[1] 0.3781222

We can now calculate an error for the mean:

> error <- qt(0.975,df=length(w1$vals)-1)*sd(w1$vals)/sqrt(length(w1$vals))

> error

[1] 0.1032075

The confidence interval is found by adding and subtracting the error from the mean:

> left <- mean(w1$vals)-error

> right <- mean(w1$vals)+error

> left

[1] 0.6617925

> right

[1] 0.8682075

There is a 95% probability that the true mean is between 0.66 and 0.87 assuming that

the original random variable is normally distributed, and the samples are independent.

Calculating Many Confidence Intervals From a t Distribution

Suppose that you want to find the confidence intervals for many tests. This is a common

task and most software packages will allow you to do this.

We have three different sets of results:

Comparison 1

Mean Std. Dev. Number (pop.)

Group I 10 3 300

Group II 10.5 2.5 230

Page 51: R Examples

Comparison 2

Mean Std. Dev. Number (pop.)

Group I 12 4 210

Group II 13 5.3 340

Comparison 3

Mean Std. Dev. Number (pop.)

Group I 30 4.5 420

Group II 28.5 3 400

For each of these comparisons we want to calculate the associated confidence interval

for the difference of the means. For each comparison there are two groups. We will refer

to group one as the group whose results are in the first row of each comparison above.

We will refer to group two as the group whose results are in the second row of each

comparison above. Before we can do that we must first compute a standard error and a

t-score. We will find general formulae which is necessary in order to do all three

calculations at once.

We assume that the means for the first group are defined in a variable called m1. The

means for the second group are defined in a variable called m2. The standard deviations

for the first group are in a variable called sd1. The standard deviations for the second

group are in a variable called sd2. The number of samples for the first group are in a

variable called num1. Finally, the number of samples for the second group are in a

variable called num2.

With these definitions the standard error is the square root

of (sd1^2)/num1+(sd2^2)/num2. The R commands to do this can be found below:

> m1 <- c(10,12,30)

> m2 <- c(10.5,13,28.5)

Page 52: R Examples

> sd1 <- c(3,4,4.5)

> sd2 <- c(2.5,5.3,3)

> num1 <- c(300,210,420)

> num2 <- c(230,340,400)

> se <- sqrt(sd1*sd1/num1+sd2*sd2/num2)

> error <- qt(0.975,df=pmin(num1,num2)-1)*se

To see the values just type in the variable name on a line alone:

> m1

[1] 10 12 30

> m2

[1] 10.5 13.0 28.5

> sd1

[1] 3.0 4.0 4.5

> sd2

[1] 2.5 5.3 3.0

> num1

[1] 300 210 420

> num2

[1] 230 340 400

> se

[1] 0.2391107 0.3985074 0.2659216

> error

[1] 0.4711382 0.7856092 0.5227825

Now we need to define the confidence interval around the assumed differences. Just as

in the case of finding the p values in previous chapter we have to use the pmin command

to get the number of degrees of freedom. In this case the null hypotheses are for a

difference of zero, and we use a 95% confidence interval:

> left <- (m1-m2)-error

> right <- (m1-m2)+error

> left

[1] -0.9711382 -1.7856092 0.9772175

> right

Page 53: R Examples

[1] -0.02886177 -0.21439076 2.02278249

This gives the confidence intervals for each of the three tests. For example, in the first

experiment the 95% confidence interval is between -0.97 and -0.03 assuming that the

random variables are normally distributed, and the samples are independent.

h. if-else example:

if ( x < 0.2)

{ x <- x + 1

cat("increment that number!\n")

} else

{

x <- x - 1

cat("nah, make it smaller.\n");

}

nah, make it smaller.

For loop example: for (lupe in seq(0,1,by=0.3))

{ cat(lupe,"\n");

}

0

0.3

0.6

0.9

>

> x <- c(1,2,4,8,16)

> for (loop in x)

{

cat("value of loop: ",loop,"\n");

}

value of loop: 1

value of loop: 2

value of loop: 4

value of loop: 8

Page 54: R Examples

value of loop: 16

h.

sum(1, 3, 5)

rep("Yo ho!", times = 3)

i. You can list the files in the current directory from within R, by calling

the list.files function

j. To run a script, pass a string with its name to the source function.

source("bottle1.R")

k. barplot(1:100)

l. Vector Math:

a <- c(1, 2, 3)

> a + 1

m. sum can take an optional named argument, na.rm. It's set to FALSE by default,

but if you set it to TRUE, all NA arguments will be removed from the vector

before the calculation is performed.

Try calling sum again, with na.rm set to TRUE:

RedoComplete

> sum(a, na.rm = TRUE)

n. a <- 1:12

matrix(a, 3, 4)

o. The dim assignment function sets dimensions for a matrix. It accepts a vector

with the number of rows and the number of columns to assign

dim(plank) <- c(2, 4)

p. contour map of the values simply by passing the matrix to the contour function

contour(matrix name)

q. create a 3D perspective plot with the persp function : persp(matrix name)

r. > persp(matrix name, expand=0.2)

Page 55: R Examples

s. image(matrix name) : to create a heat map.

t. add a line on the plot to show one standard deviation above the mean or TO

show two ablines: > abline(h = meanValue + deviation)

u. adding a line on the plot to show one standard devation below the mean:

abline(h = meanValue - deviation)

85. DATA FRAMES:

The weights, prices, and types data structures are all deeply tied together, if you

think about it. If you add a new weight sample, you need to remember to add a new

price and type, or risk everything falling out of sync. To avoid trouble, it would be

nice if we could tie all these variables together in a single data structure.

Fortunately, R has a structure for just this purpose: the data frame. You can think

of a data frame as something akin to a database table or an Excel spreadsheet. It

has a specific number of columns, each of which is expected to contain values of a

particular type. It also has an indeterminate number of rows - sets of related values

for each column.

Our vectors with treasure chest data are perfect candidates for conversion to a data

frame. And it's easy to do. Call the data.frame function, and pass weights, prices,

and types as the arguments. Assign the result to thetreasure variable:

> treasure <- data.frame(weights, prices, types)

> print(treasure)

weights prices types

1 300 9000 gold

2 200 5000 silver

3 100 12000 gems

4 250 7500 gold

5 150 18000 gems

a. You can get individual columns by providing their index number in double-

brackets. Try getting the second column (prices) of treasure:

Page 56: R Examples

> treasure[[2]]

b. You could instead provide a column name as a string in double-brackets. (This

is often more readable.) Retrieve the "weights" column:

> treasure[["weights"]]

c. Typing all those brackets can get tedious, so there's also a shorthand notation:

the data frame name, a dollar sign, and the column name (without quotes). Try

using it to get the "prices" column:

> treasure$prices

d. You can load a CSV file's content into a data frame by passing the file name to

the read.csv function. Try it with the "targets.csv" file:

RedoComplete

> read.csv("targets.csv")

e. Merge Data Frames: We want to loot the city with the most treasure and the

fewest guards. Right now, though, we have to look at both files and match up the

rows. It would be nice if all the data for a port were in one place...

R's merge function can accomplish precisely that. It joins two data frames together,

using the contents of one or more columns. First, we're going to store those file

contents in two data frames for you, targets and infantry.

The merge function takes arguments with an x frame (targets) and a y frame

(infantry). By default, it joins the frames on columns with the same name (the

two Port columns). See if you can merge the two frames:

RedoComplete

> targets <- read.csv("targets.csv")

> infantry <- read.table("infantry.txt", sep="\t", header=TRUE)

> merge(x = targets, y = infantry)