Introduction to R - Data Servicesdataservices.gmu.edu/files/R_with_Dplyr.pdf · Introduction to R...

Introduction to R

by Debby Kermer

George Mason University

Library Data Services Group

http://dataservices.gmu.edu

[email protected]

http://dataservices.gmu.edu/workshop/r

1. Create a folder with your name on the S: Drive

2. Copy titanic.csv from R Workshop Files to that folder

History

R ≈ S ≈ S-Plus

Open Source, Free

www.r-project.org

Comprehensive R Archive Network

Download R from:

cran.rstudio.com

RStudio: www.rstudio.com

R Console

Console

See… > prompt for new command + waiting for rest of command R "guesses" whether you are done considering () and symbols Type… ↑[Up] to get previous command Type in everything in Courier New font:

3+2 3- 2

Objects

Objects

nine <- 9

nine

three <- nine / 3

three

my.school <- "gmu"

my.school

Historical Conventions

Use <- to assign values

Use . to separate names

Current Capabilities

= is okay now in most cases

_ is okay now in most cases

RStudio: Press Alt - (minus) to insert "assignment operator"

Global Environment

Script Files

Vectors & Lists

numbers <- c(101,102,103,104,105) numbers <- 101:105 numbers <- c(101:104,105) numbers[ 2 ] numbers[ c(2,4,5)] numbers[-c(2,4,5)] numbers[ numbers > 102 ]

the same

RStudio: Press Ctrl-Enter to run the current line

or press the button:

Vector Variable

Files & Functions

Files Pane

1. Browse Drives

2. Choose the "S" Drive

3. Choose your folder

Working Directory

Functions

read.table( datafile, header=TRUE, sep = ",")

Positional Argument

Named Argument

Named Argument

Function

Reading Files

Thus, these would be identical: titanic <- read.table( datafile, header=TRUE, sep = "," ) titanic <- read.csv( datafile ) If you have not set a working directory, use the whole path:

titanic <- read.csv("S:/name/titanic.csv") titanic <- read.csv("S:\\name\\titanic.csv")

titanic <- read.csv("titanic.csv")

R-Studio: Import Dataset

Working with Data

Data Frames

str(titanic) think structure

titanic$pclass

titanic <- read.csv("titanic.csv", as.is = "name")

int / num = Numeric (Interval / Ratio)

Factor = Categorical (Nominal /Ordinal )

Factors - Categorical Variables

titanic$pclass.f <- factor( titanic$pclass, levels = c(1,2,3), labels = c("1st Class", "2nd Class", "3rd Class"), ordered = TRUE )

current values

labels in the same order

ordinal variable

it is convention to add .f, can also give a new name or rewrite original

NA and NULL

Delete Variable titanic$embarked <- NULL

Set Values to Missing titanic$age[titanic$age == 99] <- NA

same thing while reading in data:

titanic <- read.csv("titanic.csv", na.strings = "99")

Ignore NAs Option

na.rm = TRUE primarily needed for base R functions

Review

Words with Stuff

word (Object)

word[ stuff ] (Object Part)

word( stuff ) (Function)

"word" (String)

Words that are not Objects TRUE or T FALSE or F NaN (Not a Number)

NA (Not Available)

NULL (Empty) Inf (Infinity)

Packages

Packages

Packages must be both Installed and Loaded

To Install: install.packages("name")

To Load: library( name ) or, require( name ) or, check the box

Confirm these are installed: dplyr tidyr descr ggplot2

Loaded Installed

Install

Data Frame Alternatives

data.frame

data.table

package

dplyr

tbl_df

tbl_dt

package

titanic library(dplyr) tt <- tbl_df(titanic) tt str(tt)

History Fact Hadley Wickham,

who created dplyr, works at RStudio

Choose Variables

base tt$name tt[,"name"] tt[,c("age","gender")] dplyr select( tt, name) select( tt, -name) select( tt, age, gender) select( tt, gender : pclass) select( tt, starts_with("p"))

contains starts_with ends_with matches distinct

Choose Cases

base tt[tt$age < 5 , ] titanic[titanic$age < 5 , ]

attach(titanic) titanic[age < 5 , ] titanic[age < 5 & is.na(age) == F , ]

dplyr filter(tt, age < 5 ) filter(tt, age < 5, pclass.f == "1st Class" ) filter(tt, age < 5 | pclass == 1 )

Change data

base tt$child <- tt$age <= 12 tt$totfam <- tt$sibsp + tt$parch tt$bigfam <- as.numeric(tt$totfam > 4)

dplyr tt <- mutate(tt, child = age <= 12) tt <- mutate(tt, totfam = sibsp + parch, bigfam = as.numeric(totfam > 4) )

Chaining / Piping

%>% RStudio: Ctrl+Shift+M Read "then…"

select(tt, name, age)

vs tt %>% select(name, age) tt %>% filter(age<5) %>% select(name, age)

works anytime the 1st argument is the dataset

History Fact from magrittr originally %.%

https://github.com/smbache/magrittr

Hadley Wickham's Packages

library(dplyr)

select : Choose variables

filter : Choose cases

mutate : Change values

summarize : Aggregate values

group_by : Create groups

arrange : Order cases

History Fact dplyr - update of

plyr for data tables

library(tidyr )

spread gather separate bind_rows

library(lubridate ) library(stringr )

Descriptive Statistics

Summarize

base summary(tt) mean(tt$age) mean(tt$age , na.rm = T ) sd(tt$age , na.rm = T )

dplyr

summarize(tt, xbar=mean(age, na.rm=T)) summarize(tt, n=n(), sd=sd(sibsp))

descr Package

library(descr) freq(tt$pclass) freq(tt$pclass.f) freq(tt$age) CrossTable(tt$pclass, tt$survived) CrossTable(tt$pclass, tt$survived, prop.t = F, prop.c = F, prop.r = T, digits = 2 )

T is default for all

Pivot Table

library(dplyr) library(tidyr)

tt %>%

group_by(pclass, gender) %>%

summarize( pct = mean(survived) ) %>%

spread( gender, pct )

ggplot2

plot( tt$age) plot( tt$age, tt$fare ) library(ggplot2) qplot(age, data=tt) qplot(age, fare, data=tt)

See full documentation at:

www.ggplot2.org

History Fact Created by

Hadley Wickham, based on the book

"Grammar of Graphics" by Leland Wilkinson

Levels of Measurement

plot( tt$pclass, tt$survived ) tt$survived.f <- factor(tt$survived, labels = c("Died","Survived") ) labels(tt$gender) <- c("Males", "Females") plot( tt$pclass.f, tt$survived.f )

qplot

qplot(pclass.f, data=tt, fill=survived.f) qplot(age,fare, data=tt, color=survived.f) qplot(age, data=tt, fill = survived.f, alpha = I(0.3), position = "identity")

qplot( x, y, data=, color=, shape=, size=, fill=, method=, formula=, alpha=, #transparency geom=, #type facets=, #matrix xlim=, ylim= , #axis ranges xlab=, ylab=, #axis labels main=, sub= #titles )

Referring to Variables

mean( tt$age ) qplot( age, data=tt ) select( tt, age ) attach(titanic)

names(df) <- c( "var1","var2","var3" ) names(df)[ 1 ] <- "var1"

Statistical Analysis

Writing Models

Statistical Equation R Formula Yi = β0 + β1Xi + εI Y ~ X

Yi = β0 + β1Xi + β2Zi + εI Y ~ X + Z

Yi = β0 + β1Xi + β2Zi + β3Xi Zi + εI Y ~ X * Z or Y ~ X + Z + X:Z

t.test( fare ~ gender, data = tt )

: interaction

* factorial

~ predicted from

+ include

Analysis Objects

tt.anova <- aov(fare ~ gender*pclass , data=tt ) summary(tt.anova) tt.logistic <- glm( survived ~ gender + pclass + gender:pclass + age + child , family = binomial, data = tt ) summary(tt.logistic)

the same

More with Analysis Objects plot(tt.logistic) tt.pred <- predict(tt.logistic) tt.resid <- residuals(tt.logistic) plot(tt.pred, tt.resid)

More Capabilities

R Markdown 1 2

3 4

Analysis Environments

R Commander Separate Interface

More/Better Statistics

www.rcommander.com

install.packages("Rcmdr") library(Rcmdr)

Deducer Adds to R Interface (not RStudio!)

Easier Data Management

www.deducer.org

install.packages("Deducer") library(Deducer)

Deducer Plot Builder

Data Mining GUI

install.packages("rattle") require ("rattle") rattle()

Next Steps

Finding Packages

Swirl

install.packages("swirl") require ("swirl") install_from_swirl("Course") swirl() http://swirlstats.com

Tutorials

http://dataservices.gmu.edu/software/r

http://tryr.codeschool.com/

https://www.datacamp.com/

Slides are available at:

http://dataservices.gmu.edu/workshops/r

[email protected]

[email protected]

c b n a © 2015 by Debby Kermer, Mason Library Data Services

This work is licensed under the c Attribution-NonCommercial-ShareAlike International License: http://creativecommons.org/licenses/by-nc-sa/4.0/

Introduction to R - Data Servicesdataservices.gmu.edu/files/R_with_Dplyr.pdf · Introduction to R...

Documents

Transcript of Introduction to R - Data Servicesdataservices.gmu.edu/files/R_with_Dplyr.pdf · Introduction to R...