Introduction to R - Data Servicesdataservices.gmu.edu/files/R_with_Dplyr.pdf · Introduction to R...
Transcript of Introduction to R - Data Servicesdataservices.gmu.edu/files/R_with_Dplyr.pdf · Introduction to R...
Introduction to R
by Debby Kermer
George Mason University
Library Data Services Group
http://dataservices.gmu.edu
http://dataservices.gmu.edu/workshop/r
1. Create a folder with your name on the S: Drive
2. Copy titanic.csv from R Workshop Files to that folder
History
R ≈ S ≈ S-Plus
Open Source, Free
www.r-project.org
Comprehensive R Archive Network
Download R from:
cran.rstudio.com
RStudio: www.rstudio.com
R Console
Console
See… > prompt for new command + waiting for rest of command R "guesses" whether you are done considering () and symbols Type… ↑[Up] to get previous command Type in everything in Courier New font:
3+2 3- 2
Objects
Objects
nine <- 9
nine
three <- nine / 3
three
my.school <- "gmu"
my.school
Historical Conventions
Use <- to assign values
Use . to separate names
Current Capabilities
= is okay now in most cases
_ is okay now in most cases
RStudio: Press Alt - (minus) to insert "assignment operator"
Global Environment
Script Files
Vectors & Lists
numbers <- c(101,102,103,104,105) numbers <- 101:105 numbers <- c(101:104,105) numbers[ 2 ] numbers[ c(2,4,5)] numbers[-c(2,4,5)] numbers[ numbers > 102 ]
the same
RStudio: Press Ctrl-Enter to run the current line
or press the button:
Vector Variable
Files & Functions
Files Pane
1. Browse Drives
2. Choose the "S" Drive
3. Choose your folder
Working Directory
Functions
read.table( datafile, header=TRUE, sep = ",")
Positional Argument
Named Argument
Named Argument
Function
Reading Files
Thus, these would be identical: titanic <- read.table( datafile, header=TRUE, sep = "," ) titanic <- read.csv( datafile ) If you have not set a working directory, use the whole path:
titanic <- read.csv("S:/name/titanic.csv") titanic <- read.csv("S:\\name\\titanic.csv")
titanic <- read.csv("titanic.csv")
R-Studio: Import Dataset
Working with Data
Data Frames
str(titanic) think structure
titanic$pclass
titanic <- read.csv("titanic.csv", as.is = "name")
int / num = Numeric (Interval / Ratio)
Factor = Categorical (Nominal /Ordinal )
Factors - Categorical Variables
titanic$pclass.f <- factor( titanic$pclass, levels = c(1,2,3), labels = c("1st Class", "2nd Class", "3rd Class"), ordered = TRUE )
current values
labels in the same order
ordinal variable
it is convention to add .f, can also give a new name or rewrite original
NA and NULL
Delete Variable titanic$embarked <- NULL
Set Values to Missing titanic$age[titanic$age == 99] <- NA
same thing while reading in data:
titanic <- read.csv("titanic.csv", na.strings = "99")
Ignore NAs Option
na.rm = TRUE primarily needed for base R functions
Review
Words with Stuff
word (Object)
word[ stuff ] (Object Part)
word( stuff ) (Function)
"word" (String)
Words that are not Objects TRUE or T FALSE or F NaN (Not a Number)
NA (Not Available)
NULL (Empty) Inf (Infinity)
Packages
Packages
Packages must be both Installed and Loaded
To Install: install.packages("name")
To Load: library( name ) or, require( name ) or, check the box
Confirm these are installed: dplyr tidyr descr ggplot2
Loaded Installed
Install
Data Frame Alternatives
data.frame
data.table
package
dplyr
tbl_df
tbl_dt
package
titanic library(dplyr) tt <- tbl_df(titanic) tt str(tt)
History Fact Hadley Wickham,
who created dplyr, works at RStudio
Choose Variables
base tt$name tt[,"name"] tt[,c("age","gender")] dplyr select( tt, name) select( tt, -name) select( tt, age, gender) select( tt, gender : pclass) select( tt, starts_with("p"))
contains starts_with ends_with matches distinct
Choose Cases
base tt[tt$age < 5 , ] titanic[titanic$age < 5 , ]
attach(titanic) titanic[age < 5 , ] titanic[age < 5 & is.na(age) == F , ]
dplyr filter(tt, age < 5 ) filter(tt, age < 5, pclass.f == "1st Class" ) filter(tt, age < 5 | pclass == 1 )
Change data
base tt$child <- tt$age <= 12 tt$totfam <- tt$sibsp + tt$parch tt$bigfam <- as.numeric(tt$totfam > 4)
dplyr tt <- mutate(tt, child = age <= 12) tt <- mutate(tt, totfam = sibsp + parch, bigfam = as.numeric(totfam > 4) )
Chaining / Piping
%>% RStudio: Ctrl+Shift+M Read "then…"
select(tt, name, age)
vs tt %>% select(name, age) tt %>% filter(age<5) %>% select(name, age)
works anytime the 1st argument is the dataset
History Fact from magrittr originally %.%
Hadley Wickham's Packages
library(dplyr)
select : Choose variables
filter : Choose cases
mutate : Change values
summarize : Aggregate values
group_by : Create groups
arrange : Order cases
History Fact dplyr - update of
plyr for data tables
library(tidyr )
spread gather separate bind_rows
library(lubridate ) library(stringr )
Descriptive Statistics
Summarize
base summary(tt) mean(tt$age) mean(tt$age , na.rm = T ) sd(tt$age , na.rm = T )
dplyr
summarize(tt, xbar=mean(age, na.rm=T)) summarize(tt, n=n(), sd=sd(sibsp))
descr Package
library(descr) freq(tt$pclass) freq(tt$pclass.f) freq(tt$age) CrossTable(tt$pclass, tt$survived) CrossTable(tt$pclass, tt$survived, prop.t = F, prop.c = F, prop.r = T, digits = 2 )
T is default for all
Pivot Table
library(dplyr) library(tidyr)
tt %>%
group_by(pclass, gender) %>%
summarize( pct = mean(survived) ) %>%
spread( gender, pct )
ggplot2
plot( tt$age) plot( tt$age, tt$fare ) library(ggplot2) qplot(age, data=tt) qplot(age, fare, data=tt)
See full documentation at:
www.ggplot2.org
History Fact Created by
Hadley Wickham, based on the book
"Grammar of Graphics" by Leland Wilkinson
Levels of Measurement
plot( tt$pclass, tt$survived ) tt$survived.f <- factor(tt$survived, labels = c("Died","Survived") ) labels(tt$gender) <- c("Males", "Females") plot( tt$pclass.f, tt$survived.f )
qplot
qplot(pclass.f, data=tt, fill=survived.f) qplot(age,fare, data=tt, color=survived.f) qplot(age, data=tt, fill = survived.f, alpha = I(0.3), position = "identity")
qplot( x, y, data=, color=, shape=, size=, fill=, method=, formula=, alpha=, #transparency geom=, #type facets=, #matrix xlim=, ylim= , #axis ranges xlab=, ylab=, #axis labels main=, sub= #titles )
Referring to Variables
mean( tt$age ) qplot( age, data=tt ) select( tt, age ) attach(titanic)
names(df) <- c( "var1","var2","var3" ) names(df)[ 1 ] <- "var1"
Statistical Analysis
Writing Models
Statistical Equation R Formula Yi = β0 + β1Xi + εI Y ~ X
Yi = β0 + β1Xi + β2Zi + εI Y ~ X + Z
Yi = β0 + β1Xi + β2Zi + β3Xi Zi + εI Y ~ X * Z or Y ~ X + Z + X:Z
t.test( fare ~ gender, data = tt )
: interaction
* factorial
~ predicted from
+ include
Analysis Objects
tt.anova <- aov(fare ~ gender*pclass , data=tt ) summary(tt.anova) tt.logistic <- glm( survived ~ gender + pclass + gender:pclass + age + child , family = binomial, data = tt ) summary(tt.logistic)
the same
More with Analysis Objects plot(tt.logistic) tt.pred <- predict(tt.logistic) tt.resid <- residuals(tt.logistic) plot(tt.pred, tt.resid)
More Capabilities
R Markdown 1 2
3 4
Analysis Environments
R Commander Separate Interface
More/Better Statistics
www.rcommander.com
install.packages("Rcmdr") library(Rcmdr)
Deducer Adds to R Interface (not RStudio!)
Easier Data Management
www.deducer.org
install.packages("Deducer") library(Deducer)
Deducer Plot Builder
Data Mining GUI
install.packages("rattle") require ("rattle") rattle()
Next Steps
Finding Packages
Swirl
install.packages("swirl") require ("swirl") install_from_swirl("Course") swirl() http://swirlstats.com
Tutorials
http://dataservices.gmu.edu/software/r
http://tryr.codeschool.com/
https://www.datacamp.com/
Slides are available at:
http://dataservices.gmu.edu/workshops/r
c b n a © 2015 by Debby Kermer, Mason Library Data Services
This work is licensed under the c Attribution-NonCommercial-ShareAlike International License: http://creativecommons.org/licenses/by-nc-sa/4.0/