Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with...

35
Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN www.amelia.mn University of St Thomas Department of Computer and Information Sciences

Transcript of Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with...

Page 1: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

Working with categorical data in R without losing your mind

Amelia McNamara @AmeliaMN www.amelia.mn

University of St Thomas Department of Computer and Information Sciences

Page 2: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

https://peerj.com/collections/50-practicaldatascistats/

Page 3: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

https://peerj.com/collections/50-practicaldatascistats/

• Data organization in spreadsheets

• Packaging data analytical work reproducibly using R (and friends)

• Forecasting at scale

• How to share data for collaboration

• Opinionated analysis development

• Wrangling categorical data in R

• Lessons from between the white lines for isolated data scientists

• Teaching stats for data science

• Documenting and evaluating Data Science contributions in academic promotion in Departments of Statistics and Biostatistics

• Modeling offensive player movement in professional basketball

• Excuse me, do you have a moment to talk about version control?

• The democratization of data science education

• Extending R with C++: A Brief Introduction to Rcpp

• How R helps Airbnb make the most of its data

• Infrastructure and tools for teaching computing throughout the statistical curriculum

• Declutter your R workflow with tidy tools

Page 4: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

https://peerj.com/collections/50-practicaldatascistats/

• Data organization in spreadsheets

• Packaging data analytical work reproducibly using R (and friends)

• Forecasting at scale

• How to share data for collaboration

• Opinionated analysis development

• Wrangling categorical data in R

• Lessons from between the white lines for isolated data scientists

• Teaching stats for data science

• Documenting and evaluating Data Science contributions in academic promotion in Departments of Statistics and Biostatistics

• Modeling offensive player movement in professional basketball

• Excuse me, do you have a moment to talk about version control?

• The democratization of data science education

• Extending R with C++: A Brief Introduction to Rcpp

• How R helps Airbnb make the most of its data

• Infrastructure and tools for teaching computing throughout the statistical curriculum

• Declutter your R workflow with tidy tools

Page 5: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

http://www.amelia.mn/sds220/project.html

Page 6: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

http://gss.norc.org/

Page 7: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

factors

Adapted from 'Master the tidyverse' CC by RStudio

R’s representation of categorical data. Consists of: 1. A set of values 2. An ordered set of valid levels

eyes <- factor(x = c("blue", "green", "green"), levels = c("blue", "brown", "green"))eyes## [1] blue green green## Levels: blue brown green

Page 8: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 9: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

https://github.com/dsscollection/factor-mgmt

Page 10: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

Me @askdrstats

http://bit.ly/WranglingCats

Page 11: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 12: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

😱

Page 13: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 14: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 15: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

Page 17: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 18: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

But sometimes, you still need factors…

In particular, for modeling (changing reference levels, etc) and plotting

(reordering elements)

Page 19: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

https://github.com/COSTDataExpo2013/AmeliaMN

Page 20: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 21: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 22: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 23: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 24: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

https://github.com/tidyverse/forcats

Page 25: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

Level manipulation functions

CC BY Charlotte Wickham

fct_recode() Relabel levels "by hand"

fct_relevel() Reorder levels "by hand"

fct_reorder() Reorder levels by another variable

fct_collapse() Collapse levels "by hand"

fct_lump() Lump levels with small counts together

fct_other() Replace levels with "Other"

Values change to match levels

Page 26: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

SUMMARY STATISTICS: one continuous variable: mosaic::mean(~mpg, data=mtcars)

one categorical variable: mosaic::tally(~cyl, data=mtcars)

two categorical variables: mosaic::tally(cyl~am, data=mtcars)

one continuous, one categorical: mosaic::mean(mpg~cyl, data=mtcars)

SUMMARY STATISTICS:

one continuous variable: mean(mtcars$mpg)

one categorical variable: table(mtcars$cyl)

two categorical variables: table(mtcars$cyl, mtcars$am)

one continuous, one categorical: mean(mtcars$mpg[mtcars$cyl==4]) mean(mtcars$mpg[mtcars$cyl==6]) mean(mtcars$mpg[mtcars$cyl==8])

PLOTTING: one continuous variable:

hist(mtcars$disp)

boxplot(mtcars$disp)

one categorical variable: barplot(table(mtcars$cyl))

two continuous variables: plot(mtcars$disp, mtcars$mpg)

two categorical variables: mosaicplot(table(mtcars$am, mtcars$cyl))

one continuous, one categorical: histogram(mtcars$disp[mtcars$cyl==4]) histogram(mtcars$disp[mtcars$cyl==6]) histogram(mtcars$disp[mtcars$cyl==8])

boxplot(mtcars$disp[mtcars$cyl==4]) boxplot(mtcars$disp[mtcars$cyl==6]) boxplot(mtcars$disp[mtcars$cyl==8])

WRANGLING: subsetting:

mtcars[mtcars$mpg>30, ]

making a new variable: mtcars$efficient[mtcars$mpg>30] <- TRUE mtcars$efficient[mtcars$mpg<30] <- FALSE

R Syntax Comparison : : CHEAT SHEET

RStudio® is a trademark of RStudio, Inc. • CC BY Amelia McNamara • [email protected] • @AmeliaMN • science.smith.edu/~amcnamara/ • Updated: 2018-01

Dollar sign syntax

tilde

Formula syntax Tidyverse syntaxgoal(data$x, data$y) goal(y~x|z, data=data, group=w) data %>% goal(x)

PLOTTING: one continuous variable: lattice::histogram(~disp, data=mtcars)

lattice::bwplot(~disp, data=mtcars)

one categorical variable: mosaic::bargraph(~cyl, data=mtcars)

two continuous variables: lattice::xyplot(mpg~disp, data=mtcars)

two categorical variables: mosaic::bargraph(~am, data=mtcars, group=cyl)

one continuous, one categorical: lattice::histogram(~disp|cyl, data=mtcars)

lattice::bwplot(cyl~disp, data=mtcars)

SUMMARY STATISTICS: one continuous variable:

mtcars %>% dplyr::summarize(mean(mpg))

one categorical variable: mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarize(n())

two categorical variables: mtcars %>% dplyr::group_by(cyl, am) %>% dplyr::summarize(n())

one continuous, one categorical: mtcars %>% dplyr::group_by(cyl) %>%

dplyr::summarize(mean(mpg))

PLOTTING: one continuous variable: ggplot2::qplot(x=mpg, data=mtcars, geom = "histogram")

ggplot2::qplot(y=disp, x=1, data=mtcars, geom="boxplot")

one categorical variable: ggplot2::qplot(x=cyl, data=mtcars, geom="bar")

two continuous variables: ggplot2::qplot(x=disp, y=mpg, data=mtcars, geom="point")

two categorical variables: ggplot2::qplot(x=factor(cyl), data=mtcars, geom="bar") +

facet_grid(.~am)

one continuous, one categorical: ggplot2::qplot(x=disp, data=mtcars, geom = "histogram") +

facet_grid(.~cyl)

ggplot2::qplot(y=disp, x=factor(cyl), data=mtcars, geom="boxplot")

WRANGLING: subsetting: mtcars %>% dplyr::filter(mpg>30)

making a new variable: mtcars <- mtcars %>% dplyr::mutate(efficient = if_else(mpg>30, TRUE, FALSE))

the pipe

The variety of R syntaxes give you many ways to “say” the same thing

read across the cheatsheet to see how different syntaxes approach the same problem

https://www.rstudio.com/resources/cheatsheets/

Page 27: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

Compact but fragile (base R)

Robust but verbose (base R)

Direct and robust (tidyverse)

Page 28: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 29: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 30: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 31: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

Defensive coding

Page 32: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 33: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of
Page 34: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

Takeaways: • Use forcats• Practice defensive coding • summary() is your friend • assertthat and testthat • Check out http://bit.ly/WranglingCats

Page 35: Working with categorical data in R without losing your mind · 2020. 7. 7. · Working with categorical data in R without losing your mind Amelia McNamara @AmeliaMN University of

Thank youAmelia McNamara @AmeliaMN

www.amelia.mn University of St Thomas Department of Computer and Information Sciences