Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey...

27

Transcript of Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey...

Page 1: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Why I love R

Alastair Sanderson

School of Physics & Astronomy, University of Birmingham

2012-03-20

Page 2: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Outline

1 Introduction

2 Key attributes

3 Capabilities

4 Examples

5 Summary

Page 3: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Outline

Key attributes of R:

free/open sourcewidely usedwell documenteda high level programming language

Capabilities of R:

data handlingstatistical/numerical analysis & modellingdata visualisationreproducible research

Some examples using R

Summary

Page 4: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Outline

1 Introduction

2 Key attributes

3 Capabilities

4 Examples

5 Summary

Page 5: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

R is. . . Freely available

Free/open source software

. . . with free/open source IDEs (integrated developmentenvironment), e.g:

RStudio: http://rstudio.org/Emacs Speaks Statistics (ESS): http://ess.r-project.org/

According to John Chambers, in his excellent book �Softwarefor Data Analysis - Programming with R�, the mission of R is

. . . to enable the best and most thorough exploration of

data possible

. . . with the associated prime directive, that

the computations and the software should be trustworthy:

they should do what they claim, and be seen to do so.

Page 6: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

R is. . . Widely used (1/2)

Mature software (v1.0 released in 2000); runs on a wide rangeof platforms; with an annual development update cycle

Large & growing user base (1-2 million)

Many user contributed packages (>3500), very easy to install:

install.packages("mypkg")

library("mypkg")

Figure: Google searches for �R plot�: solidly growing interest

Page 7: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

R is. . . Widely used (2/2)

R is used by major institutions, e.g. Google, Facebook, NYTimes, New Scientist etc.

Track R's popularity: http://r4stats.com/popularity

R is the favourite tool for users of Kaggle (a platform forpredictive modelling and analytics competitions):http://blog.kaggle.com/2011/11/27/kagglers-favorite-tools

Page 8: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

R is. . . Well documented

Excellent help pages, richly cross-linked:

help(package="base") # List package contents

?data.frame # help on a specific task

?Syntax # help on a general topic

news(package="ggplot2") # details of recent changes

demo("plotmath") # demonstrate maths annotation

browseVignettes() # view index of vignettes in web browser

See �documentation� links at http://www.r-project.org for manuals,wiki, R journal, books etc.

CRAN �task views�: http://cran.r-project.org/web/views/

Type function name without brackets to view its R source code

Many R bloggers, aggregated at http://www.r-bloggers.com/

Page 9: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

R is. . . A high level programming language

Functional, object oriented language:http://cran.r-project.org/doc/manuals/R-lang.html

http://cran.r-project.org/doc/manuals/R-ints.html

Debugger, code pro�ling (Rprof) etc.:http://cran.r-project.org/doc/manuals/R-exts.html

Can link to compiled code (C, C++, Fortran) see �?Foreign�;e.g. seamless integration of R with C++ :http://cran.r-project.org/web/packages/Rcpp/index.html

Support for parallel computing:http://cran.r-project.org/web/views/HighPerformanceComputing.html

Writing packages in R:http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

Page 10: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Outline

1 Introduction

2 Key attributes

3 Capabilities

4 Examples

5 Summary

Page 11: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Data handling

Data input/output: http://cran.r-project.org/doc/manuals/R-data.html

Data structures: e.g. vectors, factors, arrays/matrices,lists/data frames: http://cran.r-project.org/doc/manuals/R-intro.html

?Extract # details of operators to extract/replace parts of data structures

?apply; ?sapply; ?aggregate; ?sweep # vectorised operations in R

?subset; ?transform; ?match; ?merge # data access & join commands

?regex # details of regular expressions capabilities

install.packages("stringr") # make it easier to work with strings

install.packages("RODBC") # Open Database Connectivity interface

Very powerful & convenient data manipulation packages:

plyr : http://cran.r-project.org/web/packages/plyr/index.html

reshape2 :http://cran.r-project.org/web/packages/reshape2/index.html

Page 12: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Statistical analysis & modelling

A brief summary:

?Distributions # details of supported statistical distributions

?RNG # details of random number generation

help(package="stats") # list contents of base stats package

?NA # support for missing data

library("cluster") # cluster analysis

library("boot") # bootstrap resampling

library("survival") # survival analysis

?lm; ?nls; ?anova # linear/non-linear regression/ANOVA

See also these CRAN task views:http://cran.r-project.org/web/views/Distributions.html

http://cran.r-project.org/web/views/Multivariate.html

http://cran.r-project.org/web/views/MachineLearning.html

http://cran.r-project.org/web/views/Spatial.html

http://cran.r-project.org/web/views/Robust.html

Page 13: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Numerical analysis & modelling

Wide variety of capabilities, e.g:

?integrate # numerical integration

?optim, ?nlminb # general-purpose optimisation

?D # symbolic differentiation

install.packages("deSolve") # solve differential equations

library("Matrix") # sparse and dense matrix classes and methods

?spline; ?smooth.spline # spline interpolation

library("splines") # regression spline functions and classes

?prcomp # Principal Components Analysis

See also the following CRAN task views:http://cran.r-project.org/web/views/Optimization.html

http://cran.r-project.org/web/views/HighPerformanceComputing.html

http://cran.r-project.org/web/views/ChemPhys.html

Page 14: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Data visualisation

help(package="graphics") # contents of base graphics package

?Devices # details of available output devices

library(lattice) # excellent for highly structured data

library(grid) # lower-level hierarchical graphics

install.packages("ggplot2") # fantastic graphics package!

install.packages("rgl") # interactive graphics

CRAN task view on graphics:http://cran.r-project.org/web/views/Graphics.html

ggplot2 web resources:

http://had.co.nz/ggplot2/

http://crantastic.org/packages/ggplot2

see Andy's talk

Page 15: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Reproducible research

"The term reproducible research was �rst proposed by Jon Claerbout at Stanford

University and refers to the idea that the ultimate product of research is the

paper along with the full computational environment used to produce the results

in the paper such as the code, data, etc. necessary for reproduction of the

results and building upon the research."

quote from http://en.wikipedia.org/wiki/Reproducibility

Sweave (see �?Sweave�): http://www.statistik.lmu.de/∼ leisch/Sweave/

xtable package - export tables to LATEX or HTML

Using Emacs Org mode (http://orgmode.org) with R:

http://orgmode.org/worg/org-contrib/babel/languages/ob-doc-R.html

http://orgmode.org/worg/org-contrib/babel/uses.html

RC package (see Ian's talk next meeting)

CRAN taskview: http://cran.r-project.org/web/views/ReproducibleResearch.html

Page 16: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Outline

1 Introduction

2 Key attributes

3 Capabilities

4 Examples

5 Summary

Page 17: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Hybrid R plot/image graphics

Figure: Integrating spatial data with a map, using ggplot2 in R

http://blog.revolutionanalytics.com/2012/02/what-are-the-most-popular-bike-routes-in-london.html

Page 18: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

A New York Times graphic using R

Figure: Michael Jackson's billboard rankings vs. the Beatles (top, in red)and U2 (bottom; in red)

These charts were done mostly in R and were published withinhours of Michael Jackson's death:

http://blog.revolutionanalytics.com/2009/06/nyt-charts-michael-jacksons-pop-hits.html

See the full, interactive graphic is here:

http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html

Page 19: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Dates & time series

> dt <- as.Date("2012-03-20")

> dt - as.Date("01/Jan/12", format="%d/%b/%y")

Time difference of 79 days

> months(dt)

[1] "March"

> weekdays(dt)

[1] "Tuesday"

For more information:

?DateTimeClasses; ?Dates

Time series handling & modelling:

?acf # auto-correlation

?ccf # cross-correlation

?arima # Fit ARIMA models to univariate time series

install.package("zoo") # } very useful time series

install.package("forecast") # } packages

CRAN task view: http://cran.r-project.org/web/views/TimeSeries.html

Page 20: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Visualising multivariate galaxy data

2D kernel density estimates; semi-transparency;colour/shape/size encoding

http://www.sr.bham.ac.uk/~ajrs/talks/SandersonAlastair_user2011_talk.pdf

Example (galaxy spatial distribution)

Right Ascension

Dec

linat

ion

−17.5

−17.0

−16.5

−16.0

−15.5

−15.0

200.0 199.5 199.0 198.5 198.0 197.5

Luminosity (Solar)

109

109.5

1010

1010.5

Scaled velocity

−2

−1

0

1

2

Morphology

Early

Late

?

Example (histogram ofgalaxy velocities)

Galaxy velocity (km/s)

Num

ber

of g

alax

ies

0

5

10

15

20

1500 2000 2500 3000 3500

Morphology

Early

Late

?

Page 21: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

World Bank data demo - part 1

Load package, download & save data

require("WDI") # load R package to access World Bank data

MSdata <- WDI(indicator="IT.CEL.SETS.P2", start=1990, end=2010)

save(MSdata, file="MSdata.RData")

Plot curves for each country & median line for all countries

## Use shorter name for data column:

MSdata <- transform(MSdata, MSpc = IT.CEL.SETS.P2)

require(ggplot2)

ggplot(data=MSdata, aes(year, MSpc, group=country)) +

geom_line(alpha=0.1) + # semi-transparent lines: 10% of normal

## add median curve for all countries (i.e. over-ride grouping):

stat_summary(fun.y=median, geom="line", aes(group=1), colour="blue") +

scale_y_log10() +

ylab("Mobile cellular subscriptions per 100 people")

Page 22: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Fraction of population with mobile phones

1e−03

1e−01

1e+01

1990 1995 2000 2005 2010year

Mob

ile c

ellu

lar

subs

crip

tions

per

100

peo

ple

Page 23: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

World Bank data demo - part 2

Plot 20 most recent countries to get mobile phones

require(plyr) # extremely useful package

## Year when a country with no subscriptions first registered some:

firstMS <- ddply(subset(MSdata, MSpc==0 & any(MSpc>0)), .(country),

summarise, year1 = max(year) + 1)

## sort by the year mobile phone use first starts:

firstMS <- firstMS[order(firstMS$year1), ]

## Create dotplot:

ggplot(data=tail(firstMS, 20), aes(year1, reorder(country, year1))) +

geom_point() +

xlab("Year of first mobile cellular subscriptions") + ylab("")

Page 24: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Year that mobile phone usage starts

Swaziland

Ethiopia

Nepal

SyrianArabRepublic

Chad

Liberia

Mauritania

SierraLeone

Somalia

Afghanistan

Iraq

Mayotte

Micronesia,Fed.Sts.

SaoTomeandPrincipe

Bhutan

Comoros

Guinea−Bissau

Eritrea

Tuvalu

Korea,Dem.Rep.

1998 2000 2002 2004 2006 2008Year of first mobile cellular subscriptions

Page 25: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Outline

1 Introduction

2 Key attributes

3 Capabilities

4 Examples

5 Summary

Page 26: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Conclusions

Is is free/open source software

It has a rapidly growing user base (1-2 million)

It is widely used in academia, business & industry

�Today, all of the Fortune 500 companies use R for their

data analyses�

http://www.r-bloggers.com/open-source-is-opening-data-to-predictive-analytics/

I love R because it empowers the individual by enabling

cutting-edge processing, analysis & visualisation of data,

based on trustworthy computations and software.

Page 27: Why I love R - Astrophysicsajrs/talks/why_I_love_R.pdf · IntroductionKey attributesCapabilitiesExamplesSummary Why I love R Alastair Sanderson School of Physics & Astronomy, University

Introduction Key attributes Capabilities Examples Summary

Birmingham R User Meeting (BRUM)

http://www.birminghamR.org

Alastair Sanderson: http://www.sr.bham.ac.uk/~ajrs