My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014...

51
My journey to R 26 Nov 2014 Budapest BI Forum Matt Dowle

Transcript of My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014...

Page 1: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

My journey to R26 Nov 2014

Budapest BI Forum

Matt Dowle

Page 2: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

2

Background

Graduated 1996: Applied Maths and Computing

Joined Lehman Brothers: Sybase SQL & Visual Basic

1999 : Joined Salomon Brothers

They used S-PLUS already.

Patrick Burns (an S-PLUS guru and author) was a member of the team ...

Page 3: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

3

Pat shows me S-PLUS

> DF <- data.frame(

A = letters[1:3],

B = c(1,3,5) )

> DF

A B

1 a 1

2 b 3

3 c 5

Page 4: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

4

Rows are stored in order

> DF[2:3,]

A B

1 a 1

2 b 3

3 c 5

(unlike SQL and very useful)

Page 5: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

5

My first thought

> DF[2:3,]

A B

2 b 3

3 c 5

> DF[2:3, sum(B)] sum(B)[1] 8

Page 6: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

6

No

sum(DF[2:3,”B”])

or

with(DF, sum(B))

Page 7: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

7

3 years pass, 2002

One day S-PLUS crashes

It's not my code, but a corruption in S-PLUS

Support: Are you sure it's not your code.

Matt: Yes. See, here's how you reproduce it.

Support: Yes, you're right. We'll fix it, thanks!

Matt: Great, when?

Page 8: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

8

When

Support: Immediately for the next release.

Matt: Great, when's that?

Support: 6 months

Matt: Can you do a patch quicker?

Support: No because it's just you with the problem.

Matt: But I'm at Salomon/Citigroup, the biggest financial corporation in the world!

Support: True but it's still just you, Matt.

Page 9: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

9

When continued

Matt: I understand. Can you send me the code and I'll fix it? I don't mind - I'll do it for free. I just want to fix it to get my job done.

Support: Sorry, can't do that. Lawyer says no.

Matt: Pat, any ideas?

Pat: Have you tried R?

Matt: What's R?

Page 10: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

10

R in 2002

I took the code I had in S-PLUS and ran it in R.

Not only didn't it crash, but it took 1 minute instead of 1 hour.

R had improved the speed of for loops (*) and was in-memory rather than on-disk.

(*) The code generated random portfolios and couldn't be vectorized, due to its nature.

Page 11: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

11

Even better

If R does error or crash, I can fix it. We have the source code! Or we can hire someone to fix it for us.

I can get my work done and not wait 6 months for a fix.

And it has packages.

I start to use R.

Page 12: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

12

My first thought, again

Matt: Pat, remember how I first thought [.data.frame should work?

DF[2:3, sum(B)]

Page 13: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

13

2004, day 1

DF[2:3, sum(B)] is born.

Only possible because R (uniquely) has lazy evaluation.

Page 14: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

14

2004, day 2

I do the same for i

DF[type=="books", sum(sales)]

Page 15: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

15

2004, day 3

I realise I need group by :

DF[type=="books", sum(sales), by=country]

Page 16: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

16

2004, day 4

I realise chaining comes for free:

DF[, sum(sales), by=country][order(-V1), ]

Page 17: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

17

Aug 2008

I release data.table 1.1 to CRAN :

DT[ where, select, group by ][ … ][ … ]

Page 18: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

18

2011

I define := in j to do assignment by reference, combined with subset and grouping

DT[ where, select | update, group by ][ … ][ … ]

From v1.6.3 NEWS :

for (i in 1:1000) DF[i,1] <- i # 591s

for (i in 1:1000) DT[i,V1:=i] # 1s

Page 19: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

19

June 2012

Page 20: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

20

data.table answer

NB: It isn't just the speed, but the simplicity. It's easy to write and easy to read.

Page 21: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

21

User's reaction

”data.table is awesome! That took about 3 seconds for the whole thing [was 30mins]!!!”

Davy Kavanagh, 15 Jun 2012

Page 22: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

22

Present day ...

Page 23: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

23

Vector languages

Number of questions R 71,471 S-PLUS 22 Matlab 39,127 Python 354,437 Julia 542

Quick demo of a vector language

Page 24: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

24

Number of R user groups

Asia 21 Canada 7 Europe 44 UK 8 Australia 7 US 55

===

142

Page 25: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

25

R Task Views (1)

Bayesian Inference Chemometrics and Computational Physics Clinical Trial Design, Monitoring, and Analysis Cluster Analysis & Finite Mixture Models Differential Equations Probability Distributions Computational Econometrics Analysis of Ecological and Environmental Data

Page 26: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

26

R Task Views (2)

Design of Experiments Empirical Finance Statistical Genetics Graphic Displays High-Performance and Parallel Computing Machine Learning & Statistical Learning Medical Image Analysis Meta-Analysis

Page 27: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

27

R Task Views (3)

Multivariate Statistics Natural Language Processing Numerical Mathematics Official Statistics & Survey Methodology Optimization and Mathematical Programming Analysis of Pharmacokinetic Data Phylogenetics, Especially Comparative

Methods Psychometric Models and Methods

Page 28: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

28

R Task Views (4)

Reproducible Research Robust Statistical Methods Statistics for the Social Sciences Analysis of Spatial Data Handling and Analyzing Spatio-Temporal Data Survival Analysis Time Series Analysis Web Technologies and Services

Page 29: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

29

Fast and friendly file reading

e.g. 50MB .csv, 1 million rows x 6 columns

read.csv("test.csv") # 30-60s

read.csv("test.csv", colClasses=, nrows=, etc...) # 10s

fread("test.csv") # 3s

e.g. 20GB .csv, 200 million rows x 16 columns

read.csv(”big.csv”, ...) # hours

fread("big.csv") # 8m

Page 30: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

30

Benchmarks

Use large data (don't loop on small data) Vary data

large groups vs small groups types: integer, character, numeric, factor random input or (partially) ordered

Vary queries Grouping : sum, mean, median, … Joining, Updating, Subsetting

Easy to reproduce Incomplete, but this is how far I've got ...

Page 31: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

31

1 billion rows (50GB)

Page 32: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

32

1 billion rows (50GB)

Page 33: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

33

Example from Landmark, an Insurance Services company in the U.K.

(with permission) ...

Page 34: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |34

LiDAr data

Page 35: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |35

Individual grid files

Building Outline (red)

Building Outline reduce by 1m

Generated ascii grid file

Page 36: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |36

Reading ascii grid files

library(SDMTools)

asc <- read.asc(file="./0001000012290222.asc",gz=FALSE)

library(raster)

r <- raster("./0001000012290222.asc“ , crs=CRS("+init=epsg:27700")) # OSGB36

Page 37: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |37

Slope & aspect

15°

NE

SW

N

Page 38: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |38

Reading ascii grid files

library(SDMTools)

asc <- read.asc(file="./0001000012290222.asc",gz=FALSE)

library(raster)

r <- raster("./0001000012290222.asc“ , crs=CRS("+init=epsg:27700")) # OSGB36

Page 39: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |39

Output

15°NE

SW

N

1.13m

31m² ≈ (8m x 4m)

Slope = atan (1.13m / 4m) ≈ 16°

Page 40: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |40

Pre Calculated Lidar Points (1m Posting)

Page 41: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |41

The data.table query usedin.lid.bld.dsm.dt[, c( tot.dsm.count = length(which(!is.na(dsm))) , tot.dsm.slp.count = length(which(!is.na(dsm.slp))) , min.dsm.distance = min(distance, na.rm = TRUE) , min.dsm = min(dsm, na.rm = TRUE) , max.dsm = max(dsm, na.rm = TRUE) , dsm.diff = round(max(dsm)-min(dsm),4) , mean.dsm = round(mean(dsm, na.rm = TRUE),2) , max.dsm.dtm.diff = round(max(dsm.dtm.diff),4) , mean.dsm.dtm.diff = round(mean(dsm.dtm.diff, na.rm = TRUE),4) , mean.dsm.slp = round(mean(dsm.slp, na.rm = TRUE),2) , mean.dsm.slp.gt15 = round(MeanSlopeGT15(dsm.slp),2) , mean.dsm.dtm.diff2 = round(mean(diff(dsm.dtm.diff,1), na.rm = TRUE, trim = 0.1),4) , max.dsm.dtm.diff2 = round(max(diff(dsm.dtm.diff,1)),4) , flat.cells = CountFlatCells(dsm.slp) , flat.cells.0.5 = CountFlatCells.0.5(dsm.slp) , flat.cells.5.10 = CountFlatCells.5.10(dsm.slp) , flat.cells.10.15 = CountFlatCells.10.15(dsm.slp) , RoofOrientation(dsm.slp, dsm.asp.head, dsm.asp.head.comp) # RoofOrientation() returns a list of 13 more columns), by=buiding.id ] # all 20 million properties in the U.K.

Page 42: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

© 2013 DMGT |42

Kim Edmunds

Chief Data Scientist

Landmark Insurance Services, U.K.

[email protected]

Page 43: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

43

”10 R packages to win Kaggle competitions”, useR! 2014

data.table

By Xavier Conortof DataRobot.com

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234

Page 44: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

44

data.table course on DataCamp

Page 45: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

45

setkey(PRC, id, date)1. PRC[.(“SBRY”)] # all 3 rows2. PRC[.(“SBRY”,20080502),price] # 391.503. PRC[.(“SBRY”,20080505),price] # NA4. PRC[.(“SBRY”,20080505),price,roll=TRUE] # 391.505. PRC[.(“SBRY”,20080601),price,roll=TRUE] # 389.006. PRC[.(“SBRY”,20080601),price,roll=TRUE,rollends=FALSE] # NA7. PRC[.(“SBRY”,20080601),price,roll=20] # NA8. PRC[.(“SBRY”,20080601),price,roll=40] # 389.00

PRC

Page 46: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

46

roll = ”nearest”

x y valueA 2 1.1A 9 1.2A 11 1.3B 3 1.4

setkey(DT, x, y)

DT[.(”A”,7), roll=”nearest”]

Page 47: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

47

Overlapping range joins

Page 48: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

48

28hrs down to 4mins

Page 49: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

49

data.table support

Page 50: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

50

Client / Server Demo

Watch on YouTube (no audio)

Page 51: My journey to R - Budapest BI Forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...Nov 26, 2014  · 2 Background Graduated 1996: Applied Maths and Computing Joined Lehman Brothers:

51

Further links

https://github.com/Rdatatable/data.table/wiki

http://stackoverflow.com/questions/tagged/data.table

3 hour data.table tutorial at useR! 2014, Los Angeles:http://user2014.stat.ucla.edu/files/tutorial_Matt.pdf

https://www.datacamp.com/courses/ data-analysis-the-data-table-way

(10% discount code: MY-DATATABLE-COUPON)