THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible...
Transcript of THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible...
![Page 1: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/1.jpg)
THE #RDATATABLE PACKAGE
OPEN ANALYTICS Data Analyst
for fast, flexible and memory efficient data wrangling
Arun SrinivasanUseR’15
![Page 2: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/2.jpg)
• Bioinformatician, Comp. Biologist
• Started using R in mid 2011
WHO AM I?
OPEN ANALYTICS
![Page 3: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/3.jpg)
• Tried plyr (>= 30GB total data size). Struggled with its speed, even with “.parallel” argument.
• Discovered data.table. Never looked back
WHO AM I?
OPEN ANALYTICS
![Page 4: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/4.jpg)
• data.table developer since late 2013
• Data analyst at Open Analytics since Feb 2015
WHO AM I?
OPEN ANALYTICS
![Page 5: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/5.jpg)
COME VISIT US
OPEN ANALYTICS
![Page 6: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/6.jpg)
OVERVIEW
• Straightforward operations, without compromising in efficiency.
• For more check: data.table vs dplyr post on StackOverflow
OPEN ANALYTICS
![Page 7: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/7.jpg)
• sep, colClasses, rows automatically detected
1. FREAD
fread(‘file.csv’)s 1.8.8, March 2013 67 ISSUES closed - 26 features - 17 bug fixes
OPEN ANALYTICS
![Page 8: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/8.jpg)
BENCHMARK
Method Run Time Threadedness
h2o.importFile 50s Multiple
fread 5m Single
readr::read_csv 12m Single
23GB .csv, 500 million rows, 9 columns
OPEN ANALYTICS
![Page 9: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/9.jpg)
FWRITE?
• Lot of work, and out of our free time
• readr reimple- menting fread has demotivated us
OPEN ANALYTICS
![Page 10: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/10.jpg)
year val2014 52015 12
year val2013 42014 22014 32015 12015 52015 6
2. GROUPED AGGREGATIONS
OPEN ANALYTICS
![Page 11: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/11.jpg)
year val2014 52015 12
year val2014 52015 12
year val2013 42014 22014 32015 12015 52015 6
year val2013 42014 22014 32015 12015 52015 6
+
+
X[year %in% 2014:2015, .(val = sum(val)), by = year]s
2. GROUPED AGGREGATIONS
OPEN ANALYTICS
![Page 12: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/12.jpg)
X[year %in% 2014:2015, .(val = sum(val)), by = year]s
i, on which rows?
j, What to do?
by, Grouped by what?
2. GROUPED AGGREGATIONS
OPEN ANALYTICS
![Page 13: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/13.jpg)
VIGNETTEShttps://github.com/Rdatatable/data.table/wiki/Getting-started
OPEN ANALYTICS
![Page 14: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/14.jpg)
year val2013 42014 22014 32015 12015 52015 6
year val2013 42014 22014 32015 12015 52015 6
year mul2014 202015 10
year val2014 1002015 120
year mul2014 202015 10
3. AGGREGATIONS ON JOINS
OPEN ANALYTICS
![Page 15: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/15.jpg)
year val2013 42014 22014 32015 12015 52015 6
year val2013 42014 22014 32015 12015 52015 6
year mul2014 202015 10
+
+
**
year val2014 1002015 120
year val2013 42014 22014 32015 12015 52015 6
year mul2014 202015 10
year val2014 1002015 120
year mul2014 202015 10
X[Y, .(val = sum(val) * mul), by = .EACHI]s
3. AGGREGATIONS ON JOINS
OPEN ANALYTICS
![Page 16: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/16.jpg)
X[Y, .(val = sum(val) * mul), by = .EACHI]s
on which rows?
What to do?
Grouped by what?
Joins as Subsets
3. AGGREGATIONS ON JOINS
OPEN ANALYTICS
![Page 17: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/17.jpg)
4. RESHAPINGid A1 A2 B1 B2
a 1 3 q s
b 2 4 r t
id num A Ba 1 1 qb 1 2 ra 2 3 sb 2 4 t
melt split varcast
Why not directly??
expensive
id var val id var num val
coerced to Char unclear
OPEN ANALYTICS
![Page 18: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/18.jpg)
4. RESHAPING
id A1 A2 B1 B2
a 1 3 q s
b 2 4 r t
id num A Ba 1 1 qb 1 2 ra 2 3 sb 2 4 t
melt(DT, measure = patterns(“^A”, “^B”))
NEW feature in 1.9.6
OPEN ANALYTICS
![Page 19: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/19.jpg)
BENCHMARK
Method Run Time
base::reshape 4.48s
melt (v1.9.6) 0.05s
tidyr 9.41s
27MB .csv, 500 thousand rows, 8 cols
OPEN ANALYTICS
![Page 20: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/20.jpg)
id A1 A2 B1 B2
a 1 3 q s
b 2 4 r t
dcast(DT, id ~ num, value.var = c(“A”, “B”))id num A Ba 1 1 qb 1 2 ra 2 3 sb 2 4 t
NEW feature in 1.9.6
4. RESHAPING
OPEN ANALYTICS
![Page 21: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/21.jpg)
THANKS TO
Ananda Mahto for ideas on reshaping enhancements.Also check out splitstackshape.
OPEN ANALYTICS
![Page 22: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/22.jpg)
5. OVERLAPPING RANGE JOINS
chr start end1 5 111 10 202 1 42 25 522 50 60
Axid yid1 NA2 23 34 35 3
result
foverlaps(A, B, which=TRUE)
since v1.9.4. Built using Rolling joins.
chr start end
1 1 41 15 182 1 55
B1:2:3:4:5:
1:2:3:
OPEN ANALYTICS
![Page 23: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/23.jpg)
id start end1 0.5 1.11 1 22 0.1 0.42 2.5 5.22 5 6
Axid yid1 NA2 23 34 35 3
result
foverlaps(A, B, which=TRUE)id start end
1 0.1 0.41 1.5 1.82 0.1 5.5
B1:2:3:4:5:
1:2:3:
Not limited to integer ranges, but numeric, POSIXct, Date ranges work just the same.
5. OVERLAPPING RANGE JOINS
OPEN ANALYTICS
![Page 24: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/24.jpg)
6. AUTO-INDEXING
data frame DF[DF$id %in% c(“KIC”, “HEJ”), ]
Run 1 4.7s
Run 2 4.8s
datatable DT[id %in% c(“KIC”, “HEJ”)]
Run 1 3.7s
Run 2 0.002s
1.2GB .csv, 100 million rows, 2 columns
Indexing + subset
since v1.9.4
OPEN ANALYTICS
![Page 25: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/25.jpg)
VERSION 1.9.6
• ~80 bug fixes and >20 new features• many new functions : rleid(), tstrsplit(), shift(), frank(), na.omit()
• melt() and dcast() can handle multiple columns• Join without setting keys now, using “on” argument, e.g., DT1[DT2, on = “colA”]
OPEN ANALYTICS
![Page 26: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/26.jpg)
ACKNOWLEDGEMENTS
• Thanks to users and contributors• Package authors who use data.table (CRAN - 89, bioconductor - 32)
• My colleagues and Matt• And you for listening! :-)
OPEN ANALYTICS
![Page 27: THE #RDATATABLE PACKAGE · THE #RDATATABLE PACKAGE OPEN ANALYTICS Data Analyst for fast, flexible and memory efficient data wrangling Arun Srinivasan UseR’15 •Bioinformatician,](https://reader034.fdocuments.us/reader034/viewer/2022050223/5f68c9ec2fb11c1cd30fd92d/html5/thumbnails/27.jpg)
QUESTIONS?
OPEN ANALYTICS