R Software development - How to write and maintain 30K+ LOC in R and survive?

Post on 22-Jan-2018

463 views 0 download

Transcript of R Software development - How to write and maintain 30K+ LOC in R and survive?

Copyright (c) WLOG Solutions

R software development How to write and maintain 30K+ LOC in R and

survive?

Wit Jakuczun, WLOG Solutions

2017-06-20

Copyright (c) WLOG Solutions 2

World of analytics has changed.

Copyright (c) WLOG Solutions 3

Copyright (c) WLOG Solutions 4

4000x4 elastic-net models (CV-5) for 45Kx10K datasetin 1,5 minute!

Copyright (c) WLOG Solutions 5

Join 21st centuRy today!

Copyright (c) WLOG Solutions

What is R?

6

Copyright (c) WLOG Solutions 7

Dynamically interpreted general programming language

Copyright (c) WLOG Solutions 8

Stable open-source productdeveloped by R Foundation

since ~1995 year.

Copyright (c) WLOG Solutions 9

Created for data analysis.

Copyright (c) WLOG Solutions 10

flights %>%

group_by(year, month, day) %>%

select(arr_delay, dep_delay) %>%

summarise(

arr = mean(arr_delay, na.rm = TRUE),

dep = mean(dep_delay, na.rm = TRUE)

) %>%

filter(arr > 30 | dep > 30)

z <- scaled_input %>%

layer_convolution2D(c(5,5), 32, pad = TRUE) %>%

layer_max_pooling(c(3,3), c(2,2)) %>%

layer_convolution2D(c(3,3), 48) %>%

layer_max_pooling(c(3,3), c(2,2)) %>%

layer_convolution2D(c(3,3), 64) %>%

layer_dense(96) %>%

layer_dropout(0.5) %>%

layer_dense(num_output_classes, activation = activation_softmax())

Copyright (c) WLOG Solutions 11

R is a community.

Copyright (c) WLOG Solutions 12

CRAN10K+ packages

Githubmore and more

popular

Copyright (c) WLOG Solutions 13

http://githut.info

Copyright (c) WLOG Solutions 14

R is really popular

Copyright (c) WLOG Solutions 15

Tiobe Index, 2017

Estimated 2M+ users all over the world.

Copyright (c) WLOG Solutions 16

Sounds like python?

Copyright (c) WLOG Solutions 18

RPackage reticulate

PythonPackage rpy2

Copyright (c) WLOG Solutions 19

R Software DevelopmentWhat is large scale?

Copyright (c) WLOG Solutions 20

R software development vs

R scripting

Copyright (c) WLOG Solutions 21

Large scale ~ 10K+ LOCSmall scale ~ 1K LOC

Copyright (c) WLOG Solutions 22

CRAN (MRAN) Github Other

R environment

Installed packages

Local CRANSource code repo

Copyright (c) WLOG Solutions 23

CRAN (MRAN) Github Other

R environment

Installed packages

Local CRANSource code repo

Copyright (c) WLOG Solutions 24

R Software DevelopmentBest practices by WLOG

Copyright (c) WLOG Solutions 25

Always make final test from command line.

Copyright (c) WLOG Solutions 26

Rscript my_script.R

Copyright (c) WLOG Solutions 27

Put all logic into packages.

Copyright (c) WLOG Solutions 28

Package help system

Package dependency

system

External data in packages

Vignettes Tests

Copyright (c) WLOG Solutions 29

Use any source code version control system.

Yes, even if you are working alone. :)

Copyright (c) WLOG Solutions 30

print is not for logging.

Forbidden

Copyright (c) WLOG Solutions 31

logging::loginfo(“Phase 1 passed”)

logging::logdebug(“Iter %d done”, i)

logging::logwarning(“Are you sure?”)

logging::logerror(“I failed :(”)

Copyright (c) WLOG Solutions 32

Select external packages carefully.And control their versions!

Copyright (c) WLOG Solutions 33

data.table

Copyright (c) WLOG Solutions 34

Use configuration files.

Copyright (c) WLOG Solutions 35

SnapshotDate: 2015-11-01PackagesPath: packagesLocalRepoPath: repositoryScriptPath: executionScriptsProject: XXXZipVersion:Artifacts:

LogLevel: INFOwork_path: ../workdata_path: ../dataexport_path: ../exportN_days: 365solver_max_iterations: 10solver_opt_horizon: 8

PARAMETERS CONFIG

Copyright (c) WLOG Solutions 36

Use standard project structure.

Copyright (c) WLOG Solutions 37

Master scripts

Project local packages

Tests

External packages

Logs

Work

Import

Export

Configura

tion

Copyright (c) WLOG Solutions 38

Automate building, deploying, testing, etc.

Copyright (c) WLOG Solutions 39

Jenkins exemplary pipeline

Copyright (c) WLOG Solutions 40

Go to hell :)

Copyright (c) WLOG Solutions 42

Summary

Copyright (c) WLOG SolutionsCopyright (c) WLOG Solutions 43

Seamless integration with existing systems and IT infrastructure

Dev/Test/Prod processes according to current software

development standards

Fast development to production cycle

Continuous integration & deployment

Repositories – models, builds, code,

dependencies, configuration

Controllable distributed job

scheduling

Resource usage monitoring

Secure access control, protected password

repositories

A well deployed R based analytical platform must have the following features

Copyright (c) WLOG Solutions

Wit Jakuczun, PhD

wit.jakuczun@wlogsolutions.com

44

WLOG R Suite™Field tested R ecosystem for Enterprise