r software development - how to write and maintain 30k+ loc in r and survive?
TRANSCRIPT
Copyright (c) WLOG Solutions
R software development How to write and maintain 30K+ LOC in R and
survive?
Wit Jakuczun, WLOG Solutions
2017-06-20
Copyright (c) WLOG Solutions 2
World of analytics has changed.
Copyright (c) WLOG Solutions 3
Copyright (c) WLOG Solutions 4
4000x4 elastic-net models (CV-5) for 45Kx10K datasetin 1,5 minute!
Copyright (c) WLOG Solutions 5
Join 21st centuRy today!
Copyright (c) WLOG Solutions
What is R?
6
Copyright (c) WLOG Solutions 7
Dynamically interpreted general programming language
Copyright (c) WLOG Solutions 8
Stable open-source productdeveloped by R Foundation
since ~1995 year.
Copyright (c) WLOG Solutions 9
Created for data analysis.
Copyright (c) WLOG Solutions 10
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
z <- scaled_input %>%
layer_convolution2D(c(5,5), 32, pad = TRUE) %>%
layer_max_pooling(c(3,3), c(2,2)) %>%
layer_convolution2D(c(3,3), 48) %>%
layer_max_pooling(c(3,3), c(2,2)) %>%
layer_convolution2D(c(3,3), 64) %>%
layer_dense(96) %>%
layer_dropout(0.5) %>%
layer_dense(num_output_classes, activation = activation_softmax())
Copyright (c) WLOG Solutions 11
R is a community.
Copyright (c) WLOG Solutions 12
CRAN10K+ packages
Githubmore and more
popular
Copyright (c) WLOG Solutions 13
http://githut.info
Copyright (c) WLOG Solutions 14
R is really popular
Copyright (c) WLOG Solutions 15
Tiobe Index, 2017
Estimated 2M+ users all over the world.
Copyright (c) WLOG Solutions 16
Sounds like python?
Copyright (c) WLOG Solutions 17
Copyright (c) WLOG Solutions 18
RPackage reticulate
PythonPackage rpy2
Copyright (c) WLOG Solutions 19
R Software DevelopmentWhat is large scale?
Copyright (c) WLOG Solutions 20
R software development vs
R scripting
Copyright (c) WLOG Solutions 21
Large scale ~ 10K+ LOCSmall scale ~ 1K LOC
Copyright (c) WLOG Solutions 22
CRAN (MRAN) Github Other
R environment
Installed packages
Local CRANSource code repo
Copyright (c) WLOG Solutions 23
CRAN (MRAN) Github Other
R environment
Installed packages
Local CRANSource code repo
Copyright (c) WLOG Solutions 24
R Software DevelopmentBest practices by WLOG
Copyright (c) WLOG Solutions 25
Always make final test from command line.
Copyright (c) WLOG Solutions 26
Rscript my_script.R
Copyright (c) WLOG Solutions 27
Put all logic into packages.
Copyright (c) WLOG Solutions 28
Package help system
Package dependency
system
External data in packages
Vignettes Tests
Copyright (c) WLOG Solutions 29
Use any source code version control system.
Yes, even if you are working alone. :)
Copyright (c) WLOG Solutions 30
print is not for logging.
Forbidden
Copyright (c) WLOG Solutions 31
logging::loginfo(“Phase 1 passed”)
logging::logdebug(“Iter %d done”, i)
logging::logwarning(“Are you sure?”)
logging::logerror(“I failed :(”)
Copyright (c) WLOG Solutions 32
Select external packages carefully.And control their versions!
Copyright (c) WLOG Solutions 33
data.table
Copyright (c) WLOG Solutions 34
Use configuration files.
Copyright (c) WLOG Solutions 35
SnapshotDate: 2015-11-01PackagesPath: packagesLocalRepoPath: repositoryScriptPath: executionScriptsProject: XXXZipVersion:Artifacts:
LogLevel: INFOwork_path: ../workdata_path: ../dataexport_path: ../exportN_days: 365solver_max_iterations: 10solver_opt_horizon: 8
PARAMETERS CONFIG
Copyright (c) WLOG Solutions 36
Use standard project structure.
Copyright (c) WLOG Solutions 37
Master scripts
Project local packages
Tests
External packages
Logs
Work
Import
Export
Configura
tion
Copyright (c) WLOG Solutions 38
Automate building, deploying, testing, etc.
Copyright (c) WLOG Solutions 39
Jenkins exemplary pipeline
Copyright (c) WLOG Solutions 40
Go to hell :)
Copyright (c) WLOG Solutions 41
“If you are using R and
you think you’re in hell, this is a map for you.”
Patrick Burns, “R Inferno”, 2011
Copyright (c) WLOG Solutions 42
Summary
Copyright (c) WLOG SolutionsCopyright (c) WLOG Solutions 43
Seamless integration with existing systems and IT infrastructure
Dev/Test/Prod processes according to current software
development standards
Fast development to production cycle
Continuous integration & deployment
Repositories – models, builds, code,
dependencies, configuration
Controllable distributed job
scheduling
Resource usage monitoring
Secure access control, protected password
repositories
A well deployed R based analytical platform must have the following features
Copyright (c) WLOG Solutions
Wit Jakuczun, PhD
44
WLOG R Suite™Field tested R ecosystem for Enterprise