computational techniques for the statistical analysis of big data in r

Computational Techniques for the StatisticalAnalysis of Big Data in R

A Case Study of the rlme Package

Herb Susmann, Yusuf Bilgic

April 12, 2014

WorkflowIdentifyRewriteBenchmarkTest

Case Study: rlmeIdentifyWilcoxon Tau EstimatorPairupCovariance Estimator

Summary

Keeping Ahead

Motivation

I Case study: rlme package

I Rank based regression and estimation of two- and three- levelnested effects models.

I Goals: faster, less memory, more data

I Before: 5,000 rows of data

I After: 50,000 rows of data

Section 1

Workflow

I Identify

I Rewrite

I Benchmark

I Test

Identify

I Know your big O!

(O(n2) memory usage? probably not sogood for big data)

I Look for error messages

I Profiling with RProf

Identify

I Know your big O! (O(n2) memory usage? probably not sogood for big data)

Identify

Rewrite

High level design

I Algorithm design

I Statistical techniques: bootstrapping

Rewrite

High level design

I Algorithm design

I Statistical techniques: bootstrapping

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Rewrite

Microbenchmarking

I Preallocation

I Embrace C++

Be careful!

Rewrite

Microbenchmarking

I Preallocation

I Embrace C++

Be careful!

Rewrite

Microbenchmarking

I Preallocation

I Embrace C++

Be careful!

Rewrite

Microbenchmarking

I Preallocation

I Embrace C++

Be careful!

Rewrite

Microbenchmarking

I Preallocation

I Embrace C++

Be careful!

Vectorizing

## Bad

vec = 1:100

for (i in 1:length(vec)) {vec[i] = vec[i]^2

## Better

sapply(vec, function(x) x^2)

## Best

Preallocation

## Bad

vec = c()

for (i in 1:0) {vec = c(vec, i)

## Better

vec = numeric(100)

for (i in 1:0) {vec[i] = i

Pass by value

square <- function(x) {x <- x^2

return(x)

x <- 1:100

square(x)

Benchmark

I Write several versions of a slow function

I Test them against each other

I Package: microbenchmark

Benchmark

I Regressions

I Unit Testing

I Package: testthat

I Regressions

I Unit Testing

I Package: testthat

I Regressions

I Unit Testing

I Package: testthat

I Regressions

I Unit Testing

I Package: testthat

Section 2

Case Study: rlme

Identify

Over to R!

Rprof("profile")

fit.rlme = rlme(...)

Rprof(NULL)

summaryRprof("profile")

Wilcoxon Tau Estimator

I Rank based scale estimator of residuals

I Uses pairup (so already O(n2))

Original:

dresd <- sort(abs(temp[, 1] - temp[, 2]))

dresd = dresd[(p + 1):choose(n, 2)]

What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++

dresd = remove.k.smallest(dresd)

Original:

What’s wrong?

Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++

Original:

What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple times

Updated with C++

Original:

What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++

Test with 2,000 residuals: better!

Wilcoxon Tau

I But what about really huge inputs?

I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times

I Not about speed, but about memory

Wilcoxon Tau

Pairup

I Pairup function: generates every possible pair from inputvector

I Some rank-based estimators require pairwise operations

I O(n2) complexity

Pairup

I Original version: vectorized (14 LOC)

I Loop version (12 LOC)

I ”Combn” version (core R function, 1 LOC)

I C++ version (12 LOC)

Pairup

Over to R!

Covariance Estimator

I n × n covariance matrix

I change to preallocation

Covariance Estimator

Summary

I Identify

I Rewrite

I Benchmark

I Test

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Keeping Ahead

I Parallelism

I GPU: rpud

I Julia Language

Keeping Ahead

I Parallelism

I GPU: rpud

I Julia Language

Keeping Ahead

I Parallelism

I GPU: rpud

I Julia Language

Keeping Ahead

I Parallelism

I GPU: rpud

I Julia Language

Keeping Ahead

I Parallelism

I GPU: rpud

I Julia Language

Keeping Ahead

I Parallelism

I GPU: rpud

I Julia Language

Keeping Ahead

I Parallelism

I GPU: rpud

I Julia Language

Questions?

computational techniques for the statistical analysis of big data in r

Technology