computational techniques for the statistical analysis of big data in r

56
Computational Techniques for the Statistical Analysis of Big Data in R A Case Study of the rlme Package Herb Susmann, Yusuf Bilgic April 12, 2014

Upload: herbps10

Post on 29-Nov-2014

479 views

Category:

Technology


3 download

DESCRIPTION

A talk presented at UP-Stat 2014 on techniques for optimizing R code for large data sets

TRANSCRIPT

Page 1: Computational Techniques for the Statistical Analysis of Big Data in R

Computational Techniques for the StatisticalAnalysis of Big Data in R

A Case Study of the rlme Package

Herb Susmann, Yusuf Bilgic

April 12, 2014

Page 2: Computational Techniques for the Statistical Analysis of Big Data in R

WorkflowIdentifyRewriteBenchmarkTest

Case Study: rlmeIdentifyWilcoxon Tau EstimatorPairupCovariance Estimator

Summary

Keeping Ahead

Page 3: Computational Techniques for the Statistical Analysis of Big Data in R

Motivation

I Case study: rlme package

I Rank based regression and estimation of two- and three- levelnested effects models.

I Goals: faster, less memory, more data

I Before: 5,000 rows of data

I After: 50,000 rows of data

Page 4: Computational Techniques for the Statistical Analysis of Big Data in R

Section 1

Workflow

Page 5: Computational Techniques for the Statistical Analysis of Big Data in R

Workflow

I Identify

I Rewrite

I Benchmark

I Test

Page 6: Computational Techniques for the Statistical Analysis of Big Data in R

Identify

I Know your big O!

(O(n2) memory usage? probably not sogood for big data)

I Look for error messages

I Profiling with RProf

Page 7: Computational Techniques for the Statistical Analysis of Big Data in R

Identify

I Know your big O! (O(n2) memory usage? probably not sogood for big data)

I Look for error messages

I Profiling with RProf

Page 8: Computational Techniques for the Statistical Analysis of Big Data in R

Identify

I Know your big O! (O(n2) memory usage? probably not sogood for big data)

I Look for error messages

I Profiling with RProf

Page 9: Computational Techniques for the Statistical Analysis of Big Data in R

Identify

I Know your big O! (O(n2) memory usage? probably not sogood for big data)

I Look for error messages

I Profiling with RProf

Page 10: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

High level design

I Algorithm design

I Statistical techniques: bootstrapping

Page 11: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

High level design

I Algorithm design

I Statistical techniques: bootstrapping

Page 12: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Page 13: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Page 14: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Page 15: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Page 16: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Page 17: Computational Techniques for the Statistical Analysis of Big Data in R

Rewrite

Microbenchmarking

I Know what R is good at

I Avoid loops in favor of vectorization

I Preallocation

I Arguments are by value, not by reference

I Embrace C++

Be careful!

Page 18: Computational Techniques for the Statistical Analysis of Big Data in R

Vectorizing

## Bad

vec = 1:100

for (i in 1:length(vec)) {vec[i] = vec[i]^2

}

## Better

sapply(vec, function(x) x^2)

## Best

vec^2

Page 19: Computational Techniques for the Statistical Analysis of Big Data in R

Preallocation

## Bad

vec = c()

for (i in 1:0) {vec = c(vec, i)

}

## Better

vec = numeric(100)

for (i in 1:0) {vec[i] = i

}

Page 20: Computational Techniques for the Statistical Analysis of Big Data in R

Pass by value

square <- function(x) {x <- x^2

return(x)

}

x <- 1:100

square(x)

Page 21: Computational Techniques for the Statistical Analysis of Big Data in R

Benchmark

I Write several versions of a slow function

I Test them against each other

I Package: microbenchmark

Page 22: Computational Techniques for the Statistical Analysis of Big Data in R

Benchmark

I Write several versions of a slow function

I Test them against each other

I Package: microbenchmark

Page 23: Computational Techniques for the Statistical Analysis of Big Data in R

Benchmark

I Write several versions of a slow function

I Test them against each other

I Package: microbenchmark

Page 24: Computational Techniques for the Statistical Analysis of Big Data in R

Test

I Regressions

I Unit Testing

I Package: testthat

Page 25: Computational Techniques for the Statistical Analysis of Big Data in R

Test

I Regressions

I Unit Testing

I Package: testthat

Page 26: Computational Techniques for the Statistical Analysis of Big Data in R

Test

I Regressions

I Unit Testing

I Package: testthat

Page 27: Computational Techniques for the Statistical Analysis of Big Data in R

Test

I Regressions

I Unit Testing

I Package: testthat

Page 28: Computational Techniques for the Statistical Analysis of Big Data in R

Section 2

Case Study: rlme

Page 29: Computational Techniques for the Statistical Analysis of Big Data in R

Identify

Over to R!

Rprof("profile")

fit.rlme = rlme(...)

Rprof(NULL)

summaryRprof("profile")

Page 30: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau Estimator

I Rank based scale estimator of residuals

I Uses pairup (so already O(n2))

Page 31: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau Estimator

Original:

dresd <- sort(abs(temp[, 1] - temp[, 2]))

dresd = dresd[(p + 1):choose(n, 2)]

What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++

dresd = remove.k.smallest(dresd)

Page 32: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau Estimator

Original:

dresd <- sort(abs(temp[, 1] - temp[, 2]))

dresd = dresd[(p + 1):choose(n, 2)]

What’s wrong?

Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++

dresd = remove.k.smallest(dresd)

Page 33: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau Estimator

Original:

dresd <- sort(abs(temp[, 1] - temp[, 2]))

dresd = dresd[(p + 1):choose(n, 2)]

What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple times

Updated with C++

dresd = remove.k.smallest(dresd)

Page 34: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau Estimator

Original:

dresd <- sort(abs(temp[, 1] - temp[, 2]))

dresd = dresd[(p + 1):choose(n, 2)]

What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++

dresd = remove.k.smallest(dresd)

Page 35: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau Estimator

Test with 2,000 residuals: better!

Page 36: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau

I But what about really huge inputs?

I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times

I Not about speed, but about memory

Page 37: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau

I But what about really huge inputs?

I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times

I Not about speed, but about memory

Page 38: Computational Techniques for the Statistical Analysis of Big Data in R

Wilcoxon Tau

I But what about really huge inputs?

I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times

I Not about speed, but about memory

Page 39: Computational Techniques for the Statistical Analysis of Big Data in R

Pairup

I Pairup function: generates every possible pair from inputvector

I Some rank-based estimators require pairwise operations

I O(n2) complexity

Page 40: Computational Techniques for the Statistical Analysis of Big Data in R

Pairup

I Original version: vectorized (14 LOC)

I Loop version (12 LOC)

I ”Combn” version (core R function, 1 LOC)

I C++ version (12 LOC)

Page 41: Computational Techniques for the Statistical Analysis of Big Data in R

Pairup

I Original version: vectorized (14 LOC)

I Loop version (12 LOC)

I ”Combn” version (core R function, 1 LOC)

I C++ version (12 LOC)

Page 42: Computational Techniques for the Statistical Analysis of Big Data in R

Pairup

I Original version: vectorized (14 LOC)

I Loop version (12 LOC)

I ”Combn” version (core R function, 1 LOC)

I C++ version (12 LOC)

Page 43: Computational Techniques for the Statistical Analysis of Big Data in R

Pairup

I Original version: vectorized (14 LOC)

I Loop version (12 LOC)

I ”Combn” version (core R function, 1 LOC)

I C++ version (12 LOC)

Page 44: Computational Techniques for the Statistical Analysis of Big Data in R

Over to R!

Page 45: Computational Techniques for the Statistical Analysis of Big Data in R

Covariance Estimator

I n × n covariance matrix

I change to preallocation

Page 46: Computational Techniques for the Statistical Analysis of Big Data in R

Covariance Estimator

Page 47: Computational Techniques for the Statistical Analysis of Big Data in R

Summary

I Identify

I Rewrite

I Benchmark

I Test

Page 48: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 49: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 50: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 51: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 52: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 53: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 54: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 55: Computational Techniques for the Statistical Analysis of Big Data in R

Keeping Ahead

I Parallelism

I Cluster: RMpi, snow

I GPU: rpud

I Probably not Hadoop, maybe Apache Spark?

I Julia Language

I Hadley Wickham (plyr, ggplot, testthat, ...)

I “Advanced R Programming”

Page 56: Computational Techniques for the Statistical Analysis of Big Data in R

Questions?