big data, bigger data & big r data

Big Data,Bigger Data

& Big R Data

Birmingham R Users Meeting23rd April 2013

Andy [email protected] / @AndyPryke

www.the-data-mine.co.uk

My Bias…

I work in commercial data mining, data analysis and data visualisation

Background in computing and artificial intelligence

Use R to write programs which analyse data


What is Big Data?

Depends who you ask.Answers are often “too big to ….”

…load into memory…store on a hard drive…fit in a standard databasePlus“Fast changing”Not just relational


My “Big Data” Definition

“Data collections big enough to require you to change the way you store and process them.”

- Andy Pryke


Data Size Limits in R

Standard R packages use a single thread, with data held in memory (RAM)help("Memory-limits")

• Vectors limited to 2 Billion items• Memory limit of ~128Tb

Servers with 1Tb+ memory are available• Also, Amazon EC2 servers up to 244Gb


Overview

• Problems using R with Big Data• Processing data on disk • Hadoop for parallel computation and Big

Data storage / access• “In Database” analysis• What next for Birmingham R User Group?


“matrix” - Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values)

Can be replaced by…..

Background: R matrix class


Sparse / Disk Based Matrices

• Matrix – Package Matrix. Sparse. In RAM• big.matrix – Package bigmemory /

bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions

• Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc.

More details? “Large-Scale Linear Algebra with R”, Bryan

W. Lewis, Boston R Users Meetup

https://r-forge.r-project.org/R/?group_id=556

http://illposed.net/boston_r_meetup_2012.pdf


Commercial Versions of R

Revolution Analytics have specialised versions of R for parallel execution & big data

I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages

Plenty more info here

http://www.revolutionanalytics.com/products/enterprise-big-data.php


Background: Hadoop

• Parallel data processing environment based on Google’s “MapReduce” model

• “Map” – divide up data and sending it for processing to multiple nodes.

• “Reduce” – Combine the resultsPlus:• Hadoop Distributed File System (HDFS)• HBase – Distributed database like

Google’s BigTable


RHadoop – Revolution Analytics

Package: rmr2, rhbase, rhdfs

• Example code using RMR (R Map-Reduce)• R and Hadoop – Step by Step Tutorials• Install and Demo RHadoop (Google for

more of these online)• Data Hacking with RHadoop

https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

http://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html



http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

http://www.slideshare.net/edkohlwey/data-hacking-with-rhadoop


RHadoopwc.map <- function(., lines) { ## split "lines" of text into a vector of individual "words" words <- unlist(strsplit(x = lines,split = " ")) keyval(words,1) ## each word occurs once }

wc.reduce <- function(word, counts ) { ## Add up the counts, grouping them by word keyval(word, sum(counts))}

wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)}

E.g. Function Output

## In, 1## the, 1## beginning, 1##...

## the, 2345## word, 987## beginning, 123##...


Other Hadoop libraries for R

Other packages: hive, segue, RHIPE…

segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun.

Example follows…

http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/


RHadoop# first, let's generate a 10-element list of# 999 random numbers + 1 NA:> myList <- getMyTestList() # Add up each set of 999 numbers> outputLocal <- lapply(myList, mean, na.rm=T)> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)RUNNING - 2011-01-04 15:16:57RUNNING - 2011-01-04 15:17:27RUNNING - 2011-01-04 15:17:58WAITING - 2011-01-04 15:18:29 ## Check local and cluster results match> all.equal(outputEmr, outputLocal)[1] TRUE

# The key is the emrlapply() function. It works just like lapply(),# but automagically spreads its work across the specified cluster


Oracle R Connector for Hadoop

• Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS

• Map-Reduce is very similar to the rmr example• Documentation lists examples for Linear

Regression, k-means, working with graphs amongst others

• Introduction to Oracle R Connector for Hadoop.• Oracle also offer some in-database algorithms

for R via Oracle R Enterprise (overview)

https://blogs.oracle.com/R/entry/introduction_to_oracle_r_connector

http://docs.oracle.com/cd/E36174_01/doc.11/e36049/orch.htm#autoId14



http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html

http://www.theregister.co.uk/2012/02/10/oracle_advanced_analytics/


Teradata Integration

Package: teradataR• Teradata offer in-database analytics, accessible

through R• These include k-means clustering, descriptive

statistics and the ability to create and call in-database user defined functions

http://developer.teradata.com/applications/articles/in-database-analytics-with-teradata-r


What Next?

I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc.

“R” you interested?

big data, bigger data & big r data

Technology