big data, bigger data & big r data
DESCRIPTION
My recent talk at the Birmingham R User Meeting (BRUM) was on Big Data in R. Different people have different definitions of big data. For this talk, my definition of big data is: “Data collections big enough to require you to change the way you store and process them.” - Andy Pryke I discuss the factors which can limit the size of data analysed using R and a variety of ways to address these, including moving data structures out of RAM and onto disk; using in database processing / analytics and harnessing the power of Hadoop to allow massively parallel R.TRANSCRIPT
Big Data,Bigger Data
& Big R Data
Birmingham R Users Meeting23rd April 2013
Andy [email protected] / @AndyPryke
www.the-data-mine.co.uk
My Bias…
I work in commercial data mining, data analysis and data visualisation
Background in computing and artificial intelligence
Use R to write programs which analyse data
www.the-data-mine.co.uk
What is Big Data?
Depends who you ask.Answers are often “too big to ….”
…load into memory…store on a hard drive…fit in a standard databasePlus“Fast changing”Not just relational
www.the-data-mine.co.uk
My “Big Data” Definition
“Data collections big enough to require you to change the way you store and process them.”
- Andy Pryke
www.the-data-mine.co.uk
Data Size Limits in R
Standard R packages use a single thread, with data held in memory (RAM)help("Memory-limits")
• Vectors limited to 2 Billion items• Memory limit of ~128Tb
Servers with 1Tb+ memory are available• Also, Amazon EC2 servers up to 244Gb
www.the-data-mine.co.uk
Overview
• Problems using R with Big Data• Processing data on disk • Hadoop for parallel computation and Big
Data storage / access• “In Database” analysis• What next for Birmingham R User Group?
www.the-data-mine.co.uk
“matrix” - Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values)
Can be replaced by…..
Background: R matrix class
www.the-data-mine.co.uk
Sparse / Disk Based Matrices
• Matrix – Package Matrix. Sparse. In RAM• big.matrix – Package bigmemory /
bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions
• Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc.
More details? “Large-Scale Linear Algebra with R”, Bryan
W. Lewis, Boston R Users Meetup
www.the-data-mine.co.uk
Commercial Versions of R
Revolution Analytics have specialised versions of R for parallel execution & big data
I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages
Plenty more info here
www.the-data-mine.co.uk
Background: Hadoop
• Parallel data processing environment based on Google’s “MapReduce” model
• “Map” – divide up data and sending it for processing to multiple nodes.
• “Reduce” – Combine the resultsPlus:• Hadoop Distributed File System (HDFS)• HBase – Distributed database like
Google’s BigTable
www.the-data-mine.co.uk
RHadoop – Revolution Analytics
Package: rmr2, rhbase, rhdfs
• Example code using RMR (R Map-Reduce)• R and Hadoop – Step by Step Tutorials• Install and Demo RHadoop (Google for
more of these online)• Data Hacking with RHadoop
www.the-data-mine.co.uk
RHadoopwc.map <- function(., lines) { ## split "lines" of text into a vector of individual "words" words <- unlist(strsplit(x = lines,split = " ")) keyval(words,1) ## each word occurs once }
wc.reduce <- function(word, counts ) { ## Add up the counts, grouping them by word keyval(word, sum(counts))}
wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)}
E.g. Function Output
## In, 1## the, 1## beginning, 1##...
## the, 2345## word, 987## beginning, 123##...
www.the-data-mine.co.uk
Other Hadoop libraries for R
Other packages: hive, segue, RHIPE…
segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun.
Example follows…
www.the-data-mine.co.uk
RHadoop# first, let's generate a 10-element list of# 999 random numbers + 1 NA:> myList <- getMyTestList() # Add up each set of 999 numbers> outputLocal <- lapply(myList, mean, na.rm=T)> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)RUNNING - 2011-01-04 15:16:57RUNNING - 2011-01-04 15:17:27RUNNING - 2011-01-04 15:17:58WAITING - 2011-01-04 15:18:29 ## Check local and cluster results match> all.equal(outputEmr, outputLocal)[1] TRUE
# The key is the emrlapply() function. It works just like lapply(),# but automagically spreads its work across the specified cluster
www.the-data-mine.co.uk
Oracle R Connector for Hadoop
• Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS
• Map-Reduce is very similar to the rmr example• Documentation lists examples for Linear
Regression, k-means, working with graphs amongst others
• Introduction to Oracle R Connector for Hadoop.• Oracle also offer some in-database algorithms
for R via Oracle R Enterprise (overview)
www.the-data-mine.co.uk
Teradata Integration
Package: teradataR• Teradata offer in-database analytics, accessible
through R• These include k-means clustering, descriptive
statistics and the ability to create and call in-database user defined functions
www.the-data-mine.co.uk
What Next?
I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc.
“R” you interested?