big data, bigger data & big r data

17
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke [email protected] / @AndyPryke

Upload: andypryke

Post on 15-Jan-2015

2.726 views

Category:

Technology


8 download

DESCRIPTION

My recent talk at the Birmingham R User Meeting (BRUM) was on Big Data in R. Different people have different definitions of big data. For this talk, my definition of big data is: “Data collections big enough to require you to change the way you store and process them.” - Andy Pryke I discuss the factors which can limit the size of data analysed using R and a variety of ways to address these, including moving data structures out of RAM and onto disk; using in database processing / analytics and harnessing the power of Hadoop to allow massively parallel R.

TRANSCRIPT

Page 1: Big Data, Bigger Data & Big R Data

Big Data,Bigger Data

& Big R Data

Birmingham R Users Meeting23rd April 2013

Andy [email protected] / @AndyPryke

Page 2: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

My Bias…

I work in commercial data mining, data analysis and data visualisation

Background in computing and artificial intelligence

Use R to write programs which analyse data

Page 3: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

What is Big Data?

Depends who you ask.Answers are often “too big to ….”

…load into memory…store on a hard drive…fit in a standard databasePlus“Fast changing”Not just relational

Page 4: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

My “Big Data” Definition

“Data collections big enough to require you to change the way you store and process them.”

- Andy Pryke

Page 5: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Data Size Limits in R

Standard R packages use a single thread, with data held in memory (RAM)help("Memory-limits")

• Vectors limited to 2 Billion items• Memory limit of ~128Tb

Servers with 1Tb+ memory are available• Also, Amazon EC2 servers up to 244Gb

Page 6: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Overview

• Problems using R with Big Data• Processing data on disk • Hadoop for parallel computation and Big

Data storage / access• “In Database” analysis• What next for Birmingham R User Group?

Page 7: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

“matrix” - Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values)

Can be replaced by…..

Background: R matrix class

Page 8: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Sparse / Disk Based Matrices

• Matrix – Package Matrix. Sparse. In RAM• big.matrix – Package bigmemory /

bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions

• Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc.

More details? “Large-Scale Linear Algebra with R”, Bryan

W. Lewis, Boston R Users Meetup

Page 9: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Commercial Versions of R

Revolution Analytics have specialised versions of R for parallel execution & big data

I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages

Plenty more info here

Page 10: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Background: Hadoop

• Parallel data processing environment based on Google’s “MapReduce” model

• “Map” – divide up data and sending it for processing to multiple nodes.

• “Reduce” – Combine the resultsPlus:• Hadoop Distributed File System (HDFS)• HBase – Distributed database like

Google’s BigTable

Page 12: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

RHadoopwc.map <- function(., lines) { ## split "lines" of text into a vector of individual "words" words <- unlist(strsplit(x = lines,split = " ")) keyval(words,1) ## each word occurs once }

wc.reduce <- function(word, counts ) { ## Add up the counts, grouping them by word keyval(word, sum(counts))}

wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)}

E.g. Function Output

## In, 1## the, 1## beginning, 1##...

## the, 2345## word, 987## beginning, 123##...

Page 13: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Other Hadoop libraries for R

Other packages: hive, segue, RHIPE…

segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun.

Example follows…

Page 14: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

RHadoop# first, let's generate a 10-element list of# 999 random numbers + 1 NA:> myList <- getMyTestList() # Add up each set of 999 numbers> outputLocal <- lapply(myList, mean, na.rm=T)> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)RUNNING - 2011-01-04 15:16:57RUNNING - 2011-01-04 15:17:27RUNNING - 2011-01-04 15:17:58WAITING - 2011-01-04 15:18:29 ## Check local and cluster results match> all.equal(outputEmr, outputLocal)[1] TRUE

# The key is the emrlapply() function. It works just like lapply(),# but automagically spreads its work across the specified cluster

Page 15: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Oracle R Connector for Hadoop

• Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS

• Map-Reduce is very similar to the rmr example• Documentation lists examples for Linear

Regression, k-means, working with graphs amongst others

• Introduction to Oracle R Connector for Hadoop.• Oracle also offer some in-database algorithms

for R via Oracle R Enterprise (overview)

Page 16: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

Teradata Integration

Package: teradataR• Teradata offer in-database analytics, accessible

through R• These include k-means clustering, descriptive

statistics and the ability to create and call in-database user defined functions

Page 17: Big Data, Bigger Data & Big R Data

www.the-data-mine.co.uk

What Next?

I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc.

“R” you interested?