tessera: open source tools for big data analysis in...

Tessera: Open Source Tools for Big DataAnalysis in R

David Zeitler - Grand Valley State University Statistics

August 12, 2015

Attribution

This presentation is based work done for the June 30, 2015 useR!Conference by

I Ryan Hafen([@hafenstats](https://twitter.com/hafenstats) )

I Stephen F. ElstonI Amanda M. White

https://twitter.com/hafenstats

Deep Analysis of Large, Complex Data l

I Data most often do not come with a modelI If we already (think we) know the algorithm / model to apply

and simply apply it to the data and nothing else, we are notdoing analysis, we are processing

I Deep analysis meansI detailed, comprehensive analysis that does not lose important

information in the dataI learning from the data, not forcing our preconceptions on the

dataI being willing and able to use any of the 1000s of statistical,

machine learning, and visualization methods as dictated by thedata

I trial and error, an iterative process of hypothesizing, fitting,validating, learning

I a lot of visualization

Deep Analysis of Large, Complex Data

Large complex data has any or all of the following:

I Large number of recordsI Many variablesI Complex data structures not readily put into tabular form of

cases by variablesI Intricate patterns and dependencies that require complex

models and methods of analysisI Does not conform to simple assumptions made by many

algorithms

The Goal of Tessera

Provide an environment that allows us to do the following with largecomplex data:

I Work completely in RI Have access to R’s 1000s of statistical, ML, and vis methods

ideally with no need to rewrite scalable versionsI Be able to apply any ad-hoc R codeI Minimize time thinking about code or distributed systemsI Maximize time thinking about the dataI Be able to analyze it with nearly as much flexibility and ease as

small data

Tessera Packages

Users interact primarily with two R packages:

I datadr: data analysis R package implementing the Divide &Recombine paradigm that allows data scientists to leverageparallel data and processing back-ends such as Hadoop andSpark through a simple consistent interface

I Trelliscope: visualization package that enables flexible detailedscalable visualization of large, complex data

Back End Agnostic

Interface stays the same regardless of back end

Tessera Fundamentals: D&R

Tessera Fundamentals: TrelliscopeI Trelliscope: a viz tool that enables scalable, detailed

visualization of large dataI Data is split into meaningful subsets, and a visualization

method is applied to each subsetI The user can sort and filter plots based on “cognostics” -

summary statistics of interest - to explore the data (example)

https://www.youtube.com/watch?v=KatkJx9Ui0o

The Current Tessera Distributed Computing Stack

I trelliscope: visualization of subsets of data, web interfacepowered by Shiny http://shiny.rstudio.com

I datadr: interface for divide and recombine operationsI RHIPE: The R and Hadoop Integrated Programming

EnvironmentI Hadoop: Framework for managing data and computation

distributed across multiple hardrives in a clusterI HDFS: Hadoop Distributed File System

http://shiny.rstudio.com

More Information

I http://tessera.ioI http://github.com/tesseradataI @TesseraIOI Google user groupI Try it out

I If you have some applications in mind, give it a try!I You don’t need big data or a cluster to use TesseraI Development team is eager for feedback and ready to help

http://tessera.io

http://github.com/tesseradata

https://twitter.com/TesseraIO

https://groups.google.com/forum/#!forum/tessera-users

Introduction to datadr

Installing the Tessera packages

install.packages("devtools") # if not installedlibrary(devtools)install_github("tesseradata/datadr")install_github("tesseradata/trelliscope")install_github("hafen/housingData") # demo data

Housing Data

I Housing sales and listing data in the United StatesI Between 2008-10-01 and 2014-03-01I Aggregated to the county levelI Zillow.com data provided by Quandl

(https://www.quandl.com/c/housing)

https://www.quandl.com/c/housing

Housing Data Variables

Variable Description

fips Federal Information Processing Standard a 5 digit count codecounty US county namestate US state nametime date (the data is aggregated monthly)nSold number sold this monthmedListPriceSqft median list price per square footmedSoldPriceSqft median sold price per square foot

datadr data representation

I Fundamentally, all data types are stored in a back-end askey/value pairs

I Data type abstractions on top of the key/value pairsI Distributed data frame (ddf):

I A data frame that is split into chunksI Each chunk contains a subset of the rows of the data frameI Each subset may be distributed across the nodes of a cluster

I Distributed data object (ddo):I Similar to distributed data frameI Except each chunk can be an object with any structureI Every distributed data frame is also a distributed data object

Back-ends

datadr data back-end options:

I In memoryI Local diskI HDFSI Spark (under development)

Data ingest

# similar to read.table function:my.data <- drRead.table(

hdfsConn("/home/me/dir/datafile.txt", header=TRUE, sep="\t"))

# similar to read.csv function:my.data2 <- drRead.csv(

localDiskConn("c:/my/local/data.csv"))

# convert in memory data.frame to ddf:my.data3 <- ddf(some.data.frame)

In memory example

# Load necessary librarieslibrary(datadr)library(trelliscope)library(housingData)

# housing data frame is in the housingData packagehousingDdf <- ddf(housing)

Division

I A common thing to do is to divide a dataset based on thevalue of one or more variables

I Another option is to divide data into random replicatesI Use random replicates to estimate a GLM fit by applying GLM

to each replicate subset and taking the mean coefficientsI Random replicates can be used for other all-data model fitting

approaches like bag of little bootstraps, concensus MCMC, etc.

Divide example

Divide the housing data set by the variables “county” and “state”

(This kind of data division is very similar to the functionalityprovided by the plyr package)

byCounty <- divide(housingDdf,by = c("county", "state"), update = TRUE)

Divide example

byCounty

#### Distributed data frame backed by 'kvMemory' connection#### attribute | value## ----------------+-----------------------------------------------------------## names | fips(cha), time(Dat), nSold(num), and 2 more## nrow | 224369## size (stored) | 15.73 MB## size (object) | 15.73 MB## # subsets | 2883#### * Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()## * Conditioning variables: county, state

Other possibilities

byState <- divide(housing, by="state", update = TRUE)

byMonth <- divide(housing, by="time", update=TRUE)

Exploring the ddf data objectData divisions can be accessed by index or by key name

byCounty[[1]]

## $key## [1] "county=Abbeville County|state=SC"#### $value## fips time nSold medListPriceSqft medSoldPriceSqft## 1 45001 2008-10-01 NA 73.06226 NA## 2 45001 2008-11-01 NA 70.71429 NA## 3 45001 2008-12-01 NA 70.71429 NA## 4 45001 2009-01-01 NA 73.43750 NA## 5 45001 2009-02-01 NA 78.69565 NA## ...

byCounty[["county=Benton County|state=WA"]]

Exploring the ddf data object

I summary(byCounty)I names(byCounty)I length(byCounty)I getKeys(byCounty)

Transformations

I The addTransform function applies a function to eachkey/value pair in a ddf

I E.g. to calculate a summary statistic

I The transformation is not applied immediately, it is deferreduntil:

I A function that kicks off a map/reduce job is called (e.g.recombine)

I A subset of the data is requested (e.g. byCounty[[1]])I drPersist function explicitly forces transformation

computation

Transformation example

# Function to calculate a linear model and extract# the slope parameterlmCoef <- function(x) {

coef(lm(medListPriceSqft ~ time, data = x))[2]}

# Best practice tip: test transformation# function on one divisionlmCoef(byCounty[[1]]$value)

## time## -0.0002323686

# Apply the transform function to the ddfbyCountySlope <- addTransform(byCounty, lmCoef)

Transformation example

byCountySlope[[1]]

## $key## [1] "county=Abbeville County|state=SC"#### $value## time## -0.0002323686

Examples

# example 1totalSold <- function(x) {

sum(x$nSold, na.rm=TRUE)}byCountySold <- addTransform(byCounty, totalSold)

# example 2timeRange <- function(x) {

range(x$time)}byCountyTime <- addTransform(byCounty, timeRange)

RecombinationI Combine transformation results together againI Example

countySlopes <- recombine(byCountySlope,combine=combRbind)

head(countySlopes)

## county state val## time Abbeville County SC -0.0002323686## time1 Acadia Parish LA 0.0019518441## time2 Accomack County VA -0.0092717711## time3 Ada County ID -0.0030197554## time4 Adair County IA -0.0308381951## time5 Adair County KY 0.0034399585

Recombination options

combine parameter controls the form of the result

I combine=combRbind: rbind is used to combine results intodata.frame, this is the most frequently used option

I combine=combCollect: results are collected into a listI combine=combDdo: results are combined into a ddo object

divide

Divide two new datasets geoCounty and wikiCounty by countyand state

# look at the data firsthead(geoCounty)head(wikiCounty)

# use divide function on each

geoByCounty <- divide(geoCounty,by=c("county", "state"))

wikiByCounty <- divide(wikiCounty,by=c("county", "state"))

Data operations: drJoin

Join together multiple data objects based on key

joinedData <- drJoin(housing=byCounty,slope=byCountySlope,geo=geoByCounty,wiki=wikiByCounty)

Distributed data objects vs distributed data frames

I In a ddf the value in each key/value is always a data.frameI A ddo can accomodate values that are not data.frames

class(joinedData)

## [1] "ddo" "kvMemory"

Distributed data objects vs distributed data framesjoinedData[[176]]

## $key## [1] "county=Benton County|state=WA"#### $value## $housing## fips time nSold medListPriceSqft medSoldPriceSqft## 1 53005 2008-10-01 137 106.6351 106.2179## 2 53005 2008-11-01 80 106.9650 NA## 3 53005 2008-11-01 NA NA 105.2370## 4 53005 2008-12-01 95 107.6642 105.6311## 5 53005 2009-01-01 73 107.6868 105.8892## 6 53005 2009-02-01 97 108.3566 NA## 7 53005 2009-02-01 NA NA 104.3273## 8 53005 2009-03-01 125 107.1968 103.2748## 9 53005 2009-04-01 147 107.7649 102.2363## 10 53005 2009-05-01 192 108.6823 NA## 11 53005 2009-05-01 NA NA 103.8925## 12 53005 2009-06-01 256 108.5143 105.1873## 13 53005 2009-07-01 232 108.4902 104.8865## 14 53005 2009-08-01 250 108.5763 NA## 15 53005 2009-08-01 NA NA 105.6069## 16 53005 2009-09-01 185 109.4762 111.4779## 17 53005 2009-10-01 208 108.3172 111.3014## 18 53005 2009-11-01 171 109.1082 NA## 19 53005 2009-11-01 NA NA 111.3912## 20 53005 2009-12-01 116 109.6998 109.8340## 21 53005 2010-01-01 112 109.8148 109.8369## 22 53005 2010-02-01 130 109.1902 NA## 23 53005 2010-02-01 NA NA 109.5266## 24 53005 2010-03-01 230 109.1489 111.3277## 25 53005 2010-04-01 257 108.9922 111.9392## 26 53005 2010-05-01 263 109.1406 NA## 27 53005 2010-05-01 NA NA 112.3360## 28 53005 2010-06-01 277 109.7328 112.4703## 29 53005 2010-07-01 143 109.3155 112.5845## 30 53005 2010-08-01 147 109.5679 NA## 31 53005 2010-08-01 NA NA 113.5515## 32 53005 2010-09-01 145 109.9034 114.7505## 33 53005 2010-10-01 158 111.1056 111.6532## 34 53005 2010-11-01 119 110.9722 NA## 35 53005 2010-11-01 NA NA 111.1267## 36 53005 2010-12-01 96 112.1773 109.0577## 37 53005 2011-01-01 84 113.0760 109.4568## 38 53005 2011-02-01 106 114.2857 NA## 39 53005 2011-02-01 NA NA 111.6570## 40 53005 2011-03-01 157 113.9522 111.6075## 41 53005 2011-04-01 178 112.5166 111.9766## 42 53005 2011-05-01 176 113.6156 NA## 43 53005 2011-05-01 NA NA 110.9980## 44 53005 2011-06-01 160 113.2360 110.8176## 45 53005 2011-07-01 175 112.5824 113.7276## 46 53005 2011-08-01 153 NA NA## 47 53005 2011-08-01 NA 119.9161 114.2078## 48 53005 2011-09-01 157 118.8687 114.6581## 49 53005 2011-10-01 148 118.1545 113.7756## 50 53005 2011-11-01 70 NA NA## 51 53005 2011-11-01 NA 118.0064 112.0518## 52 53005 2011-12-01 102 118.2796 105.5297## 53 53005 2012-01-01 72 118.4511 107.1666## 54 53005 2012-02-01 82 NA NA## 55 53005 2012-02-01 NA 118.1705 107.7065## 56 53005 2012-03-01 108 118.5601 106.8745## 57 53005 2012-04-01 100 118.3301 108.6945## 58 53005 2012-05-01 134 NA NA## 59 53005 2012-05-01 NA 118.3333 109.9282## 60 53005 2012-06-01 116 118.0660 111.0290## 61 53005 2012-07-01 115 118.2432 110.6429## 62 53005 2012-08-01 119 NA NA## 63 53005 2012-08-01 NA 118.4626 111.5133## 64 53005 2012-09-01 103 118.8302 111.9573## 65 53005 2012-10-01 129 117.9672 111.9468## 66 53005 2012-11-01 121 NA NA## 67 53005 2012-11-01 NA 117.8074 110.9557## 68 53005 2012-12-01 122 118.1281 108.4398## 69 53005 2013-01-01 110 117.2370 106.6948## 70 53005 2013-02-01 NA 117.2093 105.8374## 71 53005 2013-03-01 NA 117.7890 108.0909## 72 53005 2013-04-01 NA 118.4573 108.3922## 73 53005 2013-05-01 NA 119.0476 109.3556## 74 53005 2013-06-01 NA 118.8369 112.1253## 75 53005 2013-07-01 NA 118.5484 113.4037## 76 53005 2013-08-01 NA 119.8686 113.7689## 77 53005 2013-09-01 NA 120.1890 112.2339## 78 53005 2013-10-01 NA 119.2271 113.0176## 79 53005 2013-11-01 NA 118.5382 114.7066## 80 53005 2013-12-01 NA 120.0107 112.5436## 81 53005 2014-01-01 NA 120.5357 109.4688## 82 53005 2014-02-01 NA 120.5588 109.4023## 83 53005 2014-03-01 NA 121.6829 108.9446#### $slope## time## 0.007726753#### $geo## fips lon lat rMapState rMapCounty## 1 53005 -119.5155 46.23413 washington benton#### $wiki## fips pop2013 href## 1 53005 184486 http://en.wikipedia.org/wiki/Benton_County,_Washington

Data operations: drFilter

Filter a ddf or ddo based on key and/or value

# Note that a few county/state combinations do# not have housing sales data:names(joinedData[[2884]]$value)

## [1] "geo" "wiki"

# We want to filter those out thosejoinedData <- drFilter(joinedData,

function(v) {!is.null(v$housing)

})

Other data operations

I drSample: returns a ddo containing a random sample (i.e. aspecified fraction) of key/value pairs

I drSubset: applies a subsetting function to the rows of a ddfI drLapply: applies a function to each subset and returns the

results in a ddo

Using Tessera with a Hadoop cluster

Differences from in memory computation:

I Data ingest: use hdfsConn to specify a file location to read inHDFS

I Each data object is stored in HDFSI Use output parameter in most functions to specify a location

in HDFS to store data

housing <- drRead.csv(file=hdfsConn("/hdfs/data/location"),output=hdfsConn("/hdfs/data/second/location"))

byCounty <- divide(housing, by=c("state", "county"),output=hdfsConn("/hdfs/data/byCounty"))

Introduction to trelliscope

Trelliscope

I Divide and recombine visualization toolI Based on Trellis displayI Apply a visualization method to each subset of a ddf or ddoI Interactively sort and filter plots

Trelliscope panel function

I Define a function to apply to each subset that creates a plotI Plots can be created using base R graphics, ggplot, lattice,

rbokeh, conceptually any htmlwidget

# Plot medListPriceSqft and medSoldPriceSqft by timetimePanel <- function(x) {

xyplot(medListPriceSqft + medSoldPriceSqft ~ time,data = x$housing, auto.key = TRUE,ylab = "Price / Sq. Ft.")

}

Trelliscope panel function# test the panel function on one divisiontimePanel(joinedData[[176]]$value)

time

Pric

e / S

q. F

t.

105

110

115

120

2009 2010 2011 2012 2013 2014

medListPriceSqftmedSoldPriceSqft

Visualization database (vdb)

I Trelliscope creates a directory with all the data to render theplots

I Can later re-launch the Trelliscope display without all the priordata analysis

vdbConn("housing_vdb", autoYes=TRUE)

Creating a Trelliscope display

makeDisplay(joinedData,name = "list_sold_vs_time_datadr",desc = "List and sold price over time",panelFn = timePanel,width = 400, height = 400,lims = list(x = "same")

)

view()

Trelliscope demo

Cognostics and display organization

I Cognostic:I a value or summary statisticI calculated on each subsetI to help the user focus their attention on plots of interest

I Cognostics are used to sort and filter plots in TrelliscopeI Define a function to apply to each subset to calculate desired

valuesI Return a list of named elementsI Each list element is a single value (no vectors or complex data

objects)

Cognostics functionpriceCog <- function(x) {

st <- getSplitVar(x, "state")ct <- getSplitVar(x, "county")zillowString <- gsub(" ", "-", paste(ct, st))list(

slope = cog(x$slope, desc = "list price slope"),meanList = cogMean(x$housing$medListPriceSqft),meanSold = cogMean(x$housing$medSoldPriceSqft),lat = cog(x$geo$lat, desc = "county latitude"),lon = cog(x$geo$lon, desc = "county longitude"),wikiHref = cogHref(x$wiki$href, desc="wiki link"),zillowHref = cogHref(

sprintf("http://www.zillow.com/homes/%s_rb/",zillowString),

desc="zillow link"))

}

Use the cognostics function in trelliscope

makeDisplay(joinedData,name = "list_sold_vs_time_datadr2",desc = "List and sold price with cognostics",panelFn = timePanel,cogFn = priceCog,width = 400, height = 400,lims = list(x = "same")

)

Trelliscope demo

Running a Local Cluster

I Local cluster can use multiple cores: each running an R processI R process requires memory for the chunks being processedI Buffer size limits number of chunks, but large chunks must be

loaded into memoryHint: look at arguments of localDiskControlHint: look at process size, ps -aux on Linux or Windows TaskManager

I Local calculations often disk-I/O boundI Clusters with HDFS achieve greater I/O bandwidth

tessera: open source tools for big data analysis in...

Documents