driving new value from big data investments user r+hadoop.pdf · building modern analytics...
TRANSCRIPT
Driving New Value from Big Data Investments
An Introduction to Using R with HadoopJeffrey BreenPrincipal, Think Big [email protected]://www.thinkbigacademy.com/
February 2013
2
Building Modern Analytics Solutions to Monetize Big Data Investments
Strategy and Roadmap
IMAGINETraining
and Education
ILLUMINATEHands-On
Data Science and Data Engineering
IMPLEMENT
Leading Providerof Innovative Big Analytics Services
3
We Accelerate Your Time to Value
THINK BIG Analytics Methodology
Experiment-Driven Short Projects with Nimble Test Solution Cycles
� Breaking Down Business and IT Barriers
� Discrete Projects with Beginning and End
� Early Releases to Validate ROI andEnsure Long Term Success
IMAGINE
ILLUMINATE
IMPLEMENT
Innovation and Value
4
� Expert Training/Courses− e.g. Hadoop Developer, HBase, Pig and Hive for Modelers
� Joint Application Development� Side-by-Side Mentoring
Enable Your IT Staff with New Skills
Data Architect
Data Architect Big Data
Monitoring
DatabaseAdministrator
Big DataAdministrator
BusinessAnalyst
Data ScienceMath Modeler
Developers
Big DataEngineering
ILLUMINATE: Training and Education
� Build Capabilities to Manage Rapid Innovation Needed with Big Data
� Invest in and Scale Skills to Create Data-Driven Organization
THINK BIG Analytics
Agenda
5
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
Agenda
6
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
Revolution Confidential
7http://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html
Revolution Confidential
8http://www.wengerna.com/giant-knife-16999
Number of R Packages Available
How many R Packages are there now?
At the command line enter:> dim(available.packages())
Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup
Agenda
10
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
Revolution Confidential
Google File System is the Storage.
2003
13
MapReduce is the framework.
2004
14
Enter HadoopAbout this time,
Doug Cutting, the creator of Lucene, was working on Nutch.
15
Nutch Timeline
Year Topics
2003 Google’s GFS paper.
2004 Nutch Distributed File System (NDFS).
2004 Google’s MapReduce paper.
2004-2005
Nutch MapReduce Implementation.
16
Hadoop TimelineYear Topics
2006NDFS and Nutch MapReduce extracted to separate Hadoop Apache project.
2008Hadoop is a top-level Apache project.Yahoo! announces 10K core cluster.
17
�Optimize disk I/O performance.-Minimize disk head seeks!
�Redundant data storage and processing to eliminate many kinds of data loss.
�Horizontal scalability.�Run on commodity, server-class hardware.
Hadoop Design Goals
18
Revolution Confidential
19
from Jeff Dean, based on Peter Norvig’s http://norvig.com/21-days.html
What is Hadoop?
� An open source project designed to support large scale data processing� Inspired by Google’s MapReduce-based computational infrastructure� Comprised of several components- Hadoop Distributed File System (HDFS)- MapReduce processing framework, job scheduler, etc.- Ingest/outgest services (Sqoop, Flume, etc.)- Higher level languages and libraries (Hive, Pig, Cascading, Mahout)
� Written in Java, first opened up to alternatives through its Streaming API→ If your language of choice can handle stdin and stdout, you can use it to write MapReduce jobs
20
21
Hadoop cluster components
Key� italics: process�✲ : MR jobs
Cluster
Slaves
IngestService
OutgestService
SQLStore
SQLStore
Logs
Client Servers
✲ Hive, Pig, ...✲ cron+bash, Azkaban, …
Sqoop, Scribe, …Monitoring, Management
...
Secondary Master Server
Secondary Name Node
Primary Master Server✲ Job Tracker
Name Node
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course
22
Hadoop’s distributed file system
Services
Name Node
Data Nodes
64MB blocks
3x replication
Cluster
Slaves
IngestService
OutgestService
SQLStore
SQLStore
Logs
Client Servers
✲ Hive, Pig, ...✲ cron+bash, Azkaban, …
Sqoop, Scribe, …Monitoring, Management
...
Secondary Master Server
Secondary Name Node
Primary Master Server✲ Job Tracker
Name Node
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course
Agenda
23
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
True confession: I was wrong about MapReduce
� When the Google paper was published in 2004, I was running a typical enterprise IT department
� Big hardware (Sun, EMC) + big applications (Siebel, Peoplesoft) + big databases (Oracle, SQL Server)= big licensing & support costs
� Loved the scalability, COTS components, and price, but missed the fact that keys (and values) could be compound & complex
� ... and examples like Wordcount didn’t help!
Source: Hadoop: The Definitive Guide, Second Edition, p. 20
24
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
We need to convert the Input
into the Output.
from Think Big Academy’s Hadoop Developer Course
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers
There is a Reduce phase (N, "…")
from Think Big Academy’s Hadoop Developer Course
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers
There is a Reduce phase (N, "…")
(hadoop, 1)(uses, 1)(mapreduce, 1)
(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)
(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)
from Think Big Academy’s Hadoop Developer Course
Revolution Confidential
28http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
Reducers
There is a Reduce phase (N, "…")
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce, 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
from Think Big Academy’s Hadoop Developer Course
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
There is a Reduce phase (N, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
from Think Big Academy’s Hadoop Developer Course
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase (N, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
from Think Big Academy’s Hadoop Developer Course
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase (N, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-zMap:
• Transform one input to 0-‐N outputs.
Reduce:
• Collect multiple inputs into one output.
from Think Big Academy’s Hadoop Developer Course
Agenda
33
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
Enter RHadoop
� RHadoop is an open source project sponsored by Revolution Analytics� Package Overview- rmr2 - all MapReduce-related functions- rhdfs - interaction with Hadoop’s HDFS file system- rhbase - access to the NoSQL HBase database
� rmr2 uses Hadoop’s Streaming API to allow R users to write MapReduce jobs in R- handles all of the I/O and job submission for you (no while(<stdin>)-like loops!)
34
RHadoop Advantages
� Modular- Packages group similar functions- Only load (and learn!) what you need- Minimizes prerequisites and dependencies
� Open Source- Cost: Low (no) barrier to start using- Transparency: Development, issue tracker, Wiki, etc. hosted on github
• https://github.com/RevolutionAnalytics/RHadoop/� Supported- Sponsored by Revolution Analytics- Training & professional services available- Support available with Revolution R Enterprise subscriptions
35
wordcount: codelibrary(rmr2)
map = function(k,lines) {
words.list = strsplit(lines, '\\s') words = unlist(words.list)
return( keyval(words, 1) )}
reduce = function(word, counts) { keyval(word, sum(counts))}
wordcount = function (input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce)}
36
from Revolution Analytics’ Getting Started with RHadoop course
wordcount: submit job and fetch results
Submit job> hdfs.root = 'wordcount'> hdfs.data = file.path(hdfs.root, 'data')> hdfs.out = file.path(hdfs.root, 'out')> out = wordcount(hdfs.data, hdfs.out)
Fetch results from HDFS> results = from.dfs( out )> results.df = as.data.frame(results, stringsAsFactors=F )> colnames(results.df) = c('word', 'count')> head(results.df) word count1 greatness 22 damned 33 tis 54 jade 15 magician 1
37
from Revolution Analytics’ Getting Started with RHadoop course
Code notes
� Scalable- Hadoop and MapReduce abstract away system details- Code runs on 1 node or 1,000 nodes without modification
� Portable- You write normal R code, interacting with normal R objects- RHadoop’s rmr2 library abstracts away Hadoop details- All the functionality you expect is there—including Enterprise R’s
� Flexible- Only the mapper deals with the data directly- All components communicate via key-value pairs- Key-value “schema” chosen for each analysis rather than as a prerequisite to
loading data into the system
38
rmr2 Function Overview
� Convenience- keyval() - creates a key-value pair from any two R objects. Used to generate
output from input formatters, mappers, reducers, etc.� Input/output- from.dfs(), to.dfs() - read/write data from/to the HDFS- make.input.format() - provides common file parsing (text, CSV) or will wrap a user-
supplied function� Job execution- mapreduce() - submit job and return an HDFS path to the results if successful
39
rhdfs function overview
� File & directory manipulation- hdfs.ls(), hdfslist.files()- hdfs.delete(), hdfs.del(), hdfs.rm() - hdfs.dircreate(), hdfs.mkdir()- hdfs.chmod(), hdfs.chown(), hdfs.file.info()- hdfs.exists()
� Copying, moving & renaming files to/from/within HDFS- hdfs.copy(), hdfs.move(), hdfs.rename()- hdfs.put(), hdfs.get()
� Reading files directly from HDFS- hdfs.file(), hdfs.read(), hdfs.write(), hdfs.flush()- hdfs.seek(), hdfs.tell(con), hdfs.close()- hdfs.line.reader(), hdfs.read.text.file()
� Misc.- hdfs.init(), hdfs.defaults()
40
rhbase function overview
� Initialization- hb.init()
� Create and manage tables- hb.list.tables(), hb.describe.table()- hb.new.table(), hb.delete.table()
� Read and write data- hb.insert(), hb.insert.data.frame()- hb.get(), hb.get.data.frame(), hb.scan()- hb.delete()
� Administrative, etc.- hb.defaults(), hb.set.table.mode()- hb.regions.table(), hb.compact.table()
41
Big Data Warehousing with Hive
� Hive supplies a SQL-like query language- very familiar for those with relational database experience
� But Hive compiles, optimizes, and executes these queries as MapReduce jobs on the Hadoop cluster
� Can be used in conjunction with other Hadoop jobs, such as those written with rmr2
42
Hive architecture & access
43
Hadoop
Master✲ Job Tracker Name Node DFS
Hive
Driver(compiles, optimizes, executes)
CLI HWI Thrift Server
Metastore
JDBC ODBC
RODBC, RJDBC, etc.Terminal browser
Accessing Hive via ODBC/JDBClibrary(RJDBC)
# set the classpath to include the JDBC driver location, plus commons-logging
[...]class.path = c(hive.class.path, commons.class.path)drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classPath=class.path, "`")
# make a connection to the running Hive Server:conn = dbConnect(drv, "jdbc:hive://localhost:10000/default")
# setting the database name in the URL doesn't help,# so issue 'use databasename' command:res = dbSendQuery(conn, 'use mydatabase')
# submit the query and fetch the results as a data.frame:df = dbGetQuery(conn, 'SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subView AS sub')
44
Other ways to use R and Hadoop
� HDFS- Revolution Enterprise R can read and write files directly on the distributed file
system- Files can include ScaleR’s XDF-formatted data sets
� MapReduce- Many other R packages have been written to use R and Hadoop together,
including RHIPE, segue, Oracle’s R Connector for Hadoop, etc.
� Hive- Hadoop Streaming is also available for Hive to leverage functionality external to
Hadoop and Java- RHive leverages RServe to connect the two
• http://cran.r-project.org/web/packages/RHive/
45
Big Data ≠ Hadoop
� NoSQL databases offer low-latency, random-access to key-values- HBase- Cassandra- CouchDB- MongoDB- Accumulo
� Next week, Think Big’s Douglas Moore will be presenting at the Boston Storm Meetup:- “Predictive Analytics with Storm, Hadoop, R and AWS”- http://www.meetup.com/Boston-Storm-Users/events/103506142/
46
Want to learn more?
47
� Upcoming public Getting Started with RHadoop 1-day classes- Hands-on examples and exercises covering rhdfs, rhbase, and rmr2- Algorithms and data include wordcount, analysis of airline flight data, and
collaborative filtering using structured and unstructured data from text, CSV files and Twitter
• February 25, 2013 - Palo Alto, CA• March 13, 2013 - Boston, MA
• 25% off with “useR” discount code @ http://bit.ly/rhadoop0313
� Revolution Analytics Quick Start Program for Hadoop- Private Getting Started with RHadoop training- Onsite consulting assistance for initial use case- Revolution R for Hadoop licenses and support- More info @ http://bit.ly/rhadoopqs