r hadoop integration
Post on 15-Apr-2017
103 Views
Preview:
TRANSCRIPT
R INTEGRATION WITH HADOOPNGUYEN PHAN DZUNG
MARCH 2016
AGENDA- Objectives
- Contents:• Introduction of R• Implementation of R integration with
Hadoop• When to use R in combination with Hadoop• Examples using Hadoop
- Q&A- References
Security Classification: Internal
Objectives
3
• Understand R• Understand when to use R in
combination with Hadoop• Understand the implementation of
integration
Introduction of R
R integration with Hadoop 5Security Classification: Internal
Introduction of R – What is R?
• Software for Statistical Data Analysis• Based on S• Programming Environment• Interpreted Language• Data Storage, Analysis, Graphing• Free and Open Source Software
R integration with Hadoop 6Security Classification: Internal
Introduction of R – Why R?
• Free and Open Source• Strong User Community• Highly extensible, flexible• Implementation of high end statistical methods• Flexible graphics and intelligent defaultsBut ..• Steep learning curve• Slow for large datasets
R integration with Hadoop 7Security Classification: Internal
Introduction of R – A little bit of demo
Command to demo.txt
R integration with Hadoop
R integration with Hadoop 9Security Classification: Internal
R integration with Hadoop – Integration purposes
• Use Hadoop to execute R code• Use R to access data stored in Hadoop
R integration with Hadoop 10Security Classification: Internal
R integration with Hadoop – When to use?No Factor Mantra Guideline
1 R's natural strength Use R for statisticalcomputing
Consider integrating when your project can be solved using code available in R, or when it is not easily solved in other languages
2 Hadoop's natural strength
Use Hadoop fordistributed storage &batch computing
Consider integrating when your problem requires lots of storage or when it could benefit from parallelization
3 Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all" panaceas. Consider not integrating if it is easier to solve your problem with other tools
4 Processing time Work smart, not hard Although some problems can benefit from parallelization, consider not integrating if the gains are negligible since this can help you reduce the complexity of your project
R integration with Hadoop 11Security Classification: Internal
R integration with Hadoop – Example applicationsNo
Scenario UseR/
Hadoop?
Why? Example
1 Analyzing small data stored in Hadoop
Y R can quickly download data analyze it locally
Want to analyze summary datasets derived from map reduce jobs done in Hadoop
2 Extracting complexfeatures from large data stored in Hadoop
Y R has more built-in and contributed functions that analyze data than many standard programming languages
R is a natural language to use to write an algorithm or classifier that extracts information about objects contained in images
3 Applying predictionand classificationmodels to datasets
Y R is better at modeling than many standard programming languages
Using a logistic regression model to generate predictions in a large dataset
4 Implementing an"iteration-based"machinelearning algorithm
Maybe 1) Other languages may be faster than R for your analysis2) Hadoop reads and writes a lot of data to disks, other "big data" tools, like Spark (and SparkR) are designed for speed in these scenarios by working in memory
Training a k-means classification algorithm or logistic regression on a large dataset
5 Simple preprocessingof large data stored in Hadoop
N Standard programming languages are much faster than R at executing many basic text and image processingtasks
Pre-processing twitter tweets for use in a natural language processing project
R integration with Hadoop 12Security Classification: Internal
R integration with Hadoop – How? – RHadoop (1)
R integration with Hadoop 13Security Classification: Internal
R integration with Hadoop – How? – RHadoop (2)
rhdfs:• Manipulate HDFS directly from R• Mimic as much of the HDFS Java API as possible• Examples:
– Read a HDFS text file into a data frame.– Serialize/Deserialize a model to HDFS– Write an HDFS file to local storage
• rhdfs/pkg/inst/unitTests• rhdfs/pkg/inst/examples
R integration with Hadoop 14Security Classification: Internal
R integration with Hadoop – How? – RHadoop (3)
rhbase:• Manipulate HBASE tables and their content• Uses Thrift C++ API as the mechanism tocommunicate to HBASE• Examples:
– Create a data frame from a collection of rowsand columns in an HBASE table– Update an HBASE table with values from a dataframe
R integration with Hadoop 15Security Classification: Internal
R integration with Hadoop – How? – RHadoop (4)
rmr:• Designed to be the simplest and most elegant way towrite MapReduce programs• Gives the R programmer the tools necessary to
performdata analysis in a way that is “R” like• Provides an abstraction layer to hide the
implementationdetails
R integration with Hadoop 16Security Classification: Internal
R integration with Hadoop – How? – RHive
R integration with Hadoop 17Security Classification: Internal
R integration with Hadoop – How? – BigR
R integration with Hadoop 18Security Classification: Internal
R integration with Hadoop – How? – Ricardo
R integration with Hadoop 19Security Classification: Internal
R integration with Hadoop – How? – SparkR
R integration with Hadoop 20Security Classification: Internal
R integration with Hadoop – How? – RevoR ScaleR
R integration with Hadoop 21Security Classification: Internal
R integration with Hadoop – How? – ORCH
R integration with Hadoop 22Security Classification: Internal
R integration with Hadoop – How? – MS HDInsight
Q & A
Security Classification: Internal
References
Big data and Hadoop introduction 24
- http://cran-rproject.org- http://revolutionanalytics.com
- Hadoop for dummies
R – a brief introduction
Gilberto Câmara
Thank you for your attention!
top related