Download - Apache spark
www.edureka.co/r-for-analytics
www.edureka.co/apache-spark-scala-training
Apache Spark: Beyond Hadoop MapReduce
Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training
Agenda
At the end of this webinar you will be able to know about:
Strength of MapReduce
Things beyond MapReduce
How MapReduce limitations can be overcome
How Spark fits the bill
Other exciting features in Spark
Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training
Simple
Scalability
FaultTolerance
Minimal data
motion
Strength of MapReduce
Independence of language of choice, such as Java, C++ or Python.
process petabytes of data, stored in HDFS on one cluster
MapReduce takes care of failures using the replicated copies.
Process moves towards data to minimize disk I/O
Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training
Real Time
Complex Algorithm
Re-reading And parsing
Data
Minimal Data
Motion
Graph Processing
Iterative
Tasks
RandomAccess
Limitations Of MR
Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training
Feature Comparison with Spark
Fast 100x faster than MapReduce
Batch Processing Batch and Real-time Processing
Stores Data on Disk Stores Data in Memory
Written in Java Written in Scala
Hadoop MapReduce HADOOP Spark
Source: Databrix
Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Cutting down on the number of reads and writes to the disc
Real time
Overhead
Reading,
parsing
Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Libraries for Machine learning, Streaming
Graph processing
complex algorithm
Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Cyclic data flows
Iterative
tasks
Random access
Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training
How Spark Implements Features To Make Its Architecture Better Than MR
Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk.
Sparks Cuts Down Read/Write I/O To Disk
Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training
Libraries For ML, Graph Programming …
Machine Learning Library
Graph programming
Spark interface For RDBMS lovers
Utility for continues ingestion of data
Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training
Cyclic Data Flows• All jobs in spark comprise a series of operators and run on a set
of data.
• All the operators in a job are used to construct a DAG (Directed
Acyclic Graph).
• The DAG is optimized by rearranging and combining operators
where possible.
Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training
Spark Features/Modules In Demand
Source: Typesafe
Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training
New Features In 2015Data Frames
• Similar API to data frames in R and Pandas• Automatically optimised via Spark SQL• Released in Spark 1.3
SparkR
• Released in Spark 1.4• Exposes DataFrames, RDD’s & ML library in R
Machine Learning Pipelines
• High Level API• Featurization• Evaluation • Model Tuning
External Data Sources
• Platform API to plug Data-Sources into Spark• Pushes logic into sources
Source: Databrix
Slide 20
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better!
Please spare few minutes to take the survey after the webinar.
Survey