spark rev1

7
 resentation on Apache Spark By- B V S Mridula (1039060) Milind Baluni (100319) !" !anesh () #ilip Payra () Sa$eer %ayak ()

Upload: mridula-bvs

Post on 13-Apr-2018

248 views

Category:

Documents


0 download

TRANSCRIPT

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 1/7

 

resentation on Apache Spark

By-

B V S Mridula (1039060)Milind Baluni (100319)

!" !anesh ()

#ilip Payra ()

Sa$eer %ayak ()

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 2/7

 

&ntroduction to Apache Spark

It is a framework for performing general data analytics on distributedcomputing cluster like Hadoop● It provides in memory computations for increase speed and data

process over MapReduce.● Runs on top of existing Hadoop cluster and access Hadoop data store

(HDF!● "an also process structured data in Hive and treaming data from

HDF#Flume#$afka#%witter 

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 3/7

 

& 'ative support for multiplelanguages wit identical

 )*Is

& +se of closures# iterations#and oter common

language constructs to

minimi,e code

& +nified )*I for batc and

streaming

High-Productivity LanguageSupport

Pythonlines = sc.textFile(...)lines.filter(lam bda s: “ERROR” in s).count()

Scalaval lines = sc.textFile(...)lines.filter(s = >s.contains(“ERROR”)).count()

 Java

JavaR! "trin#> lines = sc.textFile(...)$lines.filter(new  Function! "trin#% &oolean> ()' &oolean call("trin# s) '  return s.contains(“error”)$ ).count()$

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 4/7

 

'hy Apache Spark is used●  )pace park is best for performing iterative -obs like macine learning.

●  )pace spark maintains an inmemory copy of map reduce -ob using RDD(ResilientDistributed Datasets! 

● /it tis# if te map reduce -ob is once loaded into inmemory copy.

● %e need for loading tat map reduce -ob again and again from memory getsreduced. /it tis tere is a tremendous increase in speed.

● %e inmemory copy olds te fre0uently used map reduce -ob witin itself.

● Run programs up to 122x faster tan Hadoop MapReduce in memory# or 12x fasteron disk.

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 5/7

 

Spark liraries

Spark SQL:

It is a park3s module for working wit structured data.

Spark Streaming: 

/ic makes it easy to build scalable faulttolerant streamingapplications.

MLlib: 

It is )pace park3s scalable macine learning library.

GraphX:

It is )pace park3s )*I for graps and grapparallel computation.

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 6/7

 

&s Apache Spark *oin* to replace +adoop

● Hadoop essentially consists of a MapReduce pase and a file system(HDF! wereas park is a framework tat executes -obs# sopractically park can only replace te MapReduce pase in teHadoop 4cosystem.

● park was mainly designed to run on top of Hadoop so as to minimi,e

te -ob execution time.

● park is an alternative to te traditional MapReduce model tat usedto work only in batc mode. park supports bot batc as well asrealtime processing.

● park mainly utili,es te primary memory of te system to provideefficient output. %us# it re0uires a igend macine to execute -obs.Hadoop# on te oter and# can easily run on commodity ardware.

● park3s way of andle fault tolerance is very fast as compared to

Hadoop3s. %is minimi,es network I56 and guarantees fault tolerance.

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 7/7

 

%ank 7ou8