apache spark intro

Apache Spark Introworkshop

BigData Romania

Apache Spark Intro

★ Apache Spark history★ RDD★ Transformations★ Actions★ Hands-on session

Apache Spark History

https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/



From where to learn Spark ?

http://spark.apache.org/

http://shop.oreilly.com/product/0636920028512.do



http://shop.oreilly.com/product/0636920028512.do

Spark architecture

Easy ways to run Spark ?★ your IDE (ex. Eclipse or IDEA)★ Standalone Deploy Mode: simplest way to deploy Spark

on a single machine★ Docker & Zeppelin★ EMR★ Hadoop vendors (Cloudera, Hortonworks)

http://spark.apache.org/docs/latest/spark-standalone.html

http://tudorlapusan.ro/perfect-fit-apache-spark-zeppelin-and-docker/

https://aws.amazon.com/elasticmapreduce/

http://www.cloudera.com/

http://hortonworks.com/

Supported languages

Spark basics

★ RDD★ Operations : Transformations and Actions

RDD

An RDD is simply an immutable distributed collection of objects!

b c d ge f ih kj ml ona qp

https://www.quora.com/Why-is-a-spark-RDD-immutable

Creating RDD (I) Pythonlines = sc.parallelize([“workshop”, “spark”])

Scalaval lines = sc.parallelize(List(“workshop”, “spark”))

Java JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))

Creating RDD (II) Pythonlines = sc.textFile(“/path/to/file.txt”)

Scalaval lines = sc.textFile(“/path/to/file.txt”)

Java JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)

RDD persistence MEMORY_ONLY

MEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SERDISK_ONLYMEMORY_ONLY_2MEMORY_AND_DISK_2OFF_HEAP

Other data structures in Spark

★ Paired RDD★ DataFrame★ DataSet

Paired RDD

Paired RDD = an RDD of key/value pairs

user1 user2 user3 user4 user5

id1/user1 id2/user2 id3/user3 id4/user4 id5/user5

Spark operations RDD 1

RDD 2

RDD 4

RDD 6

RDD 3

RDD 5

Action

Transformation

TransformationsRDD 1

RDD 2Transformations describe how to transform an RDD into another RDD.

RDD 1

RDD 2

Transformations RDD 1

RDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

Popular transformations★ map★ filter★ sample★ union★ distinct★ groupByKey★ reduceByKey★ sortByKey★ join

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#map

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#filter

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#sample

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#union

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#distinct

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#groupByKey

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#reduceByKey

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#sortByKey

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/#join

Actions

Actions compute a result from an RDD !

RDD 1

Actions

InputRDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

count()=6 take(2)={1,2} saveAsTextFile()

Popular actions★ collect★ count★ first★ take★ takeSample★ countByKey★ saveAsTextFile

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/#collect

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/#count

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/#first

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/#take

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/#takeSample

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/#countByKey

http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/#saveAsTextFile

Transformations and Actions

users

administrators

filter

take(3)


users

administrators

filter()

take(3) saveAsTextFile()


users

administrators

filter()

take(3) saveAsTextFile()

persist()

Lazy initialization

users

administrators

filter

take(3)

How Spark Executes Your Program

Hands-on session

MovieLens MovieLens data sets were collected by the GroupLens Research Projectat the University of Minnesota. This data set consists of:

* 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies.

* Simple demographic info for the users (age, gender, occupation, zip)

Download link : http://grouplens.org/datasets/movielens/

http://grouplens.org/datasets/movielens/

MovieLens dataset

useruser_idagegenderoccupationzipcode

user_ratinguser_idmovie_idratingtimestamp

moviemovie_idtitlerelease_datevideo_releaseimdb_urlgenres...

Exercises already solved !

★ Return only the users with occupation ‘administrator’

★ Increase the age of each user by one★ Join user and rating datasets by user id

Exercises to solve★ How many men/women register to MovieLens★ Distribution of age for male/female registered to

MovieLens★ Which are the movies names with rating x?

★ Average rating by movies★ Sort users by their occupation

Congrats if you reached this slide !

apache spark intro

Data & Analytics