introduction to apache spark - big data · 2018-01-31 · introduction to apache spark bu eğitim...

39
Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır.

Upload: others

Post on 20-May-2020

45 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Introduction to Apache Spark

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı

kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi

dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma

Bakanlığı’nın görüşlerini yansıtmamaktadır.

Page 2: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

A Major Step Backwards?

MapReduce is a step backward in database access:

Schemas are good

Separation of the schema from the application is good

High-level access languages are good

MapReduce is poor implementation

Brute force and only brute force (no indexes, for example)

MapReduce is not novel

MapReduce is missing features

Bulk loader, indexing, updates, transactions…

MapReduce is incompatible with DMBS tools

Source: Blog post by DeWitt and Stonebraker

Page 3: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Need for High-Level Languages

MapReduce is great for one-pass computation large-data

processing

But writing Java programs for everything is verbose and slow

Data scientists don’t want to write Java

But it is inefficient for multi-pass algorithms

No efficient primitives for data sharing and iterative tasks

State between steps goes to distributed file system

Slow due to replication & disk storage

Page 4: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Move it onto a cluster

In cluster setting, using of Map-Reduce and Hadoop is

slow

Hadoop writes to disk and complex

How to improve this?

Page 5: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

iter. 1 iter. 2 . . .

Input

file system

read

file system

write

file system

read

file system

write

Input

query 1

query 2

query 3

result 1

result 2

result 3

file system

read

. . .

Commonly spend 90% of time doing I/O

Example: Iterative Apps

Page 6: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

What we need is…

Resilient

Checkpointing

Fast, does not always store to disk

Replayable

Embarrassingly Parallel

Page 7: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Scala

Scala has many other nice features:

A type system that makes sense.

Traits.

Implicit conversions.

Pattern Matching.

XML literals, Parser combinators, ...

Page 8: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark: A Brief History

2004

MapReduce paper

2002

MapRe

2006

Hadoop @ Yahoo!

op-level

8

2010

Spark paper

2002 2004 2006 2008 2010 2012 2014

duce @ Google 2008 2014 Hadoop Summit Apache Spark t

Page 9: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Why better than hadoop?

• In-memory as opposed to disk

• Data can be cached in memory or disk for future use

• Fast: up to 100 times faster as it is using memory as opposed to disk

• Easier than Hadoop while being functional, runs a general DAG

• APIs in Java, Scala, Python, R

Page 10: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

WordCount MapReduce vs Spark

WordCount in 50+ lines of Java MR WordCount in 3 lines of Spark

Page 11: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Word Count Example

text_file = sc.textFile("hdfs://... wordsL ist .tx t ")

wordcounts = text_file.flatMap(lambda line: line.split(" "))

.map(lambda word:(word, 1 ) )

.reduceByKey(lambda x , y : x+y )

. c o l l e c t ( ) )

('rat',2),('elephant',1),('cat',2)

Page 12: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Disk vs Memory

L1 cache reference:

L2 cache reference:

Mutex lock/unlock:

Main memory reference:

Disk seek:

0.5 ns

7 ns

100 ns

100 ns

10,000,000 ns

Page 13: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Network vs Local

Send 2K bytes over 1 Gbps network:

Read 1 MB sequentially from memory:

Round trip within same datacenter:

Read 1 MB sequentially from network:

Read 1 MB sequentially from disk:

Send packet CA->Netherlands->CA:

20,000 ns

250,000 ns

500,000 ns

10,000,000 ns

30,000,000 ns

150,000,000 ns

Page 14: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark Architecture

Page 15: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Resilient Distributed Datasets (RDDs)

RDD: Spark “primitive” representing a collection of records

Immutable

Partitioned (the D in RDD)

Transformations operate on an RDD to create another

RDD

Coarse-grained manipulations only

RDDs keep track of lineage

Persistence

RDDs can be materialized in memory or on disk

OOM or machine failures: What happens?

Fault tolerance (the R in RDD):

RDDs can always be recomputed from stable storage (disk)

Page 16: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Resilient Distributed Datasets (RDDs)

Resilient:

• Recover from errors, e.g. node failure, slow processes

• Track history of each partition, re-run

Page 17: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Fault Tolerance

filter

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

map reduce

Inp

ut

file

RDDs track lineage info to rebuild lost data

Page 18: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

filter

Inp

ut

file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

map reduce

RDDs track lineage info to rebuild lost data

Page 19: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Distributed:

• Distributed across the cluster of machines

• Divided in partitions, atomic chunks of data

Resilient Distributed Datasets (RDDs)

Page 20: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Dataset:

• Data storage created from: HDFS, S3, HBase, JSON, text, Local hierarchy of folders

• Or created transforming another RDD

Resilient Distributed Datasets (RDDs)

Page 21: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

messages.saveAsTextFile("errors.txt") kicks off a computation

Resilient Distributed Datasets (RDDs)

» Immutable collections of objects, spread across cluster

» Statically typed: RDD[T] has objects of type T

val sc = new SparkContext()

val lines = sc.textFile("log.txt") // RDD[String]

// Transform using standard collection operations

val errors = lines.filter(_.startsWith("ERROR"))

val messages = errors.map(_.split(‘\t’)(2)) lazily evaluated

Page 22: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark Driver and Workers

A Spark program is two programs: » A driver program and a workers program

Worker programs run on cluster nodes or in

local threads

DataFrames are distributed across workers Local

threads Cluster

manager

Worker Worker

driver program

SparkContext

sqlContext

Spark Spark executor executor

Amazon S3, HDFS, or other storage

Page 23: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark Program

1) Create some input RDDs from external data or parallelize a collection in your driver program.

2) Lazily transform them to define new RDDs using

transformations like filter() or map()

3) Ask Spark to cache() any intermediate RDDs that

will need to be reused.

4) Launch actions such as count() and collect() to

kick off a parallel computation, which is then optimized

and executed by Spark.

Page 24: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Operations on RDDs Transformations (lazy):

map

flatMap

filter

union/intersection

join

reduceByKey

groupByKey

Actions (actually trigger computations)

collect

saveAsTextFile/saveAsSequenceFile

Page 25: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark with Python

Page 26: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Deploying code to the cluster

Page 27: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Talking to Cluster Manager

Manager can be:

YARN

Mesos

Spark Standalone

Page 28: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

RDD Stages Tasks

rdd1.join(rdd2) .groupBy(…) .filter(…)

build operator DAG

split graph into stages of tasks

submit each stage as ready

DAG TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Cluster manager

RDD Objects DAG Scheduler Task Scheduler Worker

execute tasks

store and serve blocks

Threads

Block manager

Task

Page 29: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Where does code run?

Local or Distributed? » Locally, in the driver

» Distributed at the executors

» Both at the driver and the executors

» Transformations run at executors

» Actions run at executors and

driver

Important point: » Executors run in parallel

» Executors have much more memory

Your application

(driver program)

Worker

Spark executor

Worker

Spark executor

Page 30: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark Components

Page 31: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

MLlib algorithms

classification: logistic regression, linear SVM, naïve Bayes, classification tree

regression: generalized linear models (GLMs), regression tree

collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF)

clustering: k-means||

decomposition: SVD, PCA

optimization: stochastic gradient descent, L-BFGS

Page 32: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Machine Learning Library (MLlib)

70+ contributors

in past year

points = context.sql(“select latitude, longitude from tweets”)

model = KMeans.train(points, 10)

Page 33: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark Streaming

Spark

Spark Streaming

batches of X seconds

Run a streaming computation as a series of very small, deterministic batch jobs

live data stream

processed results

• Chop up the live stream into batches of X seconds

• Spark treats each batch of data as RDDs and processes them using RDD opera;ons

• Finally, the processed results of the RDD opera;ons are returned in batches

Page 34: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark

Spark Streaming

batches of X seconds

Run a streaming computation as a series of very small, deterministic batch jobs

live data stream

processed results

• Batch sizes as low as ½ second, latency ~ 1 second

• Poten;al for combining batch processing and streaming processing in the same system

Spark Streaming

Page 35: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

MLlib + SQL

df = context.sql(“select latitude, longitude from tweets”)

model = pipeline.fit(df)

DataFrames in Spark 1.3 (as of March 2015)

Powerful coupled with new pipeline API

Page 36: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Spark SQL

// Run SQL statements

val teenagers = context.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are RDDs of Row objects

val names = teenagers.map(t => "Name: " + t(0)).collect()

Page 37: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

GraphX

Page 38: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

Build graph using RDDs of nodes and edges

Run standard algorithms such as PageRank

GraphX

General graph processing library

Page 39: Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı

MLlib + GraphX