data science with spark on amazon emr - pop-up loft tel aviv

49
Analytics with Spark on EMR Jonathan Fritz Sr. Product Manager, AWS

Upload: amazon-web-services

Post on 16-Apr-2017

983 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Analytics with Spark on EMRJonathan Fritz

Sr. Product Manager, AWS

Page 2: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Page 3: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Spark moves at interactive speed

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-

reduce for execution

• Minimizes I/O by storing data

in RDDs in memory

• Partitioning-aware to avoid

network-intensive shuffle

Page 4: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Spark components to match your use case

Page 5: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Spark speaks your language

Page 6: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Use DataFrames to easily interact with data

• Distributed

collection of data

organized in

columns

• An extension of the

existing RDD API

• Optimized for query

execution

Page 7: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Easily create DataFrames from many formats

RDD

Additional libraries for Spark SQL Data Sources

at spark-packages.org

Page 8: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Load data with the Spark SQL Data Sources API

Additional libraries at spark-packages.org

Page 9: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Sample DataFrame manipulations

Page 10: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Use DataFrames for machine learning

• Spark ML libraries

(replacing MLlib) use

DataFrames as

input/output for

models

• Create ML pipelines

with a variety of

distributed algorithms

Page 11: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Create DataFrames on streaming data

• Access data in Spark Streaming DStream

• Create SQLContext on the SparkContext used for Spark

Streaming application for ad hoc queries

• Incorporate DataFrame in Spark Streaming application

• Checkpointing streaming jobs

Page 12: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Spark Pipeline

Page 13: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Use R to interact with DataFrames

• SparkR package for using R to manipulate DataFrames

• Create SparkR applications or interactively use the SparkR

shell (no Zeppelin support yet - ZEPPELIN-156)

• Comparable performance to Python and Scala

DataFrames

Page 14: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Spark SQL

• Seamlessly mix SQL with Spark programs

• Uniform data access

• Hive compatibility – run Hive queries without

modifications using HiveContext

• Connect through JDBC/ODBC using the Spark

ThriftServer (coming soon natively in EMR)

Page 15: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Spark architecture

Page 16: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

• SparkContext runs as a library in your program, one instance per Spark app.

• Cluster managers: Standalone, Mesos or YARN

• Accesses storage via Hadoop InputFormat API, and can use S3 with EMRFS, HBase, HDFS, and more

Your application

SparkContext

Local

threads

Cluster

manager

Worker Worker

HDFS or other storage

Spark

executor

Spark

executor

Page 17: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Page 18: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Amazon EMR runs Spark on YARN

• Dynamically share and centrally configure

the same pool of cluster resources across

engines

• Schedulers for categorizing, isolating, and

prioritizing workloads

• Choose the number of executors to use, or

allow YARN to choose (dynamic allocation)

• Kerberos authentication

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

In MemorySpark

ApplicationsPig, Hive, Cascading, Spark Streaming, Spark SQL

Page 19: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

RDDs (and now DataFrames) and Fault

Tolerance

RDDs track the transformations used to build them

(their lineage) to recompute lost data

E.g:messages = textFile(...).filter(lambda s: s.contains(“ERROR”))

.map(lambda s: s.split(‘\t’)[2])

HadoopRDDpath = hdfs://…

FilteredRDDfunc = contains(...)

MappedRDDfunc = split(…)

Page 20: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec

(vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Caching RDDs can boost performance

Load error messages from a log into memory, then interactively search for patterns

Page 21: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

RDD Persistence

• Caching or Persisting dataset in memory

• Methods

• cache()

• persist()

• Small RDD MEMORY_ONLY

• Big RDD MEMORY_ONLY_SER (CPU intensive)

• Don’t spill to disk

• Use replicated storage for faster recovery

Page 22: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Inside Spark Executor on YARN

Max Container size on node

YARN Container Controls the max sum of memory used by the container

yarn.nodemanager.resource.memory-mb

Default: 116 GConfig File: yarn-site.xml

Page 23: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Inside Spark Executor on YARN

Max Container size on node

Executor space Where Spark executor Runs

Executor Container

Page 24: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Inside Spark Executor on YARN

Max Container size on node

Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)

𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟.𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10

Executor Container

Memory

Overhead

Config File: spark-default.conf

Page 25: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Inside Spark Executor on YARN

Max Container size on node

Spark executor memory - Amount of memory to use per executor process

spark.executor.memory

Executor Container

Memory

Overhead

Spark Executor Memory

Config File: spark-default.conf

Page 26: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Inside Spark Executor on YARN

Max Container size on node

Shuffle Memory Fraction – pre-Spark 1.6

Executor Container

Memory

Overhead

Spark Executor Memory

ShufflememoryFraction

Default: 0.2

Page 27: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Inside Spark Executor on YARN

Max Container size on node

Storage storage Fraction - pre-Spark 1.6

Executor Container

Memory

Overhead

Spark Executor Memory

ShufflememoryFraction

StoragememoryFraction

Default: 0.6

Page 28: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Inside Spark Executor on YARN

Max Container size on node

In Spark 1.6+, Spark automatically balances the amount of memory for execution

and cached data.

Executor Container

Memory

Overhead

Spark Executor Memory

Execution / Cache

Default: 0.6

Page 29: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Dynamic Allocation on YARN

Scaling up on executors

- Request when you want the job to complete faster

- Idle resources on cluster

- Exponential increase in executors over time

New default in EMR 4.4 (coming soon!)

Page 30: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Dynamic allocation setup

Property Value

Spark.dynamicAllocation.enabled true

Spark.shuffle.service.enabled true

spark.dynamicAllocation.minExecutors 5

spark.dynamicAllocation.maxExecutors 17

spark.dynamicAllocation.initalExecutors 0

sparkdynamicAllocation.executorIdleTime 60s

spark.dynamicAllocation.schedulerBacklogTimeout 5s

spark.dynamicAllocation.sustainedSchedulerBackl

ogTimeout

5s

Optional

Page 31: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Compress your input data set

• Always compress Data Files on Amazon S3

• Reduces storage cost

• Reduces bandwidth between Amazon S3 and Amazon

EMR, which can speed up bandwidth constrained jobs

Page 32: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Compressions

Compression Types:

– Some are fast BUT offer less space reduction

– Some are space efficient BUT Slower

– Some are splitable and some are not

Algorithm % Space

Remaining

Encoding

Speed

Decoding

Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

Page 33: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Data Serialization

• Data is serialized when cached or shuffled

Default: Java serializer

• Kyro serialization (10x faster than Java serialization)

• Does not support all Serializable types

• Register the class in advance

Usage: Set in SparkConf

conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")

Page 34: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Running Spark on

Amazon EMR

Page 35: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Focus on deriving insights from your data

instead of manually configuring clusters

Easy to install and configure Spark

Secured

Spark submit, Oozie or use Zeppelin UI

Quickly addand remove capacity

Hourly, reserved, or EC2 Spot pricing

Use S3 to decouplecompute and storage

Page 36: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Launch the latest Spark version

Spark 1.6.0 is the current version on EMR.

< 3 week cadence with latest open source release

Page 37: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Create a fully configured cluster in minutes

AWS Management

Console

AWS Command Line

Interface (CLI)

Or use a AWS SDK directly with the Amazon EMR API

Page 38: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Or easily change your settings

Page 39: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Many storage layers to choose from

Amazon DynamoDB

EMR-DyanmoDB

connector

Amazon RDS

Amazon

Kinesis

Streaming data

connectorsJDBC Data Source

w/ Spark SQL

ElasticSearch

connector

Amazon Redshift

Spark-Redshift

connector

EMR File System

(EMRFS)

Amazon S3

Amazon EMR

Page 40: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Decouple compute and storage by using S3

as your data layer

HDFS

S3 is designed for 11

9’s of durability and is

massively scalable

EC2 Instance

Memory

Amazon S3

Amazon EMR

Amazon EMR

Amazon EMR

Page 41: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Easy to run your Spark workloads

Amazon EMR Step API

SSH to master node and use Spark

Submit, Oozie or Zeppelin

Submit a Spark

application

Amazon EMR

Page 42: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Secured Spark clusters

Encryption At-Rest

• HDFS transparent encryption (AES 256)

• Local disk encryption for temporary files using LUKS encryption

• EMRFS support for Amazon S3 client-side and server-side encryption

Encryption In-Flight

• Secure communication with SSL from S3 to EC2 (nodes of cluster)

• HDFS blocks encrypted in-transit when using HDFS encryption

• SASL encryption for Spark Shuffle

Permissions

• IAM roles, Kerberos, and IAM Users

Access

• VPC and Security Groups

Auditing

• AWS CloudTrailAmazon S3

Page 43: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Customer use cases

Page 44: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Some of our customers running Spark on EMR

Page 45: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Page 46: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Integration Pattern – ETL with Spark

Amazon EMRAmazon S3

HDFSRead

Unstructure

d Data

Write

Structured

Extract

Load from

HDFS

Store Output Data

Page 47: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Integration Pattern – Tumbling Window Reporting

Amazon EMR

Amazon

Kinesis

Streaming Input

HDFS

Tumbling/Fixed

Window

Aggregation

Periodic Output

Amazon Redshift

COPY from

EMR

Or checkpoint to S3 and use

the Lambda loader app

Page 48: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Zeppelin demo

Page 49: Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Jonathan FritzSr. Product [email protected]