chicago spark meetup-april2017-public

Building Efficient Pipelines in Apache Spark Guru Medasani

Agenda

• Introduction• Myself• Cloudera

• Spark Pipeline Essentials• Using Spark UI• Resource Allocation• Tuning• Data Formats• Streaming

• Questions

Introduction: Myself

• Current: Senior Solutions Architect at Cloudera (Chicago, IL)• Past: BigData Engineer at Monsanto Research & Development (St. Louis, MO)

Introduction: ClouderaThe modern platform for data management, machine learning and advanced analytics

Founded 2008, by former employees ofProduct First commercial distribution of Hadoop CDH – Shipped 2009 World Class Support 24x7 Global Staff & Operations in 27 Countries

Proactive & Predictive Support Programs using our EDHMission Critical Production deployments in run-the-business applications worldwide –

Financial Services, Retail, Telecom, Media, Health Care, Energy, Government

The Largest Ecosystem 2,500+ PartnersCloudera University Over 45,000 TrainedOpen Source Leaders Cloudera employees are leading developers & contributors to the

complete Apache Hadoop ecosystem of projects

Spark Pipeline Essentials: Using Spark UI

UI: Event Timeline

UI: Job Details - DAG

UI: Stage Details

UI: Stage Metrics

UI: Skewed Data Metrics - Example

UI: Job Labels and Storage

UI: Job Labels and RDD Names

UI: DataFrame and Dataset Names

https://issues.apache.org/jira/browse/SPARK-8480

UI: Skipped Stages

http://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui

UI: Using Shuffle Metrics

Lot’s more in the UI

• SQL Queries• Environment Variables• Executor Aggregates

Spark Pipeline Essentials: Resource Allocation

Resources: Basics

• If running Spark on YARN• First Step: Setup proper YARN resource queues and dynamic resource pools

Resources: Dynamic Allocation

• Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload.• Originally just Spark-On-Yarn, now all cluster managers

Static Allocation vs Dynamic Allocation

• Static Allocation• --num-executors NUM

• Dynamic Allocation • Enabled by default in CDH• Good starting point• Not the final solution

Dynamic Allocation in Spark Streaming

• Enabled by default in CDH• Cloudera recommends to disable dynamic allocation for Spark Streaming

• Why?• Dynamic Allocation behavior - executors are removed when idle.• Data comes in every batch, and executors run whenever data is available.• Executor idle timeout is less than the batch duration, executors are constantly

being added and removed• If the executor idle timeout is greater than the batch duration, executors are

never removed

Resources: # Executors, cores, memory !?!

• 6 Nodes• 16 cores each• 64 GB of RAM each

Decisions, decisions, decisions

• Number of executors (--num-executors)• Cores for each executor (--executor-cores)• Memory for each executor (--executor-memory)

• 6 nodes• 16 cores each• 64 GB of RAM

Spark Architecture recap

Answer #1 – Most granular

• Have smallest sized executorspossible• 1 core each• 64GB/node / 16 executors/node= 4 GB/executor• Total of 16 cores x 6 nodes = 96 cores => 96 executors

Worker node

Executor 6

Executor 5

Executor 4

Executor 3

Executor 2

Executor 1

Answer #1 – Most granular

• Have smallest sized executorspossible• 1 core each• 64GB/node / 16 executors/node= 4 GB/executor• Total of 16 cores x 6 nodes = 96 cores => 96 executors

Worker node

Executor 6

Executor 5

Executor 4

Executor 3

Executor 2

Executor 1

• Not using benefits of running multiple tasks in same executor.• Missing benefits of shared broadcast variables. Need more copies of the data

Answer #2 – Least granular

• 6 executors in total=>1 executor per node• 64 GB memory each• 16 cores each

Worker node

Executor 1

Answer #2 – Least granular

• 6 executors in total=>1 executor per node• 64 GB memory each• 16 cores each

Worker node

Executor 1

• Need to leave some memory overhead for OS/Hadoop daemons

Answer #3 – with overhead

• 6 executors – 1 executor/node• 63 GB memory each• 15 cores each

Worker node

Executor 1

Overhead(1G,1 core)

Answer #3 – with overhead

• 6 executors – 1 executor/node• 63 GB memory each• 15 cores each

Worker node

Executor 1

Overhead(1G,1 core)

Let’s assume…

• You are running Spark on YARN, from here on…• 4 other things to keep in mind

#1 – Memory overhead

• --executor-memory controls the heap size• Need some overhead (controlled by

spark.yarn.executor.memory.overhead) for off heap memory• Default is max(384MB, . 0.10 * spark.executor.memory)

#2 - YARN AM needs a core: Client mode

#2 YARN AM needs a core: Cluster mode

#3 HDFS Throughput

• 15 cores per executor can lead to bad HDFS I/O throughput.• Best is to keep under 5 cores per executor

#4 Garbage Collection

• Too much executor memory could cause excessive garbage collection delays.• 64GB is a rough guess as a good upper limit for a single executor.• When you reach this level, you should start looking at GC tuning

Calculations

• 5 cores per executor• For max HDFS throughput

• Cluster has 6 * 15 = 90 cores in totalafter taking out Hadoop/Yarn daemon cores)• 90 cores / 5 cores/executor= 18 executors• Each node has 3 executors• 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB• 1 executor for AM => 17 executors

Overhead

Worker node

Executor 3

Executor 2

Executor 1

Correct answer

• 17 executors in total• 19 GB memory/executor• 5 cores/executor

* Not etched in stone

Overhead

Worker node

Executor 3

Executor 2

Executor 1

Dynamic allocation helps with this though, right?

• Number of executors (--num-executors)• Cores for each executor (--executor-cores)• Memory for each executor (--executor-memory)

• 6 nodes• 16 cores each• 64 GB of RAM

Spark Pipeline Essentials: Tuning

Memory: Unified Memory Management

https://issues.apache.org/jira/browse/SPARK-10000

Memory: Example

• Let’s say you have 64GB Executor. • Default spark.memory.fraction : 0.6 = 0.6 * 64 = 38.4 GB• Default spark.memory.storage.fraction: 0.5 = 0.5 * 38.4 = 19.2 GB

• So based on how much data is being spilled, GC pauses and OOME, you can take following actions1. Increase number of executors (increasing parallelism)2. Tweak the spark.yarn.executor.memory.overhead (avoid OOME)3. Tweak Spark.memory.fraction (reduces memory pressure and spilling)4. Tweak Spark.memory.storage.fraction (what you think is right, not excessive)

Memory: Hidden Caches(GraphX)

org.apache.spark.graphx.lib.PageRank

Memory: Hidden Caches(MLlib)

Parallelism

•Number of tasks depends on number of partitions• Too many partitions is usually better than too few partitions• Very important parameter in determining performance•Datasets read from HDFS rely on number of HDFS blocks• Typically each HDFS block becomes a partition in RDD

•User can specify the number of partitions during input or transformations

•What should the X be?• The most straightforward answer is experimentation• Look at the number of partitions in the parent RDD and then keep multiplying

that by 1.5 until performance stops improving

val rdd2 = rdd1.reduceByKey(_ + _, numPartitions = X)

How about the cluster?

• The two main resources that Spark (and YARN) think about are CPU and memory• Disk and network I/O, of course, play a part in Spark performance as well • But neither Spark nor YARN currently do anything to actively manage them

Further Tuning

• Slimming down your data structures• In-memory footprint of your data structures impacts performance greatly• Kryo Serialization preferred over default serialization for custom objects• Cache the data in memory to figure out the dataset size and can make

estimates on record sizes• Example: (total cached rdd size)/(number of records in rdd)• Gives rough estimate on how much memory your records are occupying• After several transformations, you created some custom object, this is the

easiest way to get the size• Also can use SizeEstimator’s estimate method to find object’s size

Spark Pipeline Essentials: Data Formats

Data Formats

• Parquet• Avro• JSON• Avoid if you can• Needless CPU cycles spent parsing large text files again and again

Storage: Parquet

• Popular columnar format for analytical workloads• Great performance• Efficient compression• Partition Discovery & Schema Merging• Writes files into HDFS• Small files problem, needs monitoring, manage compactions• Makes the ETL pipeline complex when handling updates

Storage: Kudu

• Open source distributed columnar data store • Runs on native Linux filesystem• Currently GA and ships with CDH• Similar performance to Parquet• Handles updates• No need to worry about files anymore• Scales well• Spark using KuduContext

https://www.cloudera.com/products/open-source/apache-hadoop/apache-kudu.html

Spark Pipeline Essentials: Streaming

Streaming: Spark & Kafka Integration

• Use Direct Approach• Simplified Parallelism• Efficient and More reliable• Exactly-once Semantics• Requires Offset Management

Streaming: Kafka Offset Management

• Set Kafka Parameter ‘auto.offset.reset’• Spark Streaming Checkpoints• Storing Offsets in HBase• Storing Offsets in Zookeeper• Kafka Itself

More Resources

• Top 5 Spark Mistakes• https://spark-summit.org/2016/events/top-5-mistakes-when-writing-spark-applicatio

• Self-paced spark workshop• https://github.com/deanwampler/spark-workshop

• Tips for Better Spark Jobs• http://

www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

• Tuning & Debugging Spark (with another explanation of internals)• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spar

Questions?

Thank you Email: gmedasani@cloudera.comTwitter: @gurumedasani

chicago spark meetup-april2017-public

Data & Analytics

spark web meetup

spark streaming resiliency (bay area spark meetup)

apache spark meetup

pyspark cassandra - amsterdam spark meetup

autoscaling spark on aws ec2 - 11th spark london meetup

paris spark meetup - trifacta - 03_04_2017

spark 4th meetup londond - building a product with spark

hadoop world spark meetup: interactive spark in your browser

spark seattle meetup - breaking etl barrier with spark...

spark meetup july 2015

seattle spark-meetup-032317

spark sql deep dive @ melbourne spark meetup

austin data meetup 092014 - spark

ibm spark meetup - rdd & spark basics

spark on dataproc - israel spark meetup at taboola

meetup: spark + kerberos

spark meetup tchug

spark sql meetup

dec6 meetup spark presentation

spark meetup tensorframes