learning spark ch07 - running on a cluster

25
Chapter 07: Running on a Cluster Learning Spark by Holden Karau et. al.

Upload: phanleson

Post on 26-Jan-2017

274 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Learning spark ch07 - Running on a Cluster

Chapter 07: Running on a Cluster

Learning Sparkby Holden Karau et. al.

Page 2: Learning spark ch07 - Running on a Cluster

Overview: Running on a Cluster

Introduction Spark Runtime Architecture

The Driver Executors Cluster Manager Launching a Program Summary

Deploying Applications with spark-submit Packaging Your Code and Dependencies Scheduling Within and Between Spark Applications Cluster Managers

Standalone Cluster Manager Hadoop YARN Apache Mesos Amazon EC2

Which Cluster Manager to Use? Conclusion

Page 3: Learning spark ch07 - Running on a Cluster

05/01/23

7.1 Introduction

This chapter explains the runtime architecture of a distributed Spark application.

Discusses options for running Spark in distributed clusters. Hadoop YARN Apache Mesos Spark's own built-in Standalone cluster manager

Discuss the trade-offs and configurations required for running in each case.

Cover the “nuts and bolts” of scheduling, deploying, and configuring a Spark application.

Page 4: Learning spark ch07 - Running on a Cluster

05/01/23

7.2 Spark Runtime Architecture

Page 5: Learning spark ch07 - Running on a Cluster

Edx and Coursera Courses

Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

Page 6: Learning spark ch07 - Running on a Cluster

05/01/23

7.2.1 The Driver

The driver is the process where the main() method of your program runs. It is the process running the user code that creates a

SparkContext, creates RDDs, and performs transformations and actions.

When the driver runs, it performs two duties: Converting a user program into tasks Scheduling tasks on executors

Page 7: Learning spark ch07 - Running on a Cluster

05/01/23

7.2.2 Executors

Spark executors are worker processes responsible for running the individual tasks in a given Spark job.

Executors have two roles : Run the tasks that make up the application and return results to

the driver. Provide in-memory storage for RDDs that are cached by user

programs, through a service called the Block Manager that lives within each executor.

Page 8: Learning spark ch07 - Running on a Cluster

05/01/23

7.2.3 Cluster Manager

Spark depends on a cluster manager to launch executors and, in certain cases, to launch the driver.

The cluster manager is a pluggable component in Spark.

This allows Spark to run on top of different external managers, such as YARN and Mesos, as well as its built-in Standalone cluster manager.

Page 9: Learning spark ch07 - Running on a Cluster

05/01/23

7.2.4 Launching a Program

No matter which cluster manager you use, Spark provides a single script you can use to submit your program to it called spark-submit.

Through various options, spark-submit can connect to different cluster managers and control how many resources your application gets.

Example( on a YARN worker node), while for others, it can run it only

on your local machine

Page 10: Learning spark ch07 - Running on a Cluster

05/01/23

7.2.5 Summary

The exact steps that occur when you run a Spark application on a cluster: The user submits an application using spark-submit. spark-submit launches the driver program and invokes the main()

method specified by the user. The driver program contacts the cluster manager to ask for

resources to launch executors. The cluster manager launches executors on behalf of the driver

program. The driver process runs through the user application. Based on

the RDD actions and transformations in the program, the driver sends work to executors in the form of tasks.

Tasks are run on executor processes to compute and save results. If the driver’s main() method exits or it calls SparkContext.stop(),

it will terminate the executors and release resources from the cluster manager.

Page 11: Learning spark ch07 - Running on a Cluster

05/01/23

7.3 Deploying Applications with spark-submit

Spark provides a single tool for submitting jobs across all cluster managers, called spark-submit.

Page 12: Learning spark ch07 - Running on a Cluster

05/01/23

7.3 Deploying Applications with spark-submit (Con't)

The general format for spark-submit is shown in Example 7-3.

Page 13: Learning spark ch07 - Running on a Cluster

05/01/23

7.3 Deploying Applications with spark-submit (Con't)

Page 14: Learning spark ch07 - Running on a Cluster

05/01/23

Packaging Your Code and Dependencies

You need to ensure that all your dependencies are present at the runtime of your Spark application.

For Python users : You can install dependency libraries directly on the cluster

machines using standard Python package managers Submit individual libraries using the --py-files argument to spark-

submitFor Java and Scala users :

Submit individual JAR files using the --jars flag to spark-submit. The most popular build tools for Java and Scala are Maven and sbt

(Scala build tool).

Page 15: Learning spark ch07 - Running on a Cluster

05/01/23

7.4 Scheduling Within and Between Spark Applications

Scheduling policies help ensure that resources are not overwhelmed and allow for prioritization of workloads.

For scheduling in multitenant clusters, Spark primarily relies on the cluster manager to share resources between Spark applications.

Page 16: Learning spark ch07 - Running on a Cluster

05/01/23

7.5 Cluster Managers

Spark can run over a variety of cluster managers to access the machines in a cluster. The built-in Standalone mode is the easiest way to deploy. Spark can also run over two popular cluster managers: Hadoop

YARN and Apache Mesos. Finally, for deploying on Amazon EC2, Spark comes with builtin

scripts that launch a Standalone cluster and various supporting services.

Page 17: Learning spark ch07 - Running on a Cluster

05/01/23

7.5.1 Standalone Cluster Manager

Spark’s Standalone manager offers a simple way to run applications on a cluster. It consists of a master and multiple workers, each with a

configured amount of memory and CPU cores.Submitting applications :

spark-submit --master spark://masternode:7077 yourapp This cluster URL is also shown in the Standalone cluster

manager’s web UI, at http://masternode:8080.You can also launch spark-shell or pyspark against

the cluster in the same way, by passing the --master parameter: spark-shell --master spark://masternode:7077 pyspark --master spark://masternode:7077

Page 18: Learning spark ch07 - Running on a Cluster

05/01/23

7.5.1 Standalone Cluster Manager(con't)

Configuring resource usage In the Standalone cluster manager, resource allocation is

controlled by two settings: Executor memory : using the --executor-memory argument to

sparksubmit. The maximum number of total cores : the --total-executorcores argument to spark-submit, or by configuring spark.cores.max in

your Spark configuration file

High availability The Standalone mode will gracefully support the failure of worker

nodes. Spark supports using Apache ZooKeeper (a distributed

coordination system) to keep multiple standby masters and switch to a new one when any of them fails.

Page 19: Learning spark ch07 - Running on a Cluster

05/01/23

7.5.2 Hadoop YARN

YARN is a cluster manager introduced in Hadoop 2.0 that allows diverse data processing frameworks to run on a shared resource pool, and is typically installed on the same nodes as the Hadoop filesystem (HDFS).

Running Spark on YARN in these environments is useful Spark access HDFS data quickly, on the same nodes where the data is stored.

Spark’s interactive shell and pyspark both work on YARN as well; simply set HADOOP_CONF_DIR and pass --master yarn to these applications. Note

export HADOOP_CONF_DIR="..."spark-submit --master yarn yourapp

Page 20: Learning spark ch07 - Running on a Cluster

05/01/23

7.5.2 Hadoop YARN (Con't)

Configuring resource usage : When running on YARN, Spark applications use a

fixed number of executors, which you can set via the --num-executors flag to spark-submit

Set the memory used by each executor via --executor-memory and the number of cores it claims from YARN via --executor-cores.

Page 21: Learning spark ch07 - Running on a Cluster

05/01/23

7.5.3 Apache Mesos

Apache Mesos is a general-purpose cluster manager that can run both analytics workloads and long-running services (e.g., web applications or key/value stores) on a cluster. spark-submit --master mesos://masternode:5050 yourapp

Mesos scheduling modes : Client and cluster mode Configuring resource usage

You can control resource usage on Mesos through two parameters to spark-submit: --executor-memory, to set the memory for each executor, and --total-executorcores, to set the maximum number of CPU cores for the application to claim (across all executors).

Page 22: Learning spark ch07 - Running on a Cluster

05/01/23

7.5.4 Amazon EC2

Spark comes with a built-in script to launch clusters on Amazon EC2.

Launching a cluster : Example : # Launch a cluster with 5 slaves of type m3.xlarge ./spark-ec2 -k mykeypair -i mykeypair.pem -s 5 -t m3.xlarge launch mycluster

Logging in to a cluster : ./spark-ec2 -k mykeypair -i mykeypair.pem login mycluster

Destroying a cluster : ./spark-ec2 destroy mycluster

Pausing and restarting clusters : ./spark-ec2 stop mycluster ./spark-ec2 -k mykeypair -i mykeypair.pem start mycluster

Storage on the cluster : Spark EC2 clusters come configured with two installations of the Hadoop file systemthat you can use for scratch space. An “ephemeral” HDFS installation using the ephemeral drives on the nodes. A “persistent” HDFS installation on the root volumes of the nodes.

Page 23: Learning spark ch07 - Running on a Cluster

05/01/23

7.6 Which Cluster Manager to Use?

The cluster managers supported in Spark offer a variety of options for deploying applications. Start with a Standalone cluster if this is a new

deployment. If you would like to run Spark alongside other

applications, or to use richer resource scheduling capabilities (e.g., queues), both YARN and Mesos provide these features.

One advantage of Mesos over both YARN and Standalone mode is its finegrained sharing option.

In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage.

Page 24: Learning spark ch07 - Running on a Cluster

Edx and Coursera Courses

Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

Page 25: Learning spark ch07 - Running on a Cluster

05/01/23

7.7 Conclusion

This chapter described the runtime architecture of a Spark application, composed of a driver process and a distributed set of executor processes :

We then covered how to build, package, and submit Spark applications.

The common deployment environments for Spark, including its built-in cluster manager

Running Spark with YARN or Mesos Running Spark on Amazon’s EC2 cloud.