spark working environment in windows os

Spark Working

Environment in

Windows OS

Mohammed Zuhair Al-TaieBig Data Centre - Universiti Teknologi Malaysia - 2016

Driver: Spark Driver is the process that contains the SparkContext

Spark Context: Spark framework operating in the basic client-server model. SparkContext is responsible of issuing tasks to the cluster

manager to execute. We have a Driver Program (which can be Scala, Ipython, or R shell) which operates in a laptop. Within the driver

program there is what is called SparkContext. typically, a Driver Program runs on a laptop or a client machine (but this is not a condition). In

this scenario, the client side (laptop or PC) stores the 'Driver' program and inside it there is the SparkContext (usually referred to as sc). In

my own experiments, the driver program is always running on my local computer (although it can run from any other computer even when in

a cluster). One difference with Hadoop MapReduce execution context is that the Driver Program is responsible for managing a lot of metadata

of which tasks to execute and the results that come back from them. In Hadoop, the master node (which lives within a cluster) is responsible

of the metadata of tasks and data. In Hadoop, the master node executes batch jobs, while in spark where we use an interactive REPL, the

Driver Program and the SparkContext often live in the same machine whether in a laptop or another machine.

Cluster Manager: In addition to the Driver Program which issues commands, there is a Cluster Manager which, in spark, can be the built-in

Standalone manager, Hadoop Yarn, or Apache Mesos manager. The Standalone manager is usually a good choice. Hadoop Yarn and Apache

Mesos are best if we want to connect to other frameworks like Hadoop, HBase, Hive, etc. A cluster manager cannot be effective without a

cluster to manage. the cluster manager can connect to one or more worker nodes. Each worker node has an Executor and a cache (or RAM)

and it has tasks to execute.

Executor: process that executes one or more spark tasks

Master: process that manages applications across the cluster

Spark Worker: process that manages executors on a particular node

Some Terminology of Spark (1)

Some Terminology of Spark (2)

In Summary:

The Spark driver sends transformations to cluster manager. The Spark cluster manager sends computation to the appropriate data.

Spark intelligently pipelines tasks to batch computations to avoid sending data over the network. Certain transformations force a wide

dependency.

An RDD can be defined as:

1. A set of partitions of current RDD (data)

2. A list of dependencies

3. A function to compute partitions (functional paradigm)

4. A partition to optimize execution

5. A Potential preferred location for partitions

The first 3 points are required by any RDD but the last two points are optional (used for optimization).

Building RDDs can be done in several ways: sc.parallelize, from hive, from an external file like a text file, from JDBC, Cassandra, HBase, JSON, csv, sequence files, object files, or various compressed formats. It can also be created from another RDD using any of the transformation methods.

To determine the parent RDD of a new child RDD: (RDDname).toDebugString()

To find the exact number of any RDD, type: RDDname.getNumPartitions(). In the localhost:4040, in the "Storage" section, you can find the number of partitions of each RDD: cached.. In the "Jobs" section, you will find that that job is done on 2partitions: 2/2 rather than 1/1.

SPARK RDDs

1. Python installation from anaconda (python version should be 2.7 only!)

2. Spark binary. Select pre-built for Hadoop 2.4 or earlier. Choose Spark package the latest version.

Download the binary version but not the source version (to avoid compiling it compiling).

Download the version that is pre-built for Hadoop

Download website: http://spark.apache.org/downloads.html

The file is downloaded in the tgz format which is a common file compression in Linux and Unix world (opposite to zip in Windows OS). WinRAR is able to unzip

such files

After downloading the file, we should keep it in an easy to use/access environment (e.g. desktop). In addition, we should provide the folder path to the

environment variables.

3. Java JDK 6/7.

To know which version of java is required for spark installation, visit spark.apache.org/docs/latest/

4. Install scientific python. ipython (the old name for Jupyter) is integrated with anaconda distribution.

5. Install py4j (to connect PySpark with Java) from the cmd command: pip install py4j.

6. Install, optionally, IRKernal (for Jupyter) to write R code on Jupyter.

NOTES:

1. Cloudera and Hortonworks provide Linux virtual machines with Spark installed. This allows at running Linux with spark installed if having a virtual box or VMware.

2. Installing spark with Homebrew (on OSX) or Cygwin (on windows) is not recommended, although they are great at installing packages other than spark.

Installation Requirements

We need to set some environment variables on our system. The goal is to let Windows know where to find Spark

and other components.

The following applies to setting environment variables for Windows OS. Linux and Mac OSs have their own ways for

setting environment variables

1. First, do all necessary installations (Python, Anaconda, PySpark, Java, py4j, IRKernal) as stated before.

2. Download Apache Spark from its official website and decompress it.

3. Set some environment variables in our systems. Go to environment variables: Control Panel --> System and Security -->

System --> Advanced system settings --> environment variables

There are two sections, the upper one is related to user variables and the lower one is related to system variables

4. Create a new user variable called 'SPARK_HOME' in the 'user variables' section that includes the full path to the unzipped spark

folder (for example: C:\Users\Zuhair\Desktop\spark-1.6.0-bin-hadoop2.6)

This is important because we want to tell Windows where we installed Spark.

Setting Environment Variables (1)

5. Add the name of that new user variable 'SPARK_HOME' to the end of 'path' variable (user variables section) like this:

;%SPARK_HOME%\bin. This will allow to run Spark from any directory without having to write the full path to it.

6. Add the following python path to the end of "path" in the "system variables" section: (variable: path, value:

...;C:\Python27). Although this is not always necessary, but it allows to enable python to run if it is not responding

7. Another important installation: open a new folder in the c: drive and name it winutils. inside that folder open a new folder

and name it bin. inside bin folder, paste an executable file that you can download from the internet which is winutils.exe.

This file is important for Spark to work inside Windows environment as it expects Hadoop (or some part of it which is

winutils.exe) to be installed in windows. Installing Hadoop instead can work perfectly. Next, we need to tell spark where

that file is: in the environment variables, in the user variables part, add a new environment variable that we call it

HADOOP_HOME and the variable value is the path to the winutils.exe file which is in this case c:\winutils.

8. OPTIONAL: change spark configuration by logging into conf folder in spark folder. The goal is to get rid of the many error

messages that appear during execution. After logging into conf folder, open the file log4j.properties.template using

WordPad application and make rootCategory=WARN rather than INFO. After that, we change the file extension to

log4j.properties only.

Setting Environment Variables (2)

In Spark, there are 3 modes of programming: Batch mode, interactive mode (using a

shell), and a streaming mode.

Only python and Scala have shells for spark (i.e. java cannot be run from the command

line script)

The spark python shell is a normal python REPL that is connected as well to the spark

cluster underneath

To run the python shell in the command prompt environment (cmd), cd (change directory)

to where Spark folder is unzipped then, cd to the "bin" folder inside it and then execute

"PySpark". If the "cmd" environment is not responding, try to use the administrator mode

when initiating it.

To run the Scala shell, type spark-shell from within the bin folder.

Spark Shell

Spark Shell Demonstration

1. C:\Users\Zuhair\Desktop\spark-1.6.0-bin-hadoop2.6\bin>pyspark --master local (for local

mode). This is the mode used most of the time. Other types for valid local modes include:

1. pyspark --master local[k], where k=2,3.... where k represents the number of worker threads.

2. pyspark --master local[*], where * corresponds to the number of cores available.

3. [*] is the default value of the pyspark shell.

2. C:\Users\Zuhair\Desktop\spark-1.6.0-bin-hadoop2.6\bin>pyspark --master mesos (for

mesos mode)

3. C:\Users\Zuhair\Desktop\spark-1.6.0-bin-hadoop2.6\bin>pyspark --master yarn (for yarn

mode)

4. C:\Users\Zuhair\Desktop\spark-1.6.0-bin-hadoop2.6\bin>pyspark --master spark (for spark

mode)

Running PySpark in Various Modes

The only mode that works initially is the local mode.

For small projects that don't require sharing clusters, standalone scheduler spark cluster mode is the

best option. YARN or Mesos are preferred for advanced tasks such as priorities, queues, access limiting

YARN or Mesos are necessary to connect spark to HDFS in Hadoop.

Spark can also live in a any hardware using a virtual machine using e.g. Amazon Web Service (EC2).

Spark has connectors to most of the popular data sources such as Apache HBase, Apache Hive,

Cassandra, and Tachyon which is developed by AMPlab particularly for spark.

A spark cluster consists of one master node and one (or more) slave nodes.

To start the master mode: spark-1.6.0-bin-hadoop2.6\sbin>start-master.sh

To stop the master mode: spark-1.6.0-bin-hadoop2.6\sbin>stop-master

To start the slave mode: spark-1.6.0-bin-hadoop2.6\sbin>start-slave.sh

Notes on Spark Running Modes (1)

1. To stop the slave mode: spark-1.6.0-bin-hadoop2.6\sbin>stop-slave.sh

2. To start all slaves: spark-1.6.0-bin-hadoop2.6\sbin>start-slaves.sh

3. The spark master node has a URL like spark://alex-laptop.local:7077 that we need

4. Starting a slave node requires that we provide the URL of the master node

5. Losing a slave node is not a problem in the standalone mode as it is resilient against data loss

6. However, losing the master node can be problematic as it will stop all new jobs from being scheduled. to

go around this, you can use zookeeper or the single node recovery.

7. exit() to exit spark sessions

8. To know which mode we are using, in pyspark, type: sc.master or in the browser, visit

http://localhost:4040/environment/ and then spark.master in spark properties.

Notes on Spark Running Modes (2)

import os

import sys

spark_home = os.environ.get('SPARK_HOME', None)

if not spark_home:

raise ValueError('SPARK_HOME environment variable is not set')

sys.path.insert(0, os.path.join(spark_home, 'python'))

sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip')) ## may need to adjust on your system depending on which Spark version you're using and where you installed it.

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Notes:1. You run this code after launching ipython notebook

2. The above code should run PySpark on ipython (Jupyter) successfully

3. This code should be run one time only (at the beginning of the session)

4. The execution should come after doing all the installations.

Ipython Using Spark

ipython is a type of REPL (Read–Eval–Print-Loop) programs

ipython is integrated with Anaconda installation

To run ipython inside the command prompt environment (instead of a web browser) just type: ipython. To exit spark shell, type: exit()

To run ipython on a web browser from the cmd command, type: ipython notebook. To exit the shell, type: ctrl + break

To install ipython from the cmd command: pip install ipython

One way to install ipython is to do the installation inside a virtual environment (virtualenv ipython-env). This type of installation has the advantage of not affecting the main working environment

To update ipython from the command line: conda update ipython, or pip install ipython --upgrade

It is possible to update anaconda distribution (packages) from the command line. Type: conda update conda... and then specify the packages that you want to upgrade (update)

If unable to install some packages (like SciPy or sklearn.preprocessing) --> it means there is a problem with Anaconda installation or path. A better way is through unstall-install Anaconda for the specific Python version (i.e. python 2.7). Then install packages by typing conda install numpy scipy matplotlib. Make use of this video tutorial "Install Python with NumPy SciPy Matplotlib on Windows".

Jupyter is the new name of ipython. it supports other languages besides python.

Ipython Settings (1)

Jupyter can be triggered from the command prompt like this: Jupyter notebook. To exit the shell, use: ctrl + break

Jupyter can also be triggered as a program from the program files in windows

In addition to running PySpark on Ipython (in the web browser), we can run PySpark on ipython in the command prompt

environment it self.

This means importing the PySpark library into ipython shell. After logging into ipython in cmd, type: import pyspark as ps

R code can also be used on Ipython in the REPL after installing IRKernal.

R can also be executed in the cmd environment by typing: SparkR

Spyder software comes integrated with Anaconda package

To know the current working directory, type: pwd.

To monitor spark processes, we can use "localhost:4040" or some other external tools such as Ganglia.

To know the default number of partitions in a PC: sc.defaultMinPartitions or sc.defaultParallelism (the result will depend on

the number of the cores in that PC). The default number of partitions in Spark on a laptop or PC = 1

Spark defaults to using one partition for each block of the input file. But the default block size is different if we are

reading from HDFS as it takes 64 MB since the files that are stored in there are big. For local files, and in most operating

systems, block size is in the order of kilobytes. Use: rdd.partitions.size

In spark, code can be done in two places: driver (shell or sc) and executors (worker threads).

In a cluster mode, the number of executors can be determined (this is not available in the standalone mode).

Spark works on top of JVM (Java Runtime Environment). However, it is platform independent in the way that there is a

version for windows, Linux, ...

Ipython Settings(2)

To monitor the performance of our application on spark, we can use:

1. Web UI (application) of the spark cluster that defaults to port 4040

2. History Server (application)

3. Ganglia (infrastructure)

4. jstack, jmap, jstat (JVM)

To optimize the performance of an application on spark and to solve possible problems, try to:

1. Optimize Data in terms of serialization/deserialization and locality

2. Optimize the application itself by using more efficient data structures, caching, broadcasting, shuffle

3. Optimize the framework itself (parallelism, memory, or garbage collecting)

In the spark Standalone cluster manager (or any other cluster manager), the master node itself launches its own web UI.

The usual local host port (Web UI/history server) on windows is 4040 (we use it to monitor Spark processes). It contains information about schedulers, RDD, spark environment, and executors. In the "Executors" menu, you find one executor only if you are using Windows environment. However, if using the cluster mode, you will find a lot of executors depending on the number of clusters. After using the cache function(), the storage menu in Spark will now be activated and will show that a MapPartitionRDD is now available in the memory with the size of that memory.

Ipython Settings(3)

Typing ? alone provides a lot of help documentation from the ipython terminal

Typing ?word gives information about the keyword

Typing ??word gives a detailed information about the keyword

Typing sc.(+tab) to check all available functions with spark context

Typing sc? gives me the built-in help

Type help() for general help inside spark code

Type help(function name) to see details about that function use

Ipython provides debugging. Type: %debug and insert "h" for help, "w" for location, "q" to leave,...

%pdb takes you to the debugger exactly after an exception is thrown

Implementing the “print statement” in several places within the working area also serves as a debugger method

Help Within Ipython

Spark deployment is done in one of two modes: local mode (running a laptop or a single machine) or

cluster mode (which provides better performance compared to the local mode).

Installing spark on laptops or PCs uses the standalone version, while a cluster requires the inclusion of

Mesos or YARN.

In other words: there are two modes of Spark scalability: Single JVM or Managed Cluster.

1. In a single JVM mode, Spark runs on a single box (Linux or Windows) and all components (Driver,

executors) run within the same JVM. This is a simple setup for Spark (i.e. intended only for training

and is not intended for production).

2. In the Managed cluster, Spark can scale from 2 to thousands of nodes. We can use any cluster

manager (like Mesos or YARN) to manage nodes. Data is distributed and processed in all nodes. This

setup is suitable for production environments.

Spark Deployment

There are three ways to run spark in the local mode:1. Single threaded: running SparkContext with a single thread. SparkContext('local'). It is a sequential execution that

allows easier debugging of program logic, where tasks are executed sequentially. When debugging logic in a multi-

threaded mode, there is no guarantee on the sequence of task execution or the tasks that are executed. In the single-

threaded mode, on the other hand, all the tasks are got executed sequentially. After building a program in the single-

threaded mode, we may move to the more advanced mode, which is the multi-threaded mode to test the application.

2. Multi-threaded: leverages multiple cores and multiple threads of the computer. For example, SparkContext('local[4]')

will create four cores for the specific application. In this mode, concurrent execution leverages parallelism and allows

debugging of coordination. It has the benefit of leveraging the parallelism available in the computer to make programs

run faster and it also allows to debug the coordination and communication of the code that is executed in parallel. A

program that can pass this stage of testing and debugging correctly is very likely to work correctly in the full

distributed mode.

3. Pseudo-distributed cluster: to a large degree similar to the cluster mode. In this mode, the distributed execution

allows debugging of communication and I/O. It is similar to working in the previous type (the multi-threaded mode)

but this mode goes one step further in how to run a program in the cluster mode with a number of physical or virtual

machines. It is possible in this mode to debug the communications and input/output of each task and the jobs that are

running, as well as having the same interface available in the cluster mode including the ability to inspect individual

workers and exactors and to make sure the communications of a network with IPs and ports work correctly for a

specific application.

Local Mode

Cluster Mode:

In the cluster mode, we must decide the mode and the machines (whether physical logical) that spark will run on. The cluster manager comes in 3 flavors: Standalone (that comes pre-packaged with Spark), YARN (the default cluster manager for Hadoop), and Mesos (which came from the same research group at UC Berkley and the one that Matie started his work on in his early days at the AMPlab).

The scheduler is responsible of building stages to execute, submitting stages to Cluster Manager, and resubmit failed stages when output is lost. It sits between worker nodes that run tasks and threads. The result coming from worker nodes go back to the cluster manager and then back to driver program.

A spark cluster can not be set on Windows environment. In this case, only one client node can be run on Windows to implement simple projects. Running a spark cluster requires a Linux environment or installing spark inside a virtual machine on windows with thousands of nodes.

Cluster Mode

In the standalone mode (i.e. running spark locally), both the driver program (e.g. ipython shell) and

worker nodes (the processes inside the laptop) are located on the same physical infrastructure. In this

case, there are no managers like Mesos or YARN. If we are running spark on amazon web services, the

workers nodes in this case will be EC2.

The standalone scheduler comes pre-packaged with spark core. It is a great choice for a dedicated Spark

cluster (i.e. in the cluster, we are running only spark, and we are not running Hadoop or HBase on the

same cluster that we are running our spark installation). If we are to run different applications in the same

cluster that share resources and load, we may use either YARN or Mesos schedulers.

The Standalone cluster manager has a high-availability mode that can leverage Apache Zookeeper to

enable standby nodes. This is useful in case one of the master nodes fails as we can promote one of the

other nodes to take its place and continue working in zero time.

Standalone Scheduler

Mesos is the most general cluster manager that we can run spark on. It can be though of as general purpose cluster and

global source manager if we want to run multiple applications such as Spark, Hadoop, MPI, Cassandra, etc.. to share the

same resources such as memory and CPU cycles. Mesos, in this case, is going to schedule the various resources of our

cluster.

Mesos is a popular open-source cluster manager. It allows sharing many clusters between users and apps. It is easy to run

spark on top of Mesos.

It can be thought of as an operating system of a cluster, where multiple applications can co-locate on a cluster. Spark,

Hadoop, Kafka, ... can all be run in one Mesos cluster. So, if we are running spark, Hadoop, and Cassandra on the same

cluster, Mesos is going find the best way to efficiently distribute resources such as memory, CPU, network bandwidth

between the various applications and the users who are likely to use the cluster.

Mesos and Spark came out from the same research group at UC Barkley and Matie (the father of Apache Spark) built the

first version of Spark to work with Mesos. This is why they can work together over various applications.

It is a global resource manager that facilitates multi-tenant and heterogeneous workloads.

It is useful in saving resources or running multiple spark instances at the same time.

It is easy to run spark on a ready Mesos cluster. in this case, spark-submit can be used.

Compared to Mesos, YARN (discussed next) is more integrated with Hadoop ecosystem.

Mesos

YARN stands for Yet Another Resource Negotiator and it came out with the second version of Hadoop. It was abstracted out

of the Hadoop MapReduce framework to exist as a standalone cluster manager. This is why YARN is more suited with

stateless batch jobs with long runtimes.

YARN and Hadoop, in a similar way to spark and Mesos, grew up together.

Compared to Mesos, YARN is a monolithic scheduler that manages cluster resources as well as schedules jobs that are

executed on these resources.

YARN is not well suited for long-running processes that are always up (such as web servers), real-time, or

stateful/interactive services (like spark REPL or database queries).

YARN integrates well with existing Hadoop cluster and applications.

YARN was developed to separate MapReduce from the cluster manager. Hence, Spark can be run on YARN easily.

Mesos is more appropriate for spark than YARN, as YARN requires a lot of configurations and maintenance. With all these

complications, Mesos or the ‘standalone’ version is preferable.

In any case, a Mesos or YARN cluster should be pre-built before we run spark on top.

If Hadoop is already installed, it means that YARN is integrated and spark can be next installed.

YARN

Amazon EC2 (Elastic Compute Cloud) is useful to run spark on a cluster 7/24 and is useful for fast

prototyping and testing. It is elastic enough to comprehend how many machines spark is running on.

EC2 is useful to deploy clusters or test prototypes. Spark works very nicely with EC2 as spark was bundled

with scripts that ease the process of setting up and installing an environment on each of the workers and

master machines.

Virtual machines (including EC2) is elastic and ephemeral, even if we have our own physical devices.

EC2 is great at providing the machines that are required to run scripts or test prototypes.

Although there are other cloud services, EC2 offers elastic scalability and ease of setup, even when Mesos

or YARN is installed. It can be leveraged to test the various aspects of Spark itself.

For many people, it is the only feasible way to scale up their analyses without the need to make big capital

investments in building their own clusters.

Amazon EC2

There is a difference between the Client Mode and the Cluster Mode:

1. In the Client Mode (like in a laptop environment) the driver runs in the client, the master gets

resources and communicates back to the client, and the client communicates directly with executors.

2. In the Cluster Mode, the driver runs in master in the cluster, the master communicates with executors

all within the cluster, and the client exists as soon as it passes info to the master.

We can run spark either on a 'local' mode or 'cluster' mode, where each case has its own benefits:

1. local mode is useful when we want to debug an application either on a sample data or a small-scale

data.

2. Cluster mode is useful when we want to scale up our analysis to an entire datasets or when we want

to run things in a parallel and distributed fashion. Moving from one mode to another requires very

minimal changes to the application code.

Spark Deployment in Summary

Hadoop MapReduce framework is similar to Spark in that it uses master slave-like paradigm. It has one Master node (which consists of a job tracker, name node, and RAM) and Worker Nodes (each worker node consists of a task tracker, data node, and a RAM). The task tracker in a worker node is analogues to an executor in Spark environment.

Tasks are assigned by Master nodes which are also responsible for the work coordination between worker nodes. However, Spark adds some abstractions and generalizations and performance optimizations to achieve much better efficiency especially in iterative workloads. Yet, spark does not concern itself with being a data file system while Hadoop has what is called HDFS.

Spark can leverage existing distributed files systems (like HDFS), a distributed data base (like HBase), traditional databases through its JDBC or ODBC adaptors, and flat files in local file systems or on a file store like S3 in Amazon cloud.

In summary:1. Spark only replaces MapReduce (or the computational engine of a distributed system)

2. We still need a data store: HDFS, HBase, Hive, etc.

3. Spark has a more flexible and general programming model compared to Hadoop

4. Spark is an echo system of higher level libraries that are built on top of Spark core framework

5. Spark is often faster for iterative computations

Comparison between Spark and Hadoop

spark working environment in windows os

Data & Analytics