the data scientist's guide to apache spark

The Data Scientist’s Guide to Apache Spark

Hands on with a practical case study

Jonathan DinuVP of Academic Excellence, Galvanize

@clearspandex

Data Science Immersive

Full StackImmersive

Data Engineering Immersive

Weekend Workshops

+

Questions? tweet @clearspandex

Spark Fundamentals

5

What Is Spark?

6

What Is Spark?

• Framework for distributed processing

• In-memory, fault tolerant data structures

• Flexible APIs in Scala, Java, Python, SQL… and now R!

• Open Source

Acquisition

7

Parse

StorageTransform/Explore

VectorizationTrain

ModelExpose

Presentation

requests

BeautifulSoup4

pandas

pymongo

Flask

Dataframes/Spark SQL

HDFS

MLlib/spark.ml

Spark Streaming

scikit-learn/NLTK

PySpark

Data PipelineAt Scale Locally

picklemodel.save()

UnifiedPlatform

8

Performance

• Very fast at iterative algorithms

• DAG scheduler supports cyclic flows (and graph computation)

• Intermediate results kept in memory when possible

• Bring computation to the data (data locality)

9

Rich API

map() reduce()

filter() sortBy()

join() groupByKey()

first() count()

… and more …

map()

reduce()

10

Spark Core

Standalone Scheduler YARN Mesos

Spark StreamingDataFrames MLlib GraphXspark.ml

PySpark SparkR Scala Java

11

Review

• Framework for distributed processing

• In-memory, fault tolerant data structures

• Flexible APIs in Scala, Java, Python, SQL… and now R!

• Open Source

12

Spark Programming Basics

13

Spark Execution Context

Cluster Manager

Laptop

Driver Program

SparkContext

Worker Node

Worker Node

Executor

Executor

cache

cache

Task Task

Task Task

StandaloneYARNMesos

Cluster

14

TerminologyTerm Meaning

Driver Process that contains the SparkContext

Executor Process that executes one or more Spark tasks

Master Process that manages applications across the cluster

Worker Process that manages executors on a particular node

15

• Resilient: if the data in memory (or on a node) is lost, it can be recreated

• Distributed: data is chunked into partitions and stored in memory across the cluster

• Dataset: initial data can come from a file or be created programmatically

What is a RDD?

Note: RDDs are read-only and immutable, we will come back to this later…

16

Functions Deconstructedimport randomflips = 1000000

# lazy evalcoins = xrange(flips)

# lazy eval, nothing executedheads = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51) \ .count()

Python Generator

Create RDD

Transformations

Action (materialize result)

17

Functions Deconstructedimport randomflips = 1000000


# lazy eval, nothing executedheads = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51) \ .count()

# create a closure with the lambda function# apply function to data

Closures

18

Spark Functions

Transformations Actions

Lazy Evaluation (does not immediately evaluate)

Returns new RDD

Materialize Data (evaluates RDD lineage)

Returns final value (on driver)

19

Transformations

# Every Spark application requires a Spark Context# Spark shell provides a preconfigured Spark Context called `sc`nums = sc.parallelize([1,2,3])

# Pass each element through a functionsquared = nums.map(lambda x: x*x) # => {1, 4, 9}

# Keep elements passing a predicateeven = squared.filter(lambda x: x % 2 == 0) # => [4]

# Map each element to zero or more othersnums.flatMap(lambda x: range(x)) # => {0, 0, 1, 0, 1, 2}

20

Actionsnums = sc.parallelize([1, 2, 3])

# Retrieves RDD contents as a local collectionnums.collect() # => [1, 2, 3]

# Returms first K elementsnums.take(2) # => [1, 2]

# Count number of elementsnums.count() # => 3

# Merge elements with an associative functionnums.reduce(lambda: x, y: x + y) # => 6

# Write elements to a text filenums.saveAsTextFile("hdfs://file.txt")

hdfs://file.txt

21

Functions Revisitedimport randomflips = 1000000


# lazy eval, nothing executedheads_rdd = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51)

head_count = heads_rdd.count()

nothing runs here

Everything runs here

22

Functions Revisitedimport randomflips = 1000000

# local sequencecoins = xrange(flips)

# distributed sequencecoin_rdd = sc.parallelize(coins)flips_rdd = coin_rdd.map(lambda i: random.random())heads_rdd = flips_rdd.filter(lambda r: r < 0.51)

# local valuehead_count = heads_rdd.count()

23

worker (distributed)driver driver

RDD Lineage

coins coin_rdd flips_rdd heads_rdd head_count

sc.parallelize() map() filter() count()

24

Key-Value Operations

pets = sc.parallelize([("cat", 1), ("dog", 1), ("cat", 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

25

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/pyspark.ipynb


26

Functional Programming Primer

• Functions are applied to data (RDDs)

• RDDs are Immutable: f(RDD) -> RDD2

• Function application necessitates creation of new data

27

Review

• Client-Server execution model

• Spark leverages higher-order functions (map(), filter(), etc.)

• Transformations create new RDDs and are lazily evaluated

• Actions force materialization of RDD on driver

28

Spark Programming APIs

29

ClusterLocal

Spark Context

PySpark

Py4J

JavaSpark

Context

Local File System

Spark Worker

Spark Worker

SocketPython

Python

Python

Python

Python

Python

Python

PythonPipe

Python

30

Review

• PySpark enables developers to write driver programs in Python

• For both of these, closures are serialized and sent to workers

• Execution happens in native language (Python/R) of closure

Data Science Applications with Spark

32

AcquisitionParse


VectorizationTrain

ModelExpose

Presentation


HDFS

MLlib/spark.ml

Spark Streaming

PySpark

Data PipelineAt Scale

model.save()

We are Here

33

What Is Exploratory Data Analysis?

• Developed at Bell Labs in the 1960’s by John Tukey

• Techniques used to visualize and summarize data

• Five-number summary: describe()

• Distributions: box plots, stem and leaf, histogram, scatterplot

34

Goals of Exploratory Data Analysis

• Gain greater intuition

• Validate our data (consistency and completeness)

• Make comparisons between distributions

• Find outliers

• Treat missing data

• Summarize data (a statistic -> one number that represents many #'s)

35

Case Study: DonorsChoose.org

http://DonorsChoose.org

36

http://data.donorschoose.org/open-data/overview/

http://data.donorschoose.org/open-data/overview/

37

Unique Values

• rdd.distinct()

• rdd.countApproxDistinct(relative_accuracy)

http://content.research.neustar.biz/blog/hll.html

http://dx.doi.org/10.1145/2452376.2452456

http://content.research.neustar.biz/blog/hll.html

http://dx.doi.org/10.1145/2452376.2452456

38

Missing Values

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

• column.isNull()

• dataframe.fillna()

…

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

39

Missing Values

40

Frequently Occurring Values

• dataframe.freqItems(columns, support)

http://dl.acm.org/citation.cfm?doid=762471.762473

required minimum proportion of rows

Note: this is an approximate algorithm that always returns all the frequent items, but may contain false positives.

http://dl.acm.org/citation.cfm?doid=762471.762473

Summary Statistics

• dataframe.describe(column_name)

41

Interlude: Sometimes numbers aren’t enough!

42

Anscombe’s Quartet

4 6 8 12 16

4

6

8

10

12

x1

y 1

4 6 8 12 16

4

6

8

10

12

x2

y 2

4 6 8 12 16

4

6

8

10

12

x3

y 3

4 6 8 12 16

4

6

8

10

12

x4

y 4

Mean (x) 9

Sample Variance (x) 11

Mean (y) 7.50

Sample Variance (y) 4.127

Correlation 0.816

Linear Regression y = 3.00 + 0.500x

43

46

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/donors_choose_eda.ipynb


Review

47

• The data science process is inherently interactive

• Spans many scales of data and computation

• Data pipelines require linking many diverse tasks (and data)

• Quick insights necessary for fast iteration

How Spark can help

• Interactive REPL

• Rapid computation (especially aggregates) on large amounts of data

• High level abstractions for efficient querying of data

• “Condense” data for easier local exploration and visualization

48

Natural Language Processing

50

AcquisitionParse


VectorizationTrain

ModelExpose

Presentation


HDFS

MLlib/spark.ml

Spark Streaming

PySpark


model.save()

We are Here

51

Natural Language Processing

[1, 3, 1, 1, 2, 0, 1, 0][0, 1, 4, 0, 0, 1, 1, 1][3, 0, 1, 1, 2, 2, 3, 2][0, 1, 1, 1, 0, 3, 2, 3][1, 2, 1, 2, 2, 0, 0, 0][1, 0, 1, 1, 0, 1, 1, 1][0, 2, 0, 0, 2, 2, 0, 0][1, 1, 1, 1, 0, 1, 1, 1]

52

DonorsChoose: Project Essays

53

Bag of Words

• Document: Single row of data/corpus

• Corpus: Entire set of all documents

• Vocabulary: Set of all words in corpus

• Vector: Mathematical representation of document

(counts of word occurrences)

54

Bag of Words

original document dictionary of word

countsfeature vector

The brown fox

{ “the” : 1, “brown”: 1, “fox” : 1

}

[0,0,1,0,1,0,...]

brown fox

Tokenization Vectorization

55

Tokenization

56

Vectorization

57

With MlLib

58

Vector Space Model

By Riclas (Own work) CC BY 3.0 , via Wikimedia Commons

Similarity is a measure of “distance”

http://creativecommons.org/licenses/by/3.0

59

Interlude: How to Scale

testing

Start small (data) and fast

(development)

testing

Increase size of data set

Optimize and productionize

PROFIT!

$$$

60

TF-IDF• Measure of discriminatory power of word (feature)

• Highest when term occurs many times in a small number of documents

• Lowest when term occurs few times in document or many times in corpus

• Useful for information retrieval (queries) and keyword extraction (among

other things)

tf (t,d) = fd (t)| d |

idf (t,D) = log( |D || {d ∈D : t ∈d} |

)

61

TF-IDF

62

TF-IDFMost Common Least Common

63

Summarization

64

Scale Up

65

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/natural_language_processing.ipynb

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/natural_language_processing.ipynb

66

Review

• We need to represent text as vectors to model documents

• The Bag-of-words model uses word counts (tf-idf improves on this)

• In vector space, we can compare documents using linear algebra

• Spark provides feature transformers to handle text input

Word2Vec

68

Vector Space Model

Source: deeplearning4j

69

Vector Space Model

Source: Vector representation of words. Source: Mikolovov T., et al. : Linguistic Regularities in Continuous Space Word Representations, NAACL 2013.

70

AcquisitionParse


VectorizationTrain

ModelExpose

Presentation


HDFS

MLlib/spark.ml

Spark Streaming

PySpark


model.save()

We are Here

71

Predict Context

Source: https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis

https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis

72

doc2vec

73

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/word2vec_search.ipynb

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/word2vec_search.ipynb

Search and Results

75

Machine Learning Pipeline

RawText

Words

Tokenization

Contextualized Documents

Query

Search Results

Vectorization

doc2vec

word2vec

76


77


78


79


Questions?

Thank You!

Jonathan DinuVP of Academic Excellence, Galvanize

@clearspandex

Appendix: Installing Spark

82

Installation: Requirements

• Spark binary (version 1.4.1)

• Java JDK 6/7

• Scientific Python (and Jupyter notebook)

• py4j

• (Optional) IRKernel (for Jupyter)

83


• Spark binary (version 1.4.1)

• Java JDK 6/7


• py4j


84


NOTE Please do not install Spark with

• Homebrew on OSX

• Cygwin on Windows

85

Installation: Spark

• Find your OS here: http://spark.apache.org/downloads.html

• Select “Pre-built for Hadoop 2.4” or earlier

under “Choose a package type”

• Download the tar package for spark-1.4.1-bin-hadoop1.tgz

(If you are not sure pick the latest version.)

Make sure you are downloading the binary version, not the source version.

http://spark.apache.org/downloads.html

86

Installation: Configuration

• Unzip the file and place it at your home directory

(/Users/jonathandinu/)

• Set PATH: Include the following lines in your

~/.bash_profile (or ~/.bashrc):

87

Installation: Configuration

• Unzip the file and place it at your home directory

(/Users/jonathandinu/)

• Set PATH: Include the following lines in your


export SPARK_HOME=/full/path/to/your/unzipped/spark/folder

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

88

Installation: Java JDK

• http://www.oracle.com/technetwork/java/javase/downloads/jdk8-

downloads-2133151.html

• Find download for your OS

• Follow install instructions/wizard

• Make sure you get JDK instead of JRE

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

89


• Spark binary

• Java JDK 6/7


• py4j


90

Installation: Scientific Python

• http://continuum.io/downloads

http://continuum.io/downloads

91



• Find download for your OS (make sure it is Python 2.7)



92



• Find download for your OS (make sure it is Python 2.7)


To make sure it installed correctly:

ipython notebook


93

And finally: pip install py4j

94

Installation: Test It All Outjonathan$ ipython

95


• Spark binary

• Java JDK 6/7


• py4j


96

Installation: IRKernel (Jupyter kernel for R)

• Make sure R is installed: https://cran.r-project.org/bin/

• Install kernel via R (get into an R shell):

install.packages(c('rzmq','repr','IRkernel','IRdisplay'),

repos = c('http://irkernel.github.io/', getOption('repos')))

IRkernel::installspec()

https://github.com/IRkernel/IRkernel

https://cran.r-project.org/bin/

97

And in the notebook:

# Set this to where Spark is installed

Sys.setenv(SPARK_HOME=“/Users/jonathandinu/spark")

# This line loads SparkR from the installed directory

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

https://github.com/apache/spark/tree/master/R




98


And in the notebook: library(SparkR)




99





100

Note: If for any reason you cannot get Spark

installed on your OS following these instructions,

Cloudera and Hortonworks provide Linux VMs

with Spark installed.

• http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

• http://hortonworks.com/products/hortonworks-sandbox/#install

http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

http://hortonworks.com/products/hortonworks-sandbox/#install

101

Review

• Command-line Spark shell: ./bin/pyspark

• Spark module: import pyspark as ps

• Jupyter Notebook interface: ipython notebook

• Also R support in the notebook (or RStudio)!

Spark Deployment

Configuring a cluster

103

Spark Deployment

Local Mode Cluster Mode

• Single threaded: SparkContext(‘local’)

• Multi-threaded: SparkContext(‘local[4]’)

• Pseudo-distributed cluster

• Standalone

• Mesos

• YARN

• Amazon EC2

104

Spark Deployment: Local

Mode Advantage

Single threaded sequential execution allows easier debugging of program logic

Multi-threaded concurrent execution leverages parallelism and allows debugging of coordination

Pseudo-distributed cluster distributed execution allows debugging of communication and I/O

105

Standalone

• Packaged with Spark core

• Great if all you need is a dedicated Spark cluster

• Doesn’t support integration with any other applications on a cluster.

The Standalone cluster manager also has a high-availability mode that can leverage Apache ZooKeeper to enable standby master nodes.

106

Mesos

• General purpose cluster and global resource manager (Spark, Hadoop, MPI, Cassandra, etc.)

• Two-level scheduler: enables pluggable scheduler algorithms

• Multiple applications can co-locate (like an operating system for a cluster)

107

YARN

• Created to scale Hadoop, optimized for Hadoop (stateless batch jobs with long runtimes)

• Monolithic scheduler: manages cluster resources as well as schedules jobs

• Not well suited for long-running, real-time, or stateful/interactive services (like database queries)

108

EC2

• Launch scripts bundled with Spark

• Elastic and ephemeral cluster

• Sets up:• Spark• HDFS• Hadoop MR

109

Spark Deployment: Cluster

Mode Advantage

Standalone Encapsulated cluster manager isolates complexity

Mesos global resource manager facilitates multi-tenant and heterogeneous workloads

YARN Integrates with existing Hadoop cluster and applications

EC2 elastic scalability and ease of setup

110

Pseudo-distributed local cluster

MasterStandalone Scheduler

Web UI (monitoring/logging)

Worker Node

Executor cache

Task Task

Worker Node

Executor cache

Task Task

Laptop

111


${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.master.Master \-h 127.0.0.1 \-p 7077 \--webui-port 8080

Master

${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker \-c 1 \-m 1G \spark://127.0.0.1:7077

Workers (x 2)

112


One Master + two Workers: Run each process in separate terminal window

113

EC2: Setup

1. Create AWS account: https://aws.amazon.com/account/

2. Get Access keys:

https://aws.amazon.com/account/

114

1. Include the following lines in your


2. Download EC2 keypair:

EC2: Setup

export AWS_ACCESS_KEY_ID=xxxxxxxexport AWS_SECRET_ACCESS_KEY=xxxxxx

115

1. Launch EC2 cluster with script in $SPARK_HOME/ec2:

EC2: Launch

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem --copy-aws-credentials --instance-type=m1.large -m m1.large -s 19 launch spark

117

EC2: Scripts

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark

Login to the Master

./spark-ec2 stop spark

Stop cluster

./spark-ec2 destroy spark

Terminate Cluster

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem start spark

Restart cluster (after stopped)

118

Setup IPython/Jupyter


Login to the Master

Installed needed packages (on master)

# Install all the necessary packages on Master

yum install -y tmux

yum install -y pssh

yum install -y python27 python27-devel

yum install -y freetype-devel libpng-devel

wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27

easy_install-2.7 pip

easy_install py4j

pip2.7 install ipython==2.0.0

pip2.7 install pyzmq==14.6.0

pip2.7 install jinja2==2.7.3

pip2.7 install tornado==4.2

pip2.7 install numpy

pip2.7 install matplotlib

pip2.7 install nltk

119

Setup IPython/Jupyter


Login to the Master

Installed needed packages (on workers)

# Install all the necessary packages on Workers

pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel

pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"

pssh -h /root/spark-ec2/slaves easy_install-2.7 pip

pssh -t 10000 -h /root/spark-ec2/slaves pip2.7 install numpy

pssh -h /root/spark-ec2/slaves pip2.7 install nltk

120

Allow inbound requests to enable IPython/Jupyter notebook(WARNING: this will create a security risk however)

121

IPython/Jupyter Profile


Login to the Master

Set notebook password

ipython profile create default

python -c "from IPython.lib import passwd; print passwd()" \ > /root/.ipython/profile_default/nbpasswd.txt

cat /root/.ipython/profile_default/nbpasswd.txt# sha1:128de302ca73:6b9a8bd5bhjde33d48cd65ad9cafb0770c13c9df

122

Configure IPython/Jupyter Settings

/root/.ipython/profile_default/ipython_notebook_config.py:

# Configuration file for ipython-notebook.

c = get_config()

# Notebook configc.NotebookApp.ip = '*'c.NotebookApp.open_browser = False# It is a good idea to put it on a known, fixed portc.NotebookApp.port = 8888

PWDFILE="/root/.ipython/profile_default/nbpasswd.txt"c.NotebookApp.password = open(PWDFILE).read().strip()

123


/root/.ipython/profile_default/startup/pyspark.py:

# Configure the necessary Spark environmentimport osos.environ['SPARK_HOME'] = '/root/spark/'

# And Python pathimport syssys.path.insert(0, '/root/spark/python')

# Detect the PySpark URLCLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()

124


Add the following to /root/spark/conf/spark-env.sh:

export PYSPARK_PYTHON=python2.7

Sync across workers:~/spark-ec2/copy-dir /root/spark/conf

Make sure master’s env is correct

source /root/spark/conf/spark-env.sh

125

IPython/Jupyter initialization (on master)

Start remote window manager (screen or tmux):

ipython notebook

screen

Start notebook server:

Ctrl-a d

Detach from session:

126

IPython/Jupyter login

On your laptop:

http://[YOUR MASTER IP/DNS HERE]:8888

127

IPython/Jupyter login

Test that it all works:

128

EC2: Data

/root/ephemeral-hdfs

HDFS (ephemeral)

s3n://bucket_name

Amazon S3

/root/persistent-hdfs

HDFS (persistent)

s3n://bucket_name

129

Review• Local mode or cluster mode each have their benefits

• Spark can be run on a variety of cluster managers

• Amazon EC2 enables elastic scaling and ease of development

• By leveraging IPython/Jupyter you can get the performance of a cluster with the ease of interactive development

the data scientist's guide to apache spark

Technology