intro to apache spark

1© Cloudera, Inc. All rights reserved.

Intro to Apache SparkAnand IyerSenior Product Manager, Cloudera


Target Audience

• New to Spark, or have very rudimentary knowledge of Spark.• Have basic knowledge of Map-Reduce

If you are an advanced Spark developer, you are unlikely to get much out of this talk.• No performance tuning or debugging tips


Spark: Easy and Fast Big Data

• Easy to Develop• Rich APIs in Java, Scala, Python• Interactive shell

• Fast to Run•General execution graphs• In-memory Caching


Easy to code API


RDD: Resilient Distributed DatasetsAbstraction to represent the large distributed sets of data that are being processed.

RDDs are:• Broken up into partitions, which are distributed across nodes• In practice, RDDs usually have between 100 to 10K partitions

• Partitions operated upon in parallel• Immutable• Fault-Tolerant via concept of lineage


Spark jobs are DAGs of operations on RDDsOperations on RDDs• Transformations: Create a new RDD from existing RDDs• Actions: Run computation on RDD, return values to the driver

= RDD

joinfilter

groupBy

B:

C: D: E:

G:

Ç√Ω

map

A:

map

take

F:


Rich Expressive API

• map• filter• groupBy• sort• union• join• leftOuterJoin• rightOuterJoin

• reduce• count• fold• reduceByKey• groupByKey• cogroup• cross• zip

sampletakefirstpartitionBymapWithpipesave ...


Example: Logistic Regression

sc = SparkContext(…)rawData = sc.textFile(“hdfs://…”)data = rawData.map(parserFunc).cache()

w = numpy.random.rand(D)

for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

print “Final w: %s” % w


Execution model and Spark Internals


Driver & Executors

• Driver: Master node• One Driver per Spark App• Runs the main(…) function of your app

• Executors: Worker nodes


Logical graph to physical execution plan

= cached partition

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take

• Execution graph is broken into Stages• Each Stage consists of multiple Tasks

• Task is unit of computation that is scheduled on an Executor• A Stage consists of multiple operations that can be pipelined• Stages are split when data needs to be “shuffled”


Shuffle• Redistributes data among partitions• reduce, groupBy, join

• Hash keys to buckets• Identical to MapReduce Shuffle

• Shuffle entails writes to disk


Spark WebUI lets you visualize DAG


Drivers & Executors revisited• Driver• One Driver per Spark App• Runs the main(…) function of your app• Creates logical DAG and physical execution plan• Schedules Tasks• Driver receives and collects the results of Actions

• Executors • Hold RDD partitions• Execute Tasks as scheduled by Driver


Spark runs on Cluster Managers

• Spark does not manage cluster of machines• Runs on YARN, Mesos or Standalone (cluster manager built specifically for Spark)


Why is Spark Fast?


Memory management leads to greater performance

Trends:½ price every 18 months2x bandwidth every 3 years

128 – 384 GB

12-24 cores

50 GB per sec

Memory can be enabler for high performance big data applications


Persisting or Caching RDDs

• If an RDD will be re-used, persist it to prevent re-computation• Very common in iterative algorithms

• By default, cached RDDs held in memory

• But memory may not suffice• MEMORY_AND_DISK persistence: Spill the partitions that don’t fit in memory

to disk


Lineage for Fault-Tolerance

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take


Lineage

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take


joinfilter

groupBy

B: F:

C: D: E:

H:

Ç√Ω

map

A:

map

take

Lineage Truncation

= RDD

Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle

G:


joinfilter

groupBy

B: F:

C: D: E:

H:

Ç√Ω

map

A:

map

take

Lineage Truncation

= RDD


G:


Summary of what makes Spark fast• Maximize use of memory• Re-used RDDs can be explicitly cached to prevent re-computation

• Leverage Lineage & Pipelining to minimize writing intermediate data to disk

• Efficient Task Scheduler• Ensure worker nodes are kept busy via quick scheduling of Tasks

• More optimizations coming in Spark SQL• Compact binary in-memory data representation, etc• More details in subsequent slides


Spark will replace MapReduceTo become the standard execution engine for Hadoop


Spark Streaming


Spark Streaming• Incoming data represented as DStreams (Discretized Streams)• Data commonly read from streaming data channels like Kafka or Flume

• A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)


Discretized Stream• Incoming data stream is broken down into micro-batches• Micro-batch size is user defined, usually 0.3 to 1 second • Micro-batches are disjoint

• Each micro-batch is an RDD • Effectively, a DStream is a sequence of RDDs, one per micro-batch• Spark Streaming known for high throughput


Windowed DStreams• Defined by specifying a window size and a step size

• Both are multiples of micro-batch size• Operations invoked on each window’s data


Maintain and update arbitrary stateupdateStateByKey(...)• Define initial state• Provide state update function• Continuously update with new information• State maintained as RDD, updated via Transformation

Examples: • Running count of words seen in text stream• Per user session state from activity stream

Note: Requires periodic check-pointing to fault-tolerant storage, every N (~10-15) micro-batches


Spark SQL & Dataframes


Dataframes

• Distributed collection of data organized as named typed columns

• Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via lineage

• Can be constructed from:• Structured data files: Json, avro, parquet, etc• Tables in Hive• Tables in a RDBMS• Existing RDDs by programmatically applying schema


Spark SQL

• SQL statements to process Dataframes

• Embed SQL statements in your scala, java, python Spark application

• Queries can also be issued via JDBC/ODBC


Why Spark SQL? Ease of programming• Easy to code against schema’d records

• SQL is often an easier alternative to code, for non-complex operations on relational data

• Embed SQL in your scala, java or python applications to seamlessly mix “regular” spark for complex operations, along with SQL


Why Spark SQL? Performance

SQL processed by Query Optimizer Automatic Optimizations

• Compressed memory format (as against java serialized objects in RDDs)• Predicate pushdown (read less data to reduce IO)• Optimal pipelining of operations• Cost based optimizer • …


MLlibCollection of popular machine learning algorithms:Classifiers: logistic regression, boosted trees, random forests,etcClustering: k-means, LDARecommender Systems: ALSDimensionality Reduction: PCA and SVDFeature Engineering: TF-IDF, Word2Vec, etcStatistical Functions: Chi-Squared Test, Pearson Correlation,etc

Pipelines API: Chain together feature engineering, training, model validation into one pipeline


Thank YouAnd of course….we are hiring!!!

intro to apache spark

Software