intro to apache spark
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Intro to Apache SparkAnand IyerSenior Product Manager, Cloudera
2© Cloudera, Inc. All rights reserved.
Target Audience
• New to Spark, or have very rudimentary knowledge of Spark.• Have basic knowledge of Map-Reduce
If you are an advanced Spark developer, you are unlikely to get much out of this talk.• No performance tuning or debugging tips
3© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
• Easy to Develop• Rich APIs in Java, Scala, Python• Interactive shell
• Fast to Run•General execution graphs• In-memory Caching
4© Cloudera, Inc. All rights reserved.
Easy to code API
5© Cloudera, Inc. All rights reserved.
RDD: Resilient Distributed DatasetsAbstraction to represent the large distributed sets of data that are being processed.
RDDs are:• Broken up into partitions, which are distributed across nodes• In practice, RDDs usually have between 100 to 10K partitions
• Partitions operated upon in parallel• Immutable• Fault-Tolerant via concept of lineage
6© Cloudera, Inc. All rights reserved.
Spark jobs are DAGs of operations on RDDsOperations on RDDs• Transformations: Create a new RDD from existing RDDs• Actions: Run computation on RDD, return values to the driver
= RDD
joinfilter
groupBy
B:
C: D: E:
G:
Ç√Ω
map
A:
map
take
F:
7© Cloudera, Inc. All rights reserved.
Rich Expressive API
• map• filter• groupBy• sort• union• join• leftOuterJoin• rightOuterJoin
• reduce• count• fold• reduceByKey• groupByKey• cogroup• cross• zip
sampletakefirstpartitionBymapWithpipesave ...
8© Cloudera, Inc. All rights reserved.
Example: Logistic Regression
sc = SparkContext(…)rawData = sc.textFile(“hdfs://…”)data = rawData.map(parserFunc).cache()
w = numpy.random.rand(D)
for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x) .reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
9© Cloudera, Inc. All rights reserved.
Execution model and Spark Internals
10© Cloudera, Inc. All rights reserved.
Driver & Executors
• Driver: Master node• One Driver per Spark App• Runs the main(…) function of your app
• Executors: Worker nodes
11© Cloudera, Inc. All rights reserved.
Logical graph to physical execution plan
= cached partition
= RDD
joinfilter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
• Execution graph is broken into Stages• Each Stage consists of multiple Tasks
• Task is unit of computation that is scheduled on an Executor• A Stage consists of multiple operations that can be pipelined• Stages are split when data needs to be “shuffled”
12© Cloudera, Inc. All rights reserved.
Shuffle• Redistributes data among partitions• reduce, groupBy, join
• Hash keys to buckets• Identical to MapReduce Shuffle
• Shuffle entails writes to disk
13© Cloudera, Inc. All rights reserved.
Spark WebUI lets you visualize DAG
14© Cloudera, Inc. All rights reserved.
Drivers & Executors revisited• Driver• One Driver per Spark App• Runs the main(…) function of your app• Creates logical DAG and physical execution plan• Schedules Tasks• Driver receives and collects the results of Actions
• Executors • Hold RDD partitions• Execute Tasks as scheduled by Driver
15© Cloudera, Inc. All rights reserved.
Spark runs on Cluster Managers
• Spark does not manage cluster of machines• Runs on YARN, Mesos or Standalone (cluster manager built specifically for Spark)
16© Cloudera, Inc. All rights reserved.
Why is Spark Fast?
17© Cloudera, Inc. All rights reserved.
Memory management leads to greater performance
Trends:½ price every 18 months2x bandwidth every 3 years
128 – 384 GB
12-24 cores
50 GB per sec
Memory can be enabler for high performance big data applications
18© Cloudera, Inc. All rights reserved.
Persisting or Caching RDDs
• If an RDD will be re-used, persist it to prevent re-computation• Very common in iterative algorithms
• By default, cached RDDs held in memory
• But memory may not suffice• MEMORY_AND_DISK persistence: Spill the partitions that don’t fit in memory
to disk
19© Cloudera, Inc. All rights reserved.
Lineage for Fault-Tolerance
= RDD
joinfilter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
20© Cloudera, Inc. All rights reserved.
Lineage
= RDD
joinfilter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
21© Cloudera, Inc. All rights reserved.
Lineage
= RDD
joinfilter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
22© Cloudera, Inc. All rights reserved.
Lineage
= RDD
joinfilter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
23© Cloudera, Inc. All rights reserved.
Lineage
= RDD
joinfilter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
24© Cloudera, Inc. All rights reserved.
joinfilter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle
G:
25© Cloudera, Inc. All rights reserved.
joinfilter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle
G:
26© Cloudera, Inc. All rights reserved.
joinfilter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle
G:
27© Cloudera, Inc. All rights reserved.
joinfilter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle
G:
28© Cloudera, Inc. All rights reserved.
joinfilter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle
G:
29© Cloudera, Inc. All rights reserved.
Summary of what makes Spark fast• Maximize use of memory• Re-used RDDs can be explicitly cached to prevent re-computation
• Leverage Lineage & Pipelining to minimize writing intermediate data to disk
• Efficient Task Scheduler• Ensure worker nodes are kept busy via quick scheduling of Tasks
• More optimizations coming in Spark SQL• Compact binary in-memory data representation, etc• More details in subsequent slides
30© Cloudera, Inc. All rights reserved.
Spark will replace MapReduceTo become the standard execution engine for Hadoop
31© Cloudera, Inc. All rights reserved.
Spark Streaming
32© Cloudera, Inc. All rights reserved.
Spark Streaming• Incoming data represented as DStreams (Discretized Streams)• Data commonly read from streaming data channels like Kafka or Flume
• A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)
33© Cloudera, Inc. All rights reserved.
Discretized Stream• Incoming data stream is broken down into micro-batches• Micro-batch size is user defined, usually 0.3 to 1 second • Micro-batches are disjoint
• Each micro-batch is an RDD • Effectively, a DStream is a sequence of RDDs, one per micro-batch• Spark Streaming known for high throughput
34© Cloudera, Inc. All rights reserved.
Windowed DStreams• Defined by specifying a window size and a step size
• Both are multiples of micro-batch size• Operations invoked on each window’s data
35© Cloudera, Inc. All rights reserved.
Maintain and update arbitrary stateupdateStateByKey(...)• Define initial state• Provide state update function• Continuously update with new information• State maintained as RDD, updated via Transformation
Examples: • Running count of words seen in text stream• Per user session state from activity stream
Note: Requires periodic check-pointing to fault-tolerant storage, every N (~10-15) micro-batches
36© Cloudera, Inc. All rights reserved.
Spark SQL & Dataframes
37© Cloudera, Inc. All rights reserved.
Dataframes
• Distributed collection of data organized as named typed columns
• Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via lineage
• Can be constructed from:• Structured data files: Json, avro, parquet, etc• Tables in Hive• Tables in a RDBMS• Existing RDDs by programmatically applying schema
38© Cloudera, Inc. All rights reserved.
Spark SQL
• SQL statements to process Dataframes
• Embed SQL statements in your scala, java, python Spark application
• Queries can also be issued via JDBC/ODBC
39© Cloudera, Inc. All rights reserved.
Why Spark SQL? Ease of programming• Easy to code against schema’d records
• SQL is often an easier alternative to code, for non-complex operations on relational data
• Embed SQL in your scala, java or python applications to seamlessly mix “regular” spark for complex operations, along with SQL
40© Cloudera, Inc. All rights reserved.
Why Spark SQL? Performance
SQL processed by Query Optimizer Automatic Optimizations
• Compressed memory format (as against java serialized objects in RDDs)• Predicate pushdown (read less data to reduce IO)• Optimal pipelining of operations• Cost based optimizer • …
41© Cloudera, Inc. All rights reserved.
MLlibCollection of popular machine learning algorithms:Classifiers: logistic regression, boosted trees, random forests,etcClustering: k-means, LDARecommender Systems: ALSDimensionality Reduction: PCA and SVDFeature Engineering: TF-IDF, Word2Vec, etcStatistical Functions: Chi-Squared Test, Pearson Correlation,etc
Pipelines API: Chain together feature engineering, training, model validation into one pipeline
42© Cloudera, Inc. All rights reserved.
Thank YouAnd of course….we are hiring!!!