an introduction to spark fishel... · 4 a brief review of mapreduce map mapmap map map map map map...
TRANSCRIPT
1
An Introduction to SparkBy Ryan Fishel
With thanks to Ted Malaksa
2
Intro
• Solutions Consultant at Cloudera Government Solutions
• Worked primarily with government clients
• Experience In
• Current: Video processing, System integration, Performance tuning
• Past: Solutions consulting (outside of Hadoop), OS research/development
3
Agenda
• Quick Look at the Past
• Introduction to Spark
• Spark in CDH
4
A brief review of MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Key advances by MapReduce:
• Data Locality: Automatic split computation and launch of mappers appropriately• Fault tolerance: Write out of intermediate results and restart able mappers meant ability to run
on commodity hardware• Linear scalability: Combination of locality + programming model that forces developers to
write generally scalable solutions to problems• Flexibility: It soon became a easy tool to do distributed computing, going well beyond just map
and reduce.
5
BUT… Can we do better?
Two approaches to doing better:1. Special purpose systems to solve one problem domain well. Ex:
Giraph / Graphlab (graph processing), Storm (stream processing)
2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems.Ex: MPI, Hama (BSP), Dryad (arbitrary DAGs)
Both are viable strategies depending on problem!
6
Agenda
• Quick Look at the Past
• Introduction to Spark
• Spark in CDH
7
What is Spark?
Spark is a general purpose computational framework
Key properties:• Leverages distributed memory• Added distributed tools: Accumulator and Broadcast (Shared variables)• Full Directed Graph expressions for data parallel computations• Improved developer experience
Yet retains:• Linear scalability• Fault-tolerance• Data Locality based computations
8
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Pythonlines = sc.textFile(...)lines.filter(lambda s: “ERROR” in s).count()
Scalaval lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()
JavaJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“error”);
}}).count();
9
Spark Client(App Master)
Scheduler and RDD Graph
Spark from a High Level
Worker
Spark Worker
RDD Objects
Task Threads
Block Manager
Rdd1.join(rdd2).groupBy(…).filter(…)
Task Scheduler
Threads
Block Manager
ClusterManager
Trackers
MemoryStore
DiskStore
BlockInfo
ShuffleBlockManager
10
Resilient Distributed Datasets
• Resilient Distributed Datasets
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure• Lineage
11
Resilient Distributed Datasets
• Building a Story• HadoopRDD
• JdbcRDD
• MappedRDD
• FlatMappedRDD
• FilteredRDD
• OrderedRDD
• PairRDD
• ShffledRDD
• UnionRDD
• ….
join
filter
groupBy
B:C:
D: E: F:
G:
Ç√Ω
map
A:
map
take
12
Resilient Distributed Datasets
• Data Interface
• Partitions
• Dependencies
• Function to compute
• Optional• Preferred Location
• Partitioning Info (Partitioner)
13
Easy: Expressive API
• Transformations
• Lazy
• Return RDDs
• Actions
• Demand action
• Return value
14
Easy: Expressive API
• Actions
• Collect
• Count
• First
• Take(n)
• saveAsTextFile
• foreach(func)
• reduce(func)
• …
• Transformations
• Map
• Filter
• flatMap
• Union
• Reduce
• Sort
• Join
• …
15
Easy: Example – Word Count
• Sparkpublic static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
• Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
16
Easy: Example – Word Count
• Sparkpublic static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
• Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
17
Easy: Out of the Box Functionality
• Hadoop Integration
• Works natively with Hadoop Data
• Runs With YARN
• Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
• Roadmap
• Language support:• Improved Python support
• SparkR
• Java 8
• Better ML• Sparse Data Support
• Model Evaluation Framework
• Performance Testing
18
Memory management leads to greater performance
Trends:• ½ price every 18 months• 2x bandwidth every 3 years
Current Specs:• Hadoop cluster with 100 nodes
contains 10+TB of RAM today and will double next year
• 1 GB RAM ~ $10-$20
64-128GB RAM
16 cores
50 GB per sec
Memory can be enabler for high performance big data applications
19
Fast: Using RAM, Operator Graphs
• In-memory Caching• Data Partitions read from
RAM instead of disk
• Operator Graphs• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
B:C:
D: E: F:
G:
map
A:
map
take
20
Logistic Regression Performance(data fits in memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Ru
nn
ing
Tim
e (
s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 sfurther iterations 1 s
21
Spark Streaming
• Run continuous processing of data using Spark’s core API.
• Extends Spark concept of RDD’s to DStreams (Discretized Streams) which are fault tolerant, transformable streams. Users can re-use existing code for batch/offline processing.
• Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes.
• Example use cases:• “On-the-fly” ETL as data is ingested into Hadoop/HDFS.
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.
22
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
Stream composed of small (1-10s) batch
computations
“Micro-batch” Architecture
23
User Use Case Spark’s Value
Conviva Optimize end user’s online video experience by analyzing traffic patterns in real-time, enabling fine-grained traffic shaping control.
• Rapid development and prototyping• Shared business logic for offline and
online computation• Open machine learning algorithms
Yahoo! Speedup model training pipeline for Ad Targeting, increasing feature extraction phase by over 3x. Content recommendation using Collaborative filtering
• Reduced latency in broader data pipeline
• Use of iterative ML• Efficient P2P broadcast
Anonymous (LargeTech Company)
“Semi real-time” log aggregation and analysis for monitoring, alerting, etc.
• Low latency, frequently running “mini” batch jobs processing latest data sets
Technicolor Real-time analytics for their customers (i.e., telcos); need ability to provide streaming results and ability to query them in real time
● Easy to develop; just need Spark & SparkStreaming
● Arbitrary queries on live data
Sample use cases
24
Flume to Spark Streaming to HBase
Avro Client
Flume Agent
Avro Source
Interceptor
MemoryChannel
MemoryChannel
HDFS Sink
Avro Sink
HDFS
HBase
Spark Streaming
HbaseClients
Micro JobsFlume Stream
25
Building on Spark Today
• What kind of Apps?
• ETL
• Machine Learning
• Streaming
• Dashboards
Growing Number of Successful Use Cases & 3rd Party
Applications
26
Current project status
• Spark promoted to an Apache Top Level Project in mid-Feb
• 100+ contributors and 25+ companies contributing
• Includes: Intel/Cloudera, Yahoo!, Microsoft, etc
27
Agenda
• Quick Look at the Past
• Introduction to Spark
• Spark in CDH
28
Timelines for Spark in CDH (Proposed)
Phase 0 Phase 1 Phase 2
Available: Jan 2014Spark 0.9
Available: Mar 2014Spark 1.0
Available: July 2014Spark 1.x.x
• Separate Spark parcel• CDH 4.4+• Manual start/stop• Stand-alone mode Spark
server• No CM integration
• Bundled into CDH 5.0• CSD in Cloudera Manager
for installation and configuration
• Supports both CDH 4 and CDH 5
• Stand-alone mode with Resource Management
• YARN mode with Kerberos• Spark SQL alpha
• Bundled into CDH 5.1• Improved monitoring
using Spark History server
• Deeper integration into Cloudera Manager
• Optimized Flume integration for Spark Streaming
• Hue support
29
Spark key roadmap items – Production requirements
• Security:• Full Kerberos support in Stand-alone Mode• Sentry integration
• High availability:• Spark Master HA (available now)• YARN Application Master availability (available now)
• Operations support:• Improved performance monitoring, health checks, alerts through Cloudera Manager
• Reporting:• Usage history for chargeback / showback• Audit and lineage capabilities through Cloudera Navigator
• Advanced Resource Management:• Dynamic resource management capabilities for long-running Spark contexts
30
Spark key roadmap items – System capabilities
• Memory
• Tachyon integration with HDFS caching
• Off-heap memory optimization for Spark
• Enable multiple systems to share memory with Spark (stand-alone mode)
• Compatibility
• MapReduce 2 migration support for Spark
• Crunch on Spark (Pipelines)
• Pig on Spark (Spork)
• Oozie connector for Spark
• Integration
• Impala <-> Spark data migration
31
Recap
• Quick Look at the Past
• Introduction to Spark
• Spark in CDH