why hadoop map reduce needs scala, an introduction to scoobi and scalding
DESCRIPTION
TRANSCRIPT
@agemooij
A Look at Scoobi and Scalding Scala DSLs for Hadoop
Why
Needs Scala
Scoobi
Scalding
Obligatory “About Me” Slide
Rocks!
Sucks!
But programming
kinda
Hello World Word Count using
Hadoop MapReduce
For each word, sum the 1s to get the total
Split lines into words
Group by word (?)
Turn each word into a Pair(word, 1)
Low level glue code
Lots of small unintuitive Mapper and Reducer
Classes
Lots of Hadoop intrusiveness(Context, Writables, Exceptions, etc.)
Actually runs the code on the cluster
This does not make me a happy Hadoop developer!
Especially for things that are a little bit more complicated than counting words
• Unintuitive, invasive programming model• Hard to compose/chain jobs into real, more
complicated programs• Lots of low-level boilerplate code• Branching, Joins, CoGroups, etc. hard to
implement
What Are the Alternatives?
Counting Words using Apache Pig
Already a lot better, but anything more complex gets hard pretty fast.
Handy for quick exploration of data!
Pig is hard to customize/extend
Nice!
And the same goes for Hive
package cascadingtutorial.wordcount;
/**
* Wordcount example in Cascading
*/
public class Main
{
public static void main( String[] args )
{
String inputPath = args[0];
String outputPath = args[1];
Scheme inputScheme = new TextLine(new Fields("offset", "line"));
Scheme outputScheme = new TextLine();
Tap sourceTap = inputPath.matches( "^[^:]+://.*") ?
new Hfs(inputScheme, inputPath) :
new Lfs(inputScheme, inputPath);
Tap sinkTap = outputPath.matches("^[^:]+://.*") ?
new Hfs(outputScheme, outputPath) :
new Lfs(outputScheme, outputPath);
Pipe wcPipe = new Each("wordcount",
new Fields("line"),
new RegexSplitGenerator(new Fields("word"), "\\s+"),
new Fields("word"));
wcPipe = new GroupBy(wcPipe, new Fields("word"));
wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, Main.class);
Flow parsedLogFlow = new FlowConnector(properties)
.connect(sourceTap, sinkTap, wcPipe);
parsedLogFlow.start();
parsedLogFlow.complete();
}
}
Pipes & Filters
Not very intuitive
Lots of boilerplate code
Very powerful!Record Model
Strange new abstraction
Joins & CoGroups
Meh...I’m lazy
I want more power with less work!
How would we count words in plain Scala?
(My current language of choice)
Nice!Familiar, intuitiveWhat if...?
But that code doesn’t scale to my cluster!
Or does it?
Meanwhile at Google...
Introducing Scoobi & ScaldingScala DSLs for Hadoop MapReduce
Scalding5%
Scoobi95%
NOTE: My relative familiarity with either platform:
http://github.com/nicta/scoobi
A Scala library that implements a higher level programming model for
Hadoop MapReduce
Counting Words using Scoobi
For each word, sum the 1s to get the total
Split lines into words
Group by wordTurn each word into a Pair(word, 1)
Actually runs the code on the cluster
Scoobi is...• A distributed collections abstraction:
• Distributed collection objects abstract data in HDFS
• Methods on these objects abstract map/reduce operations
• Programs manipulate distributed collections objects
• Scoobi turns these manipulations into MapReduce jobs
• Based on Google’s FlumeJava / Cascades
• A source code generator (it generates Java code!)
• A job plan optimizer
• Open sourced by NICTA
• Written in Scala (W00t!)
DList[T]• Abstracts storage of data and files on HDFS
• Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce
• Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution
• Very familiar: like standard Scala Lists
• Strongly typed
• Parameterized with rich types and Tuples
• Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.
DList[T]
• Can read/write text files, Sequence files and Avro files
• Can influence sorting (raw, secondary)
IO
Serialization• Serialization of custom types through Scala type
classes and WireFormat[T]
• Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T], etc.
• Out of the box support for serialization of Scala case classes
IO/Serialization I
IO/Serialization II
For normal (i.e. non-case) classes
Further Info
http://nicta.github.com/scoobi/
[email protected]@googlegroups.com
Version 0.4 released today (!)• Avro, Sequence Files• Materialized DObjects• DList reduction methods (product, min,
etc.)• Vastly improved testing support• Less overhead• Much more
Scalding!
http://github.com/twitter/scalding
A Scala library that implements a higher level programming model for
Hadoop MapReduceCascading
Counting Words using Scalding
Scalding is...• A distributed collections abstraction
• A wrapper around Cascading (i.e. no source code generation)
• Based on the same record model (i.e. named fields)
• Less strongly typed
• Uses Kryo Serialization
• Used by Twitter in production
• Written in Scala (W00t!)
Further Info
@scalding
http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
https://github.com/twitter/scalding/wiki
Current version: 0.5.4
http://github.com/twitter/scalding
How do they compare?Different approaches,
similar power
Small feature differences, which will
even out over time
Scoobi gets a little closer to idiomatic
Scala
Twitter is definitely a bigger fish than
NICTA, so Scalding gets all the attention
Both open sourced (last year) Scoobi has better docs!
Which one should I use?Ehm...
...I’m extremely prejudiced!
Questions?