scalding by adform research, alex gryzlov

Post on 15-Jul-2015

104 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Wordcount in MapReduce

Cascading

Tap / Pipe / Sink abstraction over Map / Reduce in Java

Cascading

Wordcount in Cascading

Scalding

• Scala wrapper for Cascading

• Just like working with in-memory collections (map/filter/sort…)

• Built-in parsers for {T|C}SV, date annotations etc

• Helper algorithms e.g.

approximations (Algebird library)

matrix API

Wordcount in Scalding

run the WordCountJob in local mode with given input and output

Building and Deploying

• Get sbt

• sbt assembly produces jar file in target/scala_2.10

• sbt s3-upload produces jar and uploads to s3

Running on EMR

• hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar

• hadoop jar job.jar \

com.twitter.scalding.Tool \ Entry class

com.adform.dspr.MadeupJob \ Scalding job class

--hdfs \ Run in HDFS mode

--logs s3://dev-adform-test/logs \ Parameter

--meta s3://dev-adform-test/metadata \ Parameter

--output s3://dev-adform-test/output Parameter

For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a custom runner app, check outhttps://gitz.adform.com/dco/dco-amazon-runner

Development

• Two APIs:

• Fields – everything is a string

• Typed – working with classes, e.g. Request/Transaction

Development

• Fields:• No need to parse columns

• Redundancy

• No IDE support like auto-completion

• Typed:• All benefits of types, esp. compile-time checking

• More manual work with parsing

• Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)

Downsides

• A lot of configuring and googling random issues

• Scarce documentation, have to read source code/stackoverflow

• IntelliJ is slow

• Boilerplate code for parsing data

Some tips

• In local mode you specify files as input/output, in HDFS – folders

• You can use Hadoop API to read files from HDFS directly, but only on submitting node, not in the pipeline

• As a workaround for previous problem, you can use a distributed cache mechanism, but that only works on Hadoop 1 AFAIK

• Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”

Resources

• https://github.com/twitter/scalding/wiki Wiki

• https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff

• https://github.com/twitter/scalding/tree/develop/scalding-core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs

• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation

• http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf

• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014

top related