surviving hadoop on aws

SURVIVING HADOOP ONAWS IN PRODUCTION

DISCLAIMER:I AM A BAD PERSON.

ABOUT MEChief Data Scientist at Yieldbot, Co-Founder at

StockTwits.@sorenmacbeth

YIELDBOT“Yieldbot's technology creates a marketplace where search

advertisers buy real-time consumer intent on premiumpublishers.”

WHERE WE ARE TODAYMapR M3 on EMRAll data read from and written to S3

CLOJURE FOR DATA PROCESSINGAll of our MapReduce jobs are written in .

This gives us speed, flexability and testability.

More importantly, Clojure and Cascalog are fun to write.

Cascalog

CASCALOG EXAMPLE

(ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute))

(defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att)))))

HADOOP IS COMPLEX

“Fact: There are more Hadoop configuration options than

there are stars our galaxy.”

EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OFTUNING TO GET A HADOOP CLUSTER RUNNING WELL.

There are large companies that make money soley byconfiguring and supporting hadoop clusters for

enterprise customers.

RUNNING HADOOP ON AWS

SO WHY RUN ON AWS?$$$

HADOOP ON AWS:AN PERSONAL HISTORY

PIG AND ELASTICMAPREDUCESlow development cycle; writing Java sucks.

CASCALOG AND ELASTICMAPREDUCELearning Emacs, Clojure, and Cascalog was hard, butwas worth it.The way our jobs were designed sucked and didn'twork well with ElasticMapReduce

CASCALOG AND SELF-MANAGED HADOOPCLUSTER

We used a hacked up version of a cloudera python

script to launch and bootstrap a cluster.

We ran on spot instances

Cluster boot up time SUCKED and often failed. We

paid for instances during bootstrap and configuration

Our jobs weren't designed to tolerate things like spot

instances going away in the middle of a job.

Drinking heavily dulled the pain a little.

CASCALOG AND ELASTICMAPREDUCEAGAIN

Rebuilt data processing pipeline from scratch (onlytook nine months!)Data pipelines were broken out into a handful of fault-tolerant jobflow steps; each steps writes output to S3.EMR supported spot instances at this point.

WEIRD BUGS THAT WE'VE HITBootstrap script errorsRandom cluster fuckedupednessAMI version changesVendor issuesMy personal favourite: Invisible S3 write failures.

IF YOU MUST RUN ON AWSBreak your processing pipelines into stages; write outto S3 after each stage.Bake in (a lot) of variability into your expected jobflowrun times.Compress the data your are reading and writing fromS3 as much as possible.Drinking helps.

QUESTIONS?

YIELDBOT IS HIRING!http://yieldbot.com/jobs

surviving hadoop on aws

hadoop clusters

surviving hadoop

hadoop cluster running

tokenstream tokenizer

data pipelines

version org

data processingall

emrall data

Documents

hadoop , hadoop , hadoop !!!

2012年上半期　awsパートナーアワード受賞社資料：hadoopによるバッチ処理の導入on...

surviving the hadoop revolution

martyn hadoop aws

shady ibrahim · spark/hadoop, pandas, aws cloud, ... bba...

hadoop: setting up hadoop 2.7.3 (single node) on aws ec2...

vmware cloud provider partner portfolio comparison: aws...

choosing a provider from the hadoop ecosystem - … a...

hadoop 3.x more...

aws hadoop and pig and overview

hadoop in the cloud with aws' emr

hadoop aws

10.5 & 10.5 · aws s3 microsoft azure storage local file...

apache hadoop installation and cluster setup on aws ec2

· (page views ? hourly? monthly hadoop node hadoop node...

big data analytics & cloud services - … data analytics &...

apache™ hadoop®in the datacenter and cloud -...

ioannis magnisalis - international hellenic...

installing hortonwork's hadoop on aws ec2

hadoop 2.x on a cluster environment - roma tre...