python in the hadoop ecosystem (rock health presentation)
DESCRIPTION
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.TRANSCRIPT
![Page 2: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/2.jpg)
2
Goals for today
1. Easy to jump into Hadoop with Python2. Describe 5 ways to use Python with Hadoop, batch
and interactive3. Guidelines for choosing Python framework
![Page 3: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/3.jpg)
3
Code:https://github.com/laserson/rock-health-python
Blog post:http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Slides:http://www.slideshare.net/urilaserson/
![Page 4: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/4.jpg)
4
About the speaker
• Joined Cloudera late 2012• Focus on life sciences/medical
• PhD in BME/computational biology at MIT/Harvard (2005-2012)
• Focused on genomics• Cofounded Good Start Genetics (2007-)
• Applying next-gen DNA sequencing to genetic carrier screening
![Page 5: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/5.jpg)
5
About the speaker
• No formal training in computer science• Never touched Java• Almost all work using Python
![Page 6: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/6.jpg)
6
![Page 7: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/7.jpg)
7
Python frameworks for Hadoop
• Hadoop Streaming• mrjob (Yelp)• dumbo• Luigi (Spotify)• hadoopy• pydoop• PySpark• happy• Disco• octopy• Mortar Data• Pig UDF/Jython• hipy• Impala + Numba
![Page 8: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/8.jpg)
8
Goals for Python framework
1. “Pseudocodiness”/simplicity2. Flexibility/generality3. Ease of use/installation4. Performance
![Page 9: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/9.jpg)
9
Python frameworks for Hadoop
• Hadoop Streaming• mrjob (Yelp)• dumbo• Luigi (Spotify)• hadoopy• pydoop• PySpark• happy• Disco• octopy• Mortar Data• Pig UDF/Jython• hipy• Impala + Numba
![Page 10: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/10.jpg)
10
Python frameworks for Hadoop
• Hadoop Streaming ✓• mrjob (Yelp) ✓• dumbo• Luigi (Spotify) ✓• hadoopy• pydoop• PySpark✓• happy abandoned? Jython-based• Disco not Hadoop• octopy not serious/not Hadoop• Mortar Data HaaS; support numpy, scipy, nltk, pip-installable in UDF• Pig UDF/Jython Pig is another talk; Jython limited• hipy Python syntactic sugar to construct Hive queries• Impala + Numba✓
![Page 11: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/11.jpg)
11
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram datahttp://books.google.com/ngrams
![Page 12: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/12.jpg)
12
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram datahttp://books.google.com/ngrams
1 2 3 4 5 6 7 8
( )
8-gram
![Page 13: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/13.jpg)
13
"A partial differential equation is an equation that contains partial derivatives."
![Page 14: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/14.jpg)
14
A partial differential equation is an equation that contains partial derivatives.
A 1partial 2differential 1equation 2is 1an 1that 1contains 1derivatives. 1
1-grams
![Page 15: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/15.jpg)
15
A partial differential equation is an equation that contains partial derivatives.
A partial 1partial differential 1differential equation 1equation is 1is an 1an equation 1equation that 1that contains 1contains partial 1partial derivatives. 1
2-grams
![Page 16: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/16.jpg)
16
A partial differential equation is an equation that contains partial derivatives.
A partial differential equation is 1partial differential equation is an 1differential equation is an equation 1equation is an equation that 1is an equation that contains 1an equation that contains partial 1equation that contains partial derivatives. 1
5-grams
![Page 17: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/17.jpg)
17
![Page 18: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/18.jpg)
18
goto code
![Page 19: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/19.jpg)
19
flourished in 1993 2 2 2flourished in 1998 2 2 1flourished in 1999 6 6 4flourished in 2000 5 5 5flourished in 2001 1 1 1flourished in 2002 7 7 3flourished in 2003 9 9 4flourished in 2004 22 21 13flourished in 2005 37 37 22flourished in 2006 55 55 38flourished in 2007 99 98 76flourished in 2008 220 215 118fluid of 1899 2 2 1fluid of 2000 3 3 1fluid of 2002 2 1 1fluid of 2003 3 3 1fluid of 2004 3 3 3
2-gram year matches pages volumes
![Page 20: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/20.jpg)
20
Compute how often two words are near each other in a given year.
Two words are “near” if they are both present in a 2-, 3-, 4-, or 5-gram.
![Page 21: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/21.jpg)
21
...2-grams...(cat, the) 1999 14(the, cat) 1999 7002
...3-grams...(the, cheshire, cat) 1999 563
...4-grams...
...5-grams...(the, cat, in, the, hat) 1999 1023(the, dog, chased, the, cat) 1999 403(cat, is, one, of, the) 1999 24
(cat, the) 1999 8006(hat, the) 1999 1023
raw data
aggregated results
lexicographicordering
internal n-grams counted by smaller n-grams:• avoids double-counting• increases sensitivity (observed at least 40 times)
![Page 22: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/22.jpg)
22
What is Hadoop?
• Ecosystem of tools• Core is the HDFS file system• Downloadable set of jars that can be run on any
machine
![Page 23: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/23.jpg)
23
HDFS design assumptions
• Based on Google File System• Files are large (GBs to TBs)• Failures are common
• Massive scale means failures very likely• Disk, node, or network failures
• Accesses are large and sequential• Files are append-only
![Page 24: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/24.jpg)
24
HDFS properties
• Fault-tolerant• Gracefully responds to node/disk/network failures
• Horizontally scalable• Low marginal cost
• High-bandwidth
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distributionNode A Node B Node C Node D Node E
![Page 25: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/25.jpg)
25
MapReduce computation
![Page 26: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/26.jpg)
26
MapReduce computation
• Structured as1. Embarrassingly parallel “map stage”2. Cluster-wide distributed sort (“shuffle”)3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored• Fault-tolerance: failed tasks automatically detected
and restarted• Schema-on-read: data must not be stored conforming
to rigid schema
![Page 27: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/27.jpg)
27
Pseudocode for MapReduce
def map(record): (ngram, year, count) = unpack(record) // ensure word1 has the lexicographically first word: (word1, word2) = sorted(ngram[first], ngram[last]) key = (word1, word2, year) emit(key, count)
def reduce(key, values): emit(key, sum(values))
All source code available on GitHub:https://github.com/laserson/rock-health-python
![Page 28: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/28.jpg)
28
Native Java
import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;
public class NgramsDriver extends Configured implements Tool {
public int run(String[] args) throws Exception { Job job = new Job(getConf()); job.setJarByClass(getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(NgramsMapper.class); job.setCombinerClass(NgramsReducer.class); job.setReducerClass(NgramsReducer.class); job.setOutputKeyClass(TextTriple.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(10); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new NgramsDriver(), args); System.exit(exitCode); }}
import java.io.IOException;import java.util.ArrayList;import java.util.Collections;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;
import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileSplit;import org.apache.log4j.Logger;
public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> { private Logger LOG = Logger.getLogger(getClass()); private int expectedTokens; @Override protected void setup(Context context) throws IOException, InterruptedException { String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName(); LOG.info("inputFile: " + inputFile); Pattern c = Pattern.compile("([\\d]+)gram"); Matcher m = c.matcher(inputFile); m.find(); expectedTokens = Integer.parseInt(m.group(1)); return; } @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] data = value.toString().split("\\t"); if (data.length < 3) { return; } String[] ngram = data[0].split("\\s+"); String year = data[1]; IntWritable count = new IntWritable(Integer.parseInt(data[2])); if (ngram.length != this.expectedTokens) { return; } // build keyOut List<String> triple = new ArrayList<String>(3); triple.add(ngram[0]); triple.add(ngram[expectedTokens - 1]); Collections.sort(triple); triple.add(year); TextTriple keyOut = new TextTriple(triple); context.write(keyOut, count); }}
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.mapreduce.Reducer;
public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> { @Override protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); }}
import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import java.util.List;
import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;
public class TextTriple implements WritableComparable<TextTriple> { private Text first; private Text second; private Text third; public TextTriple() { set(new Text(), new Text(), new Text()); } public TextTriple(List<String> list) { set(new Text(list.get(0)), new Text(list.get(1)), new Text(list.get(2))); } public void set(Text first, Text second, Text third) { this.first = first; this.second = second; this.third = third; } public void write(DataOutput out) throws IOException { first.write(out); second.write(out); third.write(out); }
public void readFields(DataInput in) throws IOException { first.readFields(in); second.readFields(in); third.readFields(in); }
@Override public int hashCode() { return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode(); } @Override public boolean equals(Object obj) { if (obj instanceof TextTriple) { TextTriple tt = (TextTriple) obj; return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.third); } return false; } @Override public String toString() { return first + "\t" + second + "\t" + third; }
public int compareTo(TextTriple other) { int comp = first.compareTo(other.first); if (comp != 0) { return comp; } comp = second.compareTo(other.second); if (comp != 0) { return comp; } return third.compareTo(other.third); } }
![Page 29: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/29.jpg)
29
Native Java
• Maximum flexibility• Fastest performance• Native to Hadoop• Most difficult to write
![Page 30: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/30.jpg)
30
Hadoop Streaming
hadoop jar hadoop-streaming-*-.jar \ -input path/to/input -output path/to/output -mapper “grep WARN”
![Page 31: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/31.jpg)
31
Hadoop Streaming: features
• Canonical method for using any executable as mapper/reducer
• Includes shell commands, like grep• Transparent communication with Hadoop though
stdin/stdout• Key boundaries manually detected in reducer• Built-in with Hadoop: should require no additional
framework installation• Developer must decide how to encode more
complicated objects (e.g., JSON) or binary data
![Page 32: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/32.jpg)
32
Hadoop Streaming
goto code
![Page 33: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/33.jpg)
33
mrjob
class NgramNeighbors(MRJob): # specify input/intermed/output serialization # default output protocol is JSON; here we set it to text OUTPUT_PROTOCOL = RawProtocol
def mapper(self, key, line): pass def combiner(self, key, counts): pass def reducer(self, key, counts): pass
if __name__ == '__main__': # sets up a runner, based on command line options NgramNeighbors.run()
![Page 34: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/34.jpg)
34
mrjob: features
• Abstracted MapReduce interface• Handles complex Python objects• Multi-step MapReduce workflows• Extremely tight AWS integration• Easily choose to run locally, on Hadoop cluster, or on
EMR• Actively developed; great documentation
![Page 35: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/35.jpg)
35
mrjob
goto code
![Page 36: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/36.jpg)
36
mrjob: serialization
class MyMRJob(mrjob.job.MRJob): INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol
Defaults
RawProtocol / RawValueProtocolJSONProtocol / JSONValueProtocolPickleProtocol / PickleValueProtocolReprProtocol / ReprValueProtocol
Available
Custom protocols can be written.No current support for binary serialization schemes.
![Page 37: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/37.jpg)
37
luigi
• Full-fledged workflow management, task scheduling, dependency resolution tool in Python (similar to Apache Oozie)
• Built-in support for Hadoop by wrapping Streaming• Not as fully-featured as mrjob for Hadoop, but easily
customizable• Internal serialization through repr/eval• Actively developed at Spotify• README is good but documentation is lacking
![Page 38: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/38.jpg)
38
luigi
goto code
![Page 39: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/39.jpg)
39
The cluster used for benchmarking
• 5 virtual machines• 4 CPUs• 10 GB RAM• 100 GB disk• CentOS 6.2
• CDH4 (Hadoop 2)• 20 map tasks• 10 reduce tasks
• Python 2.6
![Page 40: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/40.jpg)
40
(Unscientific) performance comparison
![Page 41: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/41.jpg)
41
(Unscientific) performance comparison
Streaming has lowest overhead
![Page 42: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/42.jpg)
42
(Unscientific) performance comparison
JSON SerDe
![Page 43: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/43.jpg)
43
Feature comparison
![Page 44: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/44.jpg)
44
Feature comparison
![Page 45: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/45.jpg)
45
Questions?
![Page 46: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/46.jpg)
46
Cloudera Hadoop Stack
Unified Scale-Out StorageFor Any Type of Data
Batch Processing
Workload Management
System
Managem
entData
Managem
ent
![Page 47: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/47.jpg)
47
Cloudera Hadoop Stack
Unified Scale-Out StorageFor Any Type of Data
Batch Processing
Workload Management
Online NoSQL
Analytic SQL Search
Machine Learning
and Streaming
3rd PartyApps
System
Managem
entData
Managem
ent
![Page 48: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/48.jpg)
48
What is Spark?
• Started in 2009 as academic project from Amplab at UCBerkeley; now ASF and >100 contributors
• In-memory distributed execution engine• Operates on Resilient Distributed Datasets (RDDs)• Provides richer distributed computing primitives for
various problems• Can support SQL, stream processing, ML, graph
computation• Supports Scala, Java, and Python
![Page 49: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/49.jpg)
Spark uses a general DAG scheduler
• Application aware scheduler• Uses locality for both disk
and memory• Partitioning-aware
to avoid shuffles• Can rewrite and optimize
graph based on analysis
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
![Page 50: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/50.jpg)
50
Operations on RDDs
Zaharia 2011
![Page 51: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/51.jpg)
51
Apache Spark
file = spark.textFile("hdfs://...")errors = file.filter(lambda line: "ERROR” in line)# Count all the errorserrors.count()# Count errors mentioning MySQLerrors.filter(lambda line: "MySQL” in line).count()# Fetch the MySQL errors as an array of stringserrors.filter(lambda line: "MySQL” in line).collect()
val points = spark.textFile(...).map(parsePoint).cache()var w = Vector.random(D) // current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}println("Final separating plane: " + w)
Log
filte
ring
(Pyt
hon)
Logi
stic
regr
essi
on(S
cala
)
![Page 52: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/52.jpg)
52
Apache Spark
goto code
![Page 53: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/53.jpg)
53
What’s Impala?
• Interactive SQL• Typically 4-65x faster than the latest Hive (observed up to 100x faster)• Responses in seconds instead of minutes (sometimes sub-second)
• ANSI-92 standard SQL queries with HiveQL• Compatible SQL interface for existing Hadoop/CDH applications• Based on industry standard SQL
• Natively on Hadoop/HBase storage and metadata• Flexibility, scale, and cost advantages of Hadoop• No duplication/synchronization of data and metadata• Local processing to avoid network bottlenecks
• Separate runtime from batch processing• Hive, Pig, MapReduce are designed and great for batch• Impala is purpose-built for low-latency SQL queries on Hadoop
Cloudera Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
![Page 54: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/54.jpg)
54
Cloudera Impala
SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotypeFROM hg19_parquet_snappy_join_cached_partitionedWHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16";
![Page 55: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/55.jpg)
55
Impala Architecture: Planner
• Example: query with join and aggregationSELECT state, SUM(revenue)FROM HdfsTbl h JOIN HbaseTbl b ON (...)GROUP BY 1 ORDER BY 2 desc LIMIT 10
HbaseScan
HashJoin
HdfsScan Exch
TopN
Agg
Exch
at coordinator at DataNodes at region servers
AggTopN
Agg
HashJoin
HdfsScan
HbaseScan
Cloudera Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
![Page 56: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/56.jpg)
56
Impala User-defined Functions (UDFs)
• Tuple => Scalar value• Substring• sin, cos, pow, …• Machine-learning models
• Supports Hive UDFs (Java)• Highly unpleasurable
• Impala (native) UDFs• C++ interface designed for efficiency• Similar to Postgres UDFs• Runs any LLVM-compiled code
![Page 57: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/57.jpg)
57
LLVM compiler infrastructure
![Page 58: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/58.jpg)
58
LLVM: C++ example
bool StringEq(FunctionContext* context, const StringVal& arg1, const StringVal& arg2) { if (arg1.is_null != arg2.is_null) return false; if (arg1.is_null) return true; if (arg1.len != arg2.len) return false; return (arg1.ptr == arg2.ptr) || memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0;}
![Page 59: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/59.jpg)
59
LLVM: IR output; ModuleID = '<stdin>'target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"target triple = "x86_64-apple-macosx10.7.0"
%"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* }%"class.impala::FunctionContextImpl" = type opaque%"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* }%"struct.impala_udf::AnyVal" = type { i8 }
; Function Attrs: nounwind readonly ssp uwtabledefine zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"* nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 {entry: %is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0 %0 = load i8* %is_null, align 1, !tbaa !0, !range !3 %is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0 %1 = load i8* %is_null1, align 1, !tbaa !0, !range !3 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %return
if.end: ; preds = %entry %tobool = icmp eq i8 %0, 0 br i1 %tobool, label %if.end7, label %return
if.end7: ; preds = %if.end %len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1 %2 = load i32* %len, align 4, !tbaa !4 %len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1 %3 = load i32* %len8, align 4, !tbaa !4 %cmp9 = icmp eq i32 %2, %3 br i1 %cmp9, label %if.end11, label %return
if.end11: ; preds = %if.end7 %ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2 %4 = load i8** %ptr, align 8, !tbaa !5 %ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2 %5 = load i8** %ptr12, align 8, !tbaa !5 %cmp13 = icmp eq i8* %4, %5 br i1 %cmp13, label %return, label %lor.rhs
lor.rhs: ; preds = %if.end11 %conv17 = sext i32 %2 to i64 %call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17) %cmp18 = icmp eq i32 %call, 0 br label %return
return: ; preds = %lor.rhs, %if.end11, %if.end7, %if.end, %entry %retval.0 = phi i1 [ false, %entry ], [ true, %if.end ], [ false, %if.end7 ], [ true, %if.end11 ], [ %cmp18, %lor.rhs ] ret i1 %retval.0}
; Function Attrs: nounwind readonlydeclare i32 @memcmp(i8* nocapture, i8* nocapture, i64) #1
attributes #0 = { nounwind readonly ssp uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"="true" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "unsafe-fp-math"="false" "use-soft-float"="false" }attributes #1 = { nounwind readonly "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"="true" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "unsafe-fp-math"="false" "use-soft-float"="false" }
!0 = metadata !{metadata !"bool", metadata !1}!1 = metadata !{metadata !"omnipotent char", metadata !2}!2 = metadata !{metadata !"Simple C/C++ TBAA"}!3 = metadata !{i8 0, i8 2}!4 = metadata !{metadata !"int", metadata !1}!5 = metadata !{metadata !"any pointer", metadata !1}
![Page 60: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/60.jpg)
60
LLVM compiler infrastructure
NumbaPython
![Page 61: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/61.jpg)
61
Iris data and BigMLdef predict_species_orig(sepal_width=None, petal_length=None, petal_width=None): """ Predictor for species from model/52952081035d07727e01d836
Predictive model by BigML - Machine Learning Made Easy """ if (petal_width is None): return u'Iris-virginica' if (petal_width > 0.8): if (petal_width <= 1.75): if (petal_length is None): return u'Iris-versicolor' if (petal_length > 4.95): if (petal_width <= 1.55): return u'Iris-virginica' if (petal_width > 1.55): if (petal_length > 5.45): return u'Iris-virginica' if (petal_length <= 5.45): return u'Iris-versicolor' if (petal_length <= 4.95): if (petal_width <= 1.65): return u'Iris-versicolor' if (petal_width > 1.65): return u'Iris-virginica' if (petal_width > 1.75): if (petal_length is None): return u'Iris-virginica' if (petal_length > 4.85): return u'Iris-virginica' if (petal_length <= 4.85): if (sepal_width is None): return u'Iris-virginica' if (sepal_width <= 3.1): return u'Iris-virginica' if (sepal_width > 3.1): return u'Iris-versicolor' if (petal_width <= 0.8): return u'Iris-setosa'
![Page 62: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/62.jpg)
62
Impala + Numba
goto code
![Page 63: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/63.jpg)
63
Impala + Numba
• Still pre-alpha• Significantly faster execution thanks to native LLVM• Significantly easier to write UDFs
![Page 64: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/64.jpg)
64
Conclusions
![Page 65: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/65.jpg)
65
If you have access to a Hadoop cluster and you want a one-off quick-and-dirty job…
Hadoop Streaming
![Page 66: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/66.jpg)
66
If you want an expressive Pythonic interface to build complex, regular ETL workflows…
Luigi
![Page 67: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/67.jpg)
67
If you want to integrate Hadoop with other regular processes…
Luigi
![Page 68: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/68.jpg)
68
If you don’t have access to Hadoop and want to try stuff out…
mrjob
![Page 69: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/69.jpg)
69
If you’re heavily using AWS…
mrjob
![Page 70: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/70.jpg)
70
If you want to work interactively…
PySpark
![Page 71: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/71.jpg)
71
If you want to do in-memory analytics…
PySpark
![Page 72: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/72.jpg)
72
If you want to do anything…*
PySpark
![Page 73: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/73.jpg)
73
If you want ease of Python with high performance
Impala + Numba
![Page 74: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/74.jpg)
74
If you want to write Python UDFs for SQL queries…
Impala + Numba
![Page 75: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/75.jpg)
75
Code:https://github.com/laserson/rock-health-python
Blog post:http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Slides:http://www.slideshare.net/urilaserson/
![Page 76: Python in the Hadoop Ecosystem (Rock Health presentation)](https://reader036.vdocuments.net/reader036/viewer/2022062511/54c658ac4a795965328b4606/html5/thumbnails/76.jpg)
76