hadoop mapreduce types - fordham university€¦ · • hadoop map-reduce framework spawns one map...

25
Hadoop MapReduce Types Spring 2015, X. Zhang Fordham Univ.

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Hadoop MapReduce Types

Spring 2015, X. Zhang

Fordham Univ.

Page 2: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Outline

• MapReduce Types • default types • partition class and number of reducer tasks • control: choosing number of reducers • or how to partition keys …

• Default streaming jobs • Input Splits and Records • Serialization/Deserialization in hadoop:

Page 3: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

MapReduce Programming

3

Output: a set of [key,value] pairs

Split

intermediate [key,value] pairs

[k,(v1,v2, …)] [k’,(v1,v2,…)] …

Shuffle

File (text, binary) Database

Input: a set of [key,value] pairs

InputFormat

OutputFormat

Partition: hash key to reducer task

Page 4: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

What happens? • Maps are the individual tasks which transform input records into a intermediate records. The

transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

• Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. !

• Framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(Context) is called.

• All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine final output.

• Users can control sorting and grouping by specifying two key RawComparator classes.

• Mapper outputs are partitioned per Reducer.

• Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

• Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of intermediate outputs, which helps to cut down amount of data transferred from the Mapper to the Reducer. 4

Page 5: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

MapReduce Programming

5

Map: input key/value type (can be int, long, text, …) output key/value type (can be int, long, text, …) reducer: input key/value type (same as mapper’s output key/value type) output key/value type (can be int, long, text, …)

Page 6: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Map and Reduce Function

• map task: call mapper class’s map function to map input key/value pair to a set of intermediate key/value pairs. (k1, v1) -> list (k2, v2)

6

Page 7: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Reduce Function

• reduce task: call reducer class’s reduce function to process: (k2, list(v2)) -> list (k3,v3)

7

Page 8: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Partition Function

• partition: (k2, v2) -> integer • work on the intermediate key k2, and

returns the partition index

8

Page 9: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

a MapReduce Job

Class Job: org.apache.hadoop.mapreduce.Job All Implemented Interfaces: JobContext, org.apache.hadoop.mapreduce.MRJobConfig !public class Job!extends org.apache.hadoop.mapreduce.task.JobContextImpl!implements JobContext!!The job submitter's view of the Job.

It allows the user to

• configure the job: using set***** methods, only work until the job is submitted, afterwards they will throw an IllegalStateException.

• submit it,

• control its execution,

• and query the state.

Normally the user creates the application, describes various facets of the job via Job and then submits the job and monitor its progress.

9

Page 10: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

a MapReduce JobHere is an example on how to submit a job:

! // Create a new Job! Job job = Job.getInstance();! job.setJarByClass(MyJob.class);! ! // Specify various job-specific parameters ! job.setJobName("myjob");! ! job.setInputPath(new Path("in"));! job.setOutputPath(new Path("out"));! ! job.setMapperClass(MyJob.MyMapper.class);! job.setReducerClass(MyJob.MyReducer.class);!! // Submit the job, then poll for progress until the job is complete! job.waitForCompletion(true);

10

Page 11: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

11

Mapper class’s KEYIN must be consistent with inputformat.class Mapper class’s KEYOUT must be consistent with map.out.key.class…

Page 12: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

a minimalMapReduce Job

• Try to run the minimalMapReduce job !

• Compare it with the …WithDefaults j

12

Page 13: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Job: default settings

13

Page 14: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Default InputFormatClass

• Default input format is TextInputFormat, which produces one (key,value) pair for each line in the text file

• keys of type LongWritable, the offset of the beginning of the line in the file

• values of type Text, the line of text.

• Other InputFormatClass ?

14

Page 15: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Default Mapper

Full name: org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

!Mapper is a generic type, which allows it to work with any key or value types.

15

an identity function: simply write input key, value pair to output

Page 16: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Mapper: run

• Expert users can override this method for more complete control over the execution of the Mapper (when they write their own mapper class, e.g., maximumTemperaturMapper.java)

16

public void run(Context context) throws IOException, InterruptedException {! setup(context);! try {! while (context.nextKeyValue()) {! map(context.getCurrentKey(), context.getCurrentValue(), context);! }! } finally {! cleanup(context);! }! }!}!

Page 17: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Default PartitionerPartitioner controls the partitioning of the keys of the intermediate map-outputs.

• The key (or a subset of the key) is used to derive the partition, typically by a hash function.

• The total number of partitions is the same as the number of reduce tasks for the job.

• Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.

Default Partitioner: HashPartitioner, which hashes a record’s key to determine which partition the record belongs in. Each partition is processed by a reduce task, so the number of partitions is equal to the number of reduce tasks for the job: public class HashPartitioner<K, V> extends Partitioner<K, V> {

public int getPartition(K key, V value, int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; }

}

The key’s hash code is turned into a nonnegative integer by bitwise ANDing it with the largest integer value. It is then reduced modulo the number of partitions to find the index of the partition that the record belongs in.

17

Page 18: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Default Reducer

• Default reducer is Reducer, again a generic type, which simply writes all its input to its output:

18

Page 19: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Default OutputFormat

• Default output format TextOutputFormat writes out records, one per line

• converting keys and values to strings • separating them with a tab character.

19

Page 20: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Exercises

• If we want to see the output of mapper class …

• If we want to input to be parsed differently to generate input <key,value> pair? • for example, if input data is formatted:

1901, 234 1902, 340 …

20

Page 21: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Outline

• MapReduce Types • default types • Serialization/Deserialization in hadoop: • partition class and number of reducer tasks • control: choosing number of reducers • or how to partition keys …

• Reading input: InputFormat • Input Splits and Records

• Writing output: OutputFormat • TextOutputFormat

• Default streaming jobs

Page 22: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Default Streaming Job• More stripped down streaming job: • !!

• Equivalently,

22

Page 23: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Usage of streaming separators

23

Page 24: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

24

Page 25: Hadoop MapReduce Types - Fordham University€¦ · • Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. ! • Framework

Input Splits

• input split: a chunk of input that is processed by a single map • length: largest split processed first … • a set of storage locations (to help schedule

map tasks close to data) • record: each split is divided into records • mapper class is called to process each record,

a key-value pair • mapper does not have global state about

the whole input 25