mapreduce computer engineering department distributed systems course assoc. prof. dr. ahmet sayar...

48
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Upload: aubrie-singleton

Post on 21-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

MapReduce

Computer Engineering DepartmentDistributed Systems Course

Assoc. Prof. Dr. Ahmet SayarKocaeli University - Fall 2015

Page 2: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

What does Scalable Mean?

• Operationally– In the past: works even if data does not fit in main

memory– Now: can make use of 1000s of cheap computers

• Algorithmically – In the past: if you have N data items, you must do

no more than Nm operations – polynomial time algorithms

– Now: if you have N data items, you should do no more than Nm / k operations, for some large k• Polynomial time algorithms must be parallelized

Page 3: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Find matching DNA Sequences

• Given a set of sequences• Find all sequences equal to

“GATTACGATTATTA”

Page 4: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Sequential (Linear) search

• Time = 0• Match? NO

Page 5: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Sequential (Linear) search

• 40 Records, 40 comparison• N Records, N comparison• The algorithmic complexity is order N: O(N)

Page 6: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

What if Sorted Sequences?

• GATATTTTAAGC < GATTACGATTATTA• No Match – keep searching in other half…• O(log N)

Page 7: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

New Task: Read Trimming

• Given a set of DNA sequences• Trim the final n bps of each sequence• Generate a new dataset

• Can we use an index?– No we have to touch every record no matter what.– O(N)

• Can we do any better?

Page 8: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Parallelization

O(?)

Page 9: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

New Task: Convert 405K TIFF Images to PNG

Page 10: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Another Example: Computing Word Frequency of Every Word in a Single document

Page 11: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015
Page 12: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

There is a pattern here …

• A function that maps a read to a trimmed read• A function that maps a TIFF image to a PNG image• A function that maps a document to its most

common word• A function that maps a document to a histogram of

word frequencies.

Page 13: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Compute Word Frequency Across all Documents

Page 14: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015
Page 15: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

(word, count)

Page 16: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

• How to split things into pieces– How to write map and reduce

MAP REDUCE

Page 17: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Map Reduce

• Map-reduce: high-level programming model and implementation for large-scale data processing.

• Programming model is based on functional programming– Every record is assumed to be in form of <key, value>

pairs.• Google: paper published 2004• Free variant: Hadoop – java – Apache

Page 18: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Upper-case Mapper in ML

Page 19: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Explode Mapper

Page 20: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Filter Mapper

Page 21: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Chaining Keyspaces

• Output key is int

Page 22: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Data Model

• Files• A File = a bag of (key, value) pairs

• A map-reduce program:– Input: a bag of (inputkey, value) pairs– Output: a bag of (outputkey, value) pairs

Page 23: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Step 1: Map Phase

• User provides the Map function:– Input: (input key, value)– Output: bag of (intermediate key, value)

• System applies the map function in parallel to all (input key, value) pairs in the input file

Page 24: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Step 2: Reduce Phase

• User provides Reduce function– Input: (intermediate key, bag of values)– Output: bag of output (values)

• The system will group all pairs with the same intermediate key, and passes the bag of values to the reduce function

Page 25: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Reduce

• After the map phase is over, all the intermediate values for a given output key are combined together into a list

• Reduce() combines those intermediate values into one or more final values for that same output key

• In practice, usually only one final value per key

Page 26: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Sum Reducer

Page 27: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

In summary

• Input and output : Each a set of key/value pairs• Programmer specifies two function• Map(in_key, in_value) -> list(out_key, intermediate_value)

– Process input key/value pair– Produces set of intermediate pairs

• Reduce (out_key, list(intermediate_value)) -> list(out_value)– Combines all intermediate values for a particular key– Produces a set of merged output values (usually just one)

• Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

Page 28: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: What does this do?

• Word count application of map reduce

Page 29: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Word Length Histogram

Page 30: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Example: Word Length Histogram

• Big = Yellow = 10+letters

• Medium = Red = 5..9 letters

• Small = Blue = 2..4 letters

• Tiny = Pink = 1 letter

Page 31: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015
Page 32: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015
Page 33: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015
Page 34: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

More Examples: Building an Inverted Index

• Input

• Tweet1, (“I love pancakes for breakfast”)

• Tweet2, (“I dislike pancakes”)

• Tweet3, (“what should I eat for breakfast”)

• Tweet4, (“I love to eat”)

• Desired output

• “pancakes”, (tweet1, tweet2)

• “breakfast”, (tweet1, tweet3)

• “eat”, (tweet3, tweet4)• “love”, (tweet1, tweet4)

Page 35: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

More Examples: Relational Joins

Page 36: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Relational Join MapReduce: Before Map Phase

Page 37: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Relational Join MapReduce: Map Phase

Page 38: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Relational Join MapReduce: Reduce Phase

Page 39: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Relational Join in MapReduce, Another Example

MAP:REDUCE:

Page 40: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Simple Social Network Analysis: Count Friends

MAP SHUFFLE

Page 41: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Taxonomy of Parallel Architectures

Page 42: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Cluster Computing

• Large number of commodity servers, connected by high speed, commodity network

• Rack holds a small number of servers• Data center: holds many racks• Massive parallelism– 100s, or 1000s servers– Many hours

• Failure– If medium time between failure is 1 year– Then, 1000 servers have one failure / hour

Page 43: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Distributed File System (DFS)

• For very large files: TBs, PBs• Each file is partitioned into chunks, typically

64MB• Each chunk is replicated several times (>2), on

different racks, for fault tolerance• Implementations:– Google’s DFS: GFS, proprietary– Hadoop’s DFS: HDFS, open source

Page 44: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

HDFS: Motivation

• Based on Google’s GFS• Redundant storage of massive amounts of

data on cheap and unreliable computers• Why not use an existing file system?– Different workload and design priorities– Handles much bigger dataset sizes than other file

systems

Page 45: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Assumptions

• High component failure rates– Inexpensive commodity components fail all the time

• Modest number of HUGE files– Just a few million– Each is 100MB or larger; multi-GB files typical

• Files are write-once, mostly appended to– Perhaps concurrently

• Large streaming reads• High sustained throughput favored over low latency

Page 46: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Hdfs Design Decisions

• Files are stored as blocks– Much larger size than most filesystems (default is 64MB)

• Reliability through replication– Each block replicated across 3+ DataNodes

• Single master (NameNode) coordinates access, metadata– Simple centralized management

• No data caching– Little benefit due to large data sets, streaming reads

• Familiar interface, but customize API– Simplify the problem; focus on distributed apps

Page 47: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Based on GFS Architecture

Page 48: MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

Referanslar

• https://class.coursera.org/datasci-001/lecture • https://www.youtube.com/watch?v=xWgdny1

9yQ4