mapreduce - bowdoin

5
Sean Barker MapReduce 1 Sean Barker Programming Model map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list 2

Upload: others

Post on 21-Apr-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce - Bowdoin

Sean Barker

MapReduce

1

Sean Barker

Programming Model

map (in_key, in_value) -> (out_key, intermediate_value) list

reduce (out_key, intermediate_value list) -> out_value list

2

Page 2: MapReduce - Bowdoin

Sean Barker

MapReduce Control Flow

3

3/20/14

3

13"

Data store 1 Data store nmap

(key 1, values ...)

(key 2, values ...)

(key 3, values ...)

map

(key 1, values...)

(key 2, values ...)

(key 3, values ...)

Input key *value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

14"

Parallelism"

•  map() functions run in parallel, creating different intermediate values from different input data sets"

•  reduce() functions also run in parallel, each working on a different output key"

•  All values are processed independently!•  Bottleneck: reduce phase can’t start until map

phase is completely finished."

15"

Example: Count word occurrences"map(String input_key, String input_value):

// input_key: document name // input_value: document contents

for each word w in input_value:

EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word

// output_values: a list of counts

int result = 0; for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result)); 16"

Example vs. Actual Source Code"

•  Example is written in pseudo-code"•  Actual implementation is in C++, using a

MapReduce library"•  Bindings for Python and Java exist via

interfaces"

•  True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)"

17"

Locality"

•  Master program divvies up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack"

•  map() task inputs are divided into 64 MB blocks: same size as Google File System chunks""

18"

Fault Tolerance"

•  Master detects worker failures"•  Re-executes completed & in-progress map() tasks"•  Re-executes in-progress reduce() tasks"

•  Master notices particular input key/values cause crashes in map(), and skips those values on re-execution."•  Effect: Can work around bugs in third-party

libraries!"

Sean Barker

Word Count Example

4

3/20/14

3

13"

Data store 1 Data store nmap

(key 1, values ...)

(key 2, values ...)

(key 3, values ...)

map

(key 1, values...)

(key 2, values ...)

(key 3, values ...)

Input key *value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

14"

Parallelism"

•  map() functions run in parallel, creating different intermediate values from different input data sets"

•  reduce() functions also run in parallel, each working on a different output key"

•  All values are processed independently!•  Bottleneck: reduce phase can’t start until map

phase is completely finished."

15"

Example: Count word occurrences"map(String input_key, String input_value):

// input_key: document name // input_value: document contents

for each word w in input_value:

EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word

// output_values: a list of counts

int result = 0; for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result)); 16"

Example vs. Actual Source Code"

•  Example is written in pseudo-code"•  Actual implementation is in C++, using a

MapReduce library"•  Bindings for Python and Java exist via

interfaces"

•  True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)"

17"

Locality"

•  Master program divvies up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack"

•  map() task inputs are divided into 64 MB blocks: same size as Google File System chunks""

18"

Fault Tolerance"

•  Master detects worker failures"•  Re-executes completed & in-progress map() tasks"•  Re-executes in-progress reduce() tasks"

•  Master notices particular input key/values cause crashes in map(), and skips those values on re-execution."•  Effect: Can work around bugs in third-party

libraries!"

Page 3: MapReduce - Bowdoin

Sean Barker

Stragglers

5

3/20/14

6

31"

Bullet"

•  Overlay based file distribution"•  30+ seconds for 50/90/130 hosts to connect"

0"

20"

40"

60"

80"

100"

120"

0" 5" 10" 15" 20" 25" 30"

Nu

mb

er

of

Ho

sts!

Elapsed time (sec)!

50 hosts"90 hosts"

130 hosts"

32"

EMAN"

•  Electron Micrograph Analysis"•  2700+ seconds to complete 98 tasks on 98 hosts"

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

33"

MapReduce"

•  Application-specific data processing"•  2500+ seconds to complete 480 map tasks on 30 hosts"

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

34"

Dealing with Stragglers"•  In MapReduce, application developers explicitly dealt with

stragglers"•  Application-specific algorithm detected slow hosts"

•  Reallocated work from slow hosts to fast hosts"

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

0" 100" 200" 300" 400" 500"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

Work reallocated"

35"

0"

20"

40"

60"

80"

100"

120"

0" 5" 10" 15" 20" 25" 30"

Nu

mb

er

of

Ho

sts!

Elapsed time (sec)!

50 hosts"90 hosts"

130 hosts" 0"

10" 20" 30" 40" 50" 60" 70" 80" 90"

0" 400" 800" 1200" 1600"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

Dealing with Stragglers"•  Need a general technique for detecting stragglers in

distributed applications"•  Ease developers of burden of handling stragglers separately for each

application"•  Detect “knee” of curve and adjust application"

Bullet" EMAN"

36"

Application Characteristics"

•  Bullet, EMAN, and MapReduce belong to a specific class of applications "•  Support mid-computation reconfiguration"

•  Support dynamically degrading computation"

•  Some applications do not support reconfiguration"•  Degrading computation may reduce accuracy"

•  Require specific number of hosts"

•  For applications that support reconfiguration, we can improve performance by decreasing completion time using partial barriers"

Sean Barker

Stragglers

6

3/20/14

6

31"

Bullet"

•  Overlay based file distribution"•  30+ seconds for 50/90/130 hosts to connect"

0"

20"

40"

60"

80"

100"

120"

0" 5" 10" 15" 20" 25" 30"

Nu

mb

er

of

Ho

sts!

Elapsed time (sec)!

50 hosts"90 hosts"

130 hosts"

32"

EMAN"

•  Electron Micrograph Analysis"•  2700+ seconds to complete 98 tasks on 98 hosts"

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

33"

MapReduce"

•  Application-specific data processing"•  2500+ seconds to complete 480 map tasks on 30 hosts"

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

34"

Dealing with Stragglers"•  In MapReduce, application developers explicitly dealt with

stragglers"•  Application-specific algorithm detected slow hosts"

•  Reallocated work from slow hosts to fast hosts"

0"

50"

100"

150"

200"

250"

300"

350"

400"

450"

0" 100" 200" 300" 400" 500"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

Work reallocated"

35"

0"

20"

40"

60"

80"

100"

120"

0" 5" 10" 15" 20" 25" 30"

Nu

mb

er

of

Ho

sts!

Elapsed time (sec)!

50 hosts"90 hosts"

130 hosts" 0"

10" 20" 30" 40" 50" 60" 70" 80" 90"

0" 400" 800" 1200" 1600"

Tas

ks!

Elapsed time (sec)!

Completed tasks"

Dealing with Stragglers"•  Need a general technique for detecting stragglers in

distributed applications"•  Ease developers of burden of handling stragglers separately for each

application"•  Detect “knee” of curve and adjust application"

Bullet" EMAN"

36"

Application Characteristics"

•  Bullet, EMAN, and MapReduce belong to a specific class of applications "•  Support mid-computation reconfiguration"

•  Support dynamically degrading computation"

•  Some applications do not support reconfiguration"•  Degrading computation may reduce accuracy"

•  Require specific number of hosts"

•  For applications that support reconfiguration, we can improve performance by decreasing completion time using partial barriers"

Page 4: MapReduce - Bowdoin

Sean Barker

Apache Hadoop

7

– 17 –!

Hadoop Project"File system with files distributed across nodes""

""

""

!  Store multiple (typically 3 copies of each file)""  If one node fails, data still available!

!  Logically, any node has access to any file""  May need to fetch across network!

Map / Reduce programming environment"!  Software manages execution of tasks on nodes"

Local Network

CPU"

Node 1

CPU"

Node 2

CPU"

Node n

• • •

Sean Barker

Programming Model

map (in_key, in_value) -> (out_key, intermediate_value) list

reduce (out_key, intermediate_value list) -> out_value list

8

Page 5: MapReduce - Bowdoin

Sean Barker

Web Log Parsing

9

C B B C

C A 3 C

1 A

Result File 2 File 1 ResultFile 2File 1

“Count the number of times pages matching A|C were fetched”

(e.g., or matching “bowdoin.edu”)

Sean Barker

Web Log Parsing

10

Map tasks:(f1, C) -> [(C, 1)](f1, B) -> [](f1, B) -> [](f1, C) -> [(C, 1)](f2, C) -> [(C, 1)](f2, A) -> [(A, 1)]

Reduce tasks:(A, [1]) -> (A, 1)(C, [1, 1, 1]) -> (C, 3)

C B B C

C A 3 C

1 A

Result File 2 File 1 ResultFile 2File 1