map reduce algorithms › cs590d › lectures › presentation8.pdf · map reduce algorithms david...

20
Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna Saha

Upload: others

Post on 05-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Map Reduce Algorithms

David Wemhoener

Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna Saha

Page 2: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna
Page 3: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna
Page 4: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Example

Page 5: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Example

Page 6: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Word Counting

map(String key, String value): // key: document name, // value: document contents

for each word w in value: EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word // values: a list of counts

int word_count = 0;for each v in values:word_count += ParseInt(v);

Emit(key, AsString(word_count));

6

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 7: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Word Counting

• Have a very large document

7

Page 8: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

What else can we do?

• Reverse Web-Link Graph

8

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 9: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

What else can we do?

• Reverse Web-Link Graph

– Map Function: source → <target, source> pairs

9

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 10: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

What else can we do?

• Reverse Web-Link Graph

– Map Function: source → <target, source> pairs

– Reduce Function: <target, source> pairs → <target, list(source)>

10

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 11: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

What else can we do?

• Reverse Web-Link Graph

– Map Function: source → <target, source> pairs

– Reduce Function: <target, source> pairs → <target, list(source)>

• Inverted Index

11

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 12: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

What else can we do?

• Reverse Web-Link Graph

– Map Function: source → <target, source> pairs

– Reduce Function: <target, source> pairs → <target, list(source)>

• Inverted Index

– Map Function: document → <word, doc ID> pairs

12

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 13: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

What else can we do?

• Reverse Web-Link Graph

– Map Function: source → <target, source> pairs

– Reduce Function: <target, source> pairs → <target, list(source)>

• Inverted Index

– Map Function: document → <word, doc ID> pairs

– Reduce Function: <word, doc ID> pairs → <word, list(doc ID)>

13

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 14: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Example: Host size

• Suppose we have a large web corpus

• Look at the metadata file

– Lines of the form: (URL, size, date, …)

• For each host, find the total number of bytes

– That is, the sum of the page sizes for all URLs from that particular host

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

14

Page 15: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Example: Join By Map-Reduce• Compute the natural join R(A,B) ⋈ S(B,C)

• R and S are each stored in files

• Tuples are pairs (a,b) or (b,c)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org15

A B

a1 b1

a2 b1

a3 b2

a4 b3

B C

b2 c1

b2 c2

b3 c3

⋈A C

a3 c1

a3 c2

a4 c3

=

R

S

Page 16: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Map-Reduce Join

• A Map process turns:

– Each input tuple R(a,b) into key-value pair (b,(a,R))

– Each input tuple S(b,c) into (b,(c,S))

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org16

Page 17: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Map-Reduce Join

• A Map process turns:

– Each input tuple R(a,b) into key-value pair (b,(a,R))

– Each input tuple S(b,c) into (b,(c,S))

• Map processes send each key-value pair with key b to Reduce process h(b)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org17

Page 18: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Map-Reduce Join

• A Map process turns:

– Each input tuple R(a,b) into key-value pair (b,(a,R))

– Each input tuple S(b,c) into (b,(c,S))

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org18

Page 19: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Example

Page 20: Map Reduce Algorithms › cs590d › lectures › Presentation8.pdf · Map Reduce Algorithms David Wemhoener Acknowledgement: Some slides are taken from Sergei Vassilivski, Barna

Example