mapreduce - university of cambridgeey204/teaching/acs/r212_2014...mapreduce: simplified data...
TRANSCRIPT
![Page 1: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/1.jpg)
MapReduce:Simplified Data Processing on Large Clusters
J. Dean, S. Ghemawat, OSDI, 2004.
Review by Mariana Marasoiu for R212
![Page 2: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/2.jpg)
Motivation: Large scale data processing
We want to:
Extract data from large datasets
Run on big clusters of computers
Be easy to program
![Page 3: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/3.jpg)
Solution: MapReduce
A new programming model: Map & Reduce
Provides:Automatic parallelization and distributionFault toleranceI/O schedulingStatus and monitoring
![Page 4: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/4.jpg)
(1, you are in Cambridge)
(2, I like Cambridge)
(3, we live in Cambridge)
(you, 1)(are, 1)(in, 1)(Cambridge, 1)
(I, 1)(like, 1)(Cambridge, 1)
(we, 1)(live, 1)(in, 1)(Cambridge, 1)
Map
map (in_key, in_value) → list(out_key, intermediate_value)
![Page 5: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/5.jpg)
(you, 1)(are, 1)(in, 1)(Cambridge, 1)
(I, 1)(like, 1)(Cambridge, 1)
(we, 1)(live, 1)(in, 1)(Cambridge, 1)
![Page 6: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/6.jpg)
Partition
(we, 1)
(you, 1)
(live, 1)
(are, 1)
(Cambridge, 1)(Cambridge, 1)(Cambridge, 1)
(in, 1)(in, 1)
(I, 1)
(like, 1)
(you, 1)(are, 1)(in, 1)(Cambridge, 1)
(I, 1)(like, 1)(Cambridge, 1)
(we, 1)(live, 1)(in, 1)(Cambridge, 1)
![Page 7: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/7.jpg)
Partition Reduce
(we, 1)
(you, 1)
(live, 1)
(are, 1)
(Cambridge, 1)(Cambridge, 1)(Cambridge, 1)
(in, 1)(in, 1)
(I, 1)
(like, 1)
(you, 1)(are, 1)(in, 1)(Cambridge, 1)
(I, 1)(like, 1)(Cambridge, 1)
(we, 1)(live, 1)(in, 1)(Cambridge, 1)
(you, 1)
(are, 1)
(in, 2)
(Cambridge, 3)
(I, 1)
(like, 1)
(we, 1)
(live, 1)
reduce (out_key, list(intermediate_value)) -> list(out_value)
![Page 8: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/8.jpg)
File 1
File 2
File 3
UserProgram
Input files
![Page 9: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/9.jpg)
File 1
File 2
worker
worker
worker
worker
worker
File 3
UserProgram Master
Input files
fork
forkfork
![Page 10: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/10.jpg)
File 1
File 2
worker
worker
worker
worker
worker
File 3
UserProgram Master
Input files
fork
assignmap
assignreduce
![Page 11: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/11.jpg)
File 1
File 2
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
File 3
UserProgram Master
Input files
M splits
Mapphase
fork
assignmap
assignreduce
split
read
![Page 12: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/12.jpg)
File 1
File 2
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
File 3
UserProgram Master
Input files
M splits
Mapphase
Intermediate files(on local disks)
fork
assignmap
assignreduce
split
readlocalwrite
![Page 13: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/13.jpg)
File 1
File 2
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
File 3
UserProgram Master
Input files
M splits
Mapphase
Intermediate files(on local disks)
Reducephase
fork
assignmap
assignreduce
split
readlocalwrite remote
read
![Page 14: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/14.jpg)
File 1
File 2
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
OutputFile 1
OutputFile 2
File 3
UserProgram Master
Input files
M splits
Mapphase
Intermediate files(on local disks)
Reducephase
R Output files
fork
assignmap
assignreduce
split
readlocalwrite remote
read
write
![Page 15: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/15.jpg)
Fine task granularity
M so that data is between 16MB and 64MBR is small multiple of workersE.g. M = 200,000, R = 5,000 on 2,000 workers
Advantages:dynamic load balancingfault tolerance
![Page 16: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/16.jpg)
Fault tolerance
Workers:Detect failure via periodic heartbeat
Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks
Task completion committed through master
Master:Not handled - failure unlikely
![Page 17: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/17.jpg)
Refinements
Locality optimizationBackup tasksOrdering guaranteesCombiner functionSkipping bad recordsLocal execution
![Page 18: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/18.jpg)
Performance
Tests run on 1800 machines:Dual 2GHz Intel Xeon processors
with Hyper-Threading enabled4GB of memoryTwo 160GB IDE disksGigabit Ethernet link
2 Benchmarks:MR_Grep 1010 x 100 byte entries, 92k matchesMR_Sort 1010 x 100 byte entries
![Page 19: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/19.jpg)
MR_Grep
150 seconds run (startup overhead of ~60 seconds)
![Page 20: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/20.jpg)
MR_Sort Normal execution No backup tasks 200 tasks killed
![Page 21: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/21.jpg)
Experience
Rewrite of the indexing systemfor Google web search
Large scale machine learning
Clustering for Google News
Data extraction for Google Zeitgeist
Large scale graph computations
![Page 22: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu](https://reader034.vdocuments.net/reader034/viewer/2022050417/5f8d80feff950450d478455e/html5/thumbnails/22.jpg)
Conclusions
MapReduce:useful abstractionsimplifies large-scale computationseasy to use
However:expensive for small applicationslong startup time (~1 min)chaining of map-reduce phases?