![Page 1: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/1.jpg)
Optimizing MapReduce for GPUs with Effective
Shared Memory Usage
Department of Computer Science and Engineering
The Ohio State University
Linchuan Chen and Gagan Agrawal
![Page 2: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/2.jpg)
Outline
Introduction Background System Design Experiment Results Related Work Conclusions and Future Work
![Page 3: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/3.jpg)
Introduction
Motivations GPUs
Suitable for extreme-scale computing Cost-effective and Power-efficient
MapReduce Programming Model Emerged with the development of Data-Intensive Computing
GPUs have been proved to be suitable for implementing MapReduce
Utilizing the fast but small shared memory for MapReduce is chanllenging Storing (Key, Value) pairs leads to high memory overhead, pr
ohibiting the use of shared memory
![Page 4: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/4.jpg)
Introduction
Our approach Reduction-based method
Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function
Very suitable for reduction-intensive applications
A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling
![Page 5: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/5.jpg)
Outline
Introduction Background System Design Experiment Results Related Work Conclusions and Future Work
![Page 6: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/6.jpg)
MapReduce
K1: v, v, v, v K2:v K3:v, v K4:v, v, v K5:v
MM
Group by Key
K1:v k1:v k2:v K1:v K3:v k4:v K4:v k5:v K4:v K1:v k3:v
MM MM MM MM MM MM MM
RR RR RR RR RR
![Page 7: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/7.jpg)
MapReduce
Programming Model Map()
Generates a large number of (key, value) pairs Reduce()
Merges the values associated with the same key
Efficient Runtime System Parallelization Concurrency Control Resource Management Fault Tolerance … …
![Page 8: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/8.jpg)
GPUs
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
(Device) Grid
ConstantMemory
TextureMemory
DeviceMemory
Block (0, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Host
Processing Component Memory Component
![Page 9: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/9.jpg)
Outline
Introduction Background System Design Experiment Results Related Work Conclusions and Future Work
![Page 10: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/10.jpg)
Traditional MapReduce
map(input){ (key, value) = process(input); emit(key, value);}
grouping the key-value pairs (by runtime system)
reduce(key, iterator){ for each value in iterator result = operation(result, value); emit(key, result);}
System Design
![Page 11: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/11.jpg)
Reduction-based approach
map(input){ (key, value) = process(input); reductionobject->insert(key, value);}
reduce(value1, value2){ value1 = operation(value1, value2);}
Reduces the memory overhead of storing key-value pairsMakes it possible to effectively utilize shared memory on a GPUEliminates the need of groupingEspecially suitable for reduction-intensive applications
System Design
![Page 12: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/12.jpg)
Chanllenges
Result collection and overflow handling Maintain a memory hierarchy
Trade off space requirement and locking overhead A multi-group scheme
To keep the framework general and efficient A well defined data structure for the reduction object
![Page 13: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/13.jpg)
Memory Hierarchy
CPU
GPU
Reduction Object 0
Reduction Object 0
Reduction Object 1
Reduction Object 1
Block 0’s Shared Memory
Reduction Object 0
Reduction Object 0
Reduction Object 1
Reduction Object 1
Block 0’s Shared Memory
… … … … … …
Device Memory Reduction ObjectDevice Memory Reduction Object
Result ArrayResult Array
Host Memory
Device Memory
![Page 14: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/14.jpg)
Reduction Object
Updating the reduction object Use locks to synchronize
Memory allocation in reduction object Dynamic memory allocation Multiple offsets in device memory reduction object
![Page 15: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/15.jpg)
Reduction Object
KeyIdx[0] ValIdx[0] … …
… …
Key Size Val Size Key Data
Val Data
Val Data
Memory Allocator
Key Size Val Size Key Data
KeyIdx[1] ValIdx[1]
![Page 16: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/16.jpg)
Multi-group Scheme Locks are used for synchronization
Large number of threads in each thread block Lead to severe contention on the shared memory RO
One solution: full replication every thread owns a shared memory RO
leads to memory overhead and combination overhead
Trade-off multi-group scheme
divide threads in each thread block into multiple sub-groups
each sub-group owns a shared memory RO
Choice of groups numbers Contention overhead
Combination overhead
![Page 17: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/17.jpg)
Overflow Handling
Swapping Merge the full shared memory ROs to the device
memory RO Empty the full shared memory ROs
In-object sorting Sort the buckets in the reduction object and delet
e the unuseful data Users define the way of comparing two buckets
![Page 18: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/18.jpg)
Discussion
Reduction-intensive applications Our framework has a big advantage
Applications with few or no reduction No need to use shared memory
Users need to setup system parameters Develop auto-tuning techniques in future work
![Page 19: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/19.jpg)
Extension for Multi-GPU
Shared memory usage can speed up single node execution Potentially benefits the overall performance
Reduction objects can avoid global shuffling overhead Can also reduce communication overhead
![Page 20: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/20.jpg)
Outline
Introduction Background System Design Experiment Results Related Work Conclusions and Future Work
![Page 21: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/21.jpg)
Experiment Results Applications used
5 reduction-intensive 2 map computation-intensive Tested with small, medium and large datasets
Evaluation of the multi-group scheme 1, 2, 4 groups
Comparison with other implementations Sequential implementations MapCG Ji et al.'s work
Evaluating the swapping mechanism Test with large number of distinct keys
![Page 22: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/22.jpg)
Evaluation of the Multi-group Scheme
![Page 23: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/23.jpg)
Comparison with Sequential Implementations
![Page 24: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/24.jpg)
Comparison with MapCG
With reduction-intensive applications
![Page 25: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/25.jpg)
Comparison with MapCG
With other applications
![Page 26: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/26.jpg)
Comparison with Ji et al.'s work
![Page 27: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/27.jpg)
Evaluation of the Swapping Mechamism
VS MapCG and Ji et al.’s work
![Page 28: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/28.jpg)
Evaluation of the Swapping Mechamism
VS MapCG
![Page 29: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/29.jpg)
Evaluation of the Swapping Mechamism
swap_frequency = num_swaps / num_tasks
![Page 30: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/30.jpg)
Outline
Introduction Background System Design Experiment Results Related Work Conclusions and Future Work
![Page 31: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/31.jpg)
Related Work
MapReduce for multi-core systems Phoenix, Phoenix Rebirth
MapReduce on GPUs Mars, MapCG
MapReduce-like framework on GPUs for SVM Catanzaro et al.
MapReduce in heterogeneous environments MITHRA, IDAV
Utilizing shared memory of GPUs for specific applications Nyland et al., Gutierrez et al.
Compiler optimizations for utilizing shared memory Baskaran et al. (PPoPP '08), Moazeni et al. (SASP '09)
![Page 32: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/32.jpg)
Conclusions and Future Work
Reduction-based MapReduce Storing the reduction object on the memory hi
erarchy of the GPU A multi-group scheme Improved performance compared with previou
s implementations Future work: extend our framework to support
new architectures
![Page 33: Optimizing MapReduce for GPUs with Effective Shared Memory Usage](https://reader036.vdocuments.net/reader036/viewer/2022070419/56815b71550346895dc96990/html5/thumbnails/33.jpg)
Thank you!