utilization of gpu’s for general computing
DESCRIPTION
Utilization of GPU’s for General Computing. Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda , Reiji, et al. Overview. Problem: Want to use the GPU for things other than graphics, however the costs can be high Solution: - PowerPoint PPT PresentationTRANSCRIPT
Utilization of GPU’s for General Computing
Presenter: Charlene DiMeglio
Paper: Aspects of GPU for General Purpose High Performance Computing
Suda, Reiji, et al.
Overview Problem:
Want to use the GPU for things other than graphics, however the costs can be high
Solution:
Improve the CUDA drivers
Results:
As compared to node of a supercomputer, worth it
Conclusion
These improvements make using GPGPU’s more feasible
Problem: Need to computation power Why GPU’s?
GPU’s are not being fully realized as a resource, often sitting idle when not being used for graphics
Better performance for less power as compared to CPU’s
What’s the issue? Cost.
Efficient scheduling – timing data loads with its uses
Memory management – using the small amount of memory available effectively
Loads and stores – waiting for memory transfers, taking 100’s of cycles
Solutions Brook+ by AMD, Larrabee by Intel
CUDA by NVIDA
Greatest technological maturity at the time
Paper investigating existing technology and suggested improvements
30 Multi-Processors
8 Streaming Processors
16kb
NVIDA’s Tesla C1060 GPU vs.
Hitachi HA8000-tc/RS425 (T2K) Super Computer T2K – fastest supercomputer in Japan
T2K C1060
Cores/MPs 16 30
Clock frequency 2.3 GHz 1.3 GHz
Single SIMD vector length
4 32
Single peak 294 Gflops 933 Gflops
Main memory 32 GB 4 GB
Memory single peak
.109 .004
Cost ~$40,000 ~$2,500
Power 300 W 200 W
Issues to Overcome High SIMD vector length
Small main memory size
High register spill cost
No L2 cache but rather read-only texture caches
Methods to Hide Away Latency CUDA compiler option limits number of registers used per warp
1 warp = the 32 threads running in a block (SMID)
Maximizes number of warps that can run at a time
Could cause spills
Variable-sized multi-round data transfer scheduling with PCI express PCI express allows for data transfer, GPU and CPU computation to occur
in parallel
Allows for constant flow of information:
Allows for up to O(log x/x) as compared to uniform scheduling’s O()
Methods to Hide Away Latency Computation time between communications > Communication latency
Worth sending the data over to the GPU
Increasing bandwidth and size of messages makes the constant term in overhead latency seem smaller
Efficient use of registers to prevent spills
Deciding what work to do where, GPU vs. CPU, work sharing
Minimizing divergent warps using atomic operations found in CUDA
Divergent warp occur when threads must follow both paths
Results Variable-sized multi-round data transfer scheduling
Number of rounds
Results Use of atomic instructions in CUDA to minimize latency
Conclusion CUDA gives programmers the ability to harness the power of the GPU for
general uses.
The improvements presented allow this option to be more feasible.
Strategic use of GPGPU’s as a resource will improve speed and efficiency.
However, presented material mainly theoretical, not much strong data to back up
More suggestions than implementations, promoting GPGPU use