gpu power model nandhini sudarsanan [email protected]@umn.edu nathan vanderby...
TRANSCRIPT
GPU Power Model
Nandhini Sudarsanan [email protected] Vanderby [email protected]
Neeraj Mishra [email protected] Vinodh [email protected]
Chi Xu [email protected]
Outline
• Introduction and Motivation• Analytical Model Description• Experiment Setup• Results• Conclusion and Further Work
Introduction
Motivation
Outline
• Introduction and Motivation• Analytical Model Description
o Parser o Power Model
• Experiment Setup• Results• Conclusion and Further Work
Parser
Outline
• Introduction and Motivation• Analytical Model Description
o Parser o Power Model
• Experiment Setup• Results• Conclusion and Further Work
Power Model
• PTX Level
Power Model
• Assembly Level
Outline
• Introduction and Motivation• Analytical Model Description
o Parser o Power Model
• Experiment Setup• Results• Conclusion and Further Work
Experiment Setup - Hardware
• Measure Power Consumption and Temperatureo Current Clamp for PCIE & GPU Power Cable
Data Acquisition Card @ 100Hzo GPU Performance Countero Sample Temperature @ 10Hz, GPU sensor
Experiment Setup - Software
• Driver API• Generate and Modify PTX code
o Minimize control loops• CUDA 4.0
o Built in Binary -> Assembly Converter (cuobjdump)• MATLAB to build model• Remote login
CUDA- Fermi Architecture
• Third Generation Streaming Multiprocessor(SM)o 32 CUDA cores per SM, 4x over GT200o 1024 thread block size, 2x over GT200o Unified address space enables full C++ supporto Improved Memory Subsystem
Benchmarks
• Small number of overhead operations (loop counters, initialization, etc.).
• Computational intensive work to allow for an experiment of significant length for accurate current measurement.
• Exhibit high utilization of the CUDA cores, few data hazards as possible.
• Grid and block sizes appropriately so that all SM are used, since idle SM leak.
• Accordingly 7 benchmarks were selected from CUDA SDK.
Benchmarks
For this project we tested out a few benchmarks.• 2D convolution• Matrix Multipication• Vector Addition• Vector Reduction• Scalar Product• DCT 8x8• 3DFD
Limitations of PTX
• Higher level than assemblyo Divide & Sqrt: 1 PTX line, library in assembly
• Compiler optimizations from PTX -> assembly• Doesn’t reflect RAW dependencies• Performance counters use assembly
Outline
• Introduction and Motivation• Analytical Model Description
o Parser o Power Model
• Experiment Setup• Results• Conclusion and Further Work
Results
Outline
• Introduction and Motivation• Analytical Model Description
o Parser o Power Model
• Experiment Setup• Results• Conclusion and Further Work
Conclusion and Further Work
• Conclusion
• Further Worko Take into account context switcheso Consider Multiple kernels running simultaneously
The End
Thanks
Q&A