gpu - an introduction
DESCRIPTION
GPU ,GPU Architecture, CUDA, TLPTRANSCRIPT
Graphics Processing UnitGraphics Processing Unit
DHAN V SAGARCB.EN.P2CSE13007
Introduction
It is a processor optimized for 2D/3D graphics, video, visual computing, and display.
It is highly parallel, highly multi threaded multiprocessor optimized for visual computing.
It provide real-time visual interaction with computed objects via graphics images, and video.
History
● Up to late 90's– No GPUs– Much simpler VGA controller
● Consisted of– A memory controller– Display generator + DRAM
● DRAM was either shared with CPU or private
History
● By 1997– More complex VGA controllers
● Incorporated 3D accelerating functions in hardware
– Triangle set up and rasterization– Texture mapping and shading
A combination of shapes(Lines, polygons, letters, …)
into an image consisting of individual pixels
History
● By 2000– Single chip graphics processor incorporated
nearly all functions of graphics pipeline of high-end workstations
● Beginning of the end of high-end workstation market
– VGA controller was renamed Graphic Processing Units
Current Trends
Well defined APIs
Open GL:Open standard for 3D graphics programming
Web GL:Open GL extension for web
DirectX:Set of MS multimedia programming interfaces (Direct3D for 3D graphics)
Can implement novel graphics algorithms
Use GPUs for non-conventional applications
Current Trends
Combining powers of CPU and GPU - heterogeneous architectures
GPUs become scalable parallel processors
Moving from hardware-defined pipelining architectures to more flexible programmable architectures
Architechture Evolution
CPU
Graphics card
Display
Memoryfloating point co-processors attached to microprocessors.
Interest to provide hardware support for displays
Led to graphics processing units (GPUs)
GPUs with dedicated pipelines
Input stage
Vertex shader stage
Geometry shader stage
Rasterizer stage
Frame buffer
Pixel shading stage
Graphics
memory
Graphics chips generally had a pipeline structure
individual stages performing Specialized operations, finallyleading to loading frame buffer for Display
Individual stages may have access to graphics memory for storing intermediate computed data.
PROGRAMMING GPUS
• Will focus on parallel computing applications
• Must decompose problem into set of parallel computations
• Ideally two-level to match GPU organization
Example
Data are inbig array
Small array
Small array
Small array
Small array
Small array
Tiny Tiny
Tiny Tiny
GPGU and CUDA
GPGU
● General-Purpose computing on GPU
● Uses traditional graphics API and graphics pipeline
CUDA
● Compute Unified Device Architecture
● Parallel computing platform and programming model
● Invented by NVIDIA
● Single Program Multiple Data approach
CUDA
➢ CUDA programs are written in C
➢ Within C programs, call SIMT “kernel” routines that are executed on GPU
➢ Provides three abstractions
➢ Hierarchy of thread groups➢ Shared memory➢ Barrier synchronization
Cont..
CUDA
● Lowest level of parallelism – CUDA Thread
● Compiler + Hardware can gang 1000s of CUDA threads together leads to various levels of parallelism within the GPU
● MIMD,SIMD,Instruction level Parallelism
Single Instruction, Multiple Thread (SIMT)
Conventional C Code
// Invoke DAXPY
dapxy(n,2.0,x,y);
// DAXPY in C
void daxpy(int n,double a,double *x, double *y)
{
for (int i=0;i<n;++i)
y[i] = a*x[i] + y[i];
}
Corresponding CUDA Code
// Invoke DAXPY with 256 threads per Thread Block
_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks,256>>>(n,2.0,x,y);
//DAXPY in CUDA
_device_
Void daxpy(int n,double a,double *x, double *y)
{
int i = blockIdX.x*blockDim.x+threadIdx.x;
if(i<n) y[i]=a*x[i]+y[i];
}
●
Cont...
● _device_ (OR) _global_ --- functions of GPU
● _host_ --- functions of the system processor
● CUDA variables declared in the _device_ are allocated to the GPU Memory,which is acessable by all the multithreaded SIMD processors
● Function call syntax for the function uses GPU is
name<<<dimGrid,dimBlock>>>(..parameterlist..)
● GPU Hardware handles Threads
● Threads are blocked together and executed in group of 32 threads – Thread Block
● The hardware that executes a whole block of threats is called a Multithreaded SIMD Processor
ReferenceReference
http://en.wikipedia.org/wiki/Graphics_processing_unit
http://www.nvidia.com/object/cuda_home_new.html
http://computershopper.com/feature/200704_the_right_gpu_for_you
http://www.cs.virginia.edu/~gfx/papers/pdfs/59_HowThingsWork.pdf
http://en.wikipedia.org/wiki/Larrabee_(GPU)#cite_note-siggraph-9
http://www.nvidia.com/geforce
“Larrabee: A Many-Core x86 Architecture for Visual Computing”, Kruger and Westermann, International Conf. on Computer Graphics and Interactive Techniques, 2005
“ An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness”Sunpyo Hong,Hyesoon Kim
Thank You..