parallel computing on the gpu

Parallel Computing on the GPU

Tilani Gunawardena

Goals• How to program heterogeneous parallel

computing system and achieve– High performance and energy efficiency– Functionality and maintainability– Scalability across future generations

• Technical subjects– Principles and patterns of parallel algorithms– Programming API, tools and techniques

Tentative Schedule– Introduction– GPU Computing and CUDA Intro – CUDA threading model– CUDA memory model– CUDA performance– Floating Point Considerations– Application Case Study

Recommended Textbook/Notes• D. Kirk and W. Hwu, “Programming Massively

Parallel Processors – A Hands-on Approach,”

• http://www.nvidia.com/(Communities CUDA Zone)

• Would you rather plow a field with two strong oxen or 1024 chickens??

How to Dig a Hole Faster??

1. Dig Faster2. Buy a More Productive Shovel3. Hire more diggers best approach

Problems:1. How to manage them?2. Will they get in each other’s way3. Will more diggers help to dig hole deeper instead of

just wider?

1. Dig Faster : Processor should run with faster clock to spend a shorter amount of time on each step of a computation (limit: power consumption on a chip : increase clock speed increase power consumption)

2. Buy a More Productive Shovel: Processor do more work on each clock cycle(How much instruction level parallelism per clock cycle)

3. Hire more diggers best approach

Parallelism• Solve Large Problems by breaking them into

small pieces• Then run smaller pieces at the same time

Modern GPU• 1000’s of ALUs• 100’s of processors• Tens of thousands of concurrent threads

• Modern GPU– Ex:GeForce GTX Titan X

CUDA cores: 30728000 million transistors12GB GDDR5 MemoryMemory Bandwidth: 336(GB/s)65000 concurrent threads

Feature size of Processors over time

As feature size decrease Transistors • get smaller• run faster• use less power• put more of them on a

chip

• As transistors improved , processor designers would then increase clock rates of processors , running them faster and faster every year

• Why don’t we keep increasing clock speed?• Have transistors stopped getting smaller+ faster?

– Problem: heat

• Even though transistors are continuing to get smaller and faster and consume less energy per transistor … Problem is running billion transistors generate lot of heat and we can not keep all these processors cool

• Can not make single processor faster and faster(processors that we cant keep cool)

• Processor designers– Smaller, more efficient processors in terms of power– Larger number of efficient processors(rather than faster less efficient processors)

• What kind of Processors we build?• CPU

– Complex control hardware– Flexibility in performance– Expensive in terms of power

• GPU– Simpler control hardware– More haradware for Computation– Potentially more power efficient– More restrictive Programming

model

Latency vs Throughput• Latency-Amount of time to complete a

task(time , seconds)• Throughput-Task completed per unit

time(Jobs/Hour) Your goals are not aligned with post office goals

Your goal: Optimize for Latency(want to spend a little time)

Post office: Optimize for throughput(number of customers they serve per a day)

CPU: Optimize for latency(minimize the time elapsed of one particular task)

GPU: Chose to Optimize for throughput

Bandwidth• How fast to devise can send data over a single

cable

Bandwidth vs Throughput vs Latency– Bandwidth is the maximum amount of data that

can travel through a 'channel'.– Throughput is how much data actually does travel

through the 'channel' successfully.– Latency is a function of how long it takes the data

to get sent all the way from the start point to the end

Latency vs Bandwidth

• Drive from Colombo to Kandy(100km)– Car(5 people, 60km/h) – Bus(60 people, 20km/h)

• Calculate– Latency?– Throughput?

GPUs from the point of view of the software developer ?

• Importance in programing in parallel– 8 core ivy bridge processor(intel)– 8-wide AVX vector operations/Core– 2 threads/core (hyperthreading)

128 way parallelism

In this processor if you run a complete serial, C program with no parallelism at all, you are going to use less than 1% of the capability of this machine.

Introduction• Microprocessor based on CPU drove rapid performance increases and

cost reduction in computer applications for more than 2 decades.

– The users demand even more improvements once they become accustomed to these improvements creating a positive cycle for the computer industry.

• This drive has slowed since 2003 due to power consumption issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU.

– All microprocessor vendors have switched to multi-core and many-core models where multiple processing unit are used in each chip to increase the processing power.

• Vast majority of SA are written as sequential programs – The expectation is that program run faster with each new

generation of microprocessors. This is no longer valid from this day onward.

– No performance improvement– Reducing the growth opportunities of computer industries.

• SA will continue to enjoy performance improvement as parallel programs, in which multiple threads of execution cooperate to achieve the funcionality faster.

18

• Parallel programming is by no means new– HPC community has been developing parallel programs

for decades.– But these programs run on large scale, expensive

computers and only a few elite application justify the use of these costs. In practice limiting the parallel programming to a small number of appication developers.

• Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased.

GPU as Parallel Computers

• Since 2003 a class of many-cores processors called GPUs have led the race for floating point performance.

While the performance of general purpose microprocessor has slowed, the GPU have continued to improve.

Many application developers are motivated to move the computationally intensive parts of their software to GPU for execution.

Why there is this Large Gap?

• The answer lies in the differences in the fundamental design philosophies between the two types of processors.

Latency oriented cores

Throughput oriented cores

CPU: Latency Oriented Design• CPU is optimized for sequential code performance• Large caches – Convert long latency memory accesses to short

latency cache accesses • Sophisticated control – Branch prediction for reduced branch latency – Data forwarding for reduced data latency

• Powerful ALU– Reduced operation latency

• GPU is optimized for the execution of massive number of threads.

• Small caches– To boost memory throughput

• Simple control– No branch prediction

– No data forwarding • Energy efficient ALUs

– Many, long latency but heavily pipelined for high throughput

• Require massive number of threads to tolerate latencies

GPU: Throughput Oriented Design

Winning Applications Use Both CPU and GPU

• CPUs for sequential parts where latency matters – CPUs can be 10+X faster than GPUs for sequential

code • GPUs for parallel parts where throughput wins – GPUs can be 10+X faster than CPUs for parallel

code

Applications

parallel computing on the gpu

Education