Graphics Processing Units GPUs 

Róisín Howard

Bachelor of Engineering in Computer EngineeringLimerick

 Abstract —this paper will discuss the use of GPUs to carry out

general purpose computing as well as graphic acceleration. The

support for GPUs for general purpose computing will be

discussed along with details of internal architectures of GPUs.

The challenges and opportunities presented by this architecture

for high-performance computing will be outlined. The evolution

of GPUs and GPU languages is prompted by the need for

graphics processing in games. These languages will be outlined

along with their similarities and differences. GPUs and multi-

core CPUs are also coming to the fore in mobile devices.

 Keywords-graphics processing unit; GPU; CUDA; OpenCL;

 DirectCompute; OpenGL; Cg; NVidia; Khronous Group;Apple;

 Intel; AMD; Microsoft; Tegra; HLSL; GLSL; GPGPU


User-programmable Graphics Processing Units (GPUs) for

mainstream computing and scientific use is a hot topic in

computer architecture. These GPUs are specialized processor

systems to accelerate the processing of graphics on both

desktops and laptops. OpenCL and CUDA are the main

contender languages for GPU programming available.

Shading languages such as Cg, HLSL and GLSL are availablefor programming the GPU programmable rendering pipeline.

NVidia is one the main companies behind the GPU

and the programming of the GPU. The GeForce nVidia GPU

card is compatible with many graphics APIs. OpenGL and

Microsoft’s DirectX are among the compatible APIs. NVidia

is also branching into the mobile space with the Tegra chips forsmart phones and tablets. Multi-core CPUs are important for

multi-tasking and lowering the power consumption.

GPGPU is a new concept which is general purpose

computing on GPUs. Here the GPU is utilized to exploit data

parallelism which is available in some applications andperform no-graphics processing. The GPU takes on some of

the mathematically intensive tasks leaving the CPU free to deal

with other user tasks.


 A.  The histroy of GPUs

Intel made the iSBX 275 Video Graphics ControllerMultimode Board in 1983. This was for industrial systems

based on the Multibus standard. The card accelerated the

drawing of lines, arcs, rectangles and character bitmaps. It was

based on the 82720 Graphic Display Controller. Direct

memory access (DMA), was used to load the framebuffer,

which accelerated it. It was intended that this board would be

used with Intel’s line of Multibus industrial single board

computer plug-in cards.[1, 2]

Texas instruments released the first microprocessor

with on-chip graphics capabilities, TMS34010, in 1986. Thishad a very graphics-oriented instruction set and could also run

general-purpose code. The IBM 8514 graphics system was

released as one of the first video cards in 1987 to implement

fixed-function 2D primitives in electronic hardware for IBM

PC compatibles.[1]

 B.  The purpose of a Graphical Processing Unit

A GPU, Graphics Processing Unit manipulates and alters

memory so as to accelerate the building of images. A GPU is

primarily used for the computation of three dimensional (3D)

functions. Lighting effects, transformations and 3D motion aresome of the computations required. These are mathematically-

intensive tasks which would put a strain on the CPU.[1-3]

Embedded systems, mobile phones, personal

computers, workstations and game consoles are some devices

in which GPUs are used. Computer graphics can be

manipulated very efficiently by modern GPUs. Due to their

highly parallel structure they are more effective that general-purpose CPUs for algorithms where the processing of large

blocks of data is done in parallel. More CPU time is freed up

for other tasks by using GPUs.[1, 3]

In 1999 the term GPU was popularized by nVidia

who marketed “the world’s first ‘GPU’, or Graphics Processing

Unit, a single-chip processor with integrated transform,

lighting, triangle setup/clipping, and rendering engines that is

capable of processing a minimum of 10 million polygons persecond”[4], the GeForce 256. This GPU is capable of billions

of calculations per second. It has over 22 million transistors,

compared to the 9 million found on the Pentium III. Quadro. It

is the workstation version which is designed for CAD

applications. The Pentium III. Quadro can process over 200

billion operations a second and deliver up to 17 million

triangles per second.[1, 3]

C.  The benefits of GPUs

GPUs process large blocks of data in parallel because of their

highly parallel structure. The processing of large blocks of

data could be in the form of fast sort algorithms of large lists,

or 2D fast wavelet transformations. This makes them moreeffective than general-purpose CPUs. They are used along-

side CPUs for this purpose, by performing those

mathematically intensive tasks it relieves the strain that wouldhave been put on the CPU and it is freed up to perform other

tasks.[1, 5]


NVidia developed Compute Unified Device Architecture,

CUDA, for graphics processing. CUDA is the computingengine in nVidia GPUs. By harnessing the power of the GPU

an increases in computing performance is facilitated. CUDA

shares a range of computational interfaces with two

competitors, the Khronos Group and Microsoft, whose

architectures are OpenCL and DirectCompute.[5, 6]

Access to the virtual instruction set and memory of

the parallel computational elements in CUAD GPUs is given

to developers through CUDA. Computations like those on

CPUs are accessible using CUDA for the latest nVidia GPUs.

However GPUs have parallel throughput architectures unlike

CPUs which execute a single thread very quickly; this

emphasises executing many threads slowly. Solving these

general purpose problems on GPUs is known as GPGPU.[5, 7]

 A.  Strengths of CUDA

There are several advantages of CUDA over traditional

GPGPUs using graphic APIs. CUDA offers full support for

integer and bitwise operations. This includes integer texturelookups. Scattered reads are also implemented meaning that

code can read from arbitrary addresses in memory. CUDA

has the advantage of faster downloads and readbacks to and

from the GPU. A shared memory region is also offered;

memory can be shared amongst threads. As a result a user-

managed cache can be availed of, which enables higherbandwidth than possible using texture lookups. CUDA also

supports a wide range of libraries and tools which are of use to

developers, Figure 1.[5, 8]

Figure 1. CUDA Libraries and Tools[8]

 B.  CUDAs weaknesses

However there are also some limitations. CUDA-enabled

GPUs are only available from nVidia unlike OpenCL. Aperformance hit due to system bus bandwidth and latency may

be incurred by copying between host and device memory.Asynchronous memory transfers handled by the GPUs DMA

engine could partially alleviate this. Due to optimisation

techniques the compiler is required to employ to use limited

resources valid C/C++ may sometimes be flagged and prevent


C.  CUDA programming model

The CPU is known as the host and the GPU is the compute

device, it is the coprocessor to the CPU in the CUDAprogramming model. Data will need to be shared between

both devices as they each have their own memory. The kernel

is an application or a program that runs on the GPU and when

it is launched it is executed as an array of parallel threads.

This execution is shown in Figure 2. A block can only contain

a certain number of threads so threads can be grouped together

to form a grid of thread blocks.[9, 10]

Figure 2. CUDA kernel threads[10]

 D.  CUDA architecture

Figure 3 shows an example of the CUDA architecture. Here it

can be seen that OpenCL and DirectCompute are supported

applications on the CUDA platform for nVidia hardware,

along with CUDA they are the device-level API support

offered by nVidia. Language integration is also possible. TheCUDA run-time application can be in C, C++, Fortran,

Python, or Java, etc. The CUDA Architecture consists of

parallel compute engines inside nVidia GPUs (1). It also

contains OS kernel-level support for hardware initialization

and configuration (2). The user-mode driver, which provides

a device-level API for developers (3) and a PTX instructionset architecture (4) for parallel computing kernels and

functions, is also shown.[11]

Figure 3. CUDA Architecture[11]


Open Computing Language, OpenCL, is a cross-platform,

parallel programming framework. It is the first truly open and

royalty-free programming standard for general-purpose

computations on heterogeneous systems. OpenCL provides a

uniform programming environment for software developers to

write efficient, portable code for devices using a diverse mix

of multi-core CPUs, GPUs, and other parallel processors suchas DSPs. OpenCL includes a language for writing kernels and

APIs (Application Programming Interfaces) that are used to

identify and then control the platforms. Using task-based and

data-based parallelism OpenCL provides parallel


OpenCL is maintained by the non-profit technology

consortium Khronos Group[13]. It has been adopted by

Intel[14], Advanced Micro Device (AMD)[15], nVidia[16]

ARM Holdings[17] and IBM[18]. OpenCL gives any

application access to the graphics processing unit for non-

graphical computing, extending the power of the GPU beyond


Apple Inc. initially developed OpenCL and hold thetrademark[19]. OpenCL was refined into an initial proposal in

collaboration with technical teams at nVidia, Intel, AMD and

IBM. The proposal was submitted to the Khronos Group in

2008. The goal was to have a cross platform environment for

general purpose computing on GPUs. Representatives from

software companies, CPU, GPU and embedded-processor

 joined together to form the Khronos Compute Working Group

to finish the technical details on the specification for

OpenCL1.0. Once the specification was reviewed by Khronos

members and approved it was released to the public by the end

of 2008. The world’s first conformant GPU implementation

of OpenCL for both windows and Linux was shipped in June

2009.[12, 16]

 A.  Strengths of OpenCL

OpenCL is an open and royalty free language and the fact that

code is portable across devices is a big advantage. It is a C –

like language for heterogeneous devices. It can be used on

parallel CPU architectures and it is not vendor specific.

OpenCL provides a common language for writing

computational “kernels”, and a common API for managingexecution on target devices. OpenCL implementations

already exist for nVidia and AMD GPUs and for x86 CPUs.

 B.  Weaknesses of OpenCL

OpenCL has some limitations. It is a low-level API whichmeans that developers are responsible for a lot of plumbing,

lots of objects/handles to keep track of. They are also

responsible for thread safety; certain types of multi-accelerator

codes are much more difficult to write than in CUDA. There

is a need for OpenCL middleware and libraries, such as the

libraries and tools available for CUDA. OpenCL code must

deal with hardware diversity. Many features are optional and

are not supported by certain devices. Due to the diversity of

hardware on which OpenCL must operate a single kernel islikely to not achieve peak performance on all device


C.  OpenCL architecture

The handling of passing data to and from your processing

environment and, the compiler of the OpenCL code is the

main part of the OpenCL framework. The main stages of

execution are shown below in Figure 4. Setting up and

coordinating the host environment (with n processors -

including multiple GPUs) so that it can then distribute the data

and compile the code efficiently are the key parts OpenCL

handles. OpenCL then has control of each process. This means

it can store the main progress of the code until the completion

of all the desired operations, when either more operations can

be performed or the data from the GPU can then be handed

back to main memory on the CPU. OpenCL depends on the

driver provided by hardware in order for it to besupported.[21]

Figure 4. OpenCL Architecture[21]

 D.  OpenCL programming model

Similarly to CUDA, OpenCL has kernels. One or morekernels make up an application or a program that runs on the

GPU and when it is launched it is executed as an array ofparallel work items. Work groups contain the array of parallel

work items. Kernels run over global dimension index range

which is known as an NDRange, shown in Figure 5.[22-24]

Figure 5. OpenCL NDRange[24]


Since the inception of OpenCL there have been manycomparisons between it and CUDA. Correct implementation

of OpenCL for the target architecture performs “no worse”than CUDA. Portability is the key feature of OpenCL. It is

not vendor specific like CUDA, which only runs on nVidia

devices. This has both advantages and disadvantages

associated with it.[5, 12]

CUDA is limited to nVidia hardware thus it is moreacutely aware of the platform upon which it will be executing.

More mature compiler optimizations and execution techniques

are provided as a result. This gives CUDA the upper hand as

OpenCL code needs to be prepared to deal with much greater

hardware diversity. GPU-specific technologies cannot be

directly used by the programmer.[5, 12]

CUDA has a much larger userbase and codebase than

OpenCL due to the maturity of CUDA. The developer canadd in optimizations manually to the kernel code. OpenCL

had less mature compilation techniques. As the OpenCL

toolkit matures the gap between it and the CUDA toolkit will

converge.[5, 12]

The Scalable Heterogeneous Computing Benchmark

Suite (SHOC) was used to compare CUDA and OpenCL

kernels on nVidia GPUs. CUDA performs better on NVIDIA

GPUs than OpenCL according to the tests. The test measuresthe number of floating point operations per second in

GFLOPS in reference to the kernels. The graph of results isshown in Figure 6.[25]

Figure 6. CUDA vs OpenCL on nVidia GPU[25]

 A.  Similarities between CUDA and OpenCL

The programming model used by CUDA is similar to that

used by OpenCL. Figure 7 shows a comparison of terms for

the data parallelism models. The CPU is the host for bothmodels and kernels executed form the application which

contain parallel thread blocks in a grid in CUDA terms.[23]

Figure 7. Mapping of terms for data parallelism models- OpenCL toCUDA[23]

 B.   Differences between CUDA and OpenCL

CUDA is hardware specific whereas OpenCL is not vendor

specific. Due to this fact CUDA knows the hardware on

which it runs and can be optimized for it. OpenCL has to be

adapted for each different hardware vendor and may not

perform as well as CUDA as a result.[5, 12]

OpenCL is an open language, very portable and

maintained by the Khronos Group, CUDA is not an open

language. CUDA has been around for longer than OpenCLthus it has a large code and user-base; it is a more maturelanguage. OpenCL’s compilation techniques are less mature

and the programmer needs to do a lot more low level

programming than with CUDA.[5, 12]


 A.   An overview of DirectCompute

Microsoft developed DirectCompute. This is an API that

supports GPGPU on Microsoft Windows Vista and Windows

7. DirectCompute is part of the DirectX collection of APIs.

Although it was initially release with the DirectX 11 API, it

runs on both DirectX 10 and DirectX 11 GPUs.[26-28]

According to nVidia’s DirectCompute programming

guide DirectCompute as a new type of shader which exposes

the compute functionality of the GPU. This compute shader

has much more general purpose processing capabilities than

the normal shader.[29]

There doesn’t have to be a fixed mapping between

the data being processed and the threads doing the processing

with DirectCompute. This means that one thread can process

one or many data elements, and the number of threads being

used to perform the computation is controlled by the

application directly.[29]

DirectCompute has thread group shared memory

which allows groups of threads to share data, and can reduce

bandwidth requirements significantly. Similarly to other

Compute APIs, Compute Shaders do not directly support any

fixed-function graphics features with the exception of


 B.   Advantages of DirectCompute

There are several advantages of DirectCompute over other

GPU computing solutions. Direct3D is integrated with

DirectCompute which means it has efficient interoperability

with D3D graphics resources. All texture features are

included but LOD must be specified explicitly. The HLSL

shading language is used by DirectCompute. A single API is

provided across all graphics hardware vendors on Windows

platforms as a result there is some level of guarantee of

consistent results across the different hardware.[29]

C.   An overview of OpenGL

The Open Graphics Library, OpenGL, is an API for GPUs.

The procedures and functions used to specify the objects and

operations need to produce 3D images are contained in this

interface. Silicon Graphics Incorporated designed OpenGL in

1992. The Khronos Group manage OpenGL.[30-33]

OpenGL is designed as window-system andoperating-system independent, it also is network-transparent.

High performance, visual compelling graphics software

applications can be created using OpenGL on PCs

workstations or supercomputers. It was used in applications

such as CAD and video games.

All the features of the latest graphics hardware are

exposed by OpenGL. Shown in Figure 8 is the OpenGL

client-server model. Once the hardware and software

configuration are compliant this model guarantees consistent

presentation on any compliant hardware and software


Figure 8. OpenGL client-server model[34]

 D.   Advantages of OpenGL

Due to the fact that OpenGL is a C-based API it is extremely

portable and widely supported. OpenGL provides functions

for an application to generate 2D or 3D images and allowsthese rendered images to be copied to its own memory or

displayed on the screen. The OpenGL specification is adhered

to for every implementation of OpenGL. A set of

conformance tests must be passed, thus implementation is

reliable. Similarly to OpenCL, OpenGL’s specification is

controlled by the Khronos Group. This guarantees industry

acceptance as the members of this industry consortium aremany of the major companies in the computer graphics



A computer program that is used to calculate rendering effectson graphics hardware is a shader. A shader is used to program

the GPU programmable rendering pipeline. Programming

languages adapted to map on shader programming are known

as shading languages. Instructions are send to the GPU by the

CPU in the form of a compile shading language program.[35,


The geometry is transformed and lightingcalculations are performed within the vertex shader. Somechanges in the geometrics in the scene are performed if a

geometry shader is in the GPU. The calculated geometry is

subdivided into triangles which are then broken down into

pixel quads. Transformation of 3D data into useful 2D data

for displaying by the frame buffer is done by the graphic

pipeline using the above steps from the shader program.[35,


The GPU is allowed to function as a stream processor

since all fragments can be thought of as independent, thus

making the graphics pipeline well suited to the renderingprocess. All stages of the pipeline can be used simultaneously

for different vertices or fragments, this independence allows

the graphics processor to use parallel processing units. By

using parallel processing units multiple vertices or fragments

can be processed in a single stage of the pipeline at the same

time.[35, 36]

 A.  OpenGL shading language

The OpenGL shading languages is known as GLSL. It is a

high-level shading language. It has been designed to allow

application programmers to express the processing that occursat the programmable points of the OpenGL pipeline. Vertex

and fragment processing is unified by GLSL in a single

instruction set. This allows branches and conditional loops.

The GLSL has five shader stages; vertex, geometry, fragment

tessellation control, and tessellation evaluation.[37, 38]

OpenGL has the benefit of having cross-platform

compatibility on multiply operating systems. Shaders that are

written can be used on any hardware vendor’s graphics card

once GLSL is supported. Each hardware vendor can create

code optimised for their particular graphics card’s architecture

because the GLSL compiler is included in their driver.[37, 38]

 B.  Cg programming language

Nvidia developed Cg, C for graphics which is a high-level

shading language. It was developed in close collaboration

with Microsoft for programming pixel and vertex shaders.This is not a general programming language; it is only suitable

for GPU programming. Microsoft has a similar shading

language called HLSL.[39]

Cg features API independence and a variety of freetools to improve asset management are available. It was

designed for easy and efficient production pipeline integration.

Connectors are special data structures used in Cg to link the

various stages of processing. They define the input from

application to vertex processing stage and the attributes to be

used as inputs to the fragment processing.[39]

C.   DirectX High-Level Shader Language

HLSL is the high-level shader language developed by

Microsoft for DirectX and Xbox. It is a C-type shader

languages supported by DirectX and Xbox game consoles.

Shaders for the Direct3D pipeline can be created using HLSL.

There are three shader stages in the HLSL; the vertex shader,

the geometry shader and the pixel shader.[40, 41]

GPUs are extensively used in the computer games market.

This is a booming market and it drives the sale of the GPU.

This means that the future of the GPU is greater than that of

the general-purpose CPU. The CPU will still remain as the

main processor but there is much more potential for expanding

the computing experience using the GPU. The GPU is much

better at parallelism than the CPU, thus complex problems canbe easily solved by the GPU which can be both graphical and


Due to high volumes of GPUs being sold to PC

gamers, as a result of this high demand for GPUs they are

relatively inexpensive. The trade off of having high costspecial purpose hardware is thus less of a factor. According to

Moore’s Law the CPU growth doubles every 18 months and

the GPU growth doubles every 6 months. This makes it

impossible for CPU manufactures to keep up with the rapid

growth of GPU advancement. It would prove too expensive to

re manufacture a new CPU every time a new GPU chip is

released. Figure 9 shows how GPUs are obeying Moore’sLaw and CPUs are being left behind. “The graphical

processing unit is visually and visibly changing the course of

general purpose computing”[43].[42, 44]

Figure 9. Comparison of GPUs and CPUs[44]

GPU hardware architecture is moving from a singlecore hardware pipeline implementation for graphics

processing to highly parallel and programmable core for more

general purpose computing. By adding more programmability

and parallelism to a GPU core architecture it is evolvingtowards a general purpose CPU-like core.[45]


GeForce is a brand of GPUs designed by nVidia. The

GeForce logo is shown below in Figure 10. There are over 10

generations of the GeForce design. The original release of the

design was the GeForce 256 in 1999. The first GeForce

products were intended for the high-margin PC gamingmarket. It was designed that they would be used on add-on

graphics boards, they were discrete GPUs. All tiers of the PC

gaming market were covered in subsequent designs. NVidia’s

embedded application processors include the most recent

GeForce technology. These are designed for mobile

phones.[46, 47]

Figure 10. GeForce logo[47]

 A.  The GeForce 6 Series

The sixth generation of GeForce is the GeForce 6 Series. It

was released in 2004. This series can have a 4, 8, 12, or 16

pixel-pipeline GPU architecture. It contains an on-chip video

processor with full MPEG-2 encoding and decoding, and

advanced adaptive de-interlacing called PureVideo. This

design also has High Precision Dynamic Range technology

and 8 times more shading performance than previous designs.

There is DirectX 9 Shader Model 3.0 support and OpenGL 2.0

optimizations and support also.[47-49]

 B.   Architecture of the GeForce 6 Series

The GPU memory interface has an available bandwidth of

35GBps. The CPU memory interface has 6.4GBps available

bandwidth and the PCI express bus has 8GBps. This showsthat there is a vast amount of internal bandwidth available on

the GPU. More dramatic performance improvements can be

made by making sure that algorithms running on the GPU take

advantage of this bandwidth.[50]

Figure 11 shows the block diagram of the GeForce 6Series Architecture. It shows the process of the graphics by

which input arrives from the CPU (host) and is output aspixels drawn to the frame buffer. The CPU writes a command

stream which sets and modifies the state, references the vertex

and texture data, and sends rendering commands. These

states, commands and vertices flow down through the blockdiagram where they will be used in subsequent pipelines.[50]

Figure 11. GeForce 6 Series Architecture[51]

The vertex shaders/ processors, shown in Figure 12,allow a program to be applied to each vertex in the object.

Transformations, skinning and other pre-vertex operations are

performed here. All operations in this processor are done in

32-bit floating-point (fp32) precision. There can be up to six

vertex units on high-end models and there may be two on low-

end models.[50]

The vertex programs can fetch texture data. The

texture cache is shared between the fragment processor and

the vertex processor due to the fact that the vertex processor

can perform texture access. There is also a vertex cache to

store all data before and after the vertex processor.[50]

Primitives are points, lines or triangles. The vertices

are grouped into these primitives. Three blocks are using cull,

clip and set-up to perform pre-primitive operations. Primitives

that aren’t visible are removed (cull), primitives that intersect

the view frustum are clipped (clip) and edge and plane

equation set up on the data is performed for the rasterization

process (setup).[50]

Figure 12. GeForce 6 Series Vertex Processor[50]

The calculation of pixels which are covered by eachprimitive is done in the rasterization block. It uses the z-cull

block to discard pixels. A fragment will then pass through the

fragment processor where there will be tests performed on it.

Once passing the tests it will carry depth and colour

information to a pixel on the frame buffer.[50]

The fragment processor and texel pipeline is also

known as the pixel shader, Figure 13. This unit applies a

shader program to each fragment independently. There can be

a varying number of fragment pipelines on the GeForce 6

Series GPUs. Texture data is cached on-chip, similarly to the

vertex processor. This reduces bandwidth requirements and

increases performance.[50]

Figure 13. GeForce 6 Series Fragment Processor and Texel Pipeline[50]

Quads are squares of four pixels. The texture and

fragment-processing units operate on quads. This allows

direct computation of derivatives for calculating texture level

of detail. The texture unit fetches the data from memory for

the fragment processor and returned in fp16 or fp32 format.The texture unit can read a 2D or 3D array of data. 16-bit

floating-point precision filtering is supported by this


There are two fp32 shader units per pipeline in the

fragment processor. Before the fragments re-circulate throughthe pipeline to execute the next set of instructions they are

passed through both shader units and the branch processor.

This happens once every clock cycle.[50]

Once the fragments have passed through the

fragment-processing unit they are sent to the z-compare andblend units in the order in which they were rasterized. Stencil

operations, alpha blending, depth testing and the final colour

write to the target surface are performed in these units.[50]

There are four DRAMs which divide up the memory

system, all of which are independent. The memory subsystem

can operate efficiently by having smaller, independentmemory partitions. This is regardless of whether small or

large blocks of data are transferred. The streaming of 32-byte

memory access near the physical limit of 35GBps is possible

due to the four independent memory partitions giving the GPU

a wide flexible memory subsystem of roughly 256 bits.[50]

C.  Challenges and Oportunities for High-Perofrmance


To achieve optimal performance of the devices there are some

actions that could be carried out. The z-cull block shown in

Figure 10, is used to discard pixels. It avoids work that

doesn’t contribute to the final result. By concluding early thata computation doesn’t contribute the z-values for all objects

can be rendered first before shading. For example, with

general purpose computing the z-cull can be used to select

which parts are still active in the computation. It will cull the

computational threads that have already been resolved.[50]

The texture math can be exploited when loading data.This unit filters data before it is returned to the fragment

processor. The total data needed by the shader is thus

reduced. The total work done by the shader can be reduced if

the texture unit’s bilinear filter is used more frequently.

Similarly when performing compares, the work form the

processor can be offloaded by using the filtering support in

shadow buffering, the result can then be filtered.[50]

By making sure that the work avoided by branching

outweighs the cost of branching, it can be very beneficial. The

fragment processor operates many fragments simultaneously.

Fragments in a group may take different branches; in this case

both branches have to be taken by the fragment processor.

This could reduce the performance of branching in programs.

However, if branching is not an effective choice, conditional

writes can be used.[50]

A full-speedfp16 normalize instruction in parallel issupported by this design. By having fp16 intermediate values

the internal storage and datapath requirements are reduced.

Instead of using fp32 intermediate values, these could be

saved for cases where the precision is needed; the performance

will be increased by using fp16 intermediate values.[50]

There is a fixed amount of register space perfragment to keep hundreds of fragments in flight by the shader

pipeline. Fewer fragments will remain in flight if this register

space is exceeded. This will reduce the latency tolerance for

texture fetches, thus adversely affecting performance. If the

register file uses fp32x4 values exclusively it may run out of

read and write bandwidth to feed all units. There is enoughbandwidth if reading fp16x4 values to keep all units busy.[50]

Extraordinary new performances are delivered from

this new design. They streamline the creation of stunning

effects in games and other 3D real-time applications. The

hardware power that is needed to create such detailed and

vibrant images won’t be too intense on the PC due to this newarchitecture.[52]

The new superscalar shader architecture in this

design double the number of operations expected per cycle.

There is a significant performance increase as a result. A full

32-bit floating point precision is provided to deliver higher

quality images. Developers can implement stunning visual

effects. There is no compromise of speed for quality in this




 A.  The difference between CPU and GPU

The CPU is the central processing unit; it is the brain of the

computer system. The GPU is a graphics processing unit.

The GPU is a complimentary processing unit which handles

the computation of intensive graphics processing. The rest ofthe application still runs on the CPU. The application runs

faster from a user’s perspective due to the processing power of

the GPU to boost performance. Hybrid computing is using the

GPU as a co-processor to the CPU. Graphics processing isparallel, therefore it can be easily parallelized and

accelerated.[2, 53]

 B.   How a multicore system differs from a GPU

A CPU is designed with a few cores; it can consist of 4 to 8

cores. These cores can handle a few software threads which

can be exploited in an application program. Figure 14 shows

an example of a CPU with multiple cores. Compared to singlecore predecessor, multi-core CPUs can operate lower

first mobile super chip…with the first mobile dual-core

CPU”[58]. The new Tegra 3 which has quad core processing

has the 4-PLUS-1™ battery-saver technology which providesgreat mobile performance.[57, 58]

LG, Motorola and Samsung are among some of the

phones which are powered by Tegra 2[59]. There are a long

list of tablets powered by Tegra 2, the most popular ones

among these are the Samsung Galaxy Tablet, Sony Tablet and

Toshiba Thrive[60].

The challenges of HD video playback, streaming

videos and 3D gaming etc for power consumption and

performance have been faced previously by desktop and

notebook CPUs. Now mobile application processors are

facing this challenge which stretches the capabilities of current

single core mobile processors. To increase their performanceand stay within mobile power budgets mobile processors need

to have multi-core processors.[54, 61]

The Tegra 2 was designed to harness the power of

Symmetrical Multiprocessing which delivers a higher

performance and lowers power consumption. It offers faster

Web page loading times a higher quality of game play withfaster multitasking features and tremendous battery life

improvements.[58, 61]


GPUs are more effective than general-purpose CPUs due to

the processing of large blocks of data in parallel. They are

used along-side CPUs for this purpose, by performing those

mathematically intensive tasks it relieves the strain that would

have been put on the CPU and it is freed up to perform other

tasks.[1, 5]

CUDA and OpenCL are the main contender

languages for GPU programming along with DirectCompute

from Microsoft. By implementing OpenCL correctly for the

target architecture it performs “no worse” than CUDA.

CUDA is a more mature language due to its code and user-

base. OpenCL is missing the relevant middleware tools andlibraries that CUDA has. As the OpenCL toolkit matures the

gap between it and the CUDA toolkit will converge.[5, 12, 26]

The GPU has begun to evolve from a single core,

fixed function hardware pipeline implementation just for

graphics rendering, to a set of highly parallel and

programmable cores for more general purpose computation.

The architecture of many-core GPUs are starting to look more

and more like multi-core, general purpose CPUs.[45]

A single core CPU runs at higher clock frequencies

and voltages than a multi-core CPU and it takes longer periods

of time to complete a given task. By distributing the workload

across multiple CPU core is known as workload sharing on

multi-core CPUs each CPU core can run at lower frequencies

and voltages to complete multi-threaded tasks. Significantly

less power is consumed by each core they and offers a higherperformance per watt than single core CPUs due to the lower

operating frequencies and voltages.[61]

Nvidia developed the Tegra to harness the power of

multi-core CPUs to deliver a higher performance and lower

power consumption on mobile devices. There are tremendous

battery life improvements as a result along with extrememultitasking features, a better game playing experience and

faster Web browsing.[54, 61]


