-
Cross-Platform Heterogeneous Runtime Environment
A Dissertation Presented
by
Enqiang Sun
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Computer Engineering
Northeastern University
Boston, Massachusetts
April 2016
-
To my family.
i
-
Contents
List of Figures iv
List of Tables vi
Acknowledgments vii
Abstract of the Dissertation viii
1 Introduction 11.1 A Brief History of Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . 41.2 Heterogeneous Computing with OpenCL . . . . . . . . . . . . . . . . . . . . . . 51.3 Task-level Parallelism across Platforms with Multiple Computing Devices . . . . . 51.4 Scope and Contribution of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Organization of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background and Related Work 92.1 From Serial to Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Programming Paradigms for Many Core Architecture . . . . . . . . . . . . . . . . 14
2.3.1 Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.4 Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Computing with Graphic Processing Units . . . . . . . . . . . . . . . . . . . . . . 162.4.1 The Emergence of Programmable Graphics Hardware . . . . . . . . . . . 162.4.2 General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.1 An OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.2 OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.3 OpenCL Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 SURF in OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.8 Monte Carlo Extreme in OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ii
-
3 Cross-platform Heterogeneous Runtime Environment 393.1 Limitations of the OpenCL Command-Queue Approach . . . . . . . . . . . . . . . 39
3.1.1 Working with Multiple Devices . . . . . . . . . . . . . . . . . . . . . . . 393.2 The Task Queuing Execution System . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Work Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.2 Work Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.3 Common Runtime Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.4 Resource Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.5 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.6 Task-Queuing API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Experimental Results 504.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Static Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Performance Opportunities on a Single GPU Device . . . . . . . . . . . . 514.2.2 Heterogeneous Platform with Multiple GPU Devices . . . . . . . . . . . . 524.2.3 Heterogeneous Platform with CPU and GPU(APU) Device . . . . . . . . . 56
4.3 Design Space Exploration for Flexible Workload Balancing . . . . . . . . . . . . . 594.3.1 Synthetic Workload Generator . . . . . . . . . . . . . . . . . . . . . . . . 594.3.2 Dynamic Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . 594.3.3 Workload Balancing with Irregular Work Units . . . . . . . . . . . . . . . 63
4.4 Cross-Platform Heterogeneous Execution of clSURF and MCXCL . . . . . . . . . 644.5 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Summary and Conclusions 695.1 Portable Execution across Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Dynamic Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3 APIs to Expose Both Task-level and Data-level Parallelism . . . . . . . . . . . . . 705.4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Including Flexible Workload Balancing Schemes . . . . . . . . . . . . . . 705.4.2 Running specific kernels on the best computing devices . . . . . . . . . . . 715.4.3 Prediction of data locality . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bibliography 73
iii
-
List of Figures
1.1 Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. Atpresent, designers are able to make decisions among diverse architecture choices:homogeneous multi-core with cores of various size of complexity or heterogeneoussystem-on-chip architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Intel Processors Introduction Trends . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Multi-core Processors with Shared Memory . . . . . . . . . . . . . . . . . . . . . 112.3 Intels TeraFlops Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Intels Xeon Phi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 IBMs Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 High Level Block Diagram of a GPU . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 OpenCL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 An OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.9 The OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.10 OpenCL work-items mapping to GPU devices. . . . . . . . . . . . . . . . . . . . 232.11 OpenCL work-items mapping to CPU devices. . . . . . . . . . . . . . . . . . . . . 242.12 The OpenCL Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.13 Qilin Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.14 The OpenCL environment with the IBM OpenCL common runtime. . . . . . . . . 302.15 Maestros Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.16 Symphony Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.17 The Program Flow of clSURF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.18 Block Diagram of the Parallel Monte Carlo simulation for photon migration. . . . . 38
3.1 Distributing work units from work pools to multiple devices. . . . . . . . . . . . . 413.2 CPU and GPU execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 An example execution of vector addition on multiple devices with different process-
ing capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 The performance of our work pool implementation on a single device One WorkPool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 The performance of our work pool implementation on a single device Two WorkPools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Load balancing on dual devices V9800P and HD6970. . . . . . . . . . . . . . . 55
iv
-
4.4 Load balancing on dual devices V9800P and GTX 285. . . . . . . . . . . . . . . 564.5 Performance assuming different device fission configurations and load balancing
schemes between CPU and Fused HD6550D GPU. . . . . . . . . . . . . . . . . . 574.6 The Load Balancing on dual devices HD6550D and CPU. . . . . . . . . . . . . . 584.7 Performance of different workload balancing schemes on all 3 CPU and GPU devices,
an A8-8350 CPU, a V7800 GPU and a HD6550D GPU, as compared to a V7800GPU device alone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.8 Performance of different workload balancing schemes on 1 CPU and 2 GPU devices,an NVS 5400M GPU, a Core i5-3360M CPU and a Intel HD graphics 4000 GPU, ascompared to the NVS 5400M GPU device alone. . . . . . . . . . . . . . . . . . . 63
4.9 Performance Comparison of clSURF implemented with various workload balancingschemes on the platform with V7800 and HD6550D GPUs. . . . . . . . . . . . . . 65
4.10 Performance Comparison of MCXCL implemented with various workload balancingschemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.11 Number of lines of the source code using our runtime API versus a baseline OpenCLimplementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
v
-
List of Tables
3.1 Typical memory bandwidth between different processing units for reads. . . . . . . 473.2 Typical memory bandwidth between different processing units for writes. . . . . . 473.3 The Classes and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Input sets emphasizing different phases of the SURF algorithm. . . . . . . . . . . . 52
vi
-
Acknowledgments
First, I would like to thank my advisor, Prof. David Kaeli for his insightful and inspiringguidance during the course of my graduate study. I have always enjoyed talking with him on variousresearch topics, and his computer architecture class is one of the best classes I have ever taken.
The enlightening suggestions from my committee of Prof. Norman Rubin and Prof.Ningfang Mi have been a great help to this thesis. Norm was my mentor when I was doing a 6-monthinternship at AMD, and thats where this thesis essentially started.
I would also like to thank Dr. Xinping Zhu, who gave me valuable guidance for my earlygraduate study. My fellow NUCAR colleagues, Dana Schaa, Byunghyun Jang, Perhaad Mistry, etc,also helped me so much through technical discussions and feedback. If life is a train ride, I cherishevery moment and every scene outside of the window we share together.
My deepest appreciation goes to my family, as it is always where I can put myself togetherwith their endless love. I would like to thank my mom and dad for their consistent support andmotivation, and my brother for his advice and encouragement. And finally, but most importantly, Iwould like to thank my wife and mother of my two kids, Liwei, for her understanding, patience, andfaith in me. I couldnt have finished this thesis without her love.
vii
-
Abstract of the Dissertation
Cross-Platform Heterogeneous Runtime Environment
by
Enqiang Sun
Doctor of Philosophy in Computer Engineering
Northeastern University, April 2016
Dr. David Kaeli, Adviser
Heterogeneous platforms are becoming widely adopted thanks to the support from newprogramming languages and models. Among these languages/models, OpenCL is an industrystandard for parallel programming on heterogeneous devices. With OpenCL, compute-intensiveportions of an application can be offloaded to a variety of processing units on a system. OpenCL isone of the first standards that focuses on portability, allowing programs to be written once and rununmodified on multiple, heterogeneous devices, regardless of vendor.
While OpenCL has been widely adopted, there still remains a lack of support for automaticworkload balancing and data consistency when multiple devices are present in the system. To addressthis need, we have designed a cross-platform heterogeneous runtime environment which provides ahigh-level, unified, execution model that is coupled with an intelligent resource management facility.The main motivation for developing this runtime environment is to provide OpenCL programmerswith a convenient programming paradigm to fully utilize all possible devices in a system andincorporate flexible workload balancing schemes without compromising the users ability to assigntasks according to the data affinity. Our work removes much of the cumbersome initialization of theplatform, and now devices and related OpenCL objects are hidden under the hood.
Equipped with this new runtime environment and associated programming interface, theprogrammer can focus on designing the application and worry less about customization to the targetplatform. Further, the programmer can now take advantage of multiple devices using a dynamicworkload balancing algorithm to reap the benefits of task-level parallelism.
To demonstrate the value of this cross-platform heterogeneous runtime environment, wehave evaluated it running both micro benchmarks and popular OpenCL benchmark applications. Withminimal overhead of managing data objects across devices, we show that we can achieve scalableperformance and application speedup as we increase the number of computing devices, without anychanges to the program source code.
viii
-
ix
-
Chapter 1
Introduction
Moores law describes technology advances that double transistor density on integrated
circuits every 12 to 18 months [1]. However, with the size of transistors approaching the size of
individual atoms, and as power density outpaces current cooling techniques, the end of Moores law
has appeared on the horizon. This has encouraged the research community to look at new solutions
in system architecture, including heterogeneous computing architectures.
Since 2003, the semiconductor industry has settled on three main trends for microprocessors
design. The first trend is to continue improving the sequential execution speed while increasing the
number of cores [2]. This kind of microprocessors are called multicore processors. An example of
multicore CPU is Intels widely used Core 2 Duo processor. It has dual processor cores, each of
which is an out-of-order, multiple instruction issue processor implementing the full X86 instruction
set, supporting hyperthreading with two hardware threads, designed to maximize the execution speed
of sequential programs. The second trend focuses more on the execution of parallel applications with
as many as possible threads. This kind of processors are called many-thread processors. Most of
the current popular GPUs are many-thread architecture. For example, with full occupancy NVidias
GTX 970 can host 26,624 threads, executing in a large number of simple and in-order pipelines.
The third trend is the improved combination of both multicore and many-thread architecture. This
kind of processors are represented by most current desktop processors with integrated Graphics
processing units. For example, the 6th generation Intels Core i7-6567U processor has dual-core
CPU with integrated Iris graphics 550 GPU, which has 72 execution units [3]. AMDs A8-3850
fusion processor has four x86-64 CPU cores integrated together with a Radeon HD6550D Radeon
GPU, which has 5 SIMD engines (16-wide) and a total of 400 streaming processors [4].
With its long evolving history, the design philosophy of a CPU is to minimize the execution
1
-
CHAPTER 1. INTRODUCTION
latency of a single thread. Large on-chip caches are integrated to store frequently accessed data and
improve some of the long-latency memory accesses, providing short-latency cache access. There
is also prediction logic, such as branch prediction and data prefetching designed to minimize the
effective latency of operations at the cost of increased chip area and power. With all these hardware
logic components, the CPU greatly reduces the execution latency of each individual thread. However,
the large cache memory, low-latency arithmetic units, and sophisticated prediction logic consume
chip area and power that could be otherwise used to provide more arithmetic execution units and
memory access channels. This design style inside CPUs emphasizes on minimizing the latency and
is latency-oriented design.
The GPUs, either standalone or integrated, on the other hand, are designed as parallel,
throughput-oriented computing engines. The application software is expected to be organized with
much more data parallelism. The hardware takes advantage of the large number of arithmetic
execution units, and pipelines the execution when some of them are waiting for long-latency memory
accesses or arithmetic operations. Only limited amount of cache memories are supplied to help
increase the memory bandwidth requirements of these applications and facilitate the data synchro-
nization between multiple threads that access the same memory data. This design style strives
to maximize the total execution throughput of a large amount of data parallelism while allowing
individual threads to take a potentially much longer time to execute.
GPUs have been leading the race of floating-point performance since 2013. With enough
data parallelism and proper memory arrangement, the performance gap can be more than ten times.
These are not necessarily the application speeds, but only the raw speed the execution resources can
potentially support. For applications that have one or a few threads, CPUs can achieve much higher
performance than GPUs. Therefore, the heterogeneous architectures combining CPUs and GPUs
would be the natural selection for the applications, which can execute the sequential parts on the
CPU and numerically intensive parallel parts on GPU.
Figure 1.1 is an high-level illustration of multi-core CPU, many-thread accelerator GPU,
and a heterogeneous system-on-chip architecture with CPU and GPU on the same die. High-
performance computing might emphasize single-threaded latency whereas commercial transaction
processing might emphasize aggregate throughput. Designers began to put both of these devices
with very different characteristics together, and expected a performance gain, leveraging properly
workload distribution and balancing.
Graphics processing units used to be very difficult to program since programmers had to
use the corresponding graphics application programming interface. OpenGL and Direct3D are the
2
-
CHAPTER 1. INTRODUCTION
S
Homogeneous multi-core CPU Homogeneous multi-core CPU Homogeneous multi-core GPU Heterogeneous System-on-Chip
with CPU and GPU
Figure 1.1: Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. Atpresent, designers are able to make decisions among diverse architecture choices: homogeneousmulti-core with cores of various size of complexity or heterogeneous system-on-chip architectures.
most widely used graphics API specifications. More precisely, a computation must be mapped to a
graphical function that programs an pixel processing engine so that they can be executed on the early
GPUs. These APIs require extensive knowledge of graphics processing and also limit the kinds of
applications that one can actually write for early general purpose GPU programming. To quench
the increasing demands, new GPU programming paradigms became more and more popular, such
as CUDA [5], OpenCL [6] OpenACC [7], and C++AMP [8]. Many runtime and execution system
are also designed to help developer to manage the heterogeneous platform with multiple computing
devices with dramatically different characteristics.
In this thesis, we present a cross-platform heterogeneous runtime environment, providing a
convenient programming interface to fully utilize all possible devices in a heterogeneous system. Our
framework incorporates flexible workload balancing schemes without compromising the users ability
to assign tasks according to the data affinity. Our framework provides significant enhancements to
the state-of-the-art in OpenCL programming practice in terms of workload balancing and distribution.
Furthermore, the details of programming the specific platform are hidden from the programmer,
enabling the programmer to focus more on high-level design of the algorithms.
In this chapter, we present the reader with an introduction to some basic concepts of
heterogeneous computing. This includes a very brief history of heterogeneous computing with
CPUs and GPUs, the potential benefits that heterogeneous computing provides, and the ability of
our runtime framework to adapt applications to heterogeneous computing platforms. Finally, we
3
-
CHAPTER 1. INTRODUCTION
highlight the contributions of this thesis and outline the organization of the remainder of this thesis.
1.1 A Brief History of Heterogeneous Computing
Over the last decade, developers have witnessed the field of computer architecture transi-
tioning from single-core compute devices to a wide range of parallel architectures. The change in
architecture has also produced new challenges with the underlying parallel programming paradigms.
Existing algorithms designed to scale with single-core systems had to be redesigned to reap the
performance benefits of new parallel architectures. Multi-core is the chosen path of the industry to
quench the thirsty for performance, and at the same time, respecting thermal and power design limits.
While multi-core processors have ushered in a new era of concurrency, there has also
been work on exploiting existing parallel platforms such as GPUs. Since the early 1990s, software
architects have explored how best to run general-purpose applications on computer graphics hardware
(i.e., GPUs). GPUs were originally designed to execute a set of predefined functions as a graphics
rendering pipeline. Even today, GPUs are mainly designed to calculate the color of pixels on the
screen to support complex graphics processing functions. GPUs provide deterministic performance
when rendering frames. In the beginning of this revolution, GPU programming was done using a
graphics Application Programming Interface (API) such as OpenGL [9] or DirectX [10]. This model
required general purpose application developers to have intimate knowledge of graphics hardware
and graphics APIs. These restrictions severely impacted the implementation of many algorithms on
GPUs.
General purpose GPU (GPGPU) programming was not widely accepted until new GPU
architectures unified vertex and pixel processors (first available in the R600 family from AMD and
the G80 family from NVIDIA). New general purpose programming languages such as CUDA [5]
and Brook+ [11] were introduced in 2006. The introduction of fully programmable hardware and
new programming languages lifted many of the restrictions and greatly increased the interest in using
GPU for general purpose computing. Heterogeneous platforms that include GPUs as a powerful data-
parallel co-processor have been adopted for many scientific and engineering environments [12] [13]
[14] [15]. On current systems, discrete GPUs are connected to the rest of the system through a PCI
express bus. All data transfer between the CPU and GPU is limited by the speed of PCI express
protocol.
Recently, industry leaders have recognized that scalar processing on the CPU, combined
with parallel processing on the GPU, could be a power model for application throughput. More
4
-
CHAPTER 1. INTRODUCTION
recently, the Heterogeneous System Architecture (HSA) Foundation [16] was founded in 2012 by
many vendors. HSA has provided industry with standards to further support heterogeneity across
systems and devices. We have also seen solutions with a CPU and a GPU on the same die, such
as AMDs APU [4] series, INTELs Ivybridge [17] series and Qualcomms Snapdragon [18], has
demonstrated potential power/performance savings. Current state-of-the-art supercomputers utilize a
heterogeneous solution.
Heterogeneous systems can be found in every domain of computing, ranging from high-
performance computing servers to low-power embedded processors in mobile phones and tablets.
Industry and academia are investing huge amount of effort and budget to improve every aspects of
heterogeneous computing [19] [20] [21] [15] [22] [23].
1.2 Heterogeneous Computing with OpenCL
The emerging software framework for programming heterogeneous devices is the Open
Computing Language(OpenCL) [6]. OpenCL is an open industry standard managed by the non-
profit technology consortium Khronos Group. Support for OpenCL has been increasing from major
companies such as Qualcomm, AMD, Intel and Imagination.
The aim of OpenCL is to serve as a universal language for programming heterogeneous
platforms such as GPUs, CPUs, DSPs, and FPGAs. In order to support such a wide variety of
heterogeneous devices, some elements of the OpenCL API are necessarily low-level. As with the
CUDA/C language [5], OpenCL does not provide support for automatic workload balancing, nor
guarantee global data consistencyit is up to the programmer to explicitly define tasks and enqueue
them on devices, and to move data between devices as required. Furthermore, when different
implementations of OpenCL produced by different vendors are used, OpenCL objects from vendor
As implementation may not run on vendor Bs hardware. Given these limitations, there still remain
barriers to achieve straightforward heterogeneous computing.
1.3 Task-level Parallelism across Platforms with Multiple Computing
Devices
Platform agnostic is a quality that is taken for granted for many existing programming
languages such as C/C++, Java, etc. Programmers rely on compilers or run-time systems to automati-
cally generate executables for different processing units. Until recently, there did not exist a set of
5
-
CHAPTER 1. INTRODUCTION
API functions that would enable the programmer to automatically exploit all computing resources
when the characteristics of the underlying platform change (e.g., number of processing units and
accelerators).
To help illustrate some of the challenges with heterogeneous computing, we consider the
OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature)[24]
to demonstrate a typical use of the OpenCL programming model. In OpenSURF, the degree of
data-parallelism a single kernel can vary when executing on different computing devices. Execution
dynamics are also dependent on the characteristics of the input images or video frames, such as the
size and image complexity. Furthermore, when mapping to another platform with a different number
of devices, we usually have to re-design the kernel binding and associated data transfers, without
proper runtime management. Without runtime workload balancing, the additional processing units
available on the targeted accelerator may remain idle unless the application is redesigned. Even
with the range of parallelism present in OpenSURF, an application has no inherent ability to exploit
the extra computing resources, and is not able to improve performance if we upgrade our hardware
platform.
In this thesis we present a cross-platform heterogeneous runtime environment that helps
ameliorate many of the burdens faced when performing heterogeneous programming. New pro-
gramming models such as OpenCL and CUDA provide the ability to dynamically initialize the
platforms and objects, and acquire the processing capability of each device, such as the number of
compute units, core frequency, etc.. The presented runtime environment augments this ability, and
incorporates a central task queuing/scheduling system. This central task queuing system is based on
the concepts of work pools and work units, and cooperates with workload balancing algorithms to
execute applications on heterogeneous hardware platforms. Using the runtime API, programmers
can easily develop and tune flexible workload balancing schemes across different platforms.
In the proposed runtime environment, data-parallel kernels in an application are wrapped
with metadata into work units. These work units are then enqueued into a work pool and assigned to
computing devices according to a selectable workload balancing policy. A resource management
system is seamlessly integrated in the central task-queuing system to provide for migration of kernels
between devices and platforms. We demonstrate the utility of this class of task queuing runtime
system by implementing selected benchmark applications from OpenCL benchmark suites. We also
benchmark the performance trade-off by implementing real world applications such as clSURF[25],
an OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature)
framework, and Monte Carlo Extreme in OpenCL[26], a Monte Carlo simulation for time-resolved
6
-
CHAPTER 1. INTRODUCTION
photon transport in 3D turbid media.
1.4 Scope and Contribution of This Thesis
Unlike the ubiquity of the x86 architecture and the long life cycle of CPU designs, GPUs
often have much shorter release cycles and ever-changing ISAs and hardware features. Platforms
incorporating GPUs as accelerators can have very different configurations in terms of processing
capabilities and number of devices. As such, the need has arisen for a programming interface and
runtime execution system that allows a single program to be portable across different platforms, and
can automatically use all devices supported by an associated workload balancing scheme.
The key contribution of this thesis is the development of a cross-platform heterogeneous
runtime environment, which enables flexible task-level workload balancing on heterogeneous plat-
forms with multiple computing devices. Together with the application programming interface, this
extension layer is designed in the form of a library. Different levels of this runtime environment are
considered. We study the following aspects of our runtime environment:
We enable portable execution of applications across platforms. Our runtime environmentprovides a unified abstraction for all processing units, including CPU, GPU and many existing
OpenCL devices. With this unified abstraction, tasks are able to be distributed on all devices.
An application is portable across different platforms with a variable number of processing
units.
We provide APIs to expose both task-level and data-level parallelism. The program de-signer is in the best position to identify all levels of parallelism present in his/her application.
We provide API functions and a dependency description mechanism so that the programmer
can expose task-level parallelism. When combined with the data-level parallelism present in
OpenCL kernels, the run-time and/or the compiler can effectively adapt any type of parallel
machine without the modification of the source code.
Balance task execution dynamically based on static and run-time profiling information.The optimal static mapping of task execution on the underlying platform requires a significant
amount of analysis of all of the devices on the platform, and it is impossible for programmers
to perform such analysis and remap whenever new hardware is used. A dynamic workload
balancing scheme makes it possible for the same source code to obtain portable performance.
7
-
CHAPTER 1. INTRODUCTION
We support the management of data locality at runtime. Due to the data transfer overheadand its impact on the overall performance for OpenCL applications, data locality is an important
issue for the portable execution of tasks. In our OpenCL support, data management is tightly
integrated with the workload migrating decisions. The runtime layer ensures data availability
and data coherency throughout the whole system.
We simplify the initialization of platforms. Scientific programmers usually are not familiarwith the best way to initialize platforms across different types of OpenCL devices. With a new
API designed for our runtime environment, we shift this burden to the underlying execution
system, so that the programmer can focus on the development of his/her algorithms.
1.5 Organization of This Thesis
The rest of this thesis is organized as follows. Chapter 2 provides necessary background
information and related work on heterogeneous computing. It also presents a summary of related work
on previously proposed runtime environments targeting heterogeneous platforms. In Chapter 3, we
describe the structure and components in our cross-platform heterogeneous runtime environment and
discuss how it can facilitate more effective use of the resources present on heterogeneous platforms.
In Chapter 4, we explore the design space by using our heterogeneous runtime environment equipped
with different scheduling schemes when running synthetic workloads. We then demonstrate the
true value of our proposed runtime environment by evaluating the performance of benchmark
applications run on multiple cross-vendor heterogeneous platforms. We present a detailed analysis
on the performance components and demonstrate the programming efficiency. In Chapter 5, we
conclude this thesis, summarizing the major contributions embodied in this work, and describe
potential directions for future work.
8
-
Chapter 2
Background and Related Work
2.1 From Serial to Parallel
In July 2004, Intel released a statement that their 4GHz chip, originally targeted for the
fourth quarter, will be delayed until the first quarter of next year. Company Spokesman Howard
High said the delay will help ensure that the company can deliver high chip quantities when the
product is launched. Later in mid-October, in a surprising announcement, Intel officially abandoned
their entire plans to release the 4GHz version of the processor, and moved their engineers onto other
projects. This marks an abrupt change of 34 years of CPU-frequency scaling, where the increase in
CPU frequency grew exponentially over time.
Figure 2.1 illustrates a brief history of Intel processors, plotting the number of transistors
per chip and the associated clock speed [27]. As the total number of transistors continued to climb,
the clock speed did not keep up. The reason behind this major change in CPU development is due to
current power and cooling challenges, more specifically power density. The power density in proces-
sors has already exceeded that of a hot plate. Continuing to increase the frequency would require
either new cooling technologies or new materials to relax the physical limits of what a processor can
withstand. Processor design has hit the power wall. Our ability to improve performance automatically
by increasing frequency of the processor is gone. To further improve application throughput, major
silicon vendors elected to provide multi-core designs, providing higher performance within the
constraints of thermal limits and power density thresholds.
9
-
CHAPTER 2. BACKGROUND AND RELATED WORK
1980
Quad-Core Ivy Bridge
2010
Dual-Core Itanium 2
Transistors (103)
Clock Speed (MHz)
Power (W)
Performance/Clock (ILP)
Pentium 4
Pentium
Intel CPU Trends
(sources: Intel, Wikipedia, K. Olukotun)
386
1970
0
100
1975
1000
1
1985
10,000
10
2000
100,000
1,000,000
1990
10,000,000
100,000,000
1995 2005
Figure 2.1: Intel Processors Introduction Trends
2.2 Many-Core Architecture
In recent years, multi-core processors become the norm. Figure 2.2 shows an example of a
multi-core processor. A multi-core processor has two or more processing cores on a single chip, each
core with their own level-1 cache. The common global memory is shared among different processing
cores, while multiple tasks are executed on the multi-core processors.
Intels TeraFlops architecture [28] was designed to demonstrate a prototype of a many-core
processor, as shown in Figure 2.3. Developed by Intel Corporations Tera-Scale Computing Research
Program, this research processor contains 80 tiles of cores, and can yield 1.8 teraflops at 5.6GHz.
10
-
CHAPTER 2. BACKGROUND AND RELATED WORK
Shared L3 Cache
Core 0
CPU
L1 Cache
L2 Cache
Core 1
CPU
L1 Cache
L2 Cache
Core 2
CPU
L1 Cache
L2 Cache
Core 3
CPU
L1 Cache
L2 Cache
System Memory
Figure 2.2: Multi-core Processors with Shared Memory
While data transfers can occur between any pair of cores, no cache coherency is enforced across cores,
and all memory transfers are explicit. Therefore, the biggest hurdle to fully take advantage of the
power of these 80 cores is parallel programming. As shown in Figure 2.3, another interesting point is
that some dedicated hardware engines could be integrated with some of the cores for multimedia,
networking, security, and other tasks.
The Intel Xeon Phi coprocessor [29] inherited many design elements from Larrabee
project [30], which is another high performance co-processor based on the TeraFlops architecture.
The Intel Xeon Phi coprocessor is primarily composed of processing cores, caches, PCIe client
logic, and a very high bandwidth, bidirectional ring interconnect, as illustrated in Figure 2.4. Intel is
using Xeon Phi as the primary element for its family of Many Integrated Core architectures. Intel
revealed its second generation Many Integrated Core architecture in November 2013, with codename
Knights Landing [31]. The Knights Landing contains up to 72 cores and 36 tiles manufactured
in 14nm technology, with each core running 4 threads. The Knight Landing chip also has a 2MB
coherent shared cache between 2 cores in a tile, which indicates the effort to make this architecture as
programmable as possible. The Knights Landing is ISA compatible with the Intel Xeon processors
with support for Intels Advanced Vector Extension 512, and supports most of todays parallel
optimizations. One interesting feature of the Knights Landing is that it can either be the main
11
-
CHAPTER 2. BACKGROUND AND RELATED WORK
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$ HD
Video
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
Crypto
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
DSP
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
GPU
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
PE
$
Physics
Figure 2.3: Intels TeraFlops Architecture
processor on a compute node, or as a coprocessor in a PCIe slot. Intel is exploring different
heterogeneous computing organizations.
Another example of a heterogeneous many core architecture is IBMs Cell processor [32].
It includes a general purpose PowerPC core with 8 very simple SIMD coprocessors, which are
specially designed for accelerating vector or multimedia operations. An operating system runs on
the main core, which is called Power Processing Unit (PPU). It is functioning as a master device
controlling the 8 coprocessors, which are called Synergistic Processing Elements (SPE). Each SPE
is a dual issue in-order processor composed of a Synergistic Processing Unit (SPU) and a Memory
Flow Controller (MFC). The Element Interconnect Bus (EIB) is the internal communication bus
12
-
CHAPTER 2. BACKGROUND AND RELATED WORK
TD
Core
L2
TD
Core
L2
TD
Core
L2
TD
Core
L2
TD
Core
L2
TD
Core
L2
TD
Core
L2
TD
Core
L2
GDDR MC
GDDR MC
GDDR MC
GDDR MC GD
DR
IO
GD
DR
IO
PCIe
Client
Logic
PCIe IO
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Figure 2.4: Intels Xeon Phi Architecture
connecting various on-chip system elements.
The Cell processor was used as the processor for the Sonys PlayStation 3 game console
and some high performance computing servers, such as the IBM Roadrunner supercomputer and
Mercury System servers with Cell accelerator boards [33].
By November 2009, IBM discontinued the development of the Cell processor. The Cell
processor benefits from very high internal memory bandwidth, but all transfers must be explicitly
programmed by using low-level asynchronous DMA transfers. It requires significant expertise to
write efficient code for this architecture, especially with the limited size of the local storage on each
SPU (256 KB). Load balancing is another challenging issue on the Cell. The application programmer
is responsible for evenly mapping the different pieces of computation on the SPUs.
Besides the novelty in each hardware design of these many-core processors, industry
realized that programmability can not be overlooked anymore. When the hardware design of these
processors reaches an unprecedented complexity, it is impossible for software designers to manage all
the processing elements manually. Suitable programming models are desperately needed to exploit
the computing power on these architectures.
13
-
CHAPTER 2. BACKGROUND AND RELATED WORK
RAM SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
PPU
$
E I B
Figure 2.5: IBMs Cell
2.3 Programming Paradigms for Many Core Architecture
Given the development of a number of many-core architectures, many parallel program-
ming models have been developed to facilitate the usage of these architectures.
2.3.1 Pthreads
Pthreads or Portable Operating System Interface (POSIX) Threads is a set of C program-
ming language types, functions and variables [34]. Pthreads is implemented as a header (pthread.h)
and a library, which creates and manages multiple threads. When using Pthreads, the programmer
has to explicitly create and destroy threads by making use of pthread API functions.
The Pthreads library provides mechanisms to synchronize different threads, resolve race
conditions, avoid deadlock conditions, and protect critical sections. However, the programmer has
the responsibility to manage threads explicitly. Therefore, it is usually very challenging to design
a scalable multithreaded application on modern many-core architectures, especially systems with
hundreds of cores on a single machine.
14
-
CHAPTER 2. BACKGROUND AND RELATED WORK
2.3.2 OpenMP
OpenMP is an open specification for shared memory parallelism [35] [36]. It comprises
compiler directives, callable runtime library routines and environment variables which extend
FORTRAN, C and C++ programs. OpenMP is portable across a shared memory architecture. The
thread management is implicit, and the programmer has to use special directives to specify the section
of code is to be run in parallel. The number of threads to be used is specified by the environment
variables. OpenMP is also extended as a parallel programing model for clusters.
OpenMP uses several constructs to support implicit synchronization, so that the the program
is relieved from worrying about the actual synchronization mechanism.
As with Pthreads, scalability is still an issue for OpenMP, as it is a thread-based mechanism.
Furthermore, since OpenMP is using implicit thread management, there is no fine-grained way to do
thread-to-processor mapping.
2.3.3 MPI
The Message Passing Interface (MPI) [37] provides a virtual topology, synchronization,
and communication functionality between nodes in clusters. It is a natural candidate for accelerating
applications in distributed systems. MPI is currently the most widely used standard for developing
High Performance Computing (HPC) applications for distributed memory architectures. It provides
programming interfaces for C, C++, and FORTRAN. Some of the well-known MPI implementations
include OpenMPI [38], MVAPICH [39], MPICH [40], GridMPI [41], and LAM/MPI [42].
Similar to Pthreads, workload partitioning and task mapping have to be done by the
programmer, but message passing is a convenient way to express date transfer between different
processors. MPI barriers are used to specify that synchronization is needed. The barrier operation
blocks each process from continuing its execution until all processes have entered the barrier. A
typical usage of barriers is to ensure that the global data has been dispersed to the appropriate
processes.
2.3.4 Hadoop MapReduce
Hadoop MapReduce is a software framework for developing parallel applications easily,
and is especially well suited for processing vast amounts of data (e.g., multi-terabyte data-sets)
in-parallel on large clusters (thousands of nodes), and on commodity hardware, in a reliable, fault-
tolerant manner [43]. A MapReduce job usually splits the input data-set into independent chunks
15
-
CHAPTER 2. BACKGROUND AND RELATED WORK
which are processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output
of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them
and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same. For example, the
MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes.
This configuration allows the framework to effectively schedule tasks on the nodes where data is
already present, resulting in very high aggregate bandwidth across the cluster [44].
2.4 Computing with Graphic Processing Units
Although the multicore architectures have made it possible for the applications to over-
come some of the physical limits encountered with purely sequential architectures, their degree of
parallelization is not comparable to the parallelism on graphic processing units (GPUs). Intrinsically,
GPUs are designed for highly parallel problem. With more and more complex graphic problems,
new architectures and APIs are created, and GPUs became more and more programmable. In this
section, we briefly review the current state of GPU computing, and the GPUs transition from a
hardware implementation of standard graphic APIs to become a fully programmable general purpose
processing unit.
2.4.1 The Emergence of Programmable Graphics Hardware
The interactive 3D graphics applications have very different characteristics as compared
to general-purpose applications. Specifically, interactive 3D application requires high throughput
and exhibit substantial parallelism. Since the late 1990s custom hardware has been built to take
advantage the native parallelism in the application. Those early custom accelerators were designed
in the form of fixed-function pipelines based on a hardware implementation of OpenGL [9] standard
and Microsofts DirectX programming APIs. At each stage of the pipeline, a sequence of different
operations were implemented in hardware units for specific tasks.
Given that the GPU was originally designed to produce visual realism of rendered images,
a fixed-function pipeline graphics hardware has some limitations to perform efficiently. In the mean-
time, offline rendering systems such as Pixars Renderman [45] can be used to achieve impressive
visual results by replacing the fixed-function pipeline with a more flexible programmable pipeline. In
16
-
CHAPTER 2. BACKGROUND AND RELATED WORK
the programmable pipeline, fixed-function operations are replaced by user-provided pieces of code
called shaders. Pixel shaders, vertex shaders and geometry shaders are introduced to enable flexible
processing at each programmable pipeline.
Initially in early shader models, vertex and pixel shares were implemented using very
different instruction sets. But later in 2006, OpenGLs Unified Shader Model and DirectX 10s
Shader Model 4.0 provided consistent instruction sets across all shader types - geometry, vertex and
pixel shaders. All three types of shaders have almost the same capabilities. For example, they can
perform the same set of arithmetic instructions and read from texture or data buffers.
Graphics hardware designers continued to explore the best ISA for the shader models.
Before the unified shader model, ATIs Xenos graphics chip integrated in the Xbox 360 used unified
shader architecture. Most shaders continued to build dedicated hardware units, even though they have
a unified shader model. But eventually, all major GPU makers chose a Unified Shader Architecture,
which allows a single type of processing unit to be used for all types of shaders. The Unified
Shader Architecture decouples the type of shaders from the processing unit, and allows a dynamic
assignment of shaders to the different processing cores. This flexibility leads to better workload
balance, allowing hardware resources to be allocated dynamically for different types of shaders,
based on the needs of the workload.
Figure 2.6 is an illustration of high level block diagram of a modern GPU architecture.
2.4.2 General Purpose GPUs
With emergence of programmable graphics hardware, new shader languages and program-
ming APIs have been created to facilitate the programming effort. Since DirectX 9, Microsoft has
been using the High Level Shading Language (HLSL) [46], which supports shader construction
with C-like syntax, types, expressions, statements and functions. Similarly, the OpenGL Shading
Language (GLSL) [47] is the corresponding high level language targeting OpenGL shader programs.
Nvidias Cg [48] is a collaborated effort with Microsoft. The Cg compiler outputs both DirectX and
OpenGL shader programs. Although these shader languages are very popular across the graphics
community, mainstream programmers feel a lack of connection between the graphics primitives in
these shader languages and the constructs in general purpose programming languages.
With the introduction of unified shader architectures and unified shader models, a uniform
ISA makes it easier to design high-level languages for this workload. Some examples of these
higher level languages include Brook [49], Scott [50], Glift [51], Nvidias CUDA [5] and the
17
-
CHAPTER 2. BACKGROUND AND RELATED WORK
Shader
Core
Shader
Core
Shader
Core
Interconnection Network
L2 Cache
Global memory
Figure 2.6: High Level Block Diagram of a GPU
Khronos Groups OpenCL [6], which is an extension of Brook. These high-level languages hide
the graphic primitives with programming constructs which are more familiar to general purpose
programmers. The availability of CUDA and OpenCL, currently the two most popular languages, has
dramatically increased the programmability of GPU hardware. As a result, GPUs have been widely
adopted in many general purpose platforms for executing data-parallel, computationally-intensive
workloads [52]. Many key applications possessing a high degree of data-level parallelism have been
successfully accelerated using GPUs.
GPUs have been included in the standard configuration for many desktop machines and
servers. The availability of high-level languages has allowed industry to support both graphics and
compute on the same GPU. According to the 42nd TOP500 list, GPUs are used in the No.2 and
No.6 fastest supercomputers in the world [53]. Intel Xeon Phi processors are used in the No.1 and
No.7 fastest supercomputers in the world. A total of fifty-three systems on the list use accelerator/co-
processor technology. Thirty-eight of these systems use NVIDIA GPU chips, two use ATI Radeon,
18
-
CHAPTER 2. BACKGROUND AND RELATED WORK
and there are now thirteen systems with Intel MIC technology (Xeon Phi).
Application OpenCL Kernel
OpenCL Framework
OpenCL API OpenCL C Language
OpenCL Runtime
OpenCL Driver
GPU Hardware
The OpenCL Architecture
Figure 2.7: OpenCL Architecture
2.5 OpenCL
OpenCL (Open Computing Language) is an open standard for general purpose parallel
programming on CPUs, GPUs and other processors, giving software developers portable and efficient
access to the computing resource on these heterogeneous processing platforms [54]. OpenCL allows
a heterogeneous platform be viewed as a single platform with multiple computing devices. It is a
mature framework that includes a language definition, a set of APIs, compiler libraries, and a runtime
system to support software development. Figure 2.7 shows a high-level breakdown of the OpenCL
architecture.
19
-
CHAPTER 2. BACKGROUND AND RELATED WORK
Host
Processing Element
Compute Unit
Compute Device
Figure 2.8: An OpenCL Platform
2.5.1 An OpenCL Platform
An OpenCL framework adopts the concept of a platform, which has a host device with
multiple OpenCL devices interconnected [55]. The OpenCL devices can be a CPU, GPU or any type
of processing unit which supports the OpenCL standard. An OpenCL device can be divided into one
or more compute units (CUs), and a CU can be further divided into one or more processing elements
(PEs). Figure 2.8 shows how the OpenCL standard hierarchically describes a heterogeneous platform
with multiple OpenCL devices, multiple CUs and multiple PEs.
2.5.2 OpenCL Execution Model
The execution model of OpenCL consists of two parts: a host program running on the host
device, setting up data and scheduling execution on a compute device, and kernels executed on one
or more OpenCL devices [56]. Figure 2.9 shows the OpenCL execution model.
An OpenCL command queue is where the host interacts with an OpenCL device by queuing
computation kernels. Each command-queue is associated with a single device. There are three types
of commands in a command-queue:
20
-
CHAPTER 2. BACKGROUND AND RELATED WORK
F
Device 0 Device 1 Device 2 Device 3
foo() bar() baz() qux()
.... Host
Program
foo() bar() baz() qux ()
Command Queue Context
Kernel
Figure 2.9: The OpenCL Execution Model
Kernel-enqueue commands: Enqueues a kernel for execution on a device.
Memory commands: Transfers data between the host and device memory, between memoryobjects, or maps and unmaps memory objects from the host address space.
Synchronization commands: Explicit synchronization points that define ordering constraintsbetween commands.
Commands communicate their status through Event objects. Successful completion is
indicated by setting the event status to CL COMPLETE. Unsuccessful completion results in abnormal
termination of the command which is indicated by setting the event status to a negative value. In
this case, the command-queue associated with the abnormally terminated command and all other
command-queues in the same context may no longer be available and their behavior is implementation
defined.
A command submitted to a device will not launch until prerequisites that constrain the
order of commands have been resolved. These prerequisites have two sources. First, they may
21
-
CHAPTER 2. BACKGROUND AND RELATED WORK
arise from commands submitted to a command-queue that constrain the order that commands are
launched. For example, commands that follow a command queue barrier will not launch until all
commands prior to the barrier are complete. The second source of prerequisites is dependencies
between commands expressed through events. A command may include an optional list of events.
The command will wait and not launch until all the events in the list are in the CL COMPLETE state.
Using this mechanism, event objects define ordering constraints between commands and coordinate
execution between the host and one or more devices [54]. In our cross-platform runtime system,
we expand this mechanism to support dependencies between events across OpenCL devices from
different vendors.
A command may be submitted to a device, and yet there may be no visible side effects
except to wait on and satisfy event dependencies. Examples include markers-, kernels executed over
ranges of no work-items or copy operations of zero size. Such commands may pass directly from the
ready state to the ended state.
Command execution can be blocking or non-blocking. Consider a sequence of OpenCL
commands. For blocking commands, the OpenCL API functions that enqueue commands do not
return until the command has completed. Alternatively, OpenCL functions that enqueue non-
blocking commands return immediately and require that a programmer defines dependencies between
enqueued commands to ensure that enqueued commands are not launched before needed resources
are available. In both cases, the actual execution of the command may occur asynchronously with
execution of the host program.
Multiple command-queues can be present within a single context. Multiple command-
queues execute commands independently. Event objects visible to the host program can be used to
define synchronization points between commands in multiple command queues. If such synchroniza-
tion points are established between commands in multiple command-queues, an implementation must
assure that the command-queues progress concurrently and correctly accounts for the dependencies
established by the synchronization points.
The core of the OpenCL execution model is defined by how the kernels execute. When a
kernel-enqueue command submits a kernel for execution, an index space is defined. The kernel, the
argument values associated with the arguments to the kernel, and the parameters that define the index
space define a kernel-instance. When a kernel-instance executes on a device, the kernel function
executes for each point in the defined index space. Each of these executing kernel functions is called
a work-item. The work-items associated with a given kernel-instance are managed by the device in
groups called work-groups. These work-groups define a coarse grained decomposition of the Index
22
-
CHAPTER 2. BACKGROUND AND RELATED WORK
space. Work-groups are further divided into sub-groups, which provide an additional level of control
over execution.
work-item
( + , + )
, = (0, 0)
work-item
( + , + )
, = ( 1, 0)
work-item
( + , + )
, = (0, 1)
work-item
( + , + )
, = ( 1, 1)
work-group
Type equation here.
work-group size
Type equation here.
wo
rk-g
rou
p s
ize
Typ
e eq
uation here.
NDRange size
Type equation here.
ND
Ran
ge s
ize
Typ
e eq
uation here.
Figure 2.10: OpenCL work-items mapping to GPU devices.
2.5.2.1 Mapping OpenCL Work-items
Each work-items global ID is an N-dimensional tuple. The global ID components are
values in the range from F, to F plus the number of elements in that dimension minus one.
If a kernel is compiled as an OpenCL 2.0 kernel [20], the size of work-groups in an
NDRange (the local size) need not be the same for all work-groups. In this case, any single
dimension for which the global size is not divisible by the local size will be partitioned into two
regions. One region will have work-groups that have the same number of work items as was specified
for that dimension by the programmer (the local size). The other region will have work-groups
with less than the number of work items specified by the local size parameter in that dimension (the
remainder work-groups). Work-group sizes can be non-uniform in multiple dimensions, potentially
producing work-groups of up to 4 different sizes in a 2D range and 8 different sizes in a 3D range.
Each work-item is assigned to a work-group and is given a local ID to represent its position
within the work-group. A work-items local ID is an N-dimensional tuple with components in the
range from zero to the size of the work-group in that dimension minus one.
23
-
CHAPTER 2. BACKGROUND AND RELATED WORK
bar
rier
bar
rier
Workgroup 0 Workgroup n
bar
rier
bar
rier
Workgroup 1 Workgroup n+1
bar
rier
bar
rier
Workgroup n-1 Workgroup 2n-1
Execution
CPU
Thread 0
CPU
Thread 1
CPU
Thread n-1
Figure 2.11: OpenCL work-items mapping to CPU devices.
Work-groups are assigned IDs similarly. The number of work-groups in each dimension
is not directly defined but is inferred from the local and global NDRanges provided when a kernel
instance is enqueued. A work-groups ID is an N-dimensional tuple with components in the range 0
to the ceiling of the global size in that dimension divided by the local size in the same dimension. As
a result, the combination of a work-group ID and the local-ID within a work-group uniquely defines
a work-item. Each work-item is identifiable in two ways; in terms of a global index, and in terms of
a work-group index plus a local index within a work group.
On a CPU device, work-items are mapped by a different mechanism. An example mapping
of OpenCL execution on a CPU is shown in Figure 2.11. In this example, one worker thread is
created per physical CPU core when executing a kernel. Then this worker-thread, which is usually a
CPU thread, takes a work-group from the ND-range and begins to execute its associated work-items
one by one in sequence. If an OpenCL barrier is reached, the work-items state is stored and the
execution of the following work-item begins. When all work-items in this work group have reached
the barrier, execution will go back to the first work-item which stops at the barrier. It will resume the
24
-
CHAPTER 2. BACKGROUND AND RELATED WORK
execution until the next synchronization point. In the absence of barriers, the first work-item will
run to the end of the kernel before switching to the next. In both cases, the CPU will continuously
process all the work-items until the entire work-group is executed. During the whole process, idle
CPU threads will look for any remaining work-groups in the ND-range and begin the process them.
2.5.2.2 Kernel Execution
A kernel object is defined to include a function within the program object and a collection
of arguments connecting the kernel to a set of argument values [57]. The host program enqueues a
kernel object to the command queue, along with the NDRange and the work-group decomposition.
These define a kernel instance. In addition, an optional set of events may be defined when the kernel
is enqueued. The events associated with a particular kernel instance are used to constrain when the
kernel instance is launched with respect to other commands in the queue or with respect to commands
in other queues within the same context.
A kernel instance is submitted to a device. For an in-order command queue, the kernel
instances appear to launch and then execute in that same order.
Once these conditions are met, the kernel instance is launched and the work-groups
associated with the kernel instance are placed into a pool of ready-to-execute workgroups. The
device schedules work-groups from the pool for execution on the compute units of the device. The
kernel-enqueue command is complete when all work-groups associated with the kernel instance
end their execution, updates to global memory associated with a command are visible globally, and
the device signals successful completion by setting the event associated with the kernel-enqueue
command to CL COMPLETE.
While a command-queue is associated with only one device, a single device may be
associated with multiple command-queues. A device may also be associated with command queues
associated with different contexts within the same platform. The device will pull work-groups
from the pool and execute them on one or several compute units in any order; possibly interleaving
execution of work-groups from multiple commands. A conforming implementation may choose
to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute
in parallel. There is no safe and portable way to synchronize across the independent execution of
work-groups since they can execute in any order.
The work-items within a single sub-group execute concurrently, but not necessarily in
parallel (i.e., they are not guaranteed to make independent forward progress). Therefore, only
25
-
CHAPTER 2. BACKGROUND AND RELATED WORK
high-level synchronization constructs (e.g. sub-group functions such as barriers) that apply to all the
work-items in a sub-group are well defined and included in OpenCL.
Sub-groups execute concurrently within a given work-group and with appropriate device
support may make independent forward progress with respect to each other, with respect to host
threads and with respect to any entities external to the OpenCL system but running on an OpenCL
device, even in the absence of work-group barrier operations. In this situation, sub-groups are able
to internally synchronize using barrier operations without synchronizing with each other and may
perform operations that rely on runtime dependencies on operations other sub-groups perform.
The work-items within a single work-group execute concurrently, but are only guaranteed
to make independent progress in the presence of sub-groups and device support. In the absence
of this capability, only high-level synchronization constructs (e.g., work-group functions such as
barriers), that apply to all the work-items in a work-group, are well defined and included in OpenCL
for synchronization within a work-group.
2.5.2.3 Synchronization
Synchronization across all work-items within a single work-group is carried out using a
work-group function [58]. These functions carry out collective operations across all the work-items
in a work-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and
evaluation of a predicate. A work-group function must occur within a converged control flow; i.e.,
all work-items in the work-group must encounter precisely the same work-group function. For
example, if a work-group function occurs within a loop, the work-items must encounter the same
work-group function in the same loop iterations. All the work-items of a work-group must execute
the work-group function and complete reads and writes to memory before any are allowed to continue
execution beyond the work-group function. Work-group functions that apply between work-groups
are not provided in OpenCL since OpenCL does not define forward progress or ordering relations
between work-groups, hence collective synchronization operations are not well defined.
Synchronization across all work-items within a single sub-group is carried out using a
sub-group function. These functions carry out collective operations across all the work-items in
a sub-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and
evaluation of a predicate. A sub-group function must occur within a converged control flow; i.e., all
work-items in the sub-group must encounter precisely the same sub-group function. For example,
if a work-group function occurs within a loop, the work-items must encounter the same sub-group
26
-
CHAPTER 2. BACKGROUND AND RELATED WORK
function in the same loop iterations. All the work-items of a sub-group must execute the sub-group
function and complete reads and writes to memory before any are allowed to continue execution
beyond the sub-group function. Synchronization between sub-groups must either be performed using
work-group functions, or through memory operations. Using memory operations for sub-group
synchronization should be used carefully as forward progress of sub-groups relative to each other is
only supported optionally by OpenCL implementations.
A synchronization point between a pair of commands (A and B) assures that results of
command A happens-before command B is launched. This requires that any updates to memory
from command A complete and are made available to other commands before the synchronization
point completes. Likewise, this requires that command B waits until after the synchronization point
before loading values from global memory. The concept of a synchronization point works in a similar
fashion for commands such as a barrier that apply to two sets of commands. All the commands prior
to the barrier must complete and make their results available to following commands. Furthermore,
any commands following the barrier must wait for the commands prior to the barrier before loading
values and continuing their execution.
2.5.3 OpenCL Memory Model
The OpenCL memory model describes the structure, contents, and behavior of the memory
exposed by an OpenCL platform as an OpenCL program runs [59]. The model allows a programmer
to reason about values in memory as the host program and multiple kernel-instances execute.
An OpenCL program defines a context that includes a host, one or more devices, command-
queues, and memory exposed within the context. Consider the units of execution involved with such
a program. The host program runs as one or more host threads managed by the operating system
running on the host (the details of which are defined outside of OpenCL). There may be multiple
devices in a single context which all have access to memory objects defined by OpenCL. On a
single device, multiple work-groups may execute in parallel with potentially overlapping updates to
memory. Finally, within a single work-group, multiple work-items concurrently execute, once again
with potentially overlapping updates to memory.
The memory regions, and their relationship to the OpenCL Platform model, are summarized
in Figure 2.12. Local and private memories are always associated with a particular device. The
global and constant memories, however, are shared between all devices within a given context. An
OpenCL device may include a cache to support efficient access to these shared memories.
27
-
CHAPTER 2. BACKGROUND AND RELATED WORK
Host
Host Memory Global Memory Constant Memory
PE
Local Memory
Private
Memory
Private
Memory
PE
Local Memory
Private
Memory
Private
Memory
Compute Unit 0 Compute Unit 1
PE
Local Memory
Private
Memory
Private
Memory
PE
Local Memory
Private
Memory
Private
Memory
Compute Unit 0 Compute Unit 1
Kernel A Kernel B
PCIE
Figure 2.12: The OpenCL Memory Hierarchy.
To understand memory in OpenCL, it is important to appreciate the relationship between
these named address spaces. The four named address spaces available to a device are disjoint, which
means that they do not overlap. This is their logical relationship, however, and an implementation
may choose to let these disjoint named address spaces share physical memory.
Programmers often need functions callable from kernels, where the pointers manipulated
by those functions can point to multiple named address spaces. This saves a programmer from
the error-prone and wasteful practice of creating multiple copies of functions, one for each named
address space. Therefore, the global, local and private address spaces belong to a single generic
address space.
2.6 Heterogeneous Computing
To take full advantage of the resources on a heterogeneous platform, the programmer has
to manage these the allocation of these resources. In this section, we introduce several projects which
were designed or extended to support heterogeneous computing platforms. All of these runtimes
or libraries provide higher-level software layers with convenient abstractions, which alleviates the
programmer from the burden of managing resources on the targeted heterogeneous platform.
28
-
CHAPTER 2. BACKGROUND AND RELATED WORK
C++ source
Qilin API
Compiler
Code Cache
Scheduler
Libraries Dev.
tools
CPU GPU
Application
Qilin
System
Hardware
Figure 2.13: Qilin Software Architecture
2.6.0.1 Qilin
Qilin [60] is a programming system recently developed for heterogeneous multiprocessors.
Figure 2.13 shows the software architecture of Qilin. At the application level, Qilin provides an API
to programmers for describe parallelizable operations. By explicitly expressing these computations
through the API, the compiler does not have to extract any implicit parallelism from the serial code,
and instead can focus on performance tuning. Similar to OpenMP, the Qilin API is built on top of
C/C++ so that it can be easily adopted. But unlike standard OpenMP, where parallelization only
happens on the CPU, Qilin can exploit the hardware parallelism available on both the CPU and the
GPU.
Beneath the API layer is the Qilin system layer, which consists of a dynamic compiler
and its code cache, a number of libraries, a set of development tools, and a scheduler. The compiler
dynamically translates the API calls into native machine codes. It also produces a near-optimal map-
29
-
CHAPTER 2. BACKGROUND AND RELATED WORK
ping from computations to processing elements using an adaptive algorithm. To reduce compilation
overhead, translated code is stored in the code cache so that it can be reused without recompilation,
whenever possible. Once native machine code is available, it can be scheduled to run on the CPU
and/or the GPU by the scheduler. Libraries include commonly used functions such as BLAS and FFT.
Finally, debugging, visualization, and profiling tools can be provided to facilitate the development of
Qilin programs.
Qilin uses off-line profiling to obtain information about each task on each computing
device. This information is then used to partition tasks and create an appropriate performance model
for the targeted heterogeneous platform. However, the overhead to carry out the initial profiling
phase can be prohibitively high and results may be inaccurate if computation behavior is heavily
input dependent.
OpenCL Application
OpenCL Common Runtime for Linux on x86
Platform
Device 0 Context Device 1
Queue 0 Program MemObj Queue 1
Figure 2.14: The OpenCL environment with the IBM OpenCL common runtime.
30
-
CHAPTER 2. BACKGROUND AND RELATED WORK
2.6.0.2 IBM OpenCL common runtime
IBMs OpenCL common runtime [61] improves the OpenCL programming experience
by removing the burden from the programmer of managing multiple OpenCL platforms and dupli-
cated resources, such as contexts and memory objects. In the conventional OpenCL programming
environment, programmers are responsible for managing the movement of memory between two
or more contexts, when there are multiple OpenCL devices are present on the platform. In this
case, the application is forced to have host side synchronization in order to move their memory
objects between coordinating contexts. Equipped with the common runtime, this movement and
synchronization is done automatically.
In addition, the common runtime also improves the OpenCL programming experience by
alleviating the programmer from managing cross-queue scheduling and event dependencies. By
convention, OpenCL requires that command queue event dependencies must originate from the same
context as that of the command queue. In a multiple context environment, this restriction forces
programmers to manage their own cross-queue scheduling and dependencies. Again, this requires
additional host-side synchronization in the application. With the common runtime, the handling of
cross-queue event dependencies and scheduling are handled for the programmer.
Finally, The common runtime improves application portability and resource usage, which
reduces application complexity. In the conventional OpenCL environment, coordination of OpenCL
resources is more than just an inconvenience. Managing resources comes with challenges of
application portability, which becomes an issue when code is tuned for a particular underlying
platform. Applications are forced to choose whether to support only one platform, potentially leaving
compute resources unused, or adding complexity to manage resources across a range of platforms.
Using the unifying platform provided by the IBM OpenCL common runtime, applications are more
portable and resources can be more easily exploited.
IBMs OpenCL common runtime is designed to improve the OpenCL programming expe-
rience by managing multiple OpenCL platforms and duplicated resources. It minimizes application
complexity by presenting the programming environment as a single OpenCL platform. Shared
OpenCL resources, such as data buffers, events, and kernel programs are transparently managed
across the installed vendor implementations. The result is simpler programming in heterogeneous
environments. However, even equipped with this commercially-developed common runtime, many
of the multiple contexts features, such as scheduling decisions and data synchronization, must still
be manually performed by the programmer.
31
-
CHAPTER 2. BACKGROUND AND RELATED WORK
2.6.0.3 StarPU
StarPU [62] automatically schedules tasks across the different processing units of an
accelerator-based machine. Applications using StarPU do not have to deal with low-level concerns
such as data transfers or an efficient load balancing that are target system dependent. StarPU
is a C library that provides an API to describe application data, and can asynchronously submit
tasks that are dispatched and executed transparently over the entire machine in an efficient way.
Providing a separation of concerns between writing efficient algorithms and mapping them on
complex accelerator-based machines therefore makes it possible to achieve portable performance,
tapping into the potential of both accelerators and multi-core architectures.
An application first has to register data with StarPU. Once a piece of data has been
registered, its state is fully described using an opaque data structure, called a handle. Programmers
must then divide their applications into sets of possibly inter-dependent tasks. In order to obtain
portable performance, programmers do not explicitly choose which processing units will process the
different tasks.
Each task is described by a structure that contains the list of handles of the data that the task
will manipulate, the corresponding access modes (i.e. read, write, etc.), and a multi-versioned kernel
called a codelet, which gathers the various kernel implementations available on the different types of
processing units. The different tasks are submitted asynchronously to StarPU, which automatically
decides where to execute them. Thanks to the data description stored in the handle data structure,
StarPU also ensures that a coherent replicate of the different pieces of data accessed by a task are
automatically transferred to the appropriate processing unit. If StarPU selects a CUDA device to
execute a task, the CUDA implementation of the corresponding codelet will be provided with pointers
to locally replicated data allocated in the memory on the GPU.
Programmers need not worry about where the tasks are executed, nor how data replicates
are managed for these tasks. They simply need to register data, submit tasks with their implemen-
tations for the various processing units, and just wait for their termination, or simply rely on task
dependencies.
StarPU is a simple tasking API that provides numerical kernel designers with a convenient
way to execute parallel tasks on heterogeneous platforms, and incorporates a number of different
scheduling policies. StarPU is based on the integration of a resource management facility with a
task execution engine. Several scientific kernels [63][64] [65][66] have been deployed on StarPU to
utilize the computing power of heterogeneous platforms. However, StarPU is implemented in C and
32
-
CHAPTER 2. BACKGROUND AND RELATED WORK
the basic schedulable units (codelets) have to be implemented multiple times if they are targeting
multiple devices. This limits the migration of the codelets across platforms, and increases the
programmers burden. To overcome this limitation, StarPU has initiated a recent effort to incorporate
OpenCL [67] as the front-end.
2.6.0.4 Maestro
The Maestro model [68] unifies the disparate, device-specific, queues into a single, high-
level, task queue. At runtime, Maestro queries OpenCL to obtain information about the available
GPUs or other accelerators in a given system. Based on this information, Maestro can transfer data
and divide work among the available devices automatically. This frees the programmer from having
to synchronize multiple devices and keep track of device-specific information.
Since OpenCL can execute on devices which differ radically in architecture and compu-
tational capabilities, it is difficult to develop simple heuristics with strong performance guarantees.
Hence, Maestros optimizations rely solely on empirical data, instead of any performance model
or apriori knowledge. Maestros general strategy for all optimizations can be summarized by the
following steps as show in Figure 2.15.
This strategy is used to optimize a variety of parameters, including local work group
size, data transfer size, and the division of work across multiple devices. However, these dynamic
execution parameters are only one of the obstacles to true portability. Another obstacle is the choice
of hardware-specific kernel optimizations. For instance, some kernel optimizations may result in
excellent performance on a GPU, but reduce performance on a CPU. This remains an open problem.
Since the solution will no doubt involve editing kernel source code, it is beyond the scope of Maestro.
Maestro is an open source library for data orchestration on OpenCL devices. It provides
automatic data transfer, task decomposition across multiple devices, and auto-tuning of dynamic
execution parameters for selected problems. However, Maestro relies heavily on empirical data and
benchmark profiling beforehand. This limits its ability to run on applications with data-dependent
program flow and/or data dependencies.
2.6.0.5 Symphony
Symphony [69], previously known as MARE (Multicore Asynchronous Runtime Environ-
ment) [70], seamlessly integrates heterogeneous execution into a concurrent task graph and removes
the burden from the programmer of managing data transfers and explicit data copies between kernels
33
-
CHAPTER 2. BACKGROUND AND RELATED WORK
Estimate based on benchmarks
Collect empirical data from execution
Optimize based on results
Performance continues improving?
Final performance stratety
No
Yes
Figure 2.15: Maestros Optimization Flow
executing on different devices. At a low level, Symphony provides state-of-the-art algorithms for
work stealing and power optimizations that can hide hardware idiosyncrasies, allowing for portable
application development. In addition, Symphony is designed to support dynamic mapping of kernels
to heterogeneous execution units. Moreover, expert programmers can take charge of the execution
through a carefully designed system of attributes and di