intel oneapi: a performance study

Bachelor Informatica

Intel oneAPI:a Performance Study

Tristan Blommers

June 15, 2020

Supervisor(s): Ana-Lucia Varbanescu

Signed:

Informatica—

Universiteit

vanAmst

erdam

Abstract

Intel has created oneAPI, a framework aiming to simplify the programming of CPUs andaccelerators using modern C++ features to express parallelism. In this work, we focus onstudying the performance of this framework. Our analysis is based on comparing oneAPIagainst OpenMP. To this end, we select example applications from Intel’s GitHub and weport equivalent versions to OpenMP. We then run multiple experiments for each of these ap-plications, using both OpenMP and oneAPI, different data sizes, and different optimizationoptions, and collect performance data. The analysis of these performance results illustratesthat oneAPI has cases in which it performs better, and cases in which it performs worse thanOpenMP. When oneAPI performs worse, the reason is mostly the overhead of working witha command queue or the set up of how data is accessed. When oneAPI performs better,we manage to get the performance of OpenMP equal to that of oneAPI. These prelimi-nary results demonstrate that oneAPI can be a viable solution for portable programmingfor multiple devices, though one should make sure that their application is well-suited foroneAPI.

3

Contents

1 Introduction 71.1 Research question and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Theoretical background and related work 92.1 Hardware platforms and portability . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Intel oneAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Platform model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Execution model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.4 Kernel model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 The thirteen dwarfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Empirical evaluation methodology 153.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Performance analysis tools and metrics . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Vector addition 194.1 Setting up the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Implementation differences in oneAPI and OpenMP . . . . . . . . . . . . 194.1.2 Performance measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Performance results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.1 Experiment 1: Execution phases and overhead . . . . . . . . . . . . . . . 204.2.2 Experiment 2: OneAPI vs. OpenMP . . . . . . . . . . . . . . . . . . . . . 204.2.3 Experiment 3: Closing the gap . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Second order stencil 235.1 Problem explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Setting up the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2.1 Implementation differences oneAPI and OpenMP . . . . . . . . . . . . . . 245.2.2 Performance measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Performance results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3.1 Experiment 1: OneAPI vs OpenMP . . . . . . . . . . . . . . . . . . . . . 255.3.2 Experiment 2: Closing the gap . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Matrix multiplication 296.1 Problem explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2 Setting up the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.1 Implementation differences oneAPI and OpenMP . . . . . . . . . . . . . . 306.2.2 Performance measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5

6.3 Performance results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3.1 Experiment 1: OneAPI vs OpenMP . . . . . . . . . . . . . . . . . . . . . 326.3.2 Experiment 2: Closing the gap (CPU) . . . . . . . . . . . . . . . . . . . . 346.3.3 Experiment 3: Closing the gap (GPU) . . . . . . . . . . . . . . . . . . . . 36

7 Conclusion and Future Work 397.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6

CHAPTER 1

Introduction

Modern workloads have a lot of diversity, from scientific simulations to machine learning, andfrom big data analysis to real-time computing. There is no single architecture that works bestfor every one of these workloads, which has lead to a fair diversity of architectures available today.

To maximize performance, we often need a mix of Scalar, Vector, Matrix, and Spatial (SVMS)architectures (like CPUs, GPUs or FPGAs) to be deployed in the same computing systems [1].This is where Intel oneAPI comes into play. With oneAPI, Intel claims to be able to deliver thetools needed to deploy applications and solutions across all these architectures, thus solving thecode portability problem that traditional, native models cannot overcome.

The oneAPI programming model is supposed to simplify the programming of CPUs and acceler-ators using modern C++ features to express parallelism. This is achieved using a programminglanguage called Data Parallel C++ (DPC++), which would run unmodified over these differentarchitectures. Specifically, what Intel says sounds both promising and intriguing: ”To deployapplications and solutions across SVMS architectures” and ”Simplify programming and help de-velopers improve efficiency and innovation” [2].

However, before we jump on-board to code all our applications with oneAPI, there is one missingpoint: the performance analysis. There is currently no understanding of what the performanceof this new programming model/framework is, compared to already existing models.

1.1 Research question and Approach

We aim to solve the problem mentioned above by answering the following research question:”Does simplifying programming using oneAPI come with any adverse effects on ap-plication performance?”

To answer this question, we propose an empirical evaluation using applications that were cho-sen from Intel’s examples, and assessing their performance when running on CPU’s and wherepossible GPU’s. For comparison, we use OpenMP versions of the same applications. We postu-late that the difference in performance between the two versions each application - that is, theOpenMP and the oneAPI version - offers an indication of the performance gap between the twomodels. In the case of a gap, we further aim to improve the performance of the worse version bytraditional optimizations. When successful, this optimization procedure can illustrate the causesof the gaps.

7

1.2 Ethics

First we would like to mention that all the used code, data and results from this study are opensource. This allows for full transparency and reproducibility. We also verify the correctness ofboth the provided code as well as that of the ported code. All of our reported performanceresults are for code where correctness was ensured. It is thus never the case that we improveperformance while accidentally changing the output of an application.

OneAPI enables a programmer to improve their code using parallelism. By using parallelism,one can improve the utilization of their computing resources. OneAPI also enables portability ofcode, meaning that one version of code can be used for multiple devices. This can greatly reducethe program development time, enables better productivity and reduces the time needed to testcode. Enabling portability via oneAPI thus makes the development process more efficient.In conclusion, by using oneAPI we should be able to increase the speed in which innovativeapplications can be created, while also decreasing the resources needed to run said applications.

1.3 Thesis outline

To end our introductory chapter, we provide a brief outline of this thesis.Chapter 2 focuses on theoretical background and related work. In this chapter we firstintroduce oneAPI. We discuss the language it uses, how it divides tasks and the programmingmodel that is uses. We then briefly introduce the thirteen dwarfs of parallel programming andhow our applications fit in this idea. And, finally we discuss some related papers which we usedto confirm that our approach is viable.

Chapter 3 introduces our empirical evaluation methodology. In this chapter we intro-duce the different hardware and software platforms we have used, the methodology behind ourempirical evaluation, and the metrics and tools we used for performance analysis.

Chapters 4, 5 and 6 explain our approach for every application in more detail. Forevery application, we explain briefly what the application entails, we present the specifics of im-plementing the application in oneAPI and OpenMP, and we point out the differences. Lastly, wediscuss the specifics of the experiments we conduct, and present and analyse performanceresults. We end the thesis in Chapter 7, where we formulate an answer to our research questionand sketch some future work directions.

8

CHAPTER 2

Theoretical background and related work

2.1 Hardware platforms and portability

OneAPI has been designed to support multiple types of architectures: CPUs (scalar), GPUs(vector), AI-specific (matrix), and FPGAs (spatial) (see Figure 2.1 for an illustration).

Figure 2.1: Platforms supported by oneAPI. Image from https://www.colfax-intl.com/

training/intel-oneapi-training.

In this thesis, we focus on CPUs and GPUs. CPUs are scalar architectures, characterized by afew independent cores. GPUs are vector architectures (i.e., using the SIMD principle), wheremultiple threads execute the same instructions on different data.

CPUs are programmed using models like pthreads [3], TBB [4], or OpenMP [5]. GPUs can usenative models (like CUDA for NVIDIA GPUs or hip for AMD GPUs), or can use computationoffloading models such as OpenACC [6] or OpenMP offloading[5]. We note that no specific nativeprogramming model has been introduced so far for Intel GPUs. Finally, both platforms can beprogrammed using OpenCL - a standard-driven programming model, which appeared in 2009aiming to support several multi- and many-core architectures with the same code. SYCL [7] isa recent spin-off of the OpenCL standard, based on C++. We note that DPC++, the program-ming model of oneAPI, is heavily based on SYCL.

In this thesis, we aim to compare OneAPI against native programming models, and avoidOpenCL, thus avoiding a more biased comparison of the SYCL-based DPC++ against its parent,OpenCL. To this end, we use OpenMP for CPUs and OpenMP offloading for GPUs.

9

https://www.colfax-intl.com/training/intel-oneapi-training

https://www.colfax-intl.com/training/intel-oneapi-training

2.2 Intel oneAPI

There might not be any papers about oneAPI, but we can still explain how oneAPI functions.DPC++, the language used by oneAPI, enables code reuse for the host (e.g. CPU) and acceler-ators (e.g. GPU) using a single source language. Using mapping in the DPC++ code can allowan application to transition to (a set of) hardware that best accelerates the workload.The oneAPI programming model is based upon the SYCL specification [7]. This specification di-vides the programming model into four sub-models. These sub-models categorize the actions thatprogrammers need to take to employ one or more devices as accelerators. The four sub-modelsare: the Platform model, the Execution model, the Memory model, and the Kernel model. Weshall now explain in more detail what every sub-model entails.

2.2.1 Platform model

The platform model is used to specify a host and the devices it controls. A host is typically aCPU-based system which executes the main part of a program, i.e., the code that executes onthe host and the code that acts as the interface between the host and device.A device is ”an accelerator, a specialized component containing compute resources that canquickly execute a subset of operations typically more efficiently than the system’s CPU’s” [8].

A program can ask the host for all available platforms, which consist of one or multiple de-vices. Each device has one or more compute units available, which are used to run multipleoperations in parallel. A compute unit contains one or more processing elements, which areused as individual engines for computation. The program can either specify which platform anddevice it wants to use, or it can let the oneAPI runtime choose a default device.

In Figure 2.2, a platform model is visualised. The host (left) in this example is a CPU in adesktop computer. The host can choose to communicate with one or more devices. Note thatthe CPU itself is also an available device.

Figure 2.2: Visualisation of a platform model [8].

10

2.2.2 Execution model

The execution model is divided into two parts, the host execution model and the device executionmodel.

The host execution model specifies the execution and data management between the host anddevice. This is achieved by using command queues, which consist of command groups. Com-mand groups are groups of commands such as accessors and kernel invocations.

The device execution model specifies how computation is accomplished on the accelerator. Thispart of the model controls how and how much parallelism is executed. It achieves this by usinga hierarchy of ND-ranges, work-groups, sub-groups, and work-items. The total work is specifiedusing a ND-range, which can be either one-, two- or three-dimensional. The total work is thendivided over a number of work-groups, and these work-groups are again divided into sub-groups.Every sub-group then contains a certain number of work-items. A work-item is an item whichexecutes the kernel code.In Figure 2.3, this process of dividing work is visualized for a three-dimensional set of data. Thetotal work is specified using a ND-range of size X * Y * Z, a work-group has a size of X’ * Y’ *Z’, a subgroup has size X’ and a work-item is a single cube.

When executing kernel code, it can be necessary to know the location of a work-item in theirsub-group, work-group or total ND-range. An example of this is when a work-item represents acell in a grid, which uses it’s surrounding values to calculate a new value. To attain this loca-tion, a unique identification of the work-item is provided via built-in functions. For example thelocal id, work group id and global id functions of the nd item class.

Figure 2.3: Visualisation of an execution model [9].

2.2.3 Memory model

The memory model defines how the host and devices interact with memory. In this model, thememory is located on and owned by either the host or the device. The model uses two typesof memory objects to achieve this, namely buffers and images. To interact with one of theseobjects, an accessor is required. This accessor communicates the location of access, such as host

11

or device, and the access-mode, such as read-only, write-only or read-write. There are thus twogeneral steps needed to work with the memory model:

1. Instantiate a buffer or image object. The host or device memory for the buffer orimage is allocated as part of the instantiation or is wrapped around previously allocatedmemory on the host.

2. Instantiate an accessor object. This accessor communicates the location of access, suchas host or device, and the access-mode, such as read-only, write-only or read-write.

2.2.4 Kernel model

The kernel model defines which code will be executed on a device. A developer has to choosewhich part of the code will run on the host and which part of the code will run on the device.The code that will be executed on the device is what we call the kernel code.

OneAPI supports single source code, meaning that the kernel code can be in the same fileas the host code. For specific language requirements of the kernel, one can see Intel’s program-ming guide for oneAPI [10]. The device code is specified using one of three language constructs:lambda expressions, functors, or kernel classes. The use of these three different language con-structs allow flexibility in combining the host and device code.

The amount of parallelism desired can be requested using three separate mechanisms:

1. single task: executes a single instance of the kernel with a single work-item.

2. parallel for: executes a kernel in parallel across a range of processing elements.

3. parallel for work group: executes a kernel in parallel, across a hierarchical range ofprocessing elements using local memory and barriers.

2.3 The thirteen dwarfs

In this section we briefly discuss Berkeley’s 13 dwarfs and what they entail. The 13 dwarfsoriginate from ”The Landscape of Parallel Computing Research: A View from Berkeley” [11]and are used to design and evaluate parallel programming models and architectures. A dwarf isan algorithmic method that captures a pattern of computation and communication.

Ideally, one would want to cover all 13 dwarfs in their tests. However, due to a limited amountof time available, this will not be the case in our thesis. The dwarfs that are included in ourthesis can be found in Table 2.1, which lists a subset of the dwarfs. A full list of dwarfs andtheir descriptions is presented in ”The Landscape of Parallel Computing Research: A View fromBerkeley” [11].

Dwarf Description

Dense linear algebraConsists of dense matrix and vector operations. It has a high ratio of math-to-load operations and a high degree of data interdependency between threads.

Structured GridsOrganizes data in a regular multidimensional grid, where computationproceeds as a series of grid updates. For each grid update, all points areupdated using values from a small neighborhood around each point.

Table 2.1: The covered dwarfs with brief descriptions [12]

12

2.4 Related work

To the best of our knowledge, no independent study on the performance of oneAPI exists. Toget a better idea of how to start analysing the oneAPI performance, we investigate other similarcases that benchmark performance of portable programming models through a a brief literaturestudy.

The first paper we analysed was ”A Comprehensive Performance Comparison of CUDA andOpenCL” [13], which compares CUDA and OpenCL. This research uses applications from bench-mark suites (i.e., SHOC [14] and Rodinia [15]) to analyse the performance gap between CUDAand OpenCL. This paper had a lot of different types of applications, and used different perfor-mance metrics for different applications. Just like we plan to do, they ran their applicationson multiple hardware platforms and compared the performance of CUDA and OpenCL on eachplatform.

The second paper we analysed was ”Performance Gaps Between OpenMP and OpenCL for Multi-core CPUs” [16], which compares OpenMP and OpenCL. This paper also selected applicationsfrom the Rodinia benchmark suite. They ran their applications on three hardware platforms.They ran most of the experiments on one platform and validated their findings on the othertwo platforms. The main performance metric used in this paper is execution time (reported inms), which we are also planning to use. They changed the performance of their OpenCL imple-mentation by using compiler options and the performance of their OpenMP implementation bychanging the amount of threads used.

A third paper we analysed was ”Evaluating the Performance and Portability of OpenCL” [17],which compares the performance of OpenCL to CUDA as well as the performance portabilityof an OpenCL program across different architectures. In our case, we were mostly interested inthe performance comparison. In this paper, the authors introduce a few algorithms, which theythen implemented in both CUDA and OpenCL for different devices. They map the CUDA im-plementation and OpenCL implementations to their respective hardware platforms and measurethe execution time of both.

Lastly, we analysed the work in ”Assessing the performance portability of modern parallel pro-gramming models using TeaLeaf” [18]. This paper chose to compare multiple models very wellusing only one application. This application uses three different sparse matrix solvers and covers2 of the 13 parallel computing dwarfs, namely structured grid and sparse linear algebra [11].They tested three solvers of this application for three different hardware platforms, a CPU, aGPU and a KNC (Knights corner). For every solver, they used the exact same parameters. Theirperformance metric was execution time in seconds.They also did a research to the development cost of each implementation. Here they for examplelooked at Productivity (the lines of code) and the development complexity (knowledge requiredby programmer).

13

CHAPTER 3

Empirical evaluation methodology

In this chapter, we introduce the different hardware and software platforms we have used, themethodology behind our empirical evaluation, and the metrics and tools we used for performanceanalysis.

3.1 Experimental setup

We run our experiments on two machines: a personal computer (PC) and Intel’s DevCloud(DevC)1. In this paper we show only the performance on the DevCloud and we use my PC onlyto validate our findings. Table 3.1 presents the hardware specifications for both these machines.Moreover, Table 3.2 lists the compiler specifications for both these machines. We use DPC++to compile our oneAPI code, gcc with OpenMP to compile our OpenMP CPU code and we useicc with OpenMP to compile our OpenMP GPU code.

CPU Device GPU DevicePC Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz Nvidia GeForce GTX 1070DevC Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz Intel(R) Gen9 HD Graphics NEO

Table 3.1: Device specifications

gcc DPC++ OpenMP iccMy PC version 9.3.0 version 2021.1-beta05 version 4.5 not installedDevCloud version 7.4.0 version 2021.1-beta06 version 4.5 version 19.1

Table 3.2: Compiler specifications

3.2 Methodology

For our experiments, we decided to start by porting Intel’s example applications to OpenMP.Intel has given a few example applications for the CPU and GPU, including their so-called”golden standard” sequential implementation [19][20]. These are applications that currently al-ready work, including using the oneAPI framework, and should not have any performance issuesdue to the code.

1We also wanted to run our experiments on an UvA machine, but due to troubles with permissions we wereunable to install the oneAPI framework on an UvA machine.

15

For each selected application, we apply the following step-by-step analysis:

1. We build a parallel version using OpenMP. We do so by adding the right OpenMP pragmasto the sequential code, thus creating the corresponding parallel code. We note that, forGPUs, we use OpenMP offloading.

2. We empirically compare the performance obtained by the two different parallel versions ofeach application: the oneAPI version and the OpenMP version.

3. We select the version that performed the worst, and attempt to iteratively optimize itsperformance so that it either performs as good as, or better than the alternative version.With this approach, we can discover, for example, optimizations that one of the compilerscannot apply.

4. If we cannot improve the performance of the worst performing version, we also attemptto decrease the performance of the best performing version (for example, by mimickingpotential naive programmer errors). With this approach, we can also discover potentialoptimizations that one of the compilers cannot make.

OpenMP optimization

There are a few ways in which we can improve the performance of OpenMP code.

For CPUs First of all, we can change the number of threads used, expecting that using morethreads improves performance (for big input-data sizes). Next, we can also change the schedulingmethod that OpenMP uses to divide iterations of associated loops into subsets [21]. Moreover,we can use compiler flags. Using compiler flags, we can, for example, force the vectorization of allloops using -ftree-vectorize. We can also use the optimize-options provided by gcc, thoughthis must be done with care. The compiler flag -O3, for example, enables optimizations that areexpensive in terms of compile time and memory usage, which can actually slow down a system.

For GPUs Optimizing OpenMP offloading is more difficult than optimizing OpenMP for CPUs.One can for example not use the previously mentioned compiler flags. This is because compilerflags such as -O3 only work for the CPU part of the code. Instead, we need to tweak theparameters of the offload pragma manually. We could for example use simd to try and vectorizea code region, or we can try to map our data to a device’s data environment more effectively.Apart from standard pragmas, such as schedule and simd, we also have access to a few additionalpragmas, specifically designed for OpenMP offloading, such as target teams, distribute [22].

OneAPI optimization

There are also several ways in which one can improve the performance of the oneAPI applicationeven further. Similar to changing the number of threads, in OneAPI we can change the numberof work-items in a work-group, which effectively means we alter the distribution of work perthreads and/or cores. Because of how work-groups are set up, we can change the size of a work-group in up to three dimensions. It is also possible that the default device selector did not choosethe optimal device for the application. Thus, another way in which we can try to improve theperformance, is by targeting another device.

16

3.3 Performance analysis tools and metrics

The main performance metric for our experiments is execution time. The main concern aboutexecution time as a metric is that it is not comparable between experiments running on differentmachines. For our experiments, however, we only compare the same application’s versions inOpenMP and OneAPI on the same machine (either PC or DevC - see Table 3.1).

Another useful metric could be FLOPS, calculated as the number of operations performed bythe application divided by the recorded execution time. When calculating FLOPS using the the-oretical number of operations per application, the two versions - OpenMP and oneAPI - wouldhave the same amount of operations, and the results of the comparison would still be equivalentto the comparison of execution time. There are more accurate ways to calculate FLOPs - i.e., byusing tools to inspect the executable code and count operations to determine the actual numberof FLOPs for each version, thus accounting for actual compiled code differences into account.However, accessing tools for these measurements is not possible on DevCloud. Thus, we do notuse FLOPs for our evaluation.

To measure execution time, we decided to use a clock from the chrono library, ”a flexible collectionof types that track time with varying degrees of precision” [23]. This library has three differentclocks which we could use: system clock, steady clock and high resolution clock.The system clock represents the system-wide real time wall clock. This clock ”may not bemonotonic: on most systems, the system time can be adjusted at any moment” [24]. Thesteady clock on the other hand is monotonic. The time between ticks of this clock is constantand the clock itself is not linked to the wall clock time. The chrono library describes this clock as”most suitable for measuring intervals” [25]. Finally, the high resolution clock is a clock withthe shortest tick period available. This clock ”is not implemented consistently across differentstandard library implementations, and its use should be avoided” [26].

Based on these descriptions, the straightforward choice is to use the steady clock to measureperformance. To make sure that the precision of this this clock is high enough, we ran a quick testusing the code represented in Listing 3.1. The result was a precision of 1

1.000.000.00 = 1 nanosecondfor both my PC and DevCloud.

std : : cout << ” s t e a d y c l o c k ” << std : : endl ;s td : : cout << ” p r e c i s i o n = ” << f loat ( s t e a d y c l o c k : : per iod : : num)/

s t e a d y c l o c k : : per iod : : den << std : : endl ;s td : : cout << ” steady = ” << s t e a d y c l o c k : : i s s t e a d y << std : : endl ;

Listing 3.1: Code used to test the precision of a clock

The calculation in Listing 3.1 works for the following reason. The period member of a clock isdefined as ”the tick period of the clock in seconds” [25]. It is a std::ratio, which is a templateused to represent ratios at compile-time, and has the two integral constant members used in ourcalculation, num and den. These members are the numerator and denominator of the fractionthat defines the clock’s period.

Apart from measuring execution time, we apply a few other tools/techniques to compare theperformance of two applications. One of these techniques is the comparison of assembly codeoutput, which we acquire using gcc’s -S flag. We, for example, compare different assembly codeoutputs to see whether vectorization has occurred in one application, but not in the other.

Another tool we use is -fopt-info, which requires the compiler to provide additional informa-tion, in the form of a compiler report, about any optimizations that were applied while compilingthe code. For example, this report can tell users which loops have been vectorized and whichloops have been unrolled. An example of this when using the optimization flag -O2 can be seenin Listing 3.2.

17

OpenMp Stencils . cpp : 1 6 6 : 3 1 : note : loop v e c t o r i z e dOpenMp Stencils . cpp : 1 6 6 : 3 1 : note : loop ve r s i oned f o r v e c t o r i z a t i o n

because o f p o s s i b l e a l i a s i n gOpenMp Stencils . cpp : 1 9 6 : 2 7 : note : loop v e c t o r i z e dOpenMp Stencils . cpp : 1 9 6 : 2 7 : note : loop ve r s i oned f o r v e c t o r i z a t i o n

because o f p o s s i b l e a l i a s i n gOpenMp Stencils . cpp : 6 2 : 4 3 : note : loop v e c t o r i z e dOpenMp Stencils . cpp : 5 1 : 2 7 : note : loop v e c t o r i z e dOpenMp Stencils . cpp : 5 1 : 2 7 : note : loop ve r s i oned f o r v e c t o r i z a t i o n

because o f p o s s i b l e a l i a s i n gOpenMp Stencils . cpp : 4 8 : 2 6 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 1 1 0 : 1 2 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 1 6 2 : 3 2 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 2 1 2 : 1 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 1 6 2 : 3 2 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 2 2 4 : 5 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 2 2 4 : 5 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 2 2 4 : 5 : note : ba s i c b lock v e c t o r i z e dOpenMp Stencils . cpp : 2 2 4 : 5 : note : ba s i c b lock v e c t o r i z e d

Listing 3.2: Example output of compiler when using -fopt-info with -O2.

18

CHAPTER 4

Vector addition

The first application we selected from the examples provided by Intel is vector addition. Thisapplication calculates the sum of two (or more) arrays, element-by-element. Vector addition isan elementary linear algebra operation, and therefore it is easy to understand, both conceptuallyand at code level. As such, it is very useful as a ”getting started tutorial” to learn oneAPI byexample, while also comparing the performance for elementary linear algebra operations.

4.1 Setting up the experiment

The sequential C++-version of the code is presented in Listing 4.1.

for ( s i z e t i = 0 ; i < sum sequent ia l . s i z e ( ) ; i++)sum sequent ia l [ i ] = f i r s t v e c t o r [ i ] + second vec to r [ i ] ;

}

Listing 4.1: Sequential implementation of the vector addition

4.1.1 Implementation differences in oneAPI and OpenMP

The oneAPI version creates a device-queue to which tasks are submitted. It then creates tworead-only buffers to access the values to be added, and one write-only buffer to write the resultingvalues to.

The code has one parallel section and thus consists of one parallel for loop, meaning onekernel. In this kernel, the data to be added is accessed using the read-only vector buffers. Theresult of adding the read values is then stored to the write-only sum buffer. The buffer destructorsare responsible for copying the data back to host after the device has finished the task.

OpenMP achieves parallelism by using one pragma - i.e., #pragma omp parallel for, insteadof the kernel and queue that oneAPI uses. As a result, a set of threads will be created, and eachone of these threads will execute a (similar) number of addition operations.

4.1.2 Performance measurement

To measure the difference in performance well, we decided to vary a number of variables.

First, we vary the data we used. The difference in data in this case simply means that we varythe number of elements in the vectors that are being added.

Second, we decided to vary the number of threads in OpenMP and the number of work-groupsin oneAPI. The amount of threads could easily be changed using omp set threads. The amount

19

of work-groups that can be used however, changes depending on the size of the data used. Thesize of the data must divisible by the chosen work-group size. This means that for a small datasize it becomes impossible to use a a number of work-groups higher than this small data size.

To make sure that the results were consistent, we decided to loop the program 1000 times for everycombination of variables. A lower amount of loops would make the results prone to measurementerrors, while a higher amount of loops would take too much time.

4.2 Performance results and analysis

4.2.1 Experiment 1: Execution phases and overhead

Preliminary performance results for oneAPI indicated quite high variability of the results whenexecuting this simple application. Thus, to get a good idea of which parts of the executionmight cause some overhead, we separated the oneAPI version in several phases, based on theirfunctionality, and measured the execution time for each separate part.

Thus, we identified 4 different sections: Queue creation, Buffer creation, Accessor creation, andQueue submission. We then compared the first run, the second run, and the average of thesubsequent 100 runs. We did this for a data-size of 256, to minimize the effect of the vector-addition on the queue submission. The results are presented in Table 4.1.

Execution time [ms]First run Second run Avg of 100 runs

Queue creation 1305 3 3Buffer creation 0.038 0.024 0.021Accessor creation 0.039 0.024 0.022Queue submission 171 121 119Total time 1476 124 120

Table 4.1: Execution phases for the vector addition application implemented in oneAPI.

Looking at the results, we observe that the first run always performs worse than the subsequentruns. The cause for this decline in performance seems to mostly be due to queue creation. Thissignificant overhead of the queue creation seems to only happen when a device is being linkedto a queue for the first time. All subsequent queue creations to the same device have a muchsmaller overhead (3 ms).

Apart from the queue creation, the first queue submission also seems to cause an overhead, albeitsmaller. Once again this overhead only appears at the first queue submission to a device.

Because the overhead of the first queue creation, as well as that of the first queue submission,are so significant, we decided to leave out the first run when measuring the performance of anapplication from now on.

4.2.2 Experiment 2: OneAPI vs. OpenMP

Our next experiment focuses on the core analysis, namely comparing oneAPI’s performance tothat of OpenMP. This is the very first comparison, without applying any optimizations, enablingus to determine which one of the two models performs better. We run both versions 1000 times,for different data sizes (ranging from 256 to 100 million items). We report the average executiontime across the 1000 runs. The results are presented in Table 4.2 and Figure 4.1.

20

Execution time [ms]Vector size 256 1k 10k 100k 500k 1M 5M 10M 50M 100MSequential 0,01 0,01 0,04 0,26 1,2 2,6 13 24 115 228oneAPI 112 113 114 114 115 115 123 165 292 430OpenMP 0,01 0,01 0,02 0,04 0,2 0,3 2,2 4,8 25 50

Table 4.2: Execution time (ms) for the original (unoptimized) vector-add versions.

Figure 4.1: Execution time (ms) comparison of vector-add. Please note the logarithmic scale onthe vertical axis.

Looking at the results for both oneAPI and OpenMP, it is clear that the performance of oneAPIis significantly worse than that of OpenMP. Our next step was to try to bridge this performancegap by improving the performance of the OneAPI version.

4.2.3 Experiment 3: Closing the gap

We speculated that the cause for the performance difference visible in Figure 4.1 is due to the sizeof the work-groups used in oneAPI. Thus, we changed the implementation in oneAPI to manuallyspecify the size of the work-groups. We only go up to a work-group size of 256, because that isthe maximum size that our device allows. We test this new version for four different data-sizes:100 million, 10 million, 1 million and 256. We also added the performance of the automaticallychosen work-group size to the figure, so that we can easily compare it against the manuallychosen work-group sizes. The results are presented in Figure 4.2,

21

Figure 4.2: Execution time (ms) for sizes: 100 Million, 10 Million, 1 Million and 256

From this graph, it is clear that, as one may have already expected, the size of a work-grouphas more effect on the performance when the size of the data is bigger. This means it is infact important to choose a suitable work-group size for the problem at hand. For vector-add,however, the work-group size that is automatically chosen by the oneAPI framework seems toperform as good as our manually chosen best work-group size.

It is however still apparent that oneAPI performs worse than OpenMP. What is even worse isthat it seems the problem does not lie with the chosen work-group size. Instead, it seems to bean unexpectedly large overhead during the execution of the OneAPI version. This overhead alsoseems to get worse for bigger data sizes. Because we cannot see what oneAPI does behind thescenes, it becomes difficult to pinpoint the exact reason for this overhead. For now all we knowis that an overhead exists, which is dependent on both the device as well as the size of the databeing transferred, and which can be quite significant for applications such as vector-add.

22

CHAPTER 5

Second order stencil

After learning more about oneAPI and its general overhead, we decided to look at a more complexapplication, with a more significant computation load than vector addition. We selected a secondoneAPI example to port to OpenMP: the Two-Dimensional Finite-Difference Wave Propagationin Isotropic Media (ISO2DFD).

5.1 Problem explanation

This application looks at the problem of solving a Partial Differential Equation (PDE) using afinite-difference method. The code sample ”implements the solution to the wave equation for a2D acoustic isotropic medium with constant density” [27]. This solution is implemented using a2nd order stencil. The wave equation is expressed as follows:

δ2p

δt2= v2(

δ2p

δx2+δ2p

δy2)

where p is the pressure, v is the medium velocity, x and y are the two Cartesian coordinates,and t is the time.

This problem can then be approached by discretizing the model into a grid of N ×N points, andapproximating the continuous partial derivatives using 2nd order central differences [27]. Theformula for this discretized approach is:

pn+1i,j − 2pni,j + pn−1

i,j

∆t2= v2(

pni+1,j − 2pni,j + pni−1,j

∆x2+pni,j+1 − 2pni,j + pni,j−1

∆y2)

which leads to:

pn+1i,j = 2pni,j − pn−1

i,j + ∆t2v2(pni+1,j − 2pni,j + pni−1,j

∆x2+pni,j+1 − 2pni,j + pni,j−1

∆y2)

where ∆x and ∆y are used for the distance between two cells, n is the current iteration (time-wise) and ∆t is the increment in time (from n = 0).

Figure 5.1: 2D-stencil whereupdated cell is orange [27]

Each cell in the 2D grid can be independently updated usingthis approach, because the new value of a cell at time iterationn+ 1 does not depend on the value of other cells during the sametime iteration (n + 1). The calculations solely uses the valuesof cells calculated in previous time steps. Figure 5.1 presents anillustration of the stencil computation: the orange cell value initeration n + 1 is computed using its old value (from iterationn) and those of its four neighbours (from iteration n) colored ingreen in the figure.

23

5.2 Setting up the experiments

The sequential code of the application is presented in Listing 5.1.

void i s o 2 d f d i t e r a t i o n c p u ( f loat ∗ next , f loat ∗ prev , f loat ∗ vel ,const f loat dtDIVdxy , int nRows , int nCols , int n I t e r a t i o n s ) {f loat ∗ swap ;for (unsigned int k = 0 ; k < n I t e r a t i o n s ; k += 1) {

for (unsigned int i = 1 ; i < nRows − 1 ; i += 1) {for (unsigned int j = 1 ; j < nCols − 1 ; j += 1) {

int g id = j + ( i ∗ nCols ) ;f loat value = 0 . 0 ;va lue += prev [ g id +1]−2.0∗prev [ g id ]+prev [ gid −1] ;va lue += prev [ g id+nCols ]−2.0∗prev [ g id ]+prev [ gid−nCols ] ;va lue ∗= dtDIVdxy ∗ vel [ g id ] ;next [ g id ] = 2 .0∗prev [ g id ]−next [ g id ]+ value ;

}}swap = next ;next = prev ;prev = swap ;

}}

Listing 5.1: Sequential implementation of the second order stencil, from [27]

The algorithm works as follows: we first calculate gid, which is the global id of the current cell.Note that we don’t need to do this in the oneAPI code, because oneAPI uses work-items, whicheach have their own id. All calculations following gid calculation are simply the implementationof the final formula from Section 5.1 in code form.

Note that during the calculations, next represents steps n− 1 and prev represents step n. It isonly when the resulting value is assigned to next that it represents the values from step n + 1.And, finally, we swap the arrays such that, once again next represents an older iteration andprev represents the current one.

5.2.1 Implementation differences oneAPI and OpenMP

To get a better understanding of how we implement the above problem in parallel, we first studythe sequential version from Listing 5.1.

The oneAPI version of the stencil2D application creates one device-queue to which it submitstasks. This time, we have two read-write buffers and one read-only buffer. Specifically, the next

and prev buffers are both read and updated every time-step and thus are read-write buffers.The vel buffer is only used to read the medium velocity, and thus it is a read-only buffer.

Because of how oneAPI divides the data into work-groups and work-items, we only need onequeue with a parallel for, even though we have two for-loops that we can parallelize. Wehowever define two parallel for loops with slightly different kernels. The only difference isthat the next and prev buffers are swapped when passing them as parameters. This allows usto swap the buffers without having to copy or write data. Listing 5.2 shows the correspondingcode to help visualize this process. The iso 2dfd iteration global function executes the twoinner for-loops of the sequential code from before and k is the current time-iteration.

24

i f ( k % 2 == 0)h . p a r a l l e l f o r ( g l oba l range , [= ] ( id<2> i t ) {

i s o 2 d f d i t e r a t i o n g l o b a l ( i t , next . get pointer ( ) , prev . get pointer ( ) ,vel . get pointer ( ) , dtDIVdxy , nRows , nCols ) ;

) ;else

h . p a r a l l e l f o r ( g l oba l range , [= ] ( id<2> i t ) {i s o 2 d f d i t e r a t i o n g l o b a l ( i t , prev . get pointer ( ) , next . get pointer ( ) ,vel . get pointer ( ) , dtDIVdxy , nRows , nCols ) ;

} ) ;} ) ;

Listing 5.2: The key section of the OneAPI implementation of the second order stencil, from [27]

Just like we did for vector-add, we only add a #pragma omp parallel for in the sequentialcode to achieve parallelism with OpenMP. We also still need to calculate the gid for OpenMP,which was not the case for oneAPI.


We already got a general idea of the overhead of oneAPI when testing vector addition. We thendecided to leave out the first run of every measurement, and we will continue to do so for thisapplication.

The variables which we can vary for this application are the size of the grid, the number of timeiterations, the size of a work-group per dimension for oneAPI, and the amount of threads usedfor OpenMP.

We choose N such that the total execution time per data size does not exceed one hour. Thischoice implies that N depends on the size of the grid. For example, we use N=1000 iterationsfor the smallest grid, and N=10 for the largest one.


5.3.1 Experiment 1: OneAPI vs OpenMP

For this application we also started with a comparison between oneAPI and OpenMP where nooptimizations were yet applied. We ran the application N times (N ranges between 10 and 1000,depending on the grid size) for each grid size, and record the average execution time. The resultsfor this experiment are presented in table 5.1, with the bold numbers in the top row representingsize of one grid dimension (i.e. 100 represents a grid of 100× 100 cells).

Execution time [ms]Grid size 10 100 1000 2000 5000 10000Sequential 2 197 21.242 73.240 453.513 1.824.534oneAPI 54 56 446 3.518 25.042 99.025OpenMP 4 32 3.093 11.956 73.546 293.694

Table 5.1: Execution time (ms) for the original (unoptimized) stencil2D versions.

Figure 5.2 presents the execution time of the parallel versions. The graph shows that OpenMPperforms worse than oneAPI before we apply any optimizations, except for very small grid sizes.This is the opposite of what we found for vector-add, where oneAPI performed worse thanOpenMP before applying optimizations.

25

Figure 5.2: Execution time (ms) of the original OpenMP and OneAPI code (also seen in Table5.1). Please note the logarithmic scale on the vertical axis.

The results also seem to indicate that, for stencil2D, the overhead of the queue submission isless significant than for vector-add, except for very small grid sizes. This happens because theexecution time of this application is significantly higher than that of vector-add, and thus theimpact of the overhead becomes negligible.

5.3.2 Experiment 2: Closing the gap

Following the first comparison, we can decide where we should start optimizing our application.Because OpenMP performed worse than oneAPI, we attempt to optimize OpenMP. Because theperformance of OpenMP was significantly worse, we made an educated guess, based on relatedwork [16], that oneAPI used vectorization, while OpenMP did not.

To fix this difference, we first tried adding the -O2 compiler flag, which performs nearly allsupported gcc optimizations that do not involve a space-speed trade-off [28], combined with the-ftree-vectorize flag, which attempts to auto-vectorize all loops. Next to this, we also triedusing the -O3, which adds a few extra optimizations flags, among which -ftree-vectorize [28].For both runs, we also tried adding the -march=native flag, which enables non-trivial vectorinstructions. The measured speedup of these versions, calculated compared against the originalOpenMP execution time, is presented in Figure 5.3. The results indicate that the best-performingmix of compiler options is -O3 with -march=native.

Using the ”winning combination” of compiler flags, we compared the optimized version ofOpenMP against the oneAPI version. This comparison can be seen in Table 5.2 and Figure5.4. We added the unoptimized OpenMP performance to the results to show the difference withthe optimized OpenMP performance.

Execution time [ms]Grid size 10 100 1000 2000 5000 10000oneAPI 54 56 446 3.518 25.042 99.025OpenMP (unoptimized) 4 32 3.093 11.956 73.546 293.694OpenMP (optimized) 3 8 356 3.309 24.698 97.841

Table 5.2: Execution time (ms) of oneAPI compared to the optimized OpenMP.

26

Figure 5.3: Speedup of OpenMP using compiler flags

Figure 5.4: Execution time (ms) for oneAPI, and two OpenMP versions, before and after opti-mization (also seen in Table 5.2), demonstrating a decrease in the performance gap between thetwo models. Please note the logarithmic scale on the vertical axis.

The results presented in graph 5.4 demonstrate that we managed to significantly improve theperformance of OpenMP, compared to oneAPI. The optimized OpenMP code performs as good,and sometimes even slightly better, than the oneAPI code. It thus seems that oneAPI withoutany optimizations applied by a developer, performs as good as OpenMP with optimizations ap-plied by a developer. We specify ”by a developer”, because we are not sure which optimizationsare automatically applied by oneAPI.

We also remark that the advantage in performance that OpenMP has over oneAPI, is smallerfor larger grid sizes. The cases where this advantage is clearly visible are cases where executiontime is rather small. We believe in these cases that reason for the gap between the models is thesame as in the case of vector addition: the overhead of using the queue dominates the oneAPIexecution time.

27

CHAPTER 6

Matrix multiplication

Now that we learned more about the performance of oneAPI on the CPU, we decided to alsolook into how well oneAPI offloads to the GPU. The application we selected for this is matrixmultiplication, which is also an oneAPI example application. We use this application to offloadcomputations on 2D arrays to the GPU, using both DPC++ and OpenMP. Moreover, for com-pleteness, and to validate our earlier findings, we decided to also run this application on theCPU.

6.1 Problem explanation

This application looks at the relatively simple problem of multiplying two large matrices. Beforeexplaining how this process works, it is important to know when two matrices can and cannotbe multiplied.

Two matrices can only be multiplied if their dimensions fulfill a certain condition: the num-ber of columns of the first matrix must be equal to the number of rows of the second matrix.The resulting matrix will have the number of rows of the first matrix and the number of columnsof the second matrix. So, if matrix A is an m x n matrix and matrix B is an n x p matrix, thenthe resulting matrix C will have the dimensions m x p.

To find the result matrix C, we have to take the dot product of each row in A with eachcolumn in B. The dot product operation is computed by multiplying the matching row and col-umn elements, and adding up all these multiplications results. For example, in Equation 6.1, forthe first row of A and the first column of B, we get the following calculation: [1, 2, 3] · [7, 9, 11] =(1× 7) + (2× 9) + (3× 11) = 58.

[1 2 34 5 6

]×

7 89 10

11 12

=

[58 64

139 154

]A × B = C

(6.1)

29

6.2 Setting up the experiments

The sequential code of the application is presented in Listing 6.1.

// I n i t i a l i z i n g a h o s tfor ( i = 0 ; i < M; i++)

for ( j = 0 ; j < N; j++) a hos t [ i ] [ j ] = 1 . 0 ;

// I n i t i a l i z i n g b h o s tfor ( i = 0 ; i < N; i++)

for ( j = 0 ; j < P; j++) b host [ i ] [ j ] = i + 1 . ;

// I n i t i a l i z i n g the r e s u l t matrix , c hos t , to zerofor ( i = 0 ; i < M; i++)

for ( j = 0 ; j < P; j++) c ho s t [ i ] [ j ] = 0 ;

// Loop in which we perform the matrix m u l t i p l i c a t i o n .for ( i = 0 ; i < M; i++) {

for ( j = 0 ; j < P; j++) {for ( k = 0 ; k < N; k++) {

c ho s t [ i ] [ j ] += a hos t [ i ] [ k ] ∗ b host [ k ] [ j ] ;}

}}

Listing 6.1: Sequential implementation of matrix multiplication.

In Listing 6.1, M is the number of rows in a host, P is the number of columns in b host andN is the number of columns in a host and the number of rows in b host. These are the samevariables used for the dimensions as in section 6.1.

This algorithm works as follows. First we initialize the three necessary matrices. Normally theinitialization is random, but for this example, specific choices are made for simplicity. We havethe 2D array a host, which represents a matrix of ones. We then have the 2D array b host, ofwhich every column is the sequence 1, 2, ..., N.. Finally, the 2D array c host stores the multipli-cation result. We make sure to initialize every value of c host to 0.

After initializing the matrices, we start the matrix multiplication. This is achieved by utilizingthree nested for-loops, as can be seen in Listing 6.1. Every loop represents either the row ofa host, the column of b host, or the column of a host and row of b host. Using the valuesof these three loops we can calculate the dot product of the rows of a host and the columns ofb host, one element at a time.

6.2.1 Implementation differences oneAPI and OpenMP

Implementation using oneAPI

As usual, the oneAPI version of the application starts by creating a device-queue to which it willsubmit tasks. The difference is that this time we use two different ways of initializing this queue.We create the queue using the cpu selector, to have a queue which offloads to the CPU, and weuse the gpu selector to create a queue which offloads to the GPU. The process of creating thequeue this way can be seen in Listing 6.2. The printTargetInfo function is a custom functionwhich is simply used to check whether we are targeting the right offload device.

g p u s e l e c t o r d e v i c e s e l e c t o r ;queue dev ice queue ( d e v i c e s e l e c t o r ) ;p r in tTarge t In f o ( dev ice queue ) ;

Listing 6.2: Code which creates a queue using the gpu selector

30

We create read-only buffers for the A and B matrices and write-only buffers for the C matrix.The three buffers that we create for this application are two-dimensional. The size of each bufferdepends on the dimensions of the matrix which it is for. In Listing 6.3 an example of the buffersizes can be found, using the dimensions that were also used in Section 6.1

bu f f e r<double , 2> a b u f f e r ( a host , range<2>{M, N} ) ;bu f f e r<double , 2> b b u f f e r ( b host , range<2>{N, P} ) ;bu f f e r<double , 2> c b u f f e r ( c host , range<2>{M, P} ) ;

Listing 6.3: Code used to create buffers

The oneAPI kernel code of the application is presented in Listing 6.4.

// Get the N v a r i a b l eint WidthA = a . ge t range ( ) [ 1 ] ;

// Execut ing k e r n e lcgh . p a r a l l e l f o r ( range<2>{M, P} , [= ] ( id<2> index ) {

// Get g l o b a l p o s i t i o n in Y d i r e c t i o nint row = index [ 0 ] ;// Get g l o b a l p o s i t i o n in X d i r e c t i o nint c o l = index [ 1 ] ;

double sum = 0 . 0 ;// Compute the r e s u l t o f one element in cfor ( int i = 0 ; i < WidthA ; i++) {

sum += A[ row ] [ i ] ∗ B[ i ] [ c o l ] ;}

C[ index ] = sum ;} ) ;

Listing 6.4: The oneAPI kernel code for matrix multiplication

Implementation using OpenMP

For OpenMP, choosing and targeting an offload device works slightly different. To allow the of-floading to the GPU, we first add the -qnextgen -fiopenmp -fopenmp-targets=spir64 com-piler flags [29]. The -fiopenmp option invokes OpenMP with both the standard and offloadOpenMP pragmas. Then we use the export LIBOMPTARGET DEVICETYPE=gpu environment vari-able to make sure that we are targeting a GPU device.

We then use OpenMP pragmas to select code to offload to the GPU. We first use target tooffload a code region to the device and to create a data environment on the device. We combinethis with data map(map-type:variables) to map variables to the device data environment forthe extent of the region.

We then create a team of threads using the parallel construct and we use teams to create aleague of teams. The distribute construct is used to distribute a for loop across the masterthreads of all teams of the current teams region. An example of a combination of these pragmascan be seen in Listing 6.5.

31

#pragma omp t a r g e t data map( to : a [ 0 :M] [ 0 :N] , b [ 0 :N ] [ 0 : P ] )map( tofrom : c [ 0 :M] [ 0 : P ] )

#pragma omp t a r g e t#pragma omp t a r g e t teams d i s t r i b u t e p a r a l l e l for private ( i , j , k )

for ( i = 0 ; i < M; i++) {for ( j = 0 ; j < P; j++) {

for ( k = 0 ; k < N; k++) {c [ i ] [ j ] += a [ i ] [ k ] ∗ b [ k ] [ j ] ;

}}

}

Listing 6.5: OpenMP code using pragmas to offload to a device


The first variables we can change for this application are the dimensions of the matrices. Just asin Paragraph 6.1, matrix A has dimensions m×n, matrix B has dimensions n× p and the resultmatrix C has dimensions m × p. For simplicity’s sake, we define the the matrices such that wehave 3 square matrices of equal size. This means that m = n = p, as can be seen in Listing 6.6.We can vary the dimensions of our matrices by changing the value of SIZE.

#define SIZE 4000#define M SIZE // Rows 1 s t matrix#define N SIZE // Columns 1 s t matrix , rows 2nd matrix .#define P SIZE // Columns 2nd matrix

Listing 6.6: Matrix dimension definitions

Just like with vector addition and the second order stencil, something else that we can changeis the size of a work-group per dimension for oneAPI, and the amount of threads used for theOpenMP CPU-version. Apart from this, we can also try different optimizations flags for theCPU-version of the OpenMP code. For the GPU-version of the OpenMP code, we need to tweakthe pragmas used when offloading the code to a target device.


6.3.1 Experiment 1: OneAPI vs OpenMP

For this application we also started with a comparison between oneAPI and OpenMP where nooptimizations were yet applied. We ran the application MAX ITER times (MAX ITER rangesbetween 10 and 1000, depending on the data size) for each SIZE, and record the average exe-cution time. We ran this experiment for both the GPU as well as the CPU. The results for theCPU are represented in Table 6.1, and the results for the GPU are represented in Table 6.2.

Figure 6.1 and Figure 6.2 present the speed-up of the parallel versions, using the sequentialversion as reference. The graphs show that OpenMP performs worse than oneAPI before weapply any optimizations for both the CPU and the GPU.

Average execution time [ms]SIZE 100 250 500 1k 2k 4k 8k

Sequential 2,1 31,0 237 1.909 16.058 135.683 1.109.614oneAPI 0,9 2,4 14 132 1.160 11.573 164.856OpenMP 1,6 8,0 71 660 7.119 64.146 632.356

Table 6.1: Execution time (ms) for the original (unoptimized) CPU versions.

32


oneAPI 1,6 4 19 155 1.267 10.171 90.517OpenMP 1,8 5 30 210 2.391 21.346 167.424

Table 6.2: Execution time (ms) for the original (unoptimized) GPU versions.

Figure 6.1: Speedup of original OpenMP and OneAPI CPU code compared to Sequential code.

Figure 6.2: Speedup of original OpenMP and OneAPI GPU code compared to Sequential code.

From Figure 6.1 and Figure 6.2, we conclude that OpenMP performs worse than oneAPI on boththe CPU and GPU before we apply any optimizations. This behavior on the CPU is the sameas what we found in Chapter 5, where OpenMP also performed worse before we applied anyoptimizations. We also note that the difference in performance between oneAPI and OpenMP ismore significant for the CPU than the GPU.

33

6.3.2 Experiment 2: Closing the gap (CPU)

Following the first comparison, we can decide where we should start optimizing our application.We first take a look at the performance on the CPU. Because OpenMP performed worse thanoneAPI, we attempt to optimize OpenMP. Just like we did for the second order stencil in Chap-ter 5, we made an educated guess, based on related work [16], that oneAPI used vectorization,while OpenMP did not. We added the -O3 -march=native compiler flags, to ”force” vectoriza-tion.

We then compared the version of OpenMP using optimization flags against the oneAPI versionand the default OpenMP version. This comparison can be seen in Table 6.3.


oneAPI 0,9 2 14 132 1.160 11.573 164.856OpenMP - No flags 1,6 8 71 660 7.119 64.146 632.356OpenMP - Using flags 0,3 4 13 464 5.063 48.369 578.980

Table 6.3: Execution time (ms) of oneAPI compared to the OpenMP using compiler flags

The results presented in Table 6.3 demonstrate that we managed to improve the performanceusing compiler flags, compared to the original OpenMP version. However, the gap with oneAPIremains significant. We thus continue optimizing the OpenMP performance, so that we can getcloser to oneAPI’s performance.

Next we tried switching the two inner for-loops of the matrix multiplication. We now havethe order of code in Listing 6.7, instead of the order of code in Listing 6.5.

for ( i = 0 ; i < M; i++) {for ( k = 0 ; k < N; j++) {

for ( j = 0 ; j < P; k++) {c [ i ] [ j ] += a [ i ] [ k ] ∗ b [ k ] [ j ] ;

}}

}

Listing 6.7: Matrix multiplication code where the two inner loops have been switched

We measure the execution time of the new code, both with and without optimization flags, andcompare it against the default OpenMP version and the OpenMP version using only optimizationflags. In Table 6.4 and Figure 6.3 we compare the execution time and in Figure 6.4 we comparethe speedup compared to the default OpenMP code.


OpenMP - Default 1,6 8 71 660 7.119 64.146 632.356OpenMP - Using flags 0,3 4 13 464 5.063 48.369 578.980OpenMP - Switched loops 1,5 8 55 455 3.608 28.972 231.483OpenMP - Switched + flags 0,2 2 8 47 508 4.119 37.934

Table 6.4: Execution time (ms) of oneAPI compared to the OpenMP using compiler flags

34

Figure 6.3: Execution time (ms) for four two OpenMP versions, before and after applying op-timization methods (also seen in Table 6.3). Please note the logarithmic scale on the verticalaxis.

Figure 6.4: Speedup of three OpenMP versions compared to default OpenMP code.

Looking at the results in Figure 6.3 and Figure 6.4, we first of all notice that switching theloops led to an increase in performance. We believe this to be the effects of locality of referencein the computer’s memory hierarchy. We also notice that the optimization flags have a muchbigger impact after switching the loops. The reasoning behind this is that there is no longer adependency chain [30], meaning that we can vectorize the loop using only -O3.

We conclude that the best optimization is a combination of both the compiler flags and theloop switch. To see whether this closed the gap with oneAPI, we compare the performance ofour ”winning combination” of optimizations methods with the performance of oneAPI, which isrepresented in Figure 6.5.

35

Figure 6.5: Execution time (ms) for oneAPI, and two OpenMP versions, before and after opti-mization. Please note the logarithmic scale on the vertical axis.

The results presented in Figure 6.5 demonstrate that we managed to significantly improveOpenMP’s performance. Not only did we close the performance gap, but OpenMP now evenperforms better for every size of DATA. We tried changing the loop order for oneAPI as well,however we did not manage this due to how the oneAPI code is set up. We conclude that for thisapplication oneAPI performs better than OpenMP before we optimize the code, but OpenMPperforms better after optimizing the code.

6.3.3 Experiment 3: Closing the gap (GPU)

Because OpenMP also performed worse than oneAPI on the GPU, we again attempt to optimizeOpenMP. This time however, we can’t use our compiler flags, such as -O3, because these flags onlyaffect the CPU part of the code. In our previous experiment we saw that switching the two innerloops improved the performance on the CPU and we thus decided to try this for the GPU as well.

We compared the version of OpenMP with the loop switch against the oneAPI version and thedefault OpenMP version. This comparison can be seen in Table 6.5 and Figure 6.6.


oneAPI 1,6 4 19 155 1.267 10.171 90.517OpenMP - Default 1,8 5 30 210 2.391 21.346 167.424OpenMP - Switched loops 2 9 60 496 4.168 34.212 265.292

Table 6.5: Execution time (ms) of oneAPI compared to the OpenMP using switched loops

36

Figure 6.6: Execution time (ms) for oneAPI, and two OpenMP versions, before and after switch-ing loops (also seen in Table 6.5). Please note the logarithmic scale on the vertical axis.

The results presented in Figure 6.4 demonstrate that switching the loops actually made the per-formance worse, instead of better like it did for the CPU. We believe that the reason for this isthe fact that the GPU requires a different access pattern to get good memory coalescing. Whenwe reorder the loops, this memory coalescing gets worse, thus worsening the performance.

As GPUs are not using vectorization, nor are there any specific OpenMP pragmas that canused to automatically improve the performance of offloaded code, we could not add any otheroptimizations to our OpenMP GPU code. We consider the only available options for improvingthe GPU OpenMP performance is to manually tweak the parallelism by parameterizing the teamspragma further. This tedious operation, however, is left for future work.

37

CHAPTER 7

Conclusion and Future Work

In this research we performed an empirical performance analysis for Intel’s new programmingmodel, oneAPI. Specifically, our goal was to answer the following research question:”Does simplifying programming using oneAPI come with any adverse effects on ap-plication performance?”

We answered this research question by studying the behavior of three1 applications that im-plement parallel algorithms, for both oneAPI and OpenMP. For each application, we applied astep-by-step analysis through which we compared different versions using both models, aimingto quantify (and close) any performance gap.

The applications we studied are: vector addition, a second order stencil, and matrix multipli-cation. For the first of these three applications, we observed a large initial performance gap,and we were able to determine its cause. For the second application, we also observed a per-formance gap, though the other way around, and we managed to reduce this gap completely.For our third application, we observed a performance gap on both the CPU and GPU. We didnot manage to reduce this gap for the GPU, but we did manage to turn the gap the other wayaround on the CPU, meaning that OpenMP actually performed better than oneAPI on the CPU.

The first lesson learned about oneAPI’s performance is that there is a significant overhead presentwhen working with the command queue. This overhead is much worse for the first run of anapplication than for its subsequent runs, but even in the subsequent runs the overhead cansometimes be significant enough to be noticed. Whether the overhead is significant depends onthe complexity of the application and the size of the data. For example, for vector addition,the overhead is always significant, and the size of the overhead depends a lot on the data size.However, for the second order stencil and matrix multiplication, the overhead is only significantfor very small data sizes as it does not seem that the overhead out-scales the execution time ofthe application itself.

The second lesson learned about oneAPI’s performance is, as one may have already expected,that the size of a work-group has more effect on the performance for larger data sizes. We alsonoticed that the oneAPI runtime is quite good at choosing the correct work-group size for anapplication. The work-group size chosen by the oneAPI runtime performed as good as our man-ually found best work-group size.

The third and final lesson we learned about oneAPI’s performance was that oneAPI can performbetter than OpenMP without adding any optimizations to either implementation. The reason forthis difference in performance seemed to mostly come from the fact that oneAPI vectorized itsloops, while OpenMP did not. To try and close this gap, we forced OpenMP to also vectorize itsloops using gcc’s -O3 compiler flag. We achieved a significant improvement of the performance

1The third application is work-in-progress.

39

of OpenMP, compared to oneAPI, and managed to completely eliminate the performance gap onthe CPU. On the GPU we did not manage to vectorize the loops and oneAPI thus still performsbetter when offloading to the GPU. From the researched applications, we concluded that thereare applications for which an unoptimized oneAPI performs better than OpenMP.

In short, the answer to our research question could be summarized as follows: simplifying pro-gramming using oneAPI can come with adverse effects on the application performance, but itcan also improve the application performance. The biggest adverse effects we found are causedby the overhead of using oneAPI, and the biggest improvements in performance were due tooneAPI’s superior code optimizations.

7.1 Future work

There are still some points of interests left uncovered.

We first suggest more investigation into the command queue. It is currently not yet clear tous why the queue submission has such a significant overhead and whether there is a way toprevent this overhead from appearing. We could try to to solve this by more testing, especiallyon different machines. There is a possibility that the overhead is dependent on the device used.

Another interesting research direction is to understand which oneAPI optimizations are auto-matically being applied for our applications, and what the conditions are to trigger these opti-mizations. And once we know which optimizations are applied to an oneAPI application, we canget a better comparison with other native models.

Final, we believe a comprehensive study needs to be conducted to determine oneAPI’s advantagesand disadvantages. Such a study must include more applications, programming models, andhardware platforms and many more applications need to be compared. A good goal to havecould be to cover all of Berkeley’s 13 dwarfs.

40

Bibliography

[1] Oneapi programming guide. Available at https://software.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html.

[2] Admin. Intel R© oneapi: Unified x-architecture programming model, Apr 2020. Available at:https://software.intel.com/en-us/oneapi.

[3] Blaise Barney. Posix threads programming. National Laboratory. Disponıvel em:¡https://computing. llnl. gov/tutorials/pthreads/¿ Acesso em, 5:46, 2009.

[4] Intel R© threading building blocks. Available at https://software.intel.com/content/www/us/en/develop/tools/threading-building-blocks.html.

[5] Openmp (open multi-processing), Apr 2020. Available at https://www.openmp.org/.

[6] Openacc (for open accelerators). Available at https://www.openacc.org/.

[7] Sycl - c single-source heterogeneous programming for acceleration offload, Jan 2014. Avail-able at: https://www.khronos.org/sycl/.

[8] Platform model. Available at: https://software.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-programming-model/platform-model.html.

[9] Execution model. Available at: https://software.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-programming-model/execution-model.html.

[10] Kernel c++ language requirements. Available at: https://software.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-programming-model/kernel-programming-model/c-language-requirements.html.

[11] Krste Asanovic, Ras Bodik, Bryan Catanzaro, Joseph Gebis, Parry Husbands, Kurt Keutzer,David Patterson, William Plishker, John Shalf, Samuel Williams, and Katherine Yelick. Thelandscape of parallel computing research: A view from berkeley, 12 2006.

[12] Wu-chun Feng, Heshan Lin, Thomas Scogland, and Jing Zhang. Opencl and the 13 dwarfs:A work in progress. In Proceedings of the 3rd ACM/SPEC International Conference onPerformance Engineering, ICPE ’12, page 291–294, New York, NY, USA, 2012. Associationfor Computing Machinery. Available at: https://doi.org/10.1145/2188286.2188341.

[13] Jianbin Fang, Ana Lucia Varbanescu, and Henk J. Sips. A comprehensive performancecomparison of cuda and opencl. 2011 International Conference on Parallel Processing,pages 216–225, 2011.

[14] Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, KyleSpafford, Vinod Tipparaju, and Jeffrey S. Vetter. The scalable heterogeneous computing(shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computa-tion on Graphics Processing Units, GPGPU-3, page 63–74, New York, NY, USA, 2010. Asso-ciation for Computing Machinery. Available at: https://doi.org/10.1145/1735688.1735702.

41

[15] A software scheduling solution to avoid corrupted units on gpus. J. Parallel Distrib. Com-put., 90(C):1–8, April 2016. Available at: https://doi.org/10.1016/j.jpdc.2016.01.001.

[16] J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance gaps between openmp andopencl for multi-core cpus. In 2012 41st International Conference on Parallel ProcessingWorkshops, pages 116–125, 2012.

[17] Jarno van der Sanden. Evaluating the performance and portability of opencl. Master’sthesis, TU Eindhoven, 2011.

[18] Matthew Martineau, Simon McIntosh-Smith, and Wayne Gaudin. Assessing the perfor-mance portability of modern parallel programming models using tealeaf. Concurrency andComputation: Practice and Experience, 29(15):e4117, 2017.

[19] Intel. intel/basekit-code-samples, May 2020. Available at: https://github.com/intel/BaseKit-code-samples/tree/master/DPC++Compiler.

[20] Intel. intel/hpckit-code-samples, Apr 2020. Available at:https://github.com/intel/HPCKit-code-samples.

[21] Openmp* loop scheduling. Available at:https://software.intel.com/content/www/us/en/develop/articles/openmp-loop-scheduling.html.

[22] Openmp application programming interface. Available at https://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf.

[23] Date and time utilities. Available at: https://en.cppreference.com/w/cpp/chrono.

[24] std::chrono::sytem clock. Available at: https://en.cppreference.com/w/cpp/chrono/system clock.

[25] std::chrono::steady clock. Available at: https://en.cppreference.com/w/cpp/chrono/steady clock.

[26] std::chrono::high resolution clock. Available at: https://en.cppreference.com/w/cpp/chrono/high resolution clock.

[27] Code sample: Two-dimensional finite-difference wave propagation in... Avail-able at: https://software.intel.com/content/www/us/en/develop/articles/code-sample-two-dimensional-finite-difference-wave-propagation-in-isotropic-media-iso2dfd.html.

[28] gcc optimization flags. Available at: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.

[29] Openmp 5.0 target with intel compilers. Available athttps://software.intel.com/content/www/us/en/develop/articles/openmp-50-target-with-intel-compilers.html.

[30] Agner Fog. Optimizing software in c++: An optimization guide for windows, linux and macplatforms, 2004.

42

intel oneapi: a performance study

Documents