making the most out of heterogeneous chips with cpu, gpu and fpga
TRANSCRIPT
![Page 1: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/1.jpg)
Making the most out of Heterogeneous Chips with CPU,
GPU and FPGA
Rafael AsenjoDept. of Computer Architecture
University of Malaga, Spain.
![Page 2: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/2.jpg)
2
![Page 3: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/3.jpg)
Xilinx Zynq UltraScale+ MPSoC
3
4 CPUsARM
Cortex-A53
GPUARM
Mali 400
2x CPUsARM
Cortex-R5
FPGA
![Page 4: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/4.jpg)
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline
4
![Page 5: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/5.jpg)
Motivation• A new mantra: Power and Energy saving• In all domains
5
![Page 6: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/6.jpg)
Motivation
• GPUs came to rescue:– Massive Data Parallel Code at
a low price in terms of power– Supercomputers and servers:
NVIDIA
• GREEN500 Top 5:– June 2017
• Top500.org (Nov 2017):– 87 systems w. NVIDIA– 12 systems w. Xeon Phi– 5 systems w. PEZY
6
Xeon + Tesla P100
Xeon + Tesla P100
Xeon + Tesla P100
Xeon + Tesla P100
Xeon + Tesla P100
![Page 7: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/7.jpg)
7
Green 500 Nov 2017
Xeon + PEZY
Xeon + PEZY
Xeon + PEZY
Xeon + PEZY
![Page 8: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/8.jpg)
Motivation
• There is (parallel) life beyond supercomputers:
8
![Page 9: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/9.jpg)
Motivation
• Plenty of GPUs elsewhere: – Integrated GPUs on more than 90% of shipped processors
9
![Page 10: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/10.jpg)
Motivation
• Plenty of GPUs on desktops and laptops: – Desktops (35 – 130 W) and laptops (15 – 60 W):
10
Intel Kaby Lake AMD APU Kaveri
http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/http://www.anandtech.com/show/10610/intel-announces-7th-gen-kaby-lake-14nm-plus-six-notebook-skus-desktop-coming-in-january/2
GPU
![Page 11: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/11.jpg)
Motivation
• Plenty of integrated GPUs in mobile devices too.
11
Huawei HiSilicon Kirin 960 (2 - 6 W)
http://www.anandtech.com/show/10766/huawei-announces-hisilicon-kirin-960-a73-g71
Huawei Mate 9
Mali
![Page 12: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/12.jpg)
Motivation
• And user-programmable DSPs as well:
12
Qualcomm Snapdragon 820/821 (2 - 6 W)
https://www.qualcomm.com/products/snapdragon/processors/820
Samsung Galaxy S7 Google Pixel
![Page 13: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/13.jpg)
• And FPGAs too…– ARM + iFPGA
• Altera Cyclon V• Xilinx Zynq-7000
– Intel + Altera’s FPGA• Heterogeneous
Architecture Research Platform
• CPU + iGPU + iFPGA– Xilinx Zynq UltraScale+
• 4 Cortex A53• 2 Cortex R5• GPU Mali 400• FPGA
Motivation
13 http://www.analyticsengines.com/developer-blog/xilinx-announce-new-zynq-architecture/
4 A53 GPU
2 R5
FPGA
![Page 14: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/14.jpg)
Motivation
• Overwhelming heterogeneity:
14
![Page 15: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/15.jpg)
Motivation
• Plenty of room for improvements– Want to make the most out of the CPU, the GPU and the FPGA
– Taking productivity into account:Productivity = Performance - Pain
• High level abstractions in order to hide HW details• Lack of high-level productive programming models• “Homogeneous programming of heterogeneous architectures”
– Taking energy consumption into account:Performance ➔ Performance / WattPerformance ➔ Performance / Joule
Productivity = Performance / Joule - Pain15
![Page 16: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/16.jpg)
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline
16
![Page 17: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/17.jpg)
Hardware: iGPUs
17
Intel Skylake AMD Kaveri
Iris Pro Graphics Graphic Core Next
HiSilicon Kirin 960
ARM Mali-G71
![Page 18: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/18.jpg)
Intel Skylake – Kaby Lake
18
• Modular design– 2 or 4 cores– GPU
• GT-1: 12 EU• GT-2: 24 EU• GT-3: 48 EU• GT-4: 72 EU
• SKU options– 2+2 (4.5W, 15W)– 4+2 (35W -- 91W)– 2+3 (15W, 28W,...)– 4+4 (65W)– ....
http://www.anandtech.com/show/6355/intels-haswell-architecture
![Page 19: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/19.jpg)
Intel Skylake – Kaby Lake
19
https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
![Page 20: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/20.jpg)
Intel Processor Graphics
20
GT2GT3
GT4
![Page 21: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/21.jpg)
Intel Processor Graphics
21
• Peak performance:
– 72 EUs– 2 x 4-wide
SIMD – 2 flop/ck
(fmadd)– 1GHz
2 x 4-wide SIMD Units7 in-flight
wavefronts
1.15 TFlops
![Page 22: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/22.jpg)
Intel Processor Graphics
23
NDRange ≈ grid work-group ≈ block
EU-thread (SIMD16) ≈ wavefront ≈ warp:
16 work-items ≈ 16 thrds
=
• Each SIMD16 wavefront takes 2cks• IMT: fine-grained multi-threading
– Latency hiding mechanism
![Page 23: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/23.jpg)
Intel Virt. Technology for Directed I/O (Intel VT-d)
24
Cache coherent path
GPU
CPU
![Page 24: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/24.jpg)
AMD Kaveri
• Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores.
25http://wccftech.com/
CU CU CU CU
CU CU CU CU
![Page 25: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/25.jpg)
AMD Graphics Core Next (GCN)
• In Kaveri, – 8 Compute Units (CU)– Each CU: 4 x 16-wide SIMD– Total: 512 FPUs– 866 MHz
• Max GFLOPS=• 0.86 GHz x• 512 FPUs x• 2 fmad =• 880 GFLOPS
26
![Page 26: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/26.jpg)
OpenCL execution on GCN
Work-group à wavefronts (64 work-items) à pools
27
WG
CU0
SIMD0-0
SIMD0 SIMD1 SIMD2 SIMD3
SIMD1-0SIMD2-0SIMD3-0SIMD0-1SIMD1-1SIMD2-1SIMD3-1SIMD0-2SIMD1-2
4 pools:4 wavefronts in flight per SIMD
4 ck to execute each wavefront
pool number
Wavefronts
![Page 27: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/27.jpg)
HiSilicon Kirin 960
• ARM64 big.LITTLE (Cortex A73+Cortex A53) + Mali G71
28
Mali
![Page 28: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/28.jpg)
Mali G71
• Supports OpenCL 2.0
• 4, 6, 8, 16, 32 Cores• Each core:
– 3 x 4-wide SIMD units
• 32 x 12 x 2 fmad x 0.85GHz=652 GFLOPS
29
![Page 29: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/29.jpg)
HSA (Heterogeneous System Architecture)
– CPU, GPU, DSPs,...• Founders:
30
• HSA Foundation’s goal: Productivity on heterogeneous HW
![Page 30: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/30.jpg)
Advantages of iGPUs
• Discrete and integrated GPUs: different goals– NVIDIA Pascal: 3584 CUDA cores, 250W, 10 TFLOPS– Intel Iris Pro Graphics 580: 72EU x 8 SIMD, ~15W, 1.15 TFLOPS– Mali-G71: 32 EU x 3 x 4 SIMD, ~ 2W, 0.652 TFLOPS
• CPU and GPU are both first-class citizens. – SVM and platform atomics allows for closer collaboration
• Avoid PCI data transfer and associated POWER dissipation– Operating System doesn’t get in the way à less overhead
• CPU and GPU may have similar performance– It’s more likely that they can collaborate
• Cheaper!
31
![Page 31: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/31.jpg)
Hardware: iFPGAs
32
Altera Cyclone V Xilinx Zynq UltraScale+ Intel Xeon+Arria10
QPI
FPGA
2 x A9 4 x A53
GPU
2xR5
FPGA
![Page 32: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/32.jpg)
FPGA Architecture
33
Courtesy: Javier Navaridas, U. Manchester.
![Page 33: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/33.jpg)
CLB: Configurable Logic Block
34
CIN
SwitchMatrix
COUTCOUT
Slice X0Y0
Slice X0Y1
Fast Connects
Slice X1Y0
Slice X1Y1
CIN
SHIFTIN
Left-Hand SLICEM Right-Hand SLICEL
SHIFTOUT
Courtesy: Jose Luis Nunez-Yanez, U. Bristol.
LUT
LUT
FF
FF
![Page 34: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/34.jpg)
FPGA Architecture
35
![Page 35: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/35.jpg)
How do we “mold this liquid silicon”?
• Hardware overlays– Map a CPU or GPU onto the FPGA
ü100s of simple CPU can fit on large FPGAsüThe CPU or GPU overlay can change/adapt!Resulting overlay is not as efficient as a “real” one!Same drawbacks as “general purpose” processors
36
CPU: GPU:
FPGAs for Software Programmers, Dirk Kock, Daniel Ziener
![Page 36: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/36.jpg)
How do we “mold this liquid silicon”?
üNo FETCH nor DECODE of instructions à already hardwiredüData movement (Exec. Units ⇔ MEM) reduction à Power saveüNot constrained by a fixed ISA:
üBit-level parallelism; shift, bit mask, ...; variable precision arith.!Less frequency (hundreds of MHz)!In order execution (proposal from David Kaeli’s group to OoO)!Programming effort
37
FPGA:
Hardware thread reordering to boost OpenCL throughput onFPGAs, ICCD’16
FPGAs for Software Programmers, Dirk Kock, Daniel Ziener
![Page 37: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/37.jpg)
Xilinx Zynq UltraScale+ MPSoC
• Available: – SW coherence (cache flush) à Slow– HW coherence (MESI coherence protocol) à Fast. Ports:
• Accelerator Coherency Port, ACP • Cache-coherent interconnect (CCI) ACE-Lite ports
– Part of ARM® AMBA® AXI and ACE protocol specification
38
4 x A53
GPU
2xR5
FPGA
![Page 38: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/38.jpg)
Intel Broadwell+Arria10
• Cache Coherent Interconnect (CCI)– New IP inside the FPGA: Intel Quick Path Interconnect with CCI– AFU (Accelerated Function Unit) can access cached data
39
![Page 39: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/39.jpg)
Towards CCI everywhere
• CCIX consortium (http://www.ccixconsortium.com)– New chip-to-chip interconnect operating at 25Gbps– Protocol built for FPGAs, GPUs, network/storage adapters, ...– Already in Xilinx Virtex UltraScale+
40
![Page 40: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/40.jpg)
Which architecture suits me best?
41
• All of them are Turing complete, but• Depending on the problem:
– CPU:• For control-dominant problems• Frequent context switching
– GPU:• Data-parallel problems (vectorized/SIMD)• Little control flow and little synchronization
– FPGA:• Stream processing problems• Large data sets processed in a pipelined fashion• Low power constraints
![Page 41: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/41.jpg)
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline
42
![Page 42: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/42.jpg)
http://imgs.xkcd.com/comics/standards.png
43
HOW STANDARDS PROLIFERATEsee: A/C chargers, Character encodings, Instant Messaging, etc)
![Page 43: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/43.jpg)
Programming models for heterogeneous: GPUs
• Targeted at exploiting one device at a time– CUDA (NVIDIA)– OpenCL (Khronos Group Standard)– OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0)– C++AMP (Windows’ extension of C++. Recently HSA announced own ver.)– Sycl (Khronos Gruop’s specification for “Single-source C++ progr.)– ROCm (Radeon Open Compute, HCC, HIP, for AMD GCN –HSA–) – RenderScript (Google’s Java API for Android)– ParallDroid (Java + Directives from ULL, Spain)– Many more (Numba Python, IBM Java, Matlab, R, JavaScript, …)
• Targeted at exploiting both devices simultaneously (discrete GPUs)– Qilin (C++ and Qilin API compiled to TBB+CUDA)– OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler)– XKaapi– StarPU
• Targeted at exploiting both devices simultaneously (integrated GPUs)– Qualcomm Symphony– Intel Concord
44
![Page 44: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/44.jpg)
Programming models for heterogeneous: GPUs
Targeted at exploiting ONE device at a time
45
parallel_for
CPU GPU
Targeted at exploiting SEVERAL devices at the same time
parallel_forcompiler
+runtime
CPU GPU
CUDA OpenCL
OpenMP4.0C++AMPSycl
ROCm
RenderScriptParallDroid
Numba Python
JavaScriptMatlabR
Qilin OmpSs
XKaapi StarPU
Qualcomm Symphony
Concord
![Page 45: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/45.jpg)
Programming models for heterogeneous: FPGAs
• HDL-like languages– SystemVerilog, Bluespec System Verilog– Extend RTL languages by adding new features
• Parallel libraries– CUDA, OpenCL– Instrument existing parallel libraries to generate RTL code
• C-based languages– SDSoC, LegUp, ROCCC, Impulse C, Vivado, Calypto Catapult C– Compile* C into intermediate representation and from there to HDL– Functions become Finite State Machines and variables mem. blocks
• Higher level languages– Kiwi (C#), Lime (Java), Chisel (Scala)– Translate* input language into HDL
46
* Normallyasubsetthereof
Courtesy: Javier Navaridas, U. Manchester.
![Page 46: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/46.jpg)
OpenCL 2.0
• OpenCL 2.0 supports:– Coherent SVM (Shared Virtual Memory) & Platform atomics– Example: allocating an array of atomics:
47
clContex = ...clCommandQueue = ...clKernel = ...
cl_svm_mem_flags flags = CL_MEM_SVM_FINE_GRAIN_BUFFER | CL_MEM_SVM_ATOMICS;atomic_int * data;data = (atomic_int *) clSVMAlloc(clContext, flags, NUM * sizeof(atomic_int), 0);
clSetKernelArgSVMPointer(clKernel, 0, data);clSetKernelArg(clKernel, 1, sizeof(int), &NUM);clEnqueueNDRangeKernel(clCommandQueue, clKernel, .... );
atomic_store_explicit ((atomic_int*)&data[0], 99, memory_order_release);
CPU Host Code
kernel void myKernel(global int *data, global int* NUM){
int i=atomic_load_explicit((global atomic_int *)&data[0], memory_order_acquire);....}
GPU Kernel Code
![Page 47: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/47.jpg)
C++AMP
• C++ Accelerated Massive Parallelism• Example: Vector Add (C[i]=A[i]+B[i])
48
A, B and C buffers
Initialize A and B
Run on GPU
![Page 48: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/48.jpg)
SYCL Parallel STL
• Parallel STL aimed at C++17 standard:– std::sort(vec.begin(), vec.end()); //Vector sort– std::sort(seq, vec.begin(), vec.end()); // Explicit SEQUENTIAL– std::sort(par, vec.begin(), vec.end()); // Explicit PARALLEL
• SYCL exposes the execution policy (user-defined)
50
std::vector<int>(N);float ratio = 0.5; sycl::het_sycl_policy<class ForEach> policy(ratio);std::for_each(policy, v.begin(), v.end(),
[=](int& val) {val= val*val;}
);
Courtesy: Antonio Vilches
50% of the iterations on CPU50% of the iterations on GPU
https://github.com/KhronosGroup/SyclParallelSTL
![Page 49: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/49.jpg)
Qualcomm Symphony
• Point Kernel: write once, run everywhere– Pattern tuning extends to heterogeneous load hints
51
SYMPHONY_POINT_KERNEL_3(vadd, int, i, float*, a, int, na,float*, b, int, nb, float*, c, int, nc,
{ c[i] = a[i] + b[i];}); ...
symphony::buffer_ptr buf_a(1024), buf_b(1024), buf_c(1024); ... symphony::range<1>(1024) r; symphony::pfor_each(r, vadd_pk, buf_a, buf_b, buf_c,
symphony::pattern::tuner().set_cpu_load(10).set_gpu_load(50) .set_dsp_load(40)
);
Executed across CPU(10%) + GPU(50%) + DSP(40%)A kernel for
each device isautomatically
generated
![Page 50: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/50.jpg)
Intel Concord
• C++ heterogeneous programming framework
• Papers:– Rashid Kaleem et al. Adaptive heterogeneous scheduling on
integrated GPUs. PACT 2014.• Intel Heterogeneous Research Compiler (iHRC) at
https://github.com/IntelLabs/iHRC/– Naila Farooqui et al. Affinity-aware work-stealing for integrated
CPU-GPU processors. PPoPP ’16. PhD dissertation: https://smartech.gatech.edu/bitstream/handle/1853/54915/FAROOQUI-DISSERTATION-2016.pdf
52
class Foo { int *a, *b, *c; void operator(int i) { c[i] = a[i]+b[i]; }
Foo *f = new Foo();
parallel_for(0, N, *f); Executed across CPU+GPU, dynamic partition
![Page 51: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/51.jpg)
Xilinx SDSoC
• Without SDSoC:
53
FPGA
CPU
main(){
init(A,B,C);
mmult(A,B,D);
madd(C,D,E);
consume(E);
}
Courtesy: Xilix
mmult madd
ABC
D
E
![Page 52: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/52.jpg)
Xilinx SDSoC
• With SDSoC:– Eclipse-based SDK
54
FPGA
CPU
Refine Code:
• Re-factoring code to be more HLS-friendly
• Adding code annotations (i.e., pragmas)
#pragma SDS data copy(Array)#pragma SDS zero_copy(Array)#pragma SDS data access_pattern#pragma SDS data sys_port (...)#pragma HLS PIPELINE II = 1#pragma HLS unroll#pragma HLS inline
![Page 53: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/53.jpg)
Altera OpenCL
55
main() {read_data( ... );manipulate( ... );clEnqueueWriteBuffer( ... );clEnqueueNDRange(...,sum,...);clEnqueueReadBuffer( ... );display_result( ... );
}
__kernel voidsum(__global float *a,
__global float *b, __global float *y)
{ int i = get_global_id(0); c[i] = a[i] + b[i];
}
Host Code
Kernel Code
StandardC Compiler
OpenCLaoc compiler
Verilog
.EXE
.OUT.AOCX
bitstream
CPU FPGA
![Page 54: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/54.jpg)
Altera OpenCL optimizations
• Locality:– Shift-Register Pattern (SRP)– Local Memory (LM)– Constant Memory (CM)– Load-Store Units Buffers (LSU)
• Loop Unrolling (LU)– Deeper pipelines– Optimized tree-like reductions
• Kernel vectorization (SIMD)– Wider ALUs
• Compute Unit Replication (CUR)– Replicate Pipelines– Downsides: memory divergence ⬆, complexity ⬆, frequency ⬇
56
![Page 55: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/55.jpg)
Shift Register Pattern
• Example: 1D stencil computation
57
x x x
+
NDRangeNo SRP
SingleTaskSRP
x x x
+
shft_reg
64x faster than Naive (17 elem. filter)“Tuning Stencil Codes in OpenCL for
FPGAs”, ICCD’16
Shift !!
![Page 56: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/56.jpg)
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline
58
![Page 57: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/57.jpg)
Our work: parallel_for
• Heterogeneous parallel for based on TBB– Dynamic scheduling / Adaptive partitioning for irregular codes– CPU and GPU chunk size re-computed every time
59
A15
GPU
A15
A15
A15
A7
A7
A7
Problem:BarnesHut
Archit.:Exynos 5
One A7 feedsthe GPU
Time
Time to executethis chunk of
iterations on GPU
![Page 58: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/58.jpg)
New scheduler and API
60
![Page 59: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/59.jpg)
New API
61
Use the GPU
Exploit Zero-Copy-Buffer
Scheduler class
![Page 60: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/60.jpg)
62
![Page 61: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/61.jpg)
Related work: Qilin
63
• Static Approach: Bulk-Oracle Offline search (11 executions)
– Work partitioning between CPU and GPU
– One big chunk with 70% of the iterations: on the GPU– The remaining 30% processed on the cores
0"
50"
100"
150"
0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%"Percentage)of)the)itera.on)space)offloaded)to)the)GPU)
Barnes)Hut:)Offline)search)for)sta.c)par..on)
Execu2on"2me"in"seconds"
![Page 62: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/62.jpg)
Related work: Intel Concord
64
• Papers:– Rashid Kaleem et al.
Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014.
– Naila Farooqui et al. Affinity-aware work-stealing for integrated CPU-GPU processors. PPoPP ’16(poster).
assign chunksize just for
GPU
computeon CPU
computechunk on
GPU
when the GPU is done
according to relative speeds:partition the rest of the iteration
space
![Page 63: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/63.jpg)
Related Work: HDSS
65
• Belviranli et al, A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures, TACO 2013
TACO0904-57 ACM-TRANSACTION December 22, 2012 2:21
57:8 M. E. Belviranli et al.
Fig. 6. An example showing block assignments by time for a two-threaded (1 GPU and 1 CPU thread) run ofHistogram application on CPU+GPU system. Figure (a), on the left, illustrates the assignments dispatchedduring the adaptive phase. In this figure, block size axis is shown in logarithmic scale for a better separationof block assignments in the chart. Figure (b), on the right, shows the two phases of the algorithm separatedwith a clear jump in assigned block sizes just after computational weights are determined and stabilized.
a block size is retrieved, the programmer must call processor-specific implementations(kernels) of her application on the determined data block.
3.2. The AlgorithmOverview. The HDSS algorithm determines the ideal block size per processor usinga two-phased approach. The initial phase, called the adaptive phase, is responsiblefor accurately finding the computational weights which reflect the relative processorspeeds. This phase processes a relatively small amount of data whose upper bound isset as a fixed percentage of the total data. The final phase, called completion phase,processes the remainder of the data using the weights computed in the adaptive phase.This phase starts with the largest possible blocks to optimize the execution time byreducing the dispatching overhead. Block size decreases steadily as the computationworks towards end so that all units complete execution at nearly the same time. Asan example, the progression of block sizes during the adaptive phase is illustratedin Figure 6(a). The experiment is for Histogram application on a CPU+GPU systemwith 210M inputs. The overall process including completion phase is illustrated inFigure 6(b) for the same experiment. The sudden increase in the curve is the momentwhere HDSS assigns largest possible block for the very first block request of the secondphase.
A flow chart for our proposed self-scheduling algorithm, HDSS, is given in Figure 7.Our algorithm stores the remaining number of iterations and total computationalweight as global variables. In addition, each processing element has the followingattributes:
—the initial block size (user-defined);—computational weight in iterations per microsecond;—block size factor (user-defined);—dynamic list of blocks that have been assigned to the processor before. Each entry in
the list contains the size of the block as well as the execution start and end times ofthe block.
For the first block assignment, the initial block size parameter for the requestingprocessor is set as the block size. During subsequent block requests, HDSS decideswhether to continue with the adaptive phase or move to the next phase based ondynamics of the execution. Once HDSS concludes that the adaptive phase should beterminated, all subsequent block assignments by all processors are calculated by thecompletion phase.
ACM Transactions on Architecture and Code Optimization, Vol. 9, No. 4, Article 57, Publication date: January 2013.
FermiDiscrete
GPU
![Page 64: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/64.jpg)
Debunking big chunks
• Example: irregular Barnes-Hut benchmark
66
![Page 65: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/65.jpg)
Debunkign big chunks
A larger chunk causes more memory traffic
67
UpdatedchunkArray of bodies:
Writtenbody
Readbody
UpdatedchunkArray of bodies:
Writtenbody
Readbody
![Page 66: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/66.jpg)
Debunking big chunks
• BarnesHut: GPU throughput along the iteration space– For two different time-steps– Different GPU chunk-sizes
68
0 2 4 6 8 10x 104
20
40
60
80
100
120
140
160
Barnes−Hut: Throughput variation (time step=0)
Iteration Space
Thro
ughp
ut
chunk=320chunk=640chunk=1280chunk=2560
0 2 4 6 8 10x 104
20
40
60
80
100
120
140
160
Barnes−Hut: Throughput variation (time step=5)
Iteration Space
Thro
ughp
ut
chunk=320chunk=640chunk=1280chunk=2560
640
2560
1280
640
2560
Time step=0 Time step=5
![Page 67: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/67.jpg)
LogFit main idea
70
Time
GPU
Core1
Core2
Core3
Core4
Time to executethischunkof
iterationsonGPU
ExpPhs
StablePhase FP
IntelCorei7Haswell
GPU
Core
Core
CoreCore
Shared
L3
![Page 68: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/68.jpg)
Parallel_for: Performance per joule
• Static: Oracle-Like static partition of the work based on profiling• Concord: Intel approach: GPU size provided by user once• HDSS: Belviranli et al. approach: GPU size computed once
• LogFit: our work
72
GPU
GPU+1Core
GPU+2Cores
GPU+4Cores
GPU+3Cores
Antonio Vilches et al. Adaptive Partitioning for Irregular
Applications on Heterogeneous CPU-GPU Chips, Procedia Computer Science, 2015.
On Intel HaswellCore i7-4770
![Page 69: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/69.jpg)
LogFit: better for irregular benchmarks
73
![Page 70: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/70.jpg)
LogFit working on FPGA
• Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.)
74
DE1-SoC
Altera Cyclone V
AccelPower Cap + BeagelBone
Shunt resistor(power flow)
![Page 71: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/71.jpg)
LogFit working on FPGA
75
LogFit2C+FPGA
Static
• Terasic DE1: Altera Cyclone V (FPGA+2 Cortex A9)• Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.)
![Page 72: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/72.jpg)
Our work: pipeline• ViVid, an object detection application• Contains three main kernels that form a pipeline
• Would like to answer the following questions:– Granularity: coarse or fine grained parallelism?– Mapping of stages: where do we run them (CPU/GPU)?– Number of cores: how many of them when running on CPU?– Optimum: what metric do we optimize (time, energy, both)?
76
Stag
e 1
Stag
e 2
Stag
e 3
Inputframe Filter Histogram Classifier
Output
response mtx.
index mtx.
histogramsdetect. response
![Page 73: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/73.jpg)
Granularity
• Coarse grain:– CG
• Medium grain:– MG
• Fine grain:
– Fine grain also in the CPU via AVX intrinsics
77
Input Stage Output Stage
CPU
Citem
CPU
Citem
CPU
Stage 1
Citem
CPU
Stage 2
Citem
CPU
Stage 3
Citem
C =core
Input Stage
CPU
Citem
Output Stage
CPU
Citem
CPU
Stage 1
Citem
CC C
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
Input Stage Output Stage
GPUC C CC C C
itemCPU
Citem
CPU
Citem
GPUC C CC C C
itemGPUC C CC C C
item
Stage 1 Stage 2 Stage 3
![Page 74: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/74.jpg)
Our work: pipeline
78
Input Stage Output Stage
GPUC C CC C C
itemCPU
Citem
CPU
Citem
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
Stage 1
CPU
GPU
CPU
GPU
CPU
GPU
CPU CPU CPU
GPU
CPU CPU CPU CPU
CPU CPU
GPU
CPU
GPU
CPU CPU CPU
GPU
CPU CPU
GPU
CPU CPU
CPU CPU CPU CPU CPU
CPU CPU
GPU
CPU
GPU
CPU CPU
GPU
CPU CPU CPU CPU CPU
CPU CPU CPU CPU CPU
GPU
111 100
010 001
110 101
011 000
![Page 75: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/75.jpg)
Accounting for all alternatives
• In general: nC CPU cores, 1 GPU and p pipeline stages
# alternatives = 2p x (nC +2)
• For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives
79
Input Stage
GPUC C CC C C
item
CPU
Citem
Output Stage
CPU
Citem
GPUC C CC C C
itemGPUC C CC C C
item
CPU
Stage 1
itemCPU
Stage 2
itemCPU
Stage p
itemC C
nC coresC C
nC coresC C
nC cores
Mapping streaming applications on commodity multi-CPU and GPU on-chip processors, IEEE Tran. Parallel and Distributed Systems, 2016.
![Page 76: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/76.jpg)
Framework and Model
• Key idea: 1. Run only on GPU2. Run only on CPU3. Analytically extrapolate for heterogeneous execution4. Find out the best configuration à RUN
80
Input Stage
GPUC C CC C C
item
CPU
Citem
Output Stage
CPU
Citem
GPUC C CC C C
itemGPUC C CC C C
item
CPU
Stage 1
Citem
CC C
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
DP-MG
collect λ and E (homogeneous values)
![Page 77: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/77.jpg)
Environmental Setup: Benchmarks
• Four Benchmarks– ViVid (Low and High Definition inputs)
– SRAD
– Tracking
– Scene Recognition
81
InptFilter Histogram Classifier
St 1 St 2 St 3 Out
InptExtrac. Prep. Reduct.
St 1 St 2 St 3 St 4Comp.1 Comp. 2
St 5 St 6 OutStatist.
InptTrack.
St 1 Out
InptFeature. SVM
St 1 St 2 Out
![Page 78: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/78.jpg)
Does Higher Throughput imply Lower Energy?
82
1.0E-03 2.0E-03 3.0E-03 4.0E-03 5.0E-03 6.0E-03 7.0E-03 8.0E-03 9.0E-03
Num-Threads(CG)
6.0
8.0
10.0
12.0
14.0
16.0
Num-Threads(CG)
0.0E+00
2.0E-04
4.0E-04
6.0E-04
8.0E-04
1.0E-03
Num-Threads(CG)
HaswellHD
![Page 79: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/79.jpg)
1.0E-01
3.0E-01
5.0E-01
7.0E-01
9.0E-01
1.1E+00
Num-Threads(CG)
SRAD on Haswell
84
0.0E+00
1.0E-01
2.0E-01
3.0E-01
4.0E-01
5.0E-01
6.0E-01
7.0E-01
8.0E-01
Num-Threads(CG)
0.0E+00 2.0E-02 4.0E-02 6.0E-02 8.0E-02 1.0E-01 1.2E-01 1.4E-01 1.6E-01 1.8E-01
Num-Threads(CG)
Input Stage
GPUC C C
C C C
item
CPU
Citem
CPU
Stage 1
Citem
GPUC C C
C C C
item
CPU
Stage 2
Citem
Output Stage
CPU
Citem
GPUC C C
C C C
item
CPU
Stage 6
Citem
DP-CGGPUC C C
C C C
item
CPU
Stage 3
Citem
GPUC C C
C C C
item
CPU
Stage 4
Citem
GPUC C C
C C C
item
CPU
Stage 5
Citem
![Page 80: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/80.jpg)
Results on big.LITTLE architecture
85
• On Odroid XU3: 4 Cortex A15 + 4 Cortex A7 + GPU Mali
Input Stage Output Stage
CPU
Citem
CPU
Citem
CPU
Stage 1
Citem
CPU
Stage 2
Citem
CPU
Stage 3
Citem
C =core
5 thr (GPU+4A15)
1thr (A15)
2thr (2 A15)4thr (4 A15)
5thr (4 A15+1 A7)
9thr (4 A15+ 4 A7)
1 thr (GPU) 9 thr (GPU+4A15+4A7)
![Page 81: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/81.jpg)
Pipeline on CPU-FPGAs
• Change the GPU by a FPGA– Core i7 4770k 3.5 GHz Haswell quad-core CPU– Altera DE5-Net FPGA from Terasic– Intel HD6000 embedded GPU – Nvidia Quadro K600 discrete GPU (192 cuda cores)
• Homogeneous results:
86
Device Time (sec.)
Throughput (fps)
Power (watts)
Energy (Jules)
CPU (1 core) 167.610 0.596& 37& 6201&CPU (4 cores) 50.310& 1.987& 92& 4628&
FPGA 13.480& 7.418& 5" 67"On-chip GPU 7.855" 12.730" 16& 125&Discrete GPU 13.941& 7.173& 47& 655&
![Page 82: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/82.jpg)
Pipeline on CPU-FPGA
• Heterogeneous results based on our TBB scheduler:- Accelerator + CPU- Mapping 001
87
onGPU+CPU ~ FPGA+CPU
![Page 83: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/83.jpg)
Conclusions• Plenty of heterogeneous on-chip architectures out there
• Why not use all devices?– Need to find the best mapping/distribution/scheduling out of the
many possible alternatives.
• Programming models and runtimes aimed at this goal are in their infancy: striving to solve this.
• Challenges: – Hide HW complexity providing a productive programming model– Consider energy in the partition/scheduling decisions– Minimize overhead of adaptation policies
88
![Page 84: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/84.jpg)
Future Work
• ExploitCPU-GPU-FPGAchipsandcoherency• PredictenergyconsumptiononheterogeneousCPU-
GPU-FPGAchips
• Considerenergyconsumptioninthepartitioningheuristic
• Rebuildtheschedulerengineontopofthenew TBBlibrary(OpenCLnode)
• Tackleotherirregularcodes
89
![Page 85: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/85.jpg)
Future Work: TBB OpenCL node
buffer
get_next_image preprocess
detect_with_A
detect_with_B
make_decision
Can express pipelining, task parallelism and data parallelism
OpenCL node
![Page 86: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/86.jpg)
Future Work: other irregular codes
91
1
2 3 4
11 12 13 14
5 6 7 8 9 10
15 16 17
1918
Bottom-Up
a) Selective
1
2 3 4
11 12 13 14
5 6 7 8 9 10
15 16 17
1918
b) Concurrent
Top-Dow
n
1
2 3 4
11 12 13 14
5 6 7 8 9 10
15 16 17
1918
c) Asynchronous
5
1 CPU
GPU
19 Both
Node computed on:
CPU
GPU
Both
Frontier computed on:
C.H.G Vázquez, PhD, Library based solutions for algorithms with complex patterns, 2015
![Page 87: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA](https://reader030.vdocuments.net/reader030/viewer/2022021422/5a6d3d857f8b9ad6418b50c3/html5/thumbnails/87.jpg)
Collaborators
• Mª Ángeles González Navarro (UMA, Spain)• Andrés Rodríguez (UMA, Spain)• Francisco Corbera (UMA, Spain)• Jose Luis Nunez-Yanez (University of Bristol)• Antonio Vilches (UMA, Spain)• Alejandro Villegas (UMA, Spain)• Juan Gómez Luna (U. Cordoba, Spain)• Rubén Gran (U. Zaragoza, Spain)• Darío Suárez (U. Zaragoza, Spain)• Maria Jesús Garzarán (Intel and UIUC, USA)• Mert Dikmen (UIUC, USA)• Kurt Fellows (UIUC, USA)• Ehsan Totoni (UIUC, USA)• Luis Remis (Intel, USA)
92