elaborazione dati ad elevate prestazioni e bassa potenza ... · elaborazione di dati ad elevate...

Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded

many-core e FPGA

Dott. Alessandro [email protected]

Prof. Davide [email protected]

Agenda

• Many-core introduction

• CIRI ICT OpenMP Technologies– Productive Parallel Programming Models– Accelerator virtualization for high

performance and power efficient computations

– resource sharing among applications– heterogeneous unified shared memory

• Conclusions

CollaborationsIndustrial

UE Projects

FP7 – ICTGRANT N° 288574

ERCGRANT N° 291125

P-SOCRATES

Academia

The Advent ofHeterogeneous Many-Core Architectures Era

in High-Performance Embedded systems

The Advent ofHeterogeneous Many-Core Architectures Era

in High-Performance Embedded systems

Embedded systems need to be capable to process workloads usually tailored for

workstations or HPC.

Nvidia Tegra-K1Two levels of heterogeneity: Host Processor

(4 powerful cores + 1 energyefficient core) Parallel many-core co-

processor (192 cores accelerator:

NvidiaKeplerGPU)

• Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA).

Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low

power consumption = high energy efficiency(GOPS/Watt)

Various design schemes are available:• Targeting ADAPTIVITY: heterogeneity and

specialization for efficient computing. Es. ARM Big.Little

Nvidia K1 (jetson)Hardware Features• Dimensions: 5" x 5" (127mm x 127mm) board• Tegra K1 SOC (1 to 5 Watts):

• NVIDIA Kepler GPU with 192 SM (326 GFLOPS)• NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15

• DRAM: 2GB DDR3L 933MHz

IO Features• mini-PCIe• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo

Software Features• CUDA 6.0• OpenGL 4.4• OpenMAX IL multimedia codec including H.264• OpenCV4Tegra

Nvidia K1 (jetson)Hardware Features• Dimensions: 5" x 5" (127mm x 127mm) board• Tegra K1 SOC (1 to 5 Watts):

• NVIDIA Kepler GPU with 192 SM (326 GFLOPS)• NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15

• DRAM: 2GB DDR3L 933MHz

IO Features• mini-PCIe• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo


200 $. Big Community of user!

Nvidia TX1Hardware Features• Dimensions: 8" x 8” board• Tegra TX1 SOC (15 Watts):

• NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s)• Quad-core ARM Cortex-A57 MPCore Processor

• 4 GB LPDDR4 Memory

IO Features• PCI-E x4• 5MP CSI• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo


Nvidia TX1Hardware Features• Dimensions: 8" x 8” board• Tegra TX1 SOC (15 Watts):

• NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s)• Quad-core ARM Cortex-A57 MPCore Processor

• 4 GB LPDDR4 Memory

IO Features• PCI-E x4• 5MP CSI• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo

Software Features• CUDA 6.0• OpenGL 4.4• OpenMAX IL multimedia codec including H.264• OpenCV4Tegra599 $. WORKSTATION COMPARABLE

PERFORMANCES

TI Keystone IIHardware Features• Dimensions: 8" x 8” board• TI 66AK2H12 SOC (14 Watts):

• 8x C6600 DSP 1.2 GHz (304 GMACs)• Quad-core ARM Cortex-A15 MPCore

• Up to 4 GB DDR3 Memory

IO Features• PCI-E• SD/MMC card• USB 3.0/2.0• 2x Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 38 x GPIo• Hyperlink• SRIO• 20x 64bit Timers• Security Accelerators

Software Features• OpenMP• OpenCL

TI Keystone IIHardware Features• Dimensions: 8" x 8” board• TI 66AK2H12 SOC (14 Watts):

• 8x C6600 DSP 1.2 GHz (304 GMACs)• Quad-core ARM Cortex-A15 MPCore

• Up to 4 GB DDR3 Memory

IO Features• PCI-E• SD/MMC card• USB 3.0/2.0• 2x Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 38 x GPIo• Hyperlink• SRIO• 20x 64bit Timers• Security Accelerators

Software Features• OpenMP• OpenCL

Evaluation Board 1000$. Target for Signal Processing Accelerator

MPPA Kalray

MPPA Kalray

Evaluation Board 1000$.

Targeting• High Performance Time critical missions• Aerospace/Military/Autonomous driving• Industrial Robotics

Programmable many-core accelerator (PMCA)

Challenges

• Fast Programmability, High Productivity programming techniques

• Time predictability for industrial applications

• Accelerator virtualization for high performance and power efficient computations– resource sharing among applications– heterogeneous unified shared memory

0

0,5

1

1,5

2

iFunny NetFlix CandyCrushSaga

My TalkingTom

BS Player LinkedIn GoogleDrive

Instagram Youtube Dropbox Facebook Twitter

Thre

ad-L

evel

Par

alle

lism Android Apple

Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware:

TLPAVG(52apps)≈1.22 Android TLPAVG(52apps)≈1.36 Apple

Performance isnot free MEAL

0

0,5

1

1,5

2

iFunny NetFlix CandyCrushSaga

My TalkingTom

BS Player LinkedIn GoogleDrive

Instagram Youtube Dropbox Facebook Twitter

Thre

ad-L

evel

Par

alle

lism Android Apple

Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware:

TLPAVG(52apps)≈1.22 Android TLPAVG(52apps)≈1.36 Apple

Performance isnot free MEAL

[*] “Analysis of the Effective Use of Thread-Level Parallelism in Mobile Applications. A preliminary study on iOS and Android devices”, Ethan Bogdan Hongin Yun.

Parallel Programming models

ProprietaryProgramming models

Parallel Programming modelsProprietary

Programming models


Programming models

Khronos Standard forHeterogeneous Computing


Programming models


Standardfor shared memory system


Programming models


Standardfor shared memory system

AcademicProposals

• OmpSS• OpenHMPP• …

OpenMP▲ De-facto standard for shared memory programming

▲ Support for nested (multi-level) parallelism good for clusters

▲ Annotations to incrementally convey parallelism to the compiler increased ease of use

▲ Based on well-understood programming practices (shared memory, C language) increases productivity





“OpenCL for programming shared memory multicore CPUs” by Akhtar Ali , Usman Dastgeer , Christoph Kessler





“OpenCL for programming shared memory multicore CPUs” by Akhtar Ali , Usman Dastgeer , Christoph Kessler

2x to 10x less LOC





▼ Designed for uniform SMP with main shared memory

▼ Lacks constructs to control accelerators1. And compilation toolchain to deal with multiple ISA..

2. ..and multiple runtime systems too !!

But…





▼ Designed for uniform SMP with main shared memory

▼ Lacks constructs to control accelerators1. And compilation toolchain to deal with multiple ISA..

2. ..and multiple runtime systems too !!

But…

Inte

l’s P

aral

lel U

nive

rse

mag

azin

e –

May

201

4

Open-Next OpenMP runtime 4.0

• What’s new? UNTIED tasks


Comparison with otherOpenMP implementations

• >> x86 (Intel Haswell 2 × 8 cores @ 2.40 GHz.)

• libgomp: GNU OpenMP implementation (GCC 4.9.2)

• iomp: Intel OpenMP implementation (ICC 15.0.2)

0

2

4

6

8

10

12

14

16RECURSIVE

libgomp iomp


Comparison with othertasking runtimes

• >> x86 (Intel Haswell 2 × 8 cores @ 2.40 GHz.)

• nanos: BSC OpenSS(Mercurium 15.06 + Nanos++)

• Intel CILK+: ICC 15.0.2• Intel TBB: ICC 15.0.2• Wool: GCC 4.9.2

0

2

4

6

8

10

12

14

16RECURSIVE

nanos libgomp

iomp Intel CILK+ ‐ ICC (15.0.2)

Intel TBB ‐ ICC (15.0.2) WOOL ‐ GCC (4.9.2)

Time Predictability• At compile-time, generate the TDG that includes

timing information to consider the tasks communication

• At design-time, assign the TDG to OS threads (mapping)

• At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling)

CompilerC/C++

newTask()newTask()

#pragma omp#pragma omp

source code

binary code

eTDG

StaticScheduler

+ Timing Analysis

Many-coreOpenMP RTE

Run-timeCompile-time

Dispatcher

Designtime

Open-Next Offload using OpenMP

void main(){int a[];int ker_id;

/* some code here */

#pragma omp offload \shared (a) \

name (“myker”, ker_id) \nowait

{

#pragma omp parallel sections \

proc_bind (spread) {

#pragma omp sectionTASK_A();

#pragma omp sectionTASK_B();

}}

/* some more code here */

#pragma omp wait (ker_id)}






{





}}



new OpenMP directive used to offload the execution of a code block to the accelerator






{





}}




shared clause specifies data that needs to be shared between the host and accelerator






{





}}





New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for

asynchronous offloads






{





}}






asynchronous offloadsSpecify asynchronous offloads






{





}}







All standard OpenMP and custom extensions can be used within an offload block


TASK_A(){int i;#pragma omp parallel proc_bind (close)#pragma omp for

for( i=0;…. )do_smthg();

}





{





}}







All standard OpenMP and custom extensions can be used within an offload block


Early evaluation

05

10152025303540

FAST05

10152025303540

CT

05

10152025303540

Mahala.05

10152025303540

Strassen

05

10152025303540

NCC

05

10152025303540

SHOT

Kernel Repetitions

Kernel Repetitions

Spee

dup

Spee

dup

Early evaluation

05

10152025303540

FAST05

10152025303540

CT

05

10152025303540

Mahala.05

10152025303540

Strassen

05

10152025303540

NCC

05

10152025303540

SHOT

Kernel Repetitions

Kernel Repetitions

Spee

dup

Spee

dup

“Simplifying Heterogeneous Embedded SoC Programming with Directive-Based Offload”Marongiu, Capotondi, Tagliavini, BeniniIEEE Transactions Industrial Informatics 2015

Ope

nCL

Ope

nMP

Hardware Abstraction Layer

TBB

Ope

nVX

AcceleratorResource Sharing

there is not dominantstandard PPM

Ope

nCL

Ope

nMP


TBB

Ope

nVX



Ope

nCL

Ope

nMP

low-levelRuntime


TBB

Ope

nVX



Improve overall utilization of accelerators in multi-

user environment

Ope

nCL

Ope

nMP

low-levelRuntime


TBB

Ope

nVX




user environment

on PMCAs RTEs are typically developed on

top of bare metal

Ope

nCL

Ope

nMP

low-levelRuntime


TBB

Ope

nVX




user environment

on PMCAs RTEs are typically developed on

top of bare metal

Legacy ApplicationsO

penC

L

Ope

nMP

low-levelRuntime


TBB

Ope

nVX



HOST

driver

O1

O2

O3

ON

OpenCL

OpenMP

OpenVX


HOST

driver

Lightweight Spatial Partitioning Support

O1

O2

O3

ON

OpenCL

OpenMP

OpenVX


HOST

driver


VirtualAccelerators

O1

O2

O3

ON

OpenCL

OpenMP

OpenVX


HOST

driver


VirtualAccelerators

O1

O2

O3 ON

Runtime Efficiency: Computer Vision Use-Case

ORB Object Detector (OpenCL – 4 Clusters)[1]

Face Detector (OpenCL – 1 Cluster)[2]

FAST Corner Detector(OpenMP – 1 Cluster)[3]

Removal Object Detector (OpenMP – 4 Clusters)[4]


ORB Object Detector (OpenCL – 4 Clusters)[1]

Face Detector (OpenCL – 1 Cluster)[2]

FAST Corner Detector(OpenMP – 1 Cluster)[3]

Removal Object Detector (OpenMP – 4 Clusters)[4]


[1] Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF."ComputerVision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.[2] Jones, Michael, et al. "Fast multi-view face detection." Mitsubishi Electric Research Lab TR-20003-96 3 (2003): 14.[3] Rosten, et al. "Faster and better: A machine learning approach to corner detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.1 (2010): 105-119.[4] Magno, Michele, et al. "Multimodal abandoned/removed object detection for low power video surveillance systems." Advanced Video and Signal Based Surveillance, 2009. AVSS'09. Sixth IEEE International Conference on. IEEE, 2009.

0%

20%

40%

60%

80%

100%

10 100 1000 10000

Effic

ienc

y %

vs

Idea

l

#Frames

MPM-MO SPM-MO (0%) SPM-MO (25%)SPM-MO (50%) SPM-MO (100%) SPM-SO


0%

20%

40%

60%

80%

100%

10 100 1000 10000

Effic

ienc

y %

vs

Idea

l

#Frames

MPM-MO SPM-MO (0%) SPM-MO (25%)SPM-MO (50%) SPM-MO (100%) SPM-SO

90% efficient wrt ideal

+30% efficiency wrt SPM-MO

+40% efficiency wrt SPM-SO


• Shared Memory for Accelerators in Embedded SoCs– No clear view about practical implementation aspects and performance

implications of virtual memory support.

Heterogeneousunified shared memory



• Today‘s reality: Memory partitioning





Coherent virtual memory for host.






Accelerator can only access contiguous section in shared main memory, no virtual memory.







Explicit data management involving copies:• Limited programmability• Low performance







Explicit data management involving copies:• Limited programmability• Low performance


Open-Next goal: Lightweight Virtual Memory Support> Sharing of virtual address pointers>Transparent to the application developer>Zero-copy offload, higher predictability>Low complexity, low area, low cost


• Heterogeneous Systems– Increase computing power and energy efficiency.

• Communicate via coherent shared memory.• IOMMU for hUMA in high-end SoCs.

Execute control intensive and sequential tasks.

Fine-grained offloading of highly parallel tasks.






OFFLOADOFFLOAD






OFFLOADOFFLOAD ZERO-COPY (transparent)virtual pointer sharingZERO-COPY (transparent)virtual pointer sharing

Moves the complexityfrom the software to the hardware







OFFLOADOFFLOAD ZERO-COPY (transparent)virtual pointer sharingZERO-COPY (transparent)virtual pointer sharing



FPGA CNNDeep-

LearningAccelerator

Not only many-core accelerator!

HETEROGENEOUSUNIFIED SHARED MEMORY

Low-costIOMMU


Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13

Low-costIOMMU


Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13Accelerator: PULP implemented in the FPGA(http://www.pulp-platform.org/)

Low-costIOMMU


Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13Accelerator: PULP implemented in the FPGA(http://www.pulp-platform.org/)

Low-costIOMMU

First open-source RISC-V core

Open-Next CIRI-ICT Activities• Identificazione del programming model di riferimento per le

piattaforme multi- e many-core che implementano gli use-case del progetto

• Implementazione di meccanismi software per facilitare la programmazione e rendere piu' efficiente lo scambio di dati in architetture eterogenee "shared memory" composte da host con supporto per memoria virtuale ed acceleratori senza supporto per memoria virtuale (es. GPU, DSP, FPGA)

• Implementazione di meccanismi software per la gestione ad alto livello di funzioni accelerate attraverso l'utilizzo di hardware dedicato (FPGA)

• Identificazione di possibili estensioni al programming model per la prossima generazione di impianti industriali real-time

• Porting di kernel significativi estratti dalle applicazioni che implementano gli use-cases e analisi delle performance

Open-Next CIRI-ICT Unibo

• Your Industrial use-cases!• >10 year experience on

embedded many-core programming

• 36 pm on industrial use-cases exploration

• Move from workstation to efficient embedded systems!

Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded

many-core e FPGA

elaborazione dati ad elevate prestazioni e bassa potenza ... · elaborazione di dati ad elevate...

Documents