elaborazione dati ad elevate prestazioni e bassa potenza ... · elaborazione di dati ad elevate...
TRANSCRIPT
Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded
many-core e FPGA
Dott. Alessandro [email protected]
Prof. Davide [email protected]
Agenda
• Many-core introduction
• CIRI ICT OpenMP Technologies– Productive Parallel Programming Models– Accelerator virtualization for high
performance and power efficient computations
– resource sharing among applications– heterogeneous unified shared memory
• Conclusions
CollaborationsIndustrial
UE Projects
FP7 – ICTGRANT N° 288574
ERCGRANT N° 291125
P-SOCRATES
Academia
The Advent ofHeterogeneous Many-Core Architectures Era
in High-Performance Embedded systems
The Advent ofHeterogeneous Many-Core Architectures Era
in High-Performance Embedded systems
The Advent ofHeterogeneous Many-Core Architectures Era
in High-Performance Embedded systems
The Advent ofHeterogeneous Many-Core Architectures Era
in High-Performance Embedded systems
The Advent ofHeterogeneous Many-Core Architectures Era
in High-Performance Embedded systems
The Advent ofHeterogeneous Many-Core Architectures Era
in High-Performance Embedded systems
Embedded systems need to be capable to process workloads usually tailored for
workstations or HPC.
Nvidia Tegra-K1Two levels of heterogeneity: Host Processor
(4 powerful cores + 1 energyefficient core) Parallel many-core co-
processor (192 cores accelerator:
NvidiaKeplerGPU)
• Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA).
Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low
power consumption = high energy efficiency(GOPS/Watt)
Various design schemes are available:• Targeting ADAPTIVITY: heterogeneity and
specialization for efficient computing. Es. ARM Big.Little
Nvidia K1 (jetson)Hardware Features• Dimensions: 5" x 5" (127mm x 127mm) board• Tegra K1 SOC (1 to 5 Watts):
• NVIDIA Kepler GPU with 192 SM (326 GFLOPS)• NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15
• DRAM: 2GB DDR3L 933MHz
IO Features• mini-PCIe• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo
Software Features• CUDA 6.0• OpenGL 4.4• OpenMAX IL multimedia codec including H.264• OpenCV4Tegra
Nvidia K1 (jetson)Hardware Features• Dimensions: 5" x 5" (127mm x 127mm) board• Tegra K1 SOC (1 to 5 Watts):
• NVIDIA Kepler GPU with 192 SM (326 GFLOPS)• NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15
• DRAM: 2GB DDR3L 933MHz
IO Features• mini-PCIe• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo
Software Features• CUDA 6.0• OpenGL 4.4• OpenMAX IL multimedia codec including H.264• OpenCV4Tegra
200 $. Big Community of user!
Nvidia TX1Hardware Features• Dimensions: 8" x 8” board• Tegra TX1 SOC (15 Watts):
• NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s)• Quad-core ARM Cortex-A57 MPCore Processor
• 4 GB LPDDR4 Memory
IO Features• PCI-E x4• 5MP CSI• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo
Software Features• CUDA 6.0• OpenGL 4.4• OpenMAX IL multimedia codec including H.264• OpenCV4Tegra
Nvidia TX1Hardware Features• Dimensions: 8" x 8” board• Tegra TX1 SOC (15 Watts):
• NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s)• Quad-core ARM Cortex-A57 MPCore Processor
• 4 GB LPDDR4 Memory
IO Features• PCI-E x4• 5MP CSI• SD/MMC card• USB 3.0/2.0• HDMI• RS232• Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 7 x GPIo
Software Features• CUDA 6.0• OpenGL 4.4• OpenMAX IL multimedia codec including H.264• OpenCV4Tegra599 $. WORKSTATION COMPARABLE
PERFORMANCES
TI Keystone IIHardware Features• Dimensions: 8" x 8” board• TI 66AK2H12 SOC (14 Watts):
• 8x C6600 DSP 1.2 GHz (304 GMACs)• Quad-core ARM Cortex-A15 MPCore
• Up to 4 GB DDR3 Memory
IO Features• PCI-E• SD/MMC card• USB 3.0/2.0• 2x Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 38 x GPIo• Hyperlink• SRIO• 20x 64bit Timers• Security Accelerators
Software Features• OpenMP• OpenCL
TI Keystone IIHardware Features• Dimensions: 8" x 8” board• TI 66AK2H12 SOC (14 Watts):
• 8x C6600 DSP 1.2 GHz (304 GMACs)• Quad-core ARM Cortex-A15 MPCore
• Up to 4 GB DDR3 Memory
IO Features• PCI-E• SD/MMC card• USB 3.0/2.0• 2x Gigabit Ethernet • SATA• JTAG• UART• 3x I2C• 38 x GPIo• Hyperlink• SRIO• 20x 64bit Timers• Security Accelerators
Software Features• OpenMP• OpenCL
Evaluation Board 1000$. Target for Signal Processing Accelerator
MPPA Kalray
MPPA Kalray
Evaluation Board 1000$.
Targeting• High Performance Time critical missions• Aerospace/Military/Autonomous driving• Industrial Robotics
Programmable many-core accelerator (PMCA)
Programmable many-core accelerator (PMCA)
Programmable many-core accelerator (PMCA)
Programmable many-core accelerator (PMCA)
Challenges
• Fast Programmability, High Productivity programming techniques
• Time predictability for industrial applications
• Accelerator virtualization for high performance and power efficient computations– resource sharing among applications– heterogeneous unified shared memory
0
0,5
1
1,5
2
iFunny NetFlix CandyCrushSaga
My TalkingTom
BS Player LinkedIn GoogleDrive
Instagram Youtube Dropbox Facebook Twitter
Thre
ad-L
evel
Par
alle
lism Android Apple
Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware:
TLPAVG(52apps)≈1.22 Android TLPAVG(52apps)≈1.36 Apple
Performance isnot free MEAL
0
0,5
1
1,5
2
iFunny NetFlix CandyCrushSaga
My TalkingTom
BS Player LinkedIn GoogleDrive
Instagram Youtube Dropbox Facebook Twitter
Thre
ad-L
evel
Par
alle
lism Android Apple
Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware:
TLPAVG(52apps)≈1.22 Android TLPAVG(52apps)≈1.36 Apple
Performance isnot free MEAL
[*] “Analysis of the Effective Use of Thread-Level Parallelism in Mobile Applications. A preliminary study on iOS and Android devices”, Ethan Bogdan Hongin Yun.
Parallel Programming models
ProprietaryProgramming models
Parallel Programming modelsProprietary
Programming models
Parallel Programming modelsProprietary
Programming models
Parallel Programming modelsProprietary
Programming models
Parallel Programming modelsProprietary
Programming models
Khronos Standard forHeterogeneous Computing
Parallel Programming modelsProprietary
Programming models
Khronos Standard forHeterogeneous Computing
Parallel Programming modelsProprietary
Programming models
Khronos Standard forHeterogeneous Computing
Standardfor shared memory system
Parallel Programming modelsProprietary
Programming models
Khronos Standard forHeterogeneous Computing
Standardfor shared memory system
Parallel Programming modelsProprietary
Programming models
Khronos Standard forHeterogeneous Computing
Standardfor shared memory system
Parallel Programming modelsProprietary
Programming models
Khronos Standard forHeterogeneous Computing
Standardfor shared memory system
AcademicProposals
• OmpSS• OpenHMPP• …
OpenMP▲ De-facto standard for shared memory programming
▲ Support for nested (multi-level) parallelism good for clusters
▲ Annotations to incrementally convey parallelism to the compiler increased ease of use
▲ Based on well-understood programming practices (shared memory, C language) increases productivity
OpenMP▲ De-facto standard for shared memory programming
▲ Support for nested (multi-level) parallelism good for clusters
▲ Annotations to incrementally convey parallelism to the compiler increased ease of use
▲ Based on well-understood programming practices (shared memory, C language) increases productivity
“OpenCL for programming shared memory multicore CPUs” by Akhtar Ali , Usman Dastgeer , Christoph Kessler
OpenMP▲ De-facto standard for shared memory programming
▲ Support for nested (multi-level) parallelism good for clusters
▲ Annotations to incrementally convey parallelism to the compiler increased ease of use
▲ Based on well-understood programming practices (shared memory, C language) increases productivity
“OpenCL for programming shared memory multicore CPUs” by Akhtar Ali , Usman Dastgeer , Christoph Kessler
2x to 10x less LOC
OpenMP▲ De-facto standard for shared memory programming
▲ Support for nested (multi-level) parallelism good for clusters
▲ Annotations to incrementally convey parallelism to the compiler increased ease of use
▲ Based on well-understood programming practices (shared memory, C language) increases productivity
▼ Designed for uniform SMP with main shared memory
▼ Lacks constructs to control accelerators1. And compilation toolchain to deal with multiple ISA..
2. ..and multiple runtime systems too !!
But…
OpenMP▲ De-facto standard for shared memory programming
▲ Support for nested (multi-level) parallelism good for clusters
▲ Annotations to incrementally convey parallelism to the compiler increased ease of use
▲ Based on well-understood programming practices (shared memory, C language) increases productivity
▼ Designed for uniform SMP with main shared memory
▼ Lacks constructs to control accelerators1. And compilation toolchain to deal with multiple ISA..
2. ..and multiple runtime systems too !!
But…
Inte
l’s P
aral
lel U
nive
rse
mag
azin
e –
May
201
4
Open-Next OpenMP runtime 4.0
• What’s new? UNTIED tasks
Open-Next OpenMP runtime 4.0
Comparison with otherOpenMP implementations
• >> x86 (Intel Haswell 2 × 8 cores @ 2.40 GHz.)
• libgomp: GNU OpenMP implementation (GCC 4.9.2)
• iomp: Intel OpenMP implementation (ICC 15.0.2)
0
2
4
6
8
10
12
14
16RECURSIVE
libgomp iomp
Open-Next OpenMP runtime 4.0
Comparison with othertasking runtimes
• >> x86 (Intel Haswell 2 × 8 cores @ 2.40 GHz.)
• nanos: BSC OpenSS(Mercurium 15.06 + Nanos++)
• Intel CILK+: ICC 15.0.2• Intel TBB: ICC 15.0.2• Wool: GCC 4.9.2
0
2
4
6
8
10
12
14
16RECURSIVE
nanos libgomp
iomp Intel CILK+ ‐ ICC (15.0.2)
Intel TBB ‐ ICC (15.0.2) WOOL ‐ GCC (4.9.2)
Time Predictability• At compile-time, generate the TDG that includes
timing information to consider the tasks communication
• At design-time, assign the TDG to OS threads (mapping)
• At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling)
CompilerC/C++
newTask()newTask()
#pragma omp#pragma omp
source code
binary code
eTDG
StaticScheduler
+ Timing Analysis
Many-coreOpenMP RTE
Run-timeCompile-time
Dispatcher
Designtime
Time Predictability• At compile-time, generate the TDG that includes
timing information to consider the tasks communication
• At design-time, assign the TDG to OS threads (mapping)
• At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling)
CompilerC/C++
newTask()newTask()
#pragma omp#pragma omp
source code
binary code
eTDG
StaticScheduler
+ Timing Analysis
Many-coreOpenMP RTE
Run-timeCompile-time
Dispatcher
Designtime
Open-Next Offload using OpenMP
void main(){int a[];int ker_id;
/* some code here */
#pragma omp offload \shared (a) \
name (“myker”, ker_id) \nowait
{
#pragma omp parallel sections \
proc_bind (spread) {
#pragma omp sectionTASK_A();
#pragma omp sectionTASK_B();
}}
/* some more code here */
#pragma omp wait (ker_id)}
Open-Next Offload using OpenMP
void main(){int a[];int ker_id;
/* some code here */
#pragma omp offload \shared (a) \
name (“myker”, ker_id) \nowait
{
#pragma omp parallel sections \
proc_bind (spread) {
#pragma omp sectionTASK_A();
#pragma omp sectionTASK_B();
}}
/* some more code here */
#pragma omp wait (ker_id)}
new OpenMP directive used to offload the execution of a code block to the accelerator
Open-Next Offload using OpenMP
void main(){int a[];int ker_id;
/* some code here */
#pragma omp offload \shared (a) \
name (“myker”, ker_id) \nowait
{
#pragma omp parallel sections \
proc_bind (spread) {
#pragma omp sectionTASK_A();
#pragma omp sectionTASK_B();
}}
/* some more code here */
#pragma omp wait (ker_id)}
new OpenMP directive used to offload the execution of a code block to the accelerator
shared clause specifies data that needs to be shared between the host and accelerator
Open-Next Offload using OpenMP
void main(){int a[];int ker_id;
/* some code here */
#pragma omp offload \shared (a) \
name (“myker”, ker_id) \nowait
{
#pragma omp parallel sections \
proc_bind (spread) {
#pragma omp sectionTASK_A();
#pragma omp sectionTASK_B();
}}
/* some more code here */
#pragma omp wait (ker_id)}
new OpenMP directive used to offload the execution of a code block to the accelerator
shared clause specifies data that needs to be shared between the host and accelerator
New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for
asynchronous offloads
Open-Next Offload using OpenMP
void main(){int a[];int ker_id;
/* some code here */
#pragma omp offload \shared (a) \
name (“myker”, ker_id) \nowait
{
#pragma omp parallel sections \
proc_bind (spread) {
#pragma omp sectionTASK_A();
#pragma omp sectionTASK_B();
}}
/* some more code here */
#pragma omp wait (ker_id)}
new OpenMP directive used to offload the execution of a code block to the accelerator
shared clause specifies data that needs to be shared between the host and accelerator
New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for
asynchronous offloadsSpecify asynchronous offloads
Open-Next Offload using OpenMP
void main(){int a[];int ker_id;
/* some code here */
#pragma omp offload \shared (a) \
name (“myker”, ker_id) \nowait
{
#pragma omp parallel sections \
proc_bind (spread) {
#pragma omp sectionTASK_A();
#pragma omp sectionTASK_B();
}}
/* some more code here */
#pragma omp wait (ker_id)}
new OpenMP directive used to offload the execution of a code block to the accelerator
shared clause specifies data that needs to be shared between the host and accelerator
New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for
asynchronous offloadsSpecify asynchronous offloads
All standard OpenMP and custom extensions can be used within an offload block
Open-Next Offload using OpenMP
TASK_A(){int i;#pragma omp parallel proc_bind (close)#pragma omp for
for( i=0;…. )do_smthg();
}
void main(){int a[];int ker_id;
/* some code here */
#pragma omp offload \shared (a) \
name (“myker”, ker_id) \nowait
{
#pragma omp parallel sections \
proc_bind (spread) {
#pragma omp sectionTASK_A();
#pragma omp sectionTASK_B();
}}
/* some more code here */
#pragma omp wait (ker_id)}
new OpenMP directive used to offload the execution of a code block to the accelerator
shared clause specifies data that needs to be shared between the host and accelerator
New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for
asynchronous offloadsSpecify asynchronous offloads
All standard OpenMP and custom extensions can be used within an offload block
Open-Next Offload using OpenMP
Early evaluation
05
10152025303540
FAST05
10152025303540
CT
05
10152025303540
Mahala.05
10152025303540
Strassen
05
10152025303540
NCC
05
10152025303540
SHOT
Kernel Repetitions
Kernel Repetitions
Spee
dup
Spee
dup
Early evaluation
05
10152025303540
FAST05
10152025303540
CT
05
10152025303540
Mahala.05
10152025303540
Strassen
05
10152025303540
NCC
05
10152025303540
SHOT
Kernel Repetitions
Kernel Repetitions
Spee
dup
Spee
dup
“Simplifying Heterogeneous Embedded SoC Programming with Directive-Based Offload”Marongiu, Capotondi, Tagliavini, BeniniIEEE Transactions Industrial Informatics 2015
Ope
nCL
Ope
nMP
Hardware Abstraction Layer
TBB
Ope
nVX
AcceleratorResource Sharing
there is not dominantstandard PPM
Ope
nCL
Ope
nMP
Hardware Abstraction Layer
TBB
Ope
nVX
AcceleratorResource Sharing
there is not dominantstandard PPM
Ope
nCL
Ope
nMP
low-levelRuntime
Hardware Abstraction Layer
TBB
Ope
nVX
AcceleratorResource Sharing
there is not dominantstandard PPM
Improve overall utilization of accelerators in multi-
user environment
Ope
nCL
Ope
nMP
low-levelRuntime
Hardware Abstraction Layer
TBB
Ope
nVX
AcceleratorResource Sharing
there is not dominantstandard PPM
Improve overall utilization of accelerators in multi-
user environment
on PMCAs RTEs are typically developed on
top of bare metal
Ope
nCL
Ope
nMP
low-levelRuntime
Hardware Abstraction Layer
TBB
Ope
nVX
AcceleratorResource Sharing
there is not dominantstandard PPM
Improve overall utilization of accelerators in multi-
user environment
on PMCAs RTEs are typically developed on
top of bare metal
Legacy ApplicationsO
penC
L
Ope
nMP
low-levelRuntime
Hardware Abstraction Layer
TBB
Ope
nVX
AcceleratorResource Sharing
AcceleratorResource Sharing
HOST
driver
O1
O2
O3
ON
OpenCL
OpenMP
OpenVX
AcceleratorResource Sharing
HOST
driver
Lightweight Spatial Partitioning Support
O1
O2
O3
ON
OpenCL
OpenMP
OpenVX
AcceleratorResource Sharing
HOST
driver
Lightweight Spatial Partitioning Support
VirtualAccelerators
O1
O2
O3
ON
OpenCL
OpenMP
OpenVX
AcceleratorResource Sharing
HOST
driver
Lightweight Spatial Partitioning Support
VirtualAccelerators
O1
O2
O3 ON
Runtime Efficiency: Computer Vision Use-Case
ORB Object Detector (OpenCL – 4 Clusters)[1]
Face Detector (OpenCL – 1 Cluster)[2]
FAST Corner Detector(OpenMP – 1 Cluster)[3]
Removal Object Detector (OpenMP – 4 Clusters)[4]
Runtime Efficiency: Computer Vision Use-Case
ORB Object Detector (OpenCL – 4 Clusters)[1]
Face Detector (OpenCL – 1 Cluster)[2]
FAST Corner Detector(OpenMP – 1 Cluster)[3]
Removal Object Detector (OpenMP – 4 Clusters)[4]
Runtime Efficiency: Computer Vision Use-Case
[1] Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF."ComputerVision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.[2] Jones, Michael, et al. "Fast multi-view face detection." Mitsubishi Electric Research Lab TR-20003-96 3 (2003): 14.[3] Rosten, et al. "Faster and better: A machine learning approach to corner detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.1 (2010): 105-119.[4] Magno, Michele, et al. "Multimodal abandoned/removed object detection for low power video surveillance systems." Advanced Video and Signal Based Surveillance, 2009. AVSS'09. Sixth IEEE International Conference on. IEEE, 2009.
0%
20%
40%
60%
80%
100%
10 100 1000 10000
Effic
ienc
y %
vs
Idea
l
#Frames
MPM-MO SPM-MO (0%) SPM-MO (25%)SPM-MO (50%) SPM-MO (100%) SPM-SO
Runtime Efficiency: Computer Vision Use-Case
0%
20%
40%
60%
80%
100%
10 100 1000 10000
Effic
ienc
y %
vs
Idea
l
#Frames
MPM-MO SPM-MO (0%) SPM-MO (25%)SPM-MO (50%) SPM-MO (100%) SPM-SO
90% efficient wrt ideal
+30% efficiency wrt SPM-MO
+40% efficiency wrt SPM-SO
Runtime Efficiency: Computer Vision Use-Case
• Shared Memory for Accelerators in Embedded SoCs– No clear view about practical implementation aspects and performance
implications of virtual memory support.
Heterogeneousunified shared memory
• Shared Memory for Accelerators in Embedded SoCs– No clear view about practical implementation aspects and performance
implications of virtual memory support.
• Today‘s reality: Memory partitioning
Heterogeneousunified shared memory
• Shared Memory for Accelerators in Embedded SoCs– No clear view about practical implementation aspects and performance
implications of virtual memory support.
• Today‘s reality: Memory partitioning
Coherent virtual memory for host.
Heterogeneousunified shared memory
• Shared Memory for Accelerators in Embedded SoCs– No clear view about practical implementation aspects and performance
implications of virtual memory support.
• Today‘s reality: Memory partitioning
Coherent virtual memory for host.
Accelerator can only access contiguous section in shared main memory, no virtual memory.
Heterogeneousunified shared memory
• Shared Memory for Accelerators in Embedded SoCs– No clear view about practical implementation aspects and performance
implications of virtual memory support.
• Today‘s reality: Memory partitioning
Coherent virtual memory for host.
Accelerator can only access contiguous section in shared main memory, no virtual memory.
Explicit data management involving copies:• Limited programmability• Low performance
Heterogeneousunified shared memory
• Shared Memory for Accelerators in Embedded SoCs– No clear view about practical implementation aspects and performance
implications of virtual memory support.
• Today‘s reality: Memory partitioning
Coherent virtual memory for host.
Accelerator can only access contiguous section in shared main memory, no virtual memory.
Explicit data management involving copies:• Limited programmability• Low performance
Heterogeneousunified shared memory
Open-Next goal: Lightweight Virtual Memory Support> Sharing of virtual address pointers>Transparent to the application developer>Zero-copy offload, higher predictability>Low complexity, low area, low cost
Heterogeneousunified shared memory
• Heterogeneous Systems– Increase computing power and energy efficiency.
• Communicate via coherent shared memory.• IOMMU for hUMA in high-end SoCs.
Execute control intensive and sequential tasks.
Fine-grained offloading of highly parallel tasks.
Heterogeneousunified shared memory
• Heterogeneous Systems– Increase computing power and energy efficiency.
• Communicate via coherent shared memory.• IOMMU for hUMA in high-end SoCs.
Execute control intensive and sequential tasks.
Fine-grained offloading of highly parallel tasks.
OFFLOADOFFLOAD
Heterogeneousunified shared memory
• Heterogeneous Systems– Increase computing power and energy efficiency.
• Communicate via coherent shared memory.• IOMMU for hUMA in high-end SoCs.
Execute control intensive and sequential tasks.
Fine-grained offloading of highly parallel tasks.
OFFLOADOFFLOAD ZERO-COPY (transparent)virtual pointer sharingZERO-COPY (transparent)virtual pointer sharing
Moves the complexityfrom the software to the hardware
Moves the complexityfrom the software to the hardware
Heterogeneousunified shared memory
• Heterogeneous Systems– Increase computing power and energy efficiency.
• Communicate via coherent shared memory.• IOMMU for hUMA in high-end SoCs.
Execute control intensive and sequential tasks.
Fine-grained offloading of highly parallel tasks.
OFFLOADOFFLOAD ZERO-COPY (transparent)virtual pointer sharingZERO-COPY (transparent)virtual pointer sharing
Moves the complexityfrom the software to the hardware
Moves the complexityfrom the software to the hardware
FPGA CNNDeep-
LearningAccelerator
Not only many-core accelerator!
HETEROGENEOUSUNIFIED SHARED MEMORY
Low-costIOMMU
HETEROGENEOUSUNIFIED SHARED MEMORY
Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13
Low-costIOMMU
HETEROGENEOUSUNIFIED SHARED MEMORY
Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13Accelerator: PULP implemented in the FPGA(http://www.pulp-platform.org/)
Low-costIOMMU
HETEROGENEOUSUNIFIED SHARED MEMORY
Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13Accelerator: PULP implemented in the FPGA(http://www.pulp-platform.org/)
Low-costIOMMU
First open-source RISC-V core
Open-Next CIRI-ICT Activities• Identificazione del programming model di riferimento per le
piattaforme multi- e many-core che implementano gli use-case del progetto
• Implementazione di meccanismi software per facilitare la programmazione e rendere piu' efficiente lo scambio di dati in architetture eterogenee "shared memory" composte da host con supporto per memoria virtuale ed acceleratori senza supporto per memoria virtuale (es. GPU, DSP, FPGA)
• Implementazione di meccanismi software per la gestione ad alto livello di funzioni accelerate attraverso l'utilizzo di hardware dedicato (FPGA)
• Identificazione di possibili estensioni al programming model per la prossima generazione di impianti industriali real-time
• Porting di kernel significativi estratti dalle applicazioni che implementano gli use-cases e analisi delle performance
Open-Next CIRI-ICT Unibo
• Your Industrial use-cases!• >10 year experience on
embedded many-core programming
• 36 pm on industrial use-cases exploration
• Move from workstation to efficient embedded systems!
Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded
many-core e FPGA