high performance embedded systems mpsocs
TRANSCRIPT
High Performance Embedded Systems
July 2020
Electronics Engineering Department
Electronics Master Program
MPSoCs
Outline
2
• Multiprocessors Architecture and Taxonomy
• Parallel Execution Mechanism
• Multiprocessors Design Techniques
• Memory Systems
• Processors Symmetry
• Co-processing
3
Multiprocessors Architecture and Taxonomy
Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/
Intel 4004 Core i9??
4
Multiprocessors Architecture and Taxonomy
Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/
Intel 4004 Core i9
5
Multiprocessors Architecture and Taxonomy
Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/
Exynos 7420 finFET transistors
6
Multiprocessors Architecture and Taxonomy
Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/
Exynos 7420 finFET transistors
7
Multiprocessors Architecture and Taxonomy
Taken from: https://www.researchgate.net/publication/257711815_Where_Photovoltaics_Meets_Microelectronics/figures?lo=1
8
Multiprocessors Architecture and Taxonomy
Taken from: https://www.semiconductor-digest.com/2020/03/10/transistor-count-trends-continue-to-track-with-moores-law/
9
Multiprocessors Architecture and Taxonomy
Taken from: https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/
SoC
10
Multiprocessors Architecture and Taxonomy
Taken from: http://soc.inha.ac.kr/index.php/Project
2-Parallel Radix-
2^4 FFT/IFFT
Processor Chip for
MB-OFDM UWB
communications
11
Multiprocessors Architecture and Taxonomy
Taken from: PrSoC: Programmable System-on-chip (SoC) for silicon prototyping IEEE 2008
12
Multiprocessors Architecture and Taxonomy
Taken from: https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/
SoC
MPSoC
13
Multiprocessors Architecture and Taxonomy
Taken from: https://commons.wikimedia.org/wiki/File:ARM-Cortex-A9.gif
¿MPSoCs?
14
Multiprocessors Architecture and Taxonomy
SoC
Taken from: W. Wolf Multiprocessor Systems-On-Chip
• Is an integrated circuit that implements
most or all of the functions of a
complete electronic system.
• The most fundamental characteristic of
an SoC is complexity.
15
Multiprocessors Architecture and Taxonomy
SoC
Taken from: W. Wolf Multiprocessor Systems-On-Chip
Many product categories:
• Cell phones.
• Telecommunications and networking.
• Digital television.
• Videos games.
• …..
16
Multiprocessors Architecture and Taxonomy
SoC Example
Taken from: W. Wolf Multiprocessor Systems-On-Chip
Processing Elements
17
Multiprocessors Architecture and Taxonomy
SoC Example
Taken from: W. Wolf Multiprocessor Systems-On-Chip
Memory
18
Multiprocessors Architecture and Taxonomy
SoC Example
Taken from: W. Wolf Multiprocessor Systems-On-Chip
Communications
19
Multiprocessors Architecture and Taxonomy
SoC Example
Taken from: W. Wolf Multiprocessor Systems-On-Chip
MPSoCs?
20
Multiprocessors Architecture and Taxonomy
MPSoCs?
Wait!
What is a Parallel Architecture?
21
Multiprocessors Architecture and Taxonomy
Parallel Architecture
“A large collection of processing elements that communicate and cooperate to
solve large problems fast”. - Almasi.
Taken from: M. Aguilar MPSoCs
22
Multiprocessors Architecture and Taxonomy
Parallel Architecture
“A large collection of processing elements that communicate and cooperate to
solve large problems fast”. - Almasi.
Taken from: M. Aguilar MPSoCs
23
Multiprocessors Architecture and Taxonomy
Parallel Architecture
“A large collection of processing elements that communicate and cooperate to
solve large problems fast”. - Almasi.
Taken from: M. Aguilar MPSoCs
SoC
HW+SW
24
Multiprocessors Architecture and Taxonomy
Parallel Architecture
“A large collection of processing elements that communicate and cooperate to
solve large problems fast”. - Almasi.
Taken from: M. Aguilar MPSoCs
SoC
HW+SW
Technology was increased
25
Multiprocessors Architecture and Taxonomy
Parallel Architecture
“A large collection of processing elements that communicate and cooperate to
solve large problems fast”. - Almasi.
Taken from: M. Aguilar MPSoCs
SoC
HW+SW
Technology was increased
26
Multiprocessors Architecture and Taxonomy
Parallel Architecture
“A large collection of processing elements that communicate and cooperate to
solve large problems fast”. - Almasi.
Taken from: M. Aguilar MPSoCs
SoC
HW+SW
MPSoCs Technology was increased
27
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Serial Communication
Parallel Communication
28
Multiprocessors Architecture and Taxonomy
Here we go
What are MPSoCs?
Taken from: W. Wolf Multiprocessor Systems-On-Chip
29
Multiprocessors Architecture and Taxonomy
What are MPSoCs?
“Are the latest incarnation of very largescale integration (VLSI)
technology”
Taken from: W. Wolf Multiprocessor Systems-On-Chip
???
30
Multiprocessors Architecture and Taxonomy
What are MPSoCs?
“Are the latest incarnation of very largescale integration (VLSI)
technology”
Taken from: W. Wolf Multiprocessor Systems-On-Chip
???• Silicon
• Power
• Area
• …
31
Multiprocessors Architecture and Taxonomy
What are MPSoCs?
“Are the latest incarnation of very largescale integration (VLSI)
technology”
“A single integrated circuit can contain over
100 million transistors, and the International Technology Roadmap
for Semiconductors predicts that chips with a billion transistors are
within reach”
Taken from: W. Wolf Multiprocessor Systems-On-Chip
32
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs
“The multiprocessor System-on-Chip (MPSoC) is a system-on-a-chip
(SoC) which uses multiple processors (see multi-core), usually
targeted for embedded applications”.
SoC
HW+SW
MPSoCs Understood!!
33
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs
“The multiprocessor system-on-chip (MPSoC) uses multiple CPUs
along with other hardware subsystems to implement a system”. -
Wayne Wolf.
Multiprocessor = Multicore?
34
Multiprocessors Architecture and Taxonomy
General Structure MPSoCs
Processing Elements (PE)
• Relation with application context and requirements.
• MPSoCs Homogenous.
• MPSoCs Heterogenous
• Interconnection Element
• Buses.
• NoCs (Networks on Chip). More information here.
Taken from: M. Agular MPSoCs
35
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Advantage in MPSoCs
• Performance
• Powerful platform (Cores).
• Users.
• Applications.
• Tasks into same application.
Power Consumption
• Low power from parallel approach.
36
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
37
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs Beneficts
• Wireless.
• Multimedia: video and audio.
• Health.
• Military.
• Avionics.
• Aerospacial
38
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Multiprocessor = Multicore?
39
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Multiprocessor
• Platform with several CPUs.
• Parallel approach was used.
Multicore
• Platform with only one CPU.
• Multiple cores into CPU.
40
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs Software
41
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Parallel Approaches
Parallel
Approaches
42
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Parallel Approaches
Parallel
Approaches
Bits
Threads
TasksInstructions
Data
43
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs Architecture?
44
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs
PEs
45
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs
Homogeneous Heterogenous
PEs
46
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs Heterogeneous
• Different PEs, for example
• GPU (General Purpose Unit).
• DSPs.
• HW Acceleration
• NoC infrastructure.
• Better performance and power consumption
• Use in embedded system.
• Portable system.
• Power consumption.
47
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs Homogenous
• PEs to conform a SoC.
• PE is instanced several times.
• Instance is connected by communication
infrastructure.
• Flexibility and Scalability.
• Worst power consumption.
48
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs Taxonomy?
49
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Processor Organization
Serial
SISD
Uniprocessor
Multi ALUOverlapped
operations
Parallel
SIMD MISD MIMD
Vector
processor
Array
processor
Tightly
coupled
Loosely
coupled
Shared
memory
Symmetric
multiprocessor
(SMP)Nonuniform
memory access
(NUMA)
Distributed
memory
Clusters
50
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Where are located MPSoCs?
51
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Processor Organization
Serial
SISD
Uniprocessor
Multi ALUOverlapped
operations
Parallel
SIMD MISD MIMD
Vector
processor
Array
processor
Tightly
coupled
Loosely
coupled
Shared
memory
Symmetric
multiprocessor
(SMP)Nonuniform
memory access
(NUMA)
Distributed
memory
Clusters
MPSoCs
52
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs and Parallel Computing Lectures Notes
MISD
• This architecture executing
different operations over
different data bundle.
• Multiprocessing approach and
MPSoCs were located in this
category.
53
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
MPSoCs
Homogeneous Heterogenous
PEs
Memory Access
Uniform Access (UMA)
Non-Uniform Access (NUMA)
Processors Symmetry
SMP (Symmetric Multi-processing)
AMP (Asymmetric Multi-processing)
Memory Architecture
Share Memory
Distributed memory
MPSoCs Architecture
54
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
ARM Cortex A9
55
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Analog Devices - Blackfin
56
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
TI Davinci DM355
57
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
TI OMAP5
58
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
ST Microelectronic Nomadik
59
Multiprocessors Architecture and Taxonomy
Taken from: M. Aguilar MPSoCs
Nexperia
60
Multiprocessors Architecture and Taxonomy
Taken from: http://linuxgizmos.com/new-arm-cortex-a72-nearly-twice-as-fast-as-cortex-a57/
Cortex-A72
Outline
61
• Multiprocessors Architecture and Taxonomy
• Parallel Execution Mechanism
• Multiprocessors Design Techniques
• Memory Systems
• Processors Symmetry
• Co-processing
62
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
63
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Consider following approaches
• Shared memory.
• Threads.
• Message Passing.
• Data Parallel.
• Hybrid.
• Others
All these can be implemented on any architecture.
64
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Consider following approaches
• Shared memory.
• Threads.
• Message Passing.
• Data Parallel.
• Hybrid.
• Others
All these can be implemented on any architecture.
65
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Shared Memory
• Tasks share a common address space, which they read and write
asynchronously.
• Various mechanisms such as locks/semaphores may be used control access to
the shared memory.
• Advantage
• No need to explicitly communicate of data tasks simplified programming.
• Disadvantages
• Need to take care when managing memory, avoid synchronization conflicts.
• Harder to control data locality.
66
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
In Hardware
• Shared memory systems use:
• UMA (Uniform Memory Access)
• NUMA (Non- Uniform Memory
Access)
• COMA (Cache-only memory
architecture)
In Software
• Inter-process communication (IPC).
• Virtual memory mapping.
67
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Consider following approaches
• Shared memory.
• Threads.
• Message Passing.
• Data Parallel.
• Hybrid.
• Others
All these can be implemented on any architecture.
68
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Threads
• A thread can be considered as a
subroutine in the main program.
• Threads communicate with each other
through the global memory.
• Commonly associated with shared
memory architectures and operating
systems.
• Posix Threads or pthreads.
• OpenMP.
69
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Threads
Advantages
• Responsiveness.
• Faster execution.
• Lower resource consumption.
• Better system utilization.
• Simplified share and communication
• Parallelization.
• Drawbacks
• Synchronization.
• Thread crashes a process.
70
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Consider following approaches
• Shared memory.
• Threads.
• Message Passing.
• Data Parallel.
• Hybrid.
• Others.
All these can be implemented on any architecture.
71
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Message Passing
• A set of tasks that use their own local memory
during computation.
• Data exchange through sending and receiving
messages.
• Data transfer usually requires cooperative
operations to be performed by each process.
• For example, a send operation must have a
matching receive operation.
• MPI
• Example here
72
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Consider following approaches
• Shared memory.
• Threads.
• Message Passing.
• Data Parallel.
• Hybrid.
• Others.
All these can be implemented on any architecture.
73
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Data Parallel
• Consider the following characteristics:
• Parallel work performs operations on a data set,
organized into a common structure.
• Tasks works collectively on the same data structure,
with each task working on a different partition.
• Tasks perform the same operation on their partition.
• Shared memory architectures, all tasks may have
access to the data structure through global memory.
• Distributed memory architectures the data structure is
split up and resides as “chunks” in the local memory
of each task.
• More information here.
74
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Consider following approaches
• Shared memory.
• Threads.
• Message Passing.
• Data Parallel.
• Hybrid.
• Others
All these can be implemented on any architecture.
75
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Hybrid
• Using various models (for example OpenMP/MPI).
• Single Program Multiple Data (SPMD)
• Single program is executed by all tasks simultaneously.
• Multiple Program Multiple Data (MPMD)
• Has multiple executables. Task can execute the same of different programs
as other task
76
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Consider following approaches
• Shared memory.
• Threads.
• Message Passing.
• Data Parallel.
• Hybrid.
• Others. (Depends on the architecture)
All these can be implemented on any architecture.
77
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Others
• MCAPI (Multicore Association)
• Poly-Platform
• CUDA
78
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Others
• MCAPI (Multicore Association)
• Poly-Platform
• CUDA
79
Parallel Execution Mechanism
Taken from: https://en.wikipedia.org/wiki/Multicore_Association
MCAPI (Multicore Association)
• Founded in 2005
• First specification and referred to as MCAPI
• Based on message-passing
• Target is addressed to system, toolchain and programming language
heterogeneous.
• Active working
• MCAPI
• Virtualization.
• Open Asymmetric Multiprocessing (OpenAMP)
80
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Others
• MCAPI (Multicore Association)
• Poly-Platform
• CUDA
81
Parallel Execution Mechanism
Taken from: http://polycoresoftware.com/poly-platform
Poly-Platform
• Collection productivity tools
• Migrating process
• Main approach multicore platforms.
• Driven supports for several SoC, OS and Transport Information.
82
Parallel Execution Mechanism
Taken from: Parallel Computing Lectures Notes
Others
• MCAPI (Multicore Association)
• Poly-Platform
• CUDA
83
Parallel Execution Mechanism
Taken from: https://en.wikipedia.org/wiki/CUDA
CUDA
• Initial release 2007.
• Parallel computing platform and
application programming interface.
• Created by NVIDIA.
• GPU approach.
• Supports in Windows, Linux and
macOS.
Outline
84
• Multiprocessors Architecture and Taxonomy
• Parallel Execution Mechanism
• Multiprocessors Design Techniques
• Memory Systems
• Processors Symmetry
• Co-processing
85
Multiprocessors Design Techniques
Taken from: W.Wolf High-Performance Embedded Computing
Embedded Systems Design Flows
• Co-design flows.
• Platform-based design.
• Two-stage process.
• Programming platforms.
• Standards-Based design.
MPSoCs?
86
Multiprocessors Design Techniques
Challenges
• Software development is a major challenge for MPSoC designers.
• Software that runs on the multiprocessor must be high performance, real time,
and low power.
• Each MPSoC requires its own software development environment: compiler,
debugger, simulator, and other tools.
• Better understanding of how to abstract tasks properly to capture the essential
characteristics of their low-level behavior for system-level analysis.
Taken from: W.Wolf Multprocessor Systems on Chip
87
Multiprocessors Design Techniques
Taken from: W. Wolf Multiprocessor Systems on Chip
Challenges
• Networks-on-chips have emerged over the past few years as an architectural
approach to the design of single-chip multiprocessors.
• FPGAs have emerged as a viable alternative to application-specific integrated
circuits (ASICs) in many markets. FPGA fabrics are also starting to be
integrated into SoCs.
88
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
Challenges
• C code sequence is not easy to replace.
• Algorithm specification contains parallel specifications (Model of computation
KPN, SDF, etc).
• Not new programming languages.
• Automatically and parallel programming.
• Platform-based design (SW synthesis) or SW and HW synthesis.
89
Multiprocessors Design Techniques
Taken from: MPSoCs https://slideplayer.com/slide/8773117/
Challenges
All MPSOC design have the following requirements:
• Speed.
• Power.
• Area.
• Application Performance.
• Time to market.
90
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
MPSoCs Programming
• Task mapping to multiprocessor or cores.
• Communication inter-processor management.
• Data transfer engine management.
• Shared resource management.
• Memory management
• Debugging.
91
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
MPSoCs Exploration
• Divide computational and communications.
92
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
Virtual Processing Unit VPU
• Load simulator: It is a high-level simulation of
the core behavior.
• Functional simulator: Native execution of
tasks, scheduling is given by the VPU OS.
93
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
Virtual Processing Unit VPU
Allows spatial and temporal modeling of task mapping to PE
94
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
Virtual Platform
• It is a software model that allows the exploration of hardware and software.
• It allows hardware platform exploration and optimization.
• Software development, debugging and optimization.
• Concurrent hardware and software design.
95
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
Virtual Platform
• Requirements:
• High speed in terms of simulation process.
• Compromise between simulation speed and precision.
• Flexibility.
• Usability by developers not experts in hardware.
96
Multiprocessors Design Techniques
Design Techniques
• Core-based Strategy.
• Wrappers.
• System-level design flow.
• Platform-based design.
• Component-based design.
Taken from: W.Wolf High-Performance Embedded Computing
97
Multiprocessors Design Techniques
Design Techniques
• Core-based Strategy.
• Wrappers.
• System-level design flow.
• Platform-based design.
• Component-based design.
Taken from: W.Wolf High-Performance Embedded Computing
98
Multiprocessors Design Techniques
Core-based Strategy
• Core-based synthesis strategy for the IBM CoreConnect bus.
• Coral tool automates many of the tasks required to stitch together multiple
cores using virtual components.
• Each virtual component describes the interfaces for a class of real
components.
• Coral can synthesize some combinational logic.
• Coral also checks the connections between cores using Boolean decision
diagrams.
Taken from: W.Wolf High-Performance Embedded Computing
99
Multiprocessors Design Techniques
Core-based Strategy
Core Connect provides three types of busses:
• A high-speed processor local bus (PLB).
• An on-chip peripheral bus (OPB).
• A device control register (DCR) bus for configuration and status information.
Taken from: W.Wolf High-Performance Embedded Computing
100
Multiprocessors Design Techniques
Taken from: SoC Lectures Notes
Core-based Strategy
101
Multiprocessors Design Techniques
Design Techniques
• Core-based Strategy.
• Wrappers.
• System-level design flow.
• Platform-based design.
• Component-based design.
Taken from: W.Wolf High-Performance Embedded Computing
102
Multiprocessors Design Techniques
Wrappers
• Treats both hardware and software as
components.
• A wrapper is a design unit that interfaces a
module to another module.
• A wrapper can be hardware or software
and may include both.
• The wrapper performs only low-level
adaptations, such as protocol
transformationTaken from: W.Wolf High-Performance Embedded Computing
103
Multiprocessors Design Techniques
Wrappers
Heterogeneous multiprocessor introduce several types of problems:
• Many chips have multiple communication networks to match the network to
the processing needs. Synchronizing communication across network
boundaries is more difficult than communicating within a network.
• Specialized hardware is often needed to accelerate interprocess
communication and free the CPU for more interesting computations.
• The communication primitives should be at a higher level of abstraction than
shared memory.
Taken from: W.Wolf High-Performance Embedded Computing
104
Multiprocessors Design Techniques
Wrappers
A dedicated CPU is added to the system, its software must be adapted
in several ways:
1. The software must be updated to support the platform’s communication
primitives.
2. Optimized implementations of the host processor’s communication
functions must be provided for interprocessor communication.
3. Synchronization functions must be provided.
Taken from: W.Wolf High-Performance Embedded Computing
105
Multiprocessors Design Techniques
Design Techniques
• Core-based Strategy.
• Wrappers.
• System-level design flow.
• Platform-based design.
• Component-based design.
Taken from: W.Wolf High-Performance Embedded Computing
106
Multiprocessors Design Techniques
System-Level Design
• An abstract platform is created from a combination of system requirements,
models of the software, and models of the hardware components.
• Abstract platform is analyzed to determine the application’s performance
and power/energy consumption.
• Based on the results of this analysis, software is allocated and scheduled
onto the platform.
• Golden abstract architecture that can be used to build the implementation.
Taken from: W.Wolf High-Performance Embedded Computing
107
Multiprocessors Design Techniques
System-Level Design
Taken from: W.Wolf High-Performance Embedded Computing
108
Multiprocessors Design Techniques
System-Level Design
Major elements of an abstract architecture:
1. Software tasks are described by their data and
scheduling dependencies; they
interface to an API.
2. Hardware components consist of a core and an
interface.
3. The hardware/software integration is modeled by
the communication network that connects the CPUs
that run the software and the hardware IP
cores.
Taken from: W.Wolf High-Performance Embedded Computing
109
Multiprocessors Design Techniques
Design Techniques
• Core-based Strategy.
• Wrappers.
• System-level design flow.
• Platform-based design.
• Component-based design.
Taken from: W.Wolf High-Performance Embedded Computing
110
Multiprocessors Design Techniques
Platform-based Design
• Design space: platform selection
• Platform programming
• Multi-CPUs
• Concurrency
• Real-Time
• Platform developer must be
provided tools (compiler, editors,
debuggers, simulators, etc)
Taken from: Introduction to Embedded Systems
111
Multiprocessors Design Techniques
Platform-based Design
• Start with functional specifications
• Task graphs.
• Nodes: Task to complete
• Edges: Communication and
dependence between tasks
• Execution time on the nodes.
• Data communicated on the edges.
Taken from: MPSoCs https://slideplayer.com/slide/8773117/
112
Multiprocessors Design Techniques
Platform-based Design
• Map task on pre-designed HW.
• Use extended task graph for SW and
Communication
Taken from: MPSoCs https://slideplayer.com/slide/8773117/
113
Multiprocessors Design Techniques
Platform-based Design
• Map task on pre-designed HW.
• Use extended task graph for SW and
Communication
Taken from: MPSoCs https://slideplayer.com/slide/8773117/
114
Multiprocessors Design Techniques
Design Techniques
• Core-based Strategy.
• Wrappers.
• System-level design flow.
• Platform-based design.
• Component-based design.
Taken from: W.Wolf High-Performance Embedded Computing
115
Multiprocessors Design Techniques
Component Based Design
• Conceptual MPSOCs platform.
• SW, Processor, IP, Communication
Fabric.
• Parallel Development
• Use APIs.
• Quicker time to market.
Taken from: MPSoCs https://slideplayer.com/slide/8773117/
116
Multiprocessors Design Techniques
Component Based Design
Taken from: MPSoCs https://slideplayer.com/slide/8773117/
117
Multiprocessors Design Techniques
Multicore Application Programming Studio (MAPS)
• Developed at RWTH Aachen University in Germany.
• It is a platform that offers tools and technologies for MPSoC programming.
• Main features are:
• Sequential C code partition.
• Parallel programming model.
• Mapping and scheduling.
• Different types of applications.
• Functional Verification (Virtual Platform).
• Multiple applications environment.
• IDE easy to use.
Taken from: M. Aguilar SoC Lectures Notes
118
Multiprocessors Design Techniques
MAPS Flow
Taken from: M. Aguilar SoC Lectures Notes
119
Multiprocessors Design Techniques
MAPS Flow
Taken from: M. Aguilar SoC Lectures Notes
120
Multiprocessors Design Techniques
MAPS Programming Model: C for Paralell Network (CPN)
• Embedded Systems programming was used C language.
• CPN is a language developed as an extension of ANSI C in order to
describe process networks (KPN and SDF).
• A compiler called cpn-cc performs a transformation source-to-source to
convert code in CPN to code C standard with the APIs of the target
architecture.
Taken from: M. Aguilar SoC Lectures Notes
121
Multiprocessors Design Techniques
MAPS Programming Model: C for Paralell Network (CPN)
Taken from: M. Aguilar SoC Lectures Notes
122
Multiprocessors Design Techniques
MAPS Virtual Platform (MVP)
• MAPS Virtual Platform (MVP)
• High level: abstract PEs based on SystemC.
• Low level: (Instruction Set Simulators) ISS-based virtual platform.
• “mPhone” smartphone virtual.
Taken from: M. Aguilar SoC Lectures Notes
123
Multiprocessors Design Techniques
Virtual Processing Element
• It is a parameterizable processing element.
• Clock frequency.
• Type (RISC, VLIW, DSP, etc).
• Scheduling algorithm (Round robin, EDF, based on priorities, etc).
Taken from: M. Aguilar SoC Lectures Notes
Outline
124
• Multiprocessors Architecture and Taxonomy
• Parallel Execution Mechanism
• Multiprocessors Design Techniques
• Memory Systems
• Processors Symmetry
• Co-processing
125
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Memory Systems
126
Memory Systems
Memory Systems
• The memory system is a traditional bottleneck in computing.
• Not only are memories slower than processors, but processor clock rates
are increasing much faster than memory cycle times.
Taken from: W. Wolf High-Performance Embedded Computing and
https://www.taringa.net/+serviciotecnico/consulta-cuello-de-botella-cpu-debil-en-gpu-potente_15casq
127
Memory Systems
Memory Systems
Taken from: Multi-core architectures
128
Memory Systems
Memory Systems
Taken from: MPSoCs Hardware platforms Lectures Notes
129
Memory Systems
Memory Systems
• Start with a look at parallel memory systems in scientific multiprocessors.
• Consider models for memory and motivations for heterogeneous memory
systems.
• Look at what sorts of consistency mechanisms are needed in embedded
multiprocessors.
Taken from: W. Wolf Hugh-Performance Embedded Computing
130
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Memory Systems
Homogeneous Heterogenous
131
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Memory Systems
Homogeneous Heterogenous
132
Memory Systems
Memory Systems
In terms of understanding memory systems considers following case study:
• Scientific processors traditionally use parallel, homogeneous memory
systems to increase system performance.
• Multiple memory banks allow several memory accesses to occur
simultaneously.
Taken from: W. Wolf High-Performance Embedded Computing
133
Memory Systems
Memory Systems
• Each bank is separately addressable.
Taken from: W. Wolf High-Performance Embedded Computing
134
Memory Systems
Memory Systems
• If the memory system has n banks,
then n accesses can be performed in
parallel.
• This is known as the peak access
rate.
Taken from: W. Wolf High-Performance Embedded Computing
135
Memory Systems
Memory Systems
• Cannot keep the memory busy all of
the time.
• A simple statistical model lets us
estimate performance of a random-
access program.
Taken from: W. Wolf High-Performance Embedded Computing
136
Memory Systems
Memory Systems
• Assume that the program accesses a
certain number of sequential
locations, then moves to some other
location.
• Where:
• λ describes probability of a
nonsequential memory access (a
branch in code to be a nonconsecutive
data location).
• k describes sequential accesses.Taken from: W. Wolf High-Performance Embedded Computing
137
Memory Systems
Memory Systems
• Where:
• 𝑝 𝑘 = 𝜆 1 − 𝜆 𝑘−1
• And the mean length of a sequential
access sequence is:
• 𝐿𝑏 =1− 1−𝜆 𝑚
𝜆
Taken from: W. Wolf High-Performance Embedded Computing
138
Memory Systems
Memory Systems
• Use program statistics to estimate
the average probability of
nonsequential accesses, design the
memory system accordingly.
• Use software techniques to
maximize the length of access
sequences wherever possible.
Taken from: W. Wolf High-Performance Embedded Computing
139
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Memory Systems
Homogeneous Heterogenous
140
Memory Systems
Memory Systems
• Embedded systems can make use of multiple-bank memory systems, but they
also make use of more heterogeneous memory architectures.
• They do so to improve the real-time performance and lower the power
consumption of the memory system.
Taken from: W. Wolf High-Performance Embedded Computing
141
Memory Systems
Memory Systems
Why do heterogeneous memory systems
improve real-time performance?
Taken from: W. Wolf High-Performance Embedded Computing
142
Memory Systems
Memory Systems
• The energy required to perform a memory access depends in part on the size of
the memory block being accessed.
• A heterogeneous memory may be able to use smaller memory blocks, reducing
the access time.
• Energy per access also depends on the number of ports on the memory block.
• By reducing the number of units that can access a given part of memory, the
heterogeneous memory system can reduce the energy required to access that
part of the memory space.
Taken from: W. Wolf High-Performance Embedded Computing
143
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Memory Systems
Homogeneous Heterogenous
Consistent Memory Systems
144
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Shared
variables
Consistent
Memory Systems
Snooping
cachesCache
consistency
145
Memory Systems
Memory Systems
• Shared variables
• To worry about whether two processors see the same state of a shared variable.
• If reads and writes of two processors are interleaved, then one processor may write
the variable after another one has written it, causing that processor to erroneously
assume the value of the variable.
• Critical sections, guarded by semaphores, to ensure that critical operations occur in
the right order.
• Use atomic test-and-set operations (often called spin locks) to guard small pieces of
memory.
Taken from: W. Wolf High-Performance Embedded Computing
146
Memory Systems
Memory Systems
• Cache consistency
• If two processors access the same
memory location, then each may have
a copy of the location in its own cache.
• If one processing element writes that
location, then the other will not
immediately see the change and will
make an incorrect computation.
Taken from: W. Wolf High-Performance Embedded Computing
147
Memory Systems
Memory Systems
• Snooping Cache
• This type of cache contains extra
logic that watches the
multiprocessor interconnect for
memory transactions.
• When it sees a write to a location
that it currently contains, it
invalidates that location.
Taken from: W. Wolf High-Performance Embedded Computing
148
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Shared
memory
Memory Systems
Architecture
Hybrid
memoryDistributed
memory
149
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Shared
memory
Memory Systems
Architecture
Hybrid
memoryDistributed
memory
150
Memory Systems
Memory Systems
• Shared Memory
• Shared memory parallel computers vary
widely, but generally have in common the
ability for all processors to access all
memory as global address space.
• Multiple processors can operate
independently but share the same memory
resources.
Taken from: W. Wolf High-Performance Embedded Computing,
https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
151
Memory Systems
Memory Systems
• Shared Memory
• Changes in a memory location effected by
one processor are visible to all other
processors.
• Historically, shared memory machines
have been classified as UMA and NUMA,
based upon memory access times.
Taken from: W. Wolf High-Performance Embedded Computing,
https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
152
Memory Systems
Memory Systems
• Shared Memory (Uniform Memory
Access UMA)
• Most commonly represented today by
Symmetric Multiprocessor (SMP)
machines.
• Identical processors.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
153
Memory Systems
Memory Systems
• Shared Memory (Uniform Memory
Access UMA)
• Equal access and access times to
memory.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
154
Memory Systems
Memory Systems
• Shared Memory (Uniform Memory Access
UMA)
• Sometimes called CC-UMA - Cache
Coherent UMA. Cache coherent means if one
processor updates a location in shared
memory, all the other processors know about
the update. Cache coherency is accomplished
at the hardware level.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
155
Memory Systems
Memory Systems
• Shared Memory (Non-Uniform Memory
Access NUMA)
• Often made by physically linking two or
more SMPs.
• One SMP can directly access memory of
another SMP.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
156
Memory Systems
Memory Systems
• Shared Memory (Non-Uniform Memory
Access NUMA)
• Not all processors have equal access time to
all memories.
• Memory access across link is slower
• If cache coherency is maintained, then may
also be called CC-NUMA - Cache Coherent
NUMA.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
157
Memory Systems
Memory Systems
• Shared Memory
• Advantages
• Global address space provides a user-
friendly programming perspective to
memory.
• Data sharing between tasks is both fast
and uniform due to the proximity of
memory to CPUs.
Taken from: W. Wolf High-Performance Embedded Computing,,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
158
Memory Systems
Memory Systems
• Shared Memory
• Disadvantages
• Primary disadvantage is the lack of
scalability between memory and CPUs.
Adding more CPUs can geometrically
increases traffic on the shared memory-CPU
path, and for cache coherent systems,
geometrically increase traffic associated with
cache/memory management.
Taken from: W. Wolf High-Performance Embedded Computing,,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
159
Memory Systems
Memory Systems
• Shared Memory
• Disadvantages
• Programmer responsibility for
synchronization constructs that ensure
"correct" access of global memory.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
160
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Shared
memory
Memory Systems
Architecture
Hybrid
memoryDistributed
memory
161
Memory Systems
Memory Systems
• Distributed Memory
• Like shared memory systems, distributed
memory systems vary widely but share a
common characteristic.
• Distributed memory systems require a
communication network to connect inter-
processor memory.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
162
Memory Systems
Memory Systems
• Distributed Memory
• Processors have their own local memory.
Memory addresses in one processor do not
map to another processor, so there is no
concept of global address space across all
processors.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
163
Memory Systems
Memory Systems
• Distributed Memory
• Because each processor has its own local
memory, it operates independently.
Changes it makes to its local memory have
no effect on the memory of other
processors. Hence, the concept of cache
coherency does not apply.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
164
Memory Systems
Memory Systems
• Distributed Memory
• When a processor needs access to data in
another processor, it is usually the task of
the programmer to explicitly define how
and when data is communicated.
Synchronization between tasks is likewise
the programmer's responsibility.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
165
Memory Systems
Memory Systems
• Distributed Memory
• The network "fabric" used for data transfer
varies widely, though it can be as simple as
Ethernet.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
166
Memory Systems
Memory Systems
• Distributed Memory
• Advantages
• Memory is scalable with the number
of processors. Increase the number of
processors and the size of memory
increases proportionately.
Taken from: W. Wolf High-Performance Embedded Computing,
https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
167
Memory Systems
Memory Systems
• Distributed Memory
• Advantages
• Each processor can rapidly access its
own memory without interference and
without the overhead incurred with
trying to maintain global cache
coherency.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
168
Memory Systems
Memory Systems
• Distributed Memory
• Advantages
• Cost effectiveness: can use
commodity, off-the-shelf processors
and networking.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
169
Memory Systems
Memory Systems
• Distributed Memory
• Disadvantages
• The programmer is responsible for
many of the details associated with data
communication between processors.
• It may be difficult to map existing data
structures, based on global memory, to
this memory organization.
• .Taken from: W. Wolf High-Performance Embedded Computing,
https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
170
Memory Systems
Memory Systems
• Distributed Memory
• Disadvantages
• Non-uniform memory access times -
data residing on a remote node takes
longer to access than node local data.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
171
Memory Systems
Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing
Shared
memory
Memory Systems
Architecture
Hybrid
memoryDistributed
memory
172
Memory Systems
Memory Systems
• Hybrid Memory
• The largest and fastest computers in the
world today employ both shared and
distributed memory architectures.
• The shared memory component can be a
shared memory machine and/or graphics
processing units (GPU).
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
173
Memory Systems
Memory Systems
• Hybrid Memory
• The distributed memory component is
the networking of multiple shared
memory/GPU machines, which know
only about their own memory - not the
memory on another machine. Therefore,
network communications are required to
move data from one machine to another.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
174
Memory Systems
Memory Systems
• Hybrid Memory
• Current trends seem to indicate that this
type of memory architecture will
continue to prevail and increase at the
high end of computing for the
foreseeable future.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
175
Memory Systems
Memory Systems
• Hybrid Memory
• Advantages and Disadvantages
• Whatever is common to both shared and
distributed memory architectures.
• Increased scalability is an important
advantage.
• Increased programmer complexity is an
important disadvantage.
Taken from: W. Wolf High-Performance Embedded Computing,
https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch
176
Memory Systems
Design Memory Systems?
Taken from: W. Wolf High-Performance Embedded Computing,
177
Memory Systems
Design Memory Systems
A simple model of memory components for parallel memory design would include
three major parameters of a memory component of a given size.
• Area: The physical size of the logical component. This is most important in chip design, but it also
relates to cost in board design.
• Performance: The access time of the component. There may be more than one parameter, with
variations for read and write times, page mode accesses, and so on.
• Energy: The energy required per access. If performance is characterized by multiple modes, energy
consumption will exhibit similar modes.
Taken from: W. Wolf High-Performance Embedded Computing,
178
Memory Systems
Design Memory Systems
Taken from: W. Wolf High-Performance Embedded Computing,
179
Memory Systems
Memory Systems
Taken from: https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias
180
Memory Systems
Memory Systems
Taken from: https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias
Outline
181
• Multiprocessors Architecture and Taxonomy
• Parallel Execution Mechanism
• Multiprocessors Design Techniques
• Memory Systems
• Processors Symmetry
• Co-processing
182
Processors Symmetry
Taken from: W. Wolf High-Performance Embedded Computing
Symmetric
SMP
Multi-processing
Asymmetric
AMP
183
Processors Symmetry
Taken from: W. Wolf High-Performance Embedded Computing
Symmetric
SMP
Multi-processing
Asymmetric
AMP
184
Processors Symmetry
Taken from: M. Aguilar SoCs
Symmetric Multi-processing (SMP)
• System with multiple processors or cores that are communicated by a single
shared memory and are controlled by a single operating system
185
Processors Symmetry
Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/
Symmetric Multi-processing (SMP)
• Identical: All the processors are treated equally i.e. all are identical.
• Communication: Shared memory is the mode of communication among
processors.
• Complexity: Are complex in design, as all units share same memory and data
bus.
• Expensive: They are costlier in nature.
• Unlike asymmetric where a task is done only by Master processor, here tasks of
the operating system are handled individually by processors.
186
Processors Symmetry
Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/
Symmetric Multi-processing (SMP)
• Applications
• This concept finds its application in parallel processing, where time-sharing
systems(TSS) have assigned tasks to different processors running in parallel
to each other, also in TSS that uses multithreading i.e. multiple threads
running simultaneously.
187
Processors Symmetry
Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/
Symmetric Multi-processing (SMP)
• Advantages
• Throughput: Since tasks can be run by all the processors unlike in
asymmetric, hence increased degree of throughput(processes executed in unit
time).
• Reliability: Failing a processor doesn’t fail whole system, as all are equally
capable processors, though throughput do fail a little.
188
Processors Symmetry
Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/
Symmetric Multi-processing (SMP)
• Disadvantages
• Complex design: Since all the processors are treated equally by OS, so
designing and management of such OS become difficult.
• Costlier: As all the processors share the common main memory, on account
of which size of memory required is larger implying more expensive.
189
Processors Symmetry
Taken from: https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf
Symmetric Multi-processing (SMP)
190
Processors Symmetry
Taken from: https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf
Symmetric Multi-processing (SMP)
More information here
191
Processors Symmetry
Taken from: W. Wolf High-Performance Embedded Computing
Symmetric
SMP
Multi-processing
Asymmetric
AMP
192
Processors Symmetry
Taken from: M. Aguilar SoC Lectures Notes
Asymmetric Multi-processing (AMP)
• Is a system with multiple processors or cores that are communicated by a single
shared memory and each processor or cores is controlled by an independent
operating system (different or equal).
193
Processors Symmetry
Asymmetric Multi-processing (AMP)
• Characteristics
• Processors are not treated equally.
• Tasks of the operating system are done by master processor.
• No Communication between Processors as they are controlled by the
master processor.
• Process are master-slave.
• Systems are cheaper.
• Systems are easier to design.
Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/
194
Processors Symmetry
Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf
Asymmetric Multi-processing (AMP)
195
Processors Symmetry
Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf
Asymmetric Multi-processing (AMP)
196
Processors Symmetry
Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf
Asymmetric Multi-processing (AMP)
197
Processors Symmetry
Asymmetric Multi-processing (AMP)
Taken from: https://github.com/OpenAMP/open-amp
198
Processors Symmetry
Asymmetric Multi-processing (AMP)
Taken from: https://github.com/OpenAMP/open-amp
199
Processors Symmetry
Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf
Asymmetric Multi-processing (AMP)
Outline
200
• Multiprocessors Architecture and Taxonomy
• Parallel Execution Mechanism
• Multiprocessors Design Techniques
• Memory Systems
• Processors Symmetry
• Co-processing
201
Co-processing
Taken from: https://www.researchgate.net/publication/250840737_Automatic_Generation_of_Application-
Specific_Architectures_for_Heterogeneous_MPSoC_through_Combination_of_Processors/figures
202
Co-processing
Taken from: https://www.researchgate.net/publication/250840737_Automatic_Generation_of_Application-
Specific_Architectures_for_Heterogeneous_MPSoC_through_Combination_of_Processors/figures
203
Co-processing
Taken from: https://www.researchgate.net/publication/250840737_Automatic_Generation_of_Application-
Specific_Architectures_for_Heterogeneous_MPSoC_through_Combination_of_Processors/figures
204
Co-processing
Taken from: http://www.cecs.uci.edu/~papers/esweek06/codes/p288.pdf
205
Co-processing
Taken from: https://www.researchgate.net/publication/221656884_A_Generic_Wrapper_Architecture_for_Multi-
Processor_SoC_Cosimulation_and_Design/figures?lo=1
206
Co-processing
Taken from: https://link.springer.com/chapter/10.1007/978-3-319-01113-4_1
207
Co-processing
What is a coprocessor?
208
Co-processing
A coprocessor is:
• A computer processor used to supplement functions of the primary processor.
• Several operations performed by the coprocessor such as:
• Floating Point (FPU).
• Graphics Processing.
• Signal Processing.
• Cryptography.
• Etc, ……
Taken from: https://youtu.be/xrMUv9ZVKY0
209
Co-processing
A coprocessor is:
• By offloading processor intensive tasks from the main processor, coprocessor can
accelerate system performance.
• Coprocessors allow a line of computers to be customized, so that customers who
do not need extra performance need not pay for it.
Taken from: https://youtu.be/xrMUv9ZVKY0
210
Co-processing
Functions
• A coprocessor may not be a general-purpose processor.
• Coprocessors cannot fetch instructions from memory, execute program flow
control instructions, do input/output operations manage memory and so on.
• The coprocessor requires the host (main) processor to fetch the coprocessor
instructions and handle all other operations aside from the coprocessor functions.
• In some architectures the coprocessor is a more general-purpose computer but
carries out only a limited range of functions under the close control of a
supervisory processor.
Taken from: https://youtu.be/xrMUv9ZVKY0
211
Co-processing
Taken from: https://www.doulos.com/knowhow/arm/using_your_c_compiler_to_exploit_neon/Resources/using_your_c_compiler_to_exploit_neon.pdf
Coprocessor
212
Co-processing
NEON Arm
• v7-A architecture, ARM has introduced a powerful SIMD implementation called
NEON™.
• NEON is a coprocessor which comes with its own instruction set for vector
operations.
• Most vector operations carry out the same operation on all elements of their
operand vector(s) in parallel.
• Using your C compiler to exploit NEON™ Advanced SIMD.
Taken from: https://youtu.be/xrMUv9ZVKY0
213
Co-processing
NEON Arm
• The goal of NEON is to provide a powerful, yet comparatively easy to program
SIMD instruction set that covers integer data types of up to 64-bit width as well
as single precision floating point (32 bit).
• Instead it shares its sixteen 128-bit registers with the vector floating point unit.
• Executed on the same processor core, NEON performance is influenced by
context switching overhead, non-deterministic memory access latency
(cache/MMU access) and interrupt handling.
Taken from: https://youtu.be/xrMUv9ZVKY0
214
Co-processing
NEON Arm
Taken from: https://youtu.be/xrMUv9ZVKY0
215
Co-processing
NEON Arm
Taken from: https://youtu.be/xrMUv9ZVKY0
216
Co-processing
NEON Arm
Taken from: https://youtu.be/xrMUv9ZVKY0
217
Co-processing
NEON Arm
Taken from: https://youtu.be/xrMUv9ZVKY0
218
Co-processing
NEON Arm
Taken from: https://youtu.be/xrMUv9ZVKY0
219
Co-processing
NEON Arm
Taken from: https://youtu.be/xrMUv9ZVKY0
220
Co-processing
DSP’s
Taken from: Introduccion a los Sistemas Empotrados Lectures Notes
221
Co-processing
DSP’s
Taken from: M. Aguilar SoC Lectures Notes
222
Co-processing
DSP’s
Taken from: M. Aguilar SoC Lectures Notes
223
Co-processing
GPU
Taken from: https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano
224
Co-processing
GPU
Taken from: https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano
225
Co-processing
Flight controller UAV
Taken from: https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf
226
Co-processing
Flight controller UAV
Taken from: https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf
227
References
[1] Lectures Notes, Tecnologico de Costa Rica, Course SoC.
[2] W. Wolf. High-Performance Embedded Computing: Architectures, Applications
and Methodologies. Elsevier, United States of America, 2007.
[3] E. Ashford and S. Arunkumar Introduction to Embedded Systems, 2017
Lectures notes and materials are available in TEC-Digital and web portal
www.ie.tec.ac.cr/sarriola/HPEC
www.ie.tec.ac.cr/joaraya
228