high performance embedded systems mpsocs

High Performance Embedded Systems

July 2020

Electronics Engineering Department

Electronics Master Program

MPSoCs

Outline

2

• Multiprocessors Architecture and Taxonomy

• Parallel Execution Mechanism

• Multiprocessors Design Techniques

• Memory Systems

• Processors Symmetry

• Co-processing

3

Multiprocessors Architecture and Taxonomy

Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/

Intel 4004 Core i9??

https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/

4



Intel 4004 Core i9


5



Exynos 7420 finFET transistors


6



Exynos 7420 finFET transistors


7


Taken from: https://www.researchgate.net/publication/257711815_Where_Photovoltaics_Meets_Microelectronics/figures?lo=1

https://www.researchgate.net/publication/257711815_Where_Photovoltaics_Meets_Microelectronics/figures?lo=1

8


Taken from: https://www.semiconductor-digest.com/2020/03/10/transistor-count-trends-continue-to-track-with-moores-law/

https://www.semiconductor-digest.com/2020/03/10/transistor-count-trends-continue-to-track-with-moores-law/

9


Taken from: https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/

SoC

https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/

10


Taken from: http://soc.inha.ac.kr/index.php/Project

2-Parallel Radix-

2^4 FFT/IFFT

Processor Chip for

MB-OFDM UWB

communications

http://soc.inha.ac.kr/index.php/Project

11


Taken from: PrSoC: Programmable System-on-chip (SoC) for silicon prototyping IEEE 2008

12


Taken from: https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/

SoC

MPSoC

https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/

13


Taken from: https://commons.wikimedia.org/wiki/File:ARM-Cortex-A9.gif

¿MPSoCs?

https://commons.wikimedia.org/wiki/File:ARM-Cortex-A9.gif

14


SoC

Taken from: W. Wolf Multiprocessor Systems-On-Chip

• Is an integrated circuit that implements

most or all of the functions of a

complete electronic system.

• The most fundamental characteristic of

an SoC is complexity.

15


SoC


Many product categories:

• Cell phones.

• Telecommunications and networking.

• Digital television.

• Videos games.

• …..

16


SoC Example


Processing Elements

17


SoC Example


Memory

18


SoC Example


Communications

19


SoC Example


MPSoCs?

20


MPSoCs?

Wait!

What is a Parallel Architecture?

21


Parallel Architecture

“A large collection of processing elements that communicate and cooperate to

solve large problems fast”. - Almasi.

Taken from: M. Aguilar MPSoCs

22






23






SoC

HW+SW

24






SoC

HW+SW

Technology was increased

25






SoC

HW+SW

Technology was increased

26






SoC

HW+SW

MPSoCs Technology was increased

27



Serial Communication

Parallel Communication

28


Here we go

What are MPSoCs?


29


What are MPSoCs?

“Are the latest incarnation of very largescale integration (VLSI)

technology”


???

30


What are MPSoCs?


technology”


???• Silicon

• Power

• Area

• …

31


What are MPSoCs?


technology”

“A single integrated circuit can contain over

100 million transistors, and the International Technology Roadmap

for Semiconductors predicts that chips with a billion transistors are

within reach”


32



MPSoCs

“The multiprocessor System-on-Chip (MPSoC) is a system-on-a-chip

(SoC) which uses multiple processors (see multi-core), usually

targeted for embedded applications”.

SoC

HW+SW

MPSoCs Understood!!

33



MPSoCs

“The multiprocessor system-on-chip (MPSoC) uses multiple CPUs

along with other hardware subsystems to implement a system”. -

Wayne Wolf.

Multiprocessor = Multicore?

34


General Structure MPSoCs

Processing Elements (PE)

• Relation with application context and requirements.

• MPSoCs Homogenous.

• MPSoCs Heterogenous

• Interconnection Element

• Buses.

• NoCs (Networks on Chip). More information here.

Taken from: M. Agular MPSoCs

https://www.design-reuse.com/articles/10496/a-comparison-of-network-on-chip-and-busses.html

35



Advantage in MPSoCs

• Performance

• Powerful platform (Cores).

• Users.

• Applications.

• Tasks into same application.

Power Consumption

• Low power from parallel approach.

36



37



MPSoCs Beneficts

• Wireless.

• Multimedia: video and audio.

• Health.

• Military.

• Avionics.

• Aerospacial

38



Multiprocessor = Multicore?

39



Multiprocessor

• Platform with several CPUs.

• Parallel approach was used.

Multicore

• Platform with only one CPU.

• Multiple cores into CPU.

40



MPSoCs Software

41



Parallel Approaches

Parallel

Approaches

42



Parallel Approaches

Parallel

Approaches

Bits

Threads

TasksInstructions

Data

43



MPSoCs Architecture?

44



MPSoCs

PEs

45



MPSoCs

Homogeneous Heterogenous

PEs

46



MPSoCs Heterogeneous

• Different PEs, for example

• GPU (General Purpose Unit).

• DSPs.

• HW Acceleration

• NoC infrastructure.

• Better performance and power consumption

• Use in embedded system.

• Portable system.

• Power consumption.

47



MPSoCs Homogenous

• PEs to conform a SoC.

• PE is instanced several times.

• Instance is connected by communication

infrastructure.

• Flexibility and Scalability.

• Worst power consumption.

48



MPSoCs Taxonomy?

49



Processor Organization

Serial

SISD

Uniprocessor

Multi ALUOverlapped

operations

Parallel

SIMD MISD MIMD

Vector

processor

Array

processor

Tightly

coupled

Loosely

coupled

Shared

memory

Symmetric

multiprocessor

(SMP)Nonuniform

memory access

(NUMA)

Distributed

memory

Clusters

50



Where are located MPSoCs?

51



Processor Organization

Serial

SISD

Uniprocessor

Multi ALUOverlapped

operations

Parallel

SIMD MISD MIMD

Vector

processor

Array

processor

Tightly

coupled

Loosely

coupled

Shared

memory

Symmetric

multiprocessor

(SMP)Nonuniform

memory access

(NUMA)

Distributed

memory

Clusters

MPSoCs

52


Taken from: M. Aguilar MPSoCs and Parallel Computing Lectures Notes

MISD

• This architecture executing

different operations over

different data bundle.

• Multiprocessing approach and

MPSoCs were located in this

category.

53



MPSoCs


PEs

Memory Access

Uniform Access (UMA)

Non-Uniform Access (NUMA)

Processors Symmetry

SMP (Symmetric Multi-processing)

AMP (Asymmetric Multi-processing)

Memory Architecture

Share Memory

Distributed memory

MPSoCs Architecture

54



ARM Cortex A9

55



Analog Devices - Blackfin

56



TI Davinci DM355

57



TI OMAP5

58



ST Microelectronic Nomadik

59



Nexperia

60


Taken from: http://linuxgizmos.com/new-arm-cortex-a72-nearly-twice-as-fast-as-cortex-a57/

Cortex-A72

http://linuxgizmos.com/new-arm-cortex-a72-nearly-twice-as-fast-as-cortex-a57/

Outline

61




• Memory Systems


• Co-processing

62

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

63



Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others

All these can be implemented on any architecture.

64




• Shared memory.

• Threads.


• Data Parallel.

• Hybrid.

• Others


65



Shared Memory

• Tasks share a common address space, which they read and write

asynchronously.

• Various mechanisms such as locks/semaphores may be used control access to

the shared memory.

• Advantage

• No need to explicitly communicate of data tasks simplified programming.

• Disadvantages

• Need to take care when managing memory, avoid synchronization conflicts.

• Harder to control data locality.

66



In Hardware

• Shared memory systems use:

• UMA (Uniform Memory Access)

• NUMA (Non- Uniform Memory

Access)

• COMA (Cache-only memory

architecture)

In Software

• Inter-process communication (IPC).

• Virtual memory mapping.

67




• Shared memory.

• Threads.


• Data Parallel.

• Hybrid.

• Others


68



Threads

• A thread can be considered as a

subroutine in the main program.

• Threads communicate with each other

through the global memory.

• Commonly associated with shared

memory architectures and operating

systems.

• Posix Threads or pthreads.

• OpenMP.

https://computing.llnl.gov/tutorials/pthreads/

https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

69



Threads

Advantages

• Responsiveness.

• Faster execution.

• Lower resource consumption.

• Better system utilization.

• Simplified share and communication

• Parallelization.

• Drawbacks

• Synchronization.

• Thread crashes a process.

70




• Shared memory.

• Threads.


• Data Parallel.

• Hybrid.

• Others.


71



Message Passing

• A set of tasks that use their own local memory

during computation.

• Data exchange through sending and receiving

messages.

• Data transfer usually requires cooperative

operations to be performed by each process.

• For example, a send operation must have a

matching receive operation.

• MPI

• Example here

https://www.open-mpi.org/

https://mpitutorial.com/tutorials/mpi-hello-world/

72




• Shared memory.

• Threads.


• Data Parallel.

• Hybrid.

• Others.


73



Data Parallel

• Consider the following characteristics:

• Parallel work performs operations on a data set,

organized into a common structure.

• Tasks works collectively on the same data structure,

with each task working on a different partition.

• Tasks perform the same operation on their partition.

• Shared memory architectures, all tasks may have

access to the data structure through global memory.

• Distributed memory architectures the data structure is

split up and resides as “chunks” in the local memory

of each task.

• More information here.

https://www.mcs.anl.gov/~itf/dbpp/text/node83.html

74




• Shared memory.

• Threads.


• Data Parallel.

• Hybrid.

• Others


75



Hybrid

• Using various models (for example OpenMP/MPI).

• Single Program Multiple Data (SPMD)

• Single program is executed by all tasks simultaneously.

• Multiple Program Multiple Data (MPMD)

• Has multiple executables. Task can execute the same of different programs

as other task

76




• Shared memory.

• Threads.


• Data Parallel.

• Hybrid.

• Others. (Depends on the architecture)


77



Others

• MCAPI (Multicore Association)

• Poly-Platform

• CUDA

78



Others


• Poly-Platform

• CUDA

79


Taken from: https://en.wikipedia.org/wiki/Multicore_Association

MCAPI (Multicore Association)

• Founded in 2005

• First specification and referred to as MCAPI

• Based on message-passing

• Target is addressed to system, toolchain and programming language

heterogeneous.

• Active working

• MCAPI

• Virtualization.

• Open Asymmetric Multiprocessing (OpenAMP)

https://en.wikipedia.org/wiki/Multicore_Association

80



Others


• Poly-Platform

• CUDA

81


Taken from: http://polycoresoftware.com/poly-platform

Poly-Platform

• Collection productivity tools

• Migrating process

• Main approach multicore platforms.

• Driven supports for several SoC, OS and Transport Information.

http://polycoresoftware.com/poly-platform

82



Others


• Poly-Platform

• CUDA

83


Taken from: https://en.wikipedia.org/wiki/CUDA

CUDA

• Initial release 2007.

• Parallel computing platform and

application programming interface.

• Created by NVIDIA.

• GPU approach.

• Supports in Windows, Linux and

macOS.

https://en.wikipedia.org/wiki/CUDA

Outline

84




• Memory Systems


• Co-processing

85

Multiprocessors Design Techniques

Taken from: W.Wolf High-Performance Embedded Computing

Embedded Systems Design Flows

• Co-design flows.

• Platform-based design.

• Two-stage process.

• Programming platforms.

• Standards-Based design.

MPSoCs?

86


Challenges

• Software development is a major challenge for MPSoC designers.

• Software that runs on the multiprocessor must be high performance, real time,

and low power.

• Each MPSoC requires its own software development environment: compiler,

debugger, simulator, and other tools.

• Better understanding of how to abstract tasks properly to capture the essential

characteristics of their low-level behavior for system-level analysis.

Taken from: W.Wolf Multprocessor Systems on Chip

87


Taken from: W. Wolf Multiprocessor Systems on Chip

Challenges

• Networks-on-chips have emerged over the past few years as an architectural

approach to the design of single-chip multiprocessors.

• FPGAs have emerged as a viable alternative to application-specific integrated

circuits (ASICs) in many markets. FPGA fabrics are also starting to be

integrated into SoCs.

88


Taken from: SoC Lectures Notes

Challenges

• C code sequence is not easy to replace.

• Algorithm specification contains parallel specifications (Model of computation

KPN, SDF, etc).

• Not new programming languages.

• Automatically and parallel programming.

• Platform-based design (SW synthesis) or SW and HW synthesis.

89


Taken from: MPSoCs https://slideplayer.com/slide/8773117/

Challenges

All MPSOC design have the following requirements:

• Speed.

• Power.

• Area.

• Application Performance.

• Time to market.

https://slideplayer.com/slide/8773117/

90



MPSoCs Programming

• Task mapping to multiprocessor or cores.

• Communication inter-processor management.

• Data transfer engine management.

• Shared resource management.

• Memory management

• Debugging.

91



MPSoCs Exploration

• Divide computational and communications.

92



Virtual Processing Unit VPU

• Load simulator: It is a high-level simulation of

the core behavior.

• Functional simulator: Native execution of

tasks, scheduling is given by the VPU OS.

93



Virtual Processing Unit VPU

Allows spatial and temporal modeling of task mapping to PE

94



Virtual Platform

• It is a software model that allows the exploration of hardware and software.

• It allows hardware platform exploration and optimization.

• Software development, debugging and optimization.

• Concurrent hardware and software design.

95



Virtual Platform

• Requirements:

• High speed in terms of simulation process.

• Compromise between simulation speed and precision.

• Flexibility.

• Usability by developers not experts in hardware.

96


Design Techniques

• Core-based Strategy.

• Wrappers.

• System-level design flow.


• Component-based design.


97


Design Techniques


• Wrappers.





98


Core-based Strategy

• Core-based synthesis strategy for the IBM CoreConnect bus.

• Coral tool automates many of the tasks required to stitch together multiple

cores using virtual components.

• Each virtual component describes the interfaces for a class of real

components.

• Coral can synthesize some combinational logic.

• Coral also checks the connections between cores using Boolean decision

diagrams.


99


Core-based Strategy

Core Connect provides three types of busses:

• A high-speed processor local bus (PLB).

• An on-chip peripheral bus (OPB).

• A device control register (DCR) bus for configuration and status information.


100



Core-based Strategy

101


Design Techniques


• Wrappers.





102


Wrappers

• Treats both hardware and software as

components.

• A wrapper is a design unit that interfaces a

module to another module.

• A wrapper can be hardware or software

and may include both.

• The wrapper performs only low-level

adaptations, such as protocol

transformationTaken from: W.Wolf High-Performance Embedded Computing

103


Wrappers

Heterogeneous multiprocessor introduce several types of problems:

• Many chips have multiple communication networks to match the network to

the processing needs. Synchronizing communication across network

boundaries is more difficult than communicating within a network.

• Specialized hardware is often needed to accelerate interprocess

communication and free the CPU for more interesting computations.

• The communication primitives should be at a higher level of abstraction than

shared memory.


104


Wrappers

A dedicated CPU is added to the system, its software must be adapted

in several ways:

1. The software must be updated to support the platform’s communication

primitives.

2. Optimized implementations of the host processor’s communication

functions must be provided for interprocessor communication.

3. Synchronization functions must be provided.


105


Design Techniques


• Wrappers.





106


System-Level Design

• An abstract platform is created from a combination of system requirements,

models of the software, and models of the hardware components.

• Abstract platform is analyzed to determine the application’s performance

and power/energy consumption.

• Based on the results of this analysis, software is allocated and scheduled

onto the platform.

• Golden abstract architecture that can be used to build the implementation.


107


System-Level Design


108


System-Level Design

Major elements of an abstract architecture:

1. Software tasks are described by their data and

scheduling dependencies; they

interface to an API.

2. Hardware components consist of a core and an

interface.

3. The hardware/software integration is modeled by

the communication network that connects the CPUs

that run the software and the hardware IP

cores.


109


Design Techniques


• Wrappers.





110


Platform-based Design

• Design space: platform selection

• Platform programming

• Multi-CPUs

• Concurrency

• Real-Time

• Platform developer must be

provided tools (compiler, editors,

debuggers, simulators, etc)

Taken from: Introduction to Embedded Systems

111



• Start with functional specifications

• Task graphs.

• Nodes: Task to complete

• Edges: Communication and

dependence between tasks

• Execution time on the nodes.

• Data communicated on the edges.



112



• Map task on pre-designed HW.

• Use extended task graph for SW and

Communication



113



• Map task on pre-designed HW.

• Use extended task graph for SW and

Communication



114


Design Techniques


• Wrappers.





115


Component Based Design

• Conceptual MPSOCs platform.

• SW, Processor, IP, Communication

Fabric.

• Parallel Development

• Use APIs.

• Quicker time to market.



116


Component Based Design



117


Multicore Application Programming Studio (MAPS)

• Developed at RWTH Aachen University in Germany.

• It is a platform that offers tools and technologies for MPSoC programming.

• Main features are:

• Sequential C code partition.

• Parallel programming model.

• Mapping and scheduling.

• Different types of applications.

• Functional Verification (Virtual Platform).

• Multiple applications environment.

• IDE easy to use.

Taken from: M. Aguilar SoC Lectures Notes

118


MAPS Flow


119


MAPS Flow


120


MAPS Programming Model: C for Paralell Network (CPN)

• Embedded Systems programming was used C language.

• CPN is a language developed as an extension of ANSI C in order to

describe process networks (KPN and SDF).

• A compiler called cpn-cc performs a transformation source-to-source to

convert code in CPN to code C standard with the APIs of the target

architecture.


121


MAPS Programming Model: C for Paralell Network (CPN)


122


MAPS Virtual Platform (MVP)

• MAPS Virtual Platform (MVP)

• High level: abstract PEs based on SystemC.

• Low level: (Instruction Set Simulators) ISS-based virtual platform.

• “mPhone” smartphone virtual.


123


Virtual Processing Element

• It is a parameterizable processing element.

• Clock frequency.

• Type (RISC, VLIW, DSP, etc).

• Scheduling algorithm (Round robin, EDF, based on priorities, etc).


Outline

124




• Memory Systems


• Co-processing

125

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Memory Systems

126

Memory Systems

Memory Systems

• The memory system is a traditional bottleneck in computing.

• Not only are memories slower than processors, but processor clock rates

are increasing much faster than memory cycle times.

Taken from: W. Wolf High-Performance Embedded Computing and

https://www.taringa.net/+serviciotecnico/consulta-cuello-de-botella-cpu-debil-en-gpu-potente_15casq

https://www.taringa.net/+serviciotecnico/consulta-cuello-de-botella-cpu-debil-en-gpu-potente_15casq

127

Memory Systems

Memory Systems

Taken from: Multi-core architectures

128

Memory Systems

Memory Systems

Taken from: MPSoCs Hardware platforms Lectures Notes

129

Memory Systems

Memory Systems

• Start with a look at parallel memory systems in scientific multiprocessors.

• Consider models for memory and motivations for heterogeneous memory

systems.

• Look at what sorts of consistency mechanisms are needed in embedded

multiprocessors.

Taken from: W. Wolf Hugh-Performance Embedded Computing

130

Memory Systems

Memory Systems


Memory Systems


131

Memory Systems

Memory Systems


Memory Systems


132

Memory Systems

Memory Systems

In terms of understanding memory systems considers following case study:

• Scientific processors traditionally use parallel, homogeneous memory

systems to increase system performance.

• Multiple memory banks allow several memory accesses to occur

simultaneously.


133

Memory Systems

Memory Systems

• Each bank is separately addressable.


134

Memory Systems

Memory Systems

• If the memory system has n banks,

then n accesses can be performed in

parallel.

• This is known as the peak access

rate.


135

Memory Systems

Memory Systems

• Cannot keep the memory busy all of

the time.

• A simple statistical model lets us

estimate performance of a random-

access program.


136

Memory Systems

Memory Systems

• Assume that the program accesses a

certain number of sequential

locations, then moves to some other

location.

• Where:

• λ describes probability of a

nonsequential memory access (a

branch in code to be a nonconsecutive

data location).

• k describes sequential accesses.Taken from: W. Wolf High-Performance Embedded Computing

137

Memory Systems

Memory Systems

• Where:

• 𝑝 𝑘 = 𝜆 1 − 𝜆 𝑘−1

• And the mean length of a sequential

access sequence is:

• 𝐿𝑏 =1− 1−𝜆 𝑚

𝜆


138

Memory Systems

Memory Systems

• Use program statistics to estimate

the average probability of

nonsequential accesses, design the

memory system accordingly.

• Use software techniques to

maximize the length of access

sequences wherever possible.


139

Memory Systems

Memory Systems


Memory Systems


140

Memory Systems

Memory Systems

• Embedded systems can make use of multiple-bank memory systems, but they

also make use of more heterogeneous memory architectures.

• They do so to improve the real-time performance and lower the power

consumption of the memory system.


141

Memory Systems

Memory Systems

Why do heterogeneous memory systems

improve real-time performance?


142

Memory Systems

Memory Systems

• The energy required to perform a memory access depends in part on the size of

the memory block being accessed.

• A heterogeneous memory may be able to use smaller memory blocks, reducing

the access time.

• Energy per access also depends on the number of ports on the memory block.

• By reducing the number of units that can access a given part of memory, the

heterogeneous memory system can reduce the energy required to access that

part of the memory space.


143

Memory Systems

Memory Systems


Memory Systems


Consistent Memory Systems

144

Memory Systems

Memory Systems


Shared

variables

Consistent

Memory Systems

Snooping

cachesCache

consistency

145

Memory Systems

Memory Systems

• Shared variables

• To worry about whether two processors see the same state of a shared variable.

• If reads and writes of two processors are interleaved, then one processor may write

the variable after another one has written it, causing that processor to erroneously

assume the value of the variable.

• Critical sections, guarded by semaphores, to ensure that critical operations occur in

the right order.

• Use atomic test-and-set operations (often called spin locks) to guard small pieces of

memory.


146

Memory Systems

Memory Systems

• Cache consistency

• If two processors access the same

memory location, then each may have

a copy of the location in its own cache.

• If one processing element writes that

location, then the other will not

immediately see the change and will

make an incorrect computation.


147

Memory Systems

Memory Systems

• Snooping Cache

• This type of cache contains extra

logic that watches the

multiprocessor interconnect for

memory transactions.

• When it sees a write to a location

that it currently contains, it

invalidates that location.


148

Memory Systems

Memory Systems


Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

149

Memory Systems

Memory Systems


Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

150

Memory Systems

Memory Systems

• Shared Memory

• Shared memory parallel computers vary

widely, but generally have in common the

ability for all processors to access all

memory as global address space.

• Multiple processors can operate

independently but share the same memory

resources.

Taken from: W. Wolf High-Performance Embedded Computing,

https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg


151

Memory Systems

Memory Systems

• Shared Memory

• Changes in a memory location effected by

one processor are visible to all other

processors.

• Historically, shared memory machines

have been classified as UMA and NUMA,

based upon memory access times.






152

Memory Systems

Memory Systems

• Shared Memory (Uniform Memory

Access UMA)

• Most commonly represented today by

Symmetric Multiprocessor (SMP)

machines.

• Identical processors.




153

Memory Systems

Memory Systems

• Shared Memory (Uniform Memory

Access UMA)

• Equal access and access times to

memory.




154

Memory Systems

Memory Systems

• Shared Memory (Uniform Memory Access

UMA)

• Sometimes called CC-UMA - Cache

Coherent UMA. Cache coherent means if one

processor updates a location in shared

memory, all the other processors know about

the update. Cache coherency is accomplished

at the hardware level.




155

Memory Systems

Memory Systems

• Shared Memory (Non-Uniform Memory

Access NUMA)

• Often made by physically linking two or

more SMPs.

• One SMP can directly access memory of

another SMP.




156

Memory Systems

Memory Systems

• Shared Memory (Non-Uniform Memory

Access NUMA)

• Not all processors have equal access time to

all memories.

• Memory access across link is slower

• If cache coherency is maintained, then may

also be called CC-NUMA - Cache Coherent

NUMA.




157

Memory Systems

Memory Systems

• Shared Memory

• Advantages

• Global address space provides a user-

friendly programming perspective to

memory.

• Data sharing between tasks is both fast

and uniform due to the proximity of

memory to CPUs.

Taken from: W. Wolf High-Performance Embedded Computing,,



158

Memory Systems

Memory Systems

• Shared Memory

• Disadvantages

• Primary disadvantage is the lack of

scalability between memory and CPUs.

Adding more CPUs can geometrically

increases traffic on the shared memory-CPU

path, and for cache coherent systems,

geometrically increase traffic associated with

cache/memory management.

Taken from: W. Wolf High-Performance Embedded Computing,,



159

Memory Systems

Memory Systems

• Shared Memory

• Disadvantages

• Programmer responsibility for

synchronization constructs that ensure

"correct" access of global memory.




160

Memory Systems

Memory Systems


Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

161

Memory Systems

Memory Systems

• Distributed Memory

• Like shared memory systems, distributed

memory systems vary widely but share a

common characteristic.

• Distributed memory systems require a

communication network to connect inter-

processor memory.




162

Memory Systems

Memory Systems


• Processors have their own local memory.

Memory addresses in one processor do not

map to another processor, so there is no

concept of global address space across all

processors.




163

Memory Systems

Memory Systems


• Because each processor has its own local

memory, it operates independently.

Changes it makes to its local memory have

no effect on the memory of other

processors. Hence, the concept of cache

coherency does not apply.




164

Memory Systems

Memory Systems


• When a processor needs access to data in

another processor, it is usually the task of

the programmer to explicitly define how

and when data is communicated.

Synchronization between tasks is likewise

the programmer's responsibility.




165

Memory Systems

Memory Systems


• The network "fabric" used for data transfer

varies widely, though it can be as simple as

Ethernet.




166

Memory Systems

Memory Systems


• Advantages

• Memory is scalable with the number

of processors. Increase the number of

processors and the size of memory

increases proportionately.






167

Memory Systems

Memory Systems


• Advantages

• Each processor can rapidly access its

own memory without interference and

without the overhead incurred with

trying to maintain global cache

coherency.




168

Memory Systems

Memory Systems


• Advantages

• Cost effectiveness: can use

commodity, off-the-shelf processors

and networking.




169

Memory Systems

Memory Systems


• Disadvantages

• The programmer is responsible for

many of the details associated with data

communication between processors.

• It may be difficult to map existing data

structures, based on global memory, to

this memory organization.

• .Taken from: W. Wolf High-Performance Embedded Computing,





170

Memory Systems

Memory Systems


• Disadvantages

• Non-uniform memory access times -

data residing on a remote node takes

longer to access than node local data.




171

Memory Systems

Memory Systems


Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

172

Memory Systems

Memory Systems

• Hybrid Memory

• The largest and fastest computers in the

world today employ both shared and

distributed memory architectures.

• The shared memory component can be a

shared memory machine and/or graphics

processing units (GPU).




173

Memory Systems

Memory Systems

• Hybrid Memory

• The distributed memory component is

the networking of multiple shared

memory/GPU machines, which know

only about their own memory - not the

memory on another machine. Therefore,

network communications are required to

move data from one machine to another.




174

Memory Systems

Memory Systems

• Hybrid Memory

• Current trends seem to indicate that this

type of memory architecture will

continue to prevail and increase at the

high end of computing for the

foreseeable future.




175

Memory Systems

Memory Systems

• Hybrid Memory

• Advantages and Disadvantages

• Whatever is common to both shared and

distributed memory architectures.

• Increased scalability is an important

advantage.

• Increased programmer complexity is an

important disadvantage.




176

Memory Systems

Design Memory Systems?


177

Memory Systems

Design Memory Systems

A simple model of memory components for parallel memory design would include

three major parameters of a memory component of a given size.

• Area: The physical size of the logical component. This is most important in chip design, but it also

relates to cost in board design.

• Performance: The access time of the component. There may be more than one parameter, with

variations for read and write times, page mode accesses, and so on.

• Energy: The energy required per access. If performance is characterized by multiple modes, energy

consumption will exhibit similar modes.


178

Memory Systems

Design Memory Systems


179

Memory Systems

Memory Systems

Taken from: https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias

https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias

180

Memory Systems

Memory Systems

Taken from: https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias

https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias

Outline

181




• Memory Systems


• Co-processing

182

Processors Symmetry


Symmetric

SMP

Multi-processing

Asymmetric

AMP

183

Processors Symmetry


Symmetric

SMP

Multi-processing

Asymmetric

AMP

184

Processors Symmetry

Taken from: M. Aguilar SoCs

Symmetric Multi-processing (SMP)

• System with multiple processors or cores that are communicated by a single

shared memory and are controlled by a single operating system

185

Processors Symmetry

Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/


• Identical: All the processors are treated equally i.e. all are identical.

• Communication: Shared memory is the mode of communication among

processors.

• Complexity: Are complex in design, as all units share same memory and data

bus.

• Expensive: They are costlier in nature.

• Unlike asymmetric where a task is done only by Master processor, here tasks of

the operating system are handled individually by processors.

https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/

186

Processors Symmetry



• Applications

• This concept finds its application in parallel processing, where time-sharing

systems(TSS) have assigned tasks to different processors running in parallel

to each other, also in TSS that uses multithreading i.e. multiple threads

running simultaneously.


187

Processors Symmetry



• Advantages

• Throughput: Since tasks can be run by all the processors unlike in

asymmetric, hence increased degree of throughput(processes executed in unit

time).

• Reliability: Failing a processor doesn’t fail whole system, as all are equally

capable processors, though throughput do fail a little.


188

Processors Symmetry



• Disadvantages

• Complex design: Since all the processors are treated equally by OS, so

designing and management of such OS become difficult.

• Costlier: As all the processors share the common main memory, on account

of which size of memory required is larger implying more expensive.


189

Processors Symmetry

Taken from: https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf


https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf

190

Processors Symmetry

Taken from: https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf


More information here

https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf

https://youtu.be/XrxezHvLcEU

191

Processors Symmetry


Symmetric

SMP

Multi-processing

Asymmetric

AMP

192

Processors Symmetry


Asymmetric Multi-processing (AMP)

• Is a system with multiple processors or cores that are communicated by a single

shared memory and each processor or cores is controlled by an independent

operating system (different or equal).

193

Processors Symmetry


• Characteristics

• Processors are not treated equally.

• Tasks of the operating system are done by master processor.

• No Communication between Processors as they are controlled by the

master processor.

• Process are master-slave.

• Systems are cheaper.

• Systems are easier to design.



194

Processors Symmetry

Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf


https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf

195

Processors Symmetry




196

Processors Symmetry




197

Processors Symmetry


Taken from: https://github.com/OpenAMP/open-amp

https://github.com/OpenAMP/open-amp

198

Processors Symmetry


Taken from: https://github.com/OpenAMP/open-amp

https://github.com/OpenAMP/open-amp

199

Processors Symmetry




Outline

200




• Memory Systems


• Co-processing

201

Co-processing

Taken from: https://www.researchgate.net/publication/250840737_Automatic_Generation_of_Application-

Specific_Architectures_for_Heterogeneous_MPSoC_through_Combination_of_Processors/figures

202

Co-processing



203

Co-processing



204

Co-processing

Taken from: http://www.cecs.uci.edu/~papers/esweek06/codes/p288.pdf

http://www.cecs.uci.edu/~papers/esweek06/codes/p288.pdf

205

Co-processing

Taken from: https://www.researchgate.net/publication/221656884_A_Generic_Wrapper_Architecture_for_Multi-

Processor_SoC_Cosimulation_and_Design/figures?lo=1

206

Co-processing

Taken from: https://link.springer.com/chapter/10.1007/978-3-319-01113-4_1

https://link.springer.com/chapter/10.1007/978-3-319-01113-4_1

207

Co-processing

What is a coprocessor?

208

Co-processing

A coprocessor is:

• A computer processor used to supplement functions of the primary processor.

• Several operations performed by the coprocessor such as:

• Floating Point (FPU).

• Graphics Processing.

• Signal Processing.

• Cryptography.

• Etc, ……

Taken from: https://youtu.be/xrMUv9ZVKY0

209

Co-processing

A coprocessor is:

• By offloading processor intensive tasks from the main processor, coprocessor can

accelerate system performance.

• Coprocessors allow a line of computers to be customized, so that customers who

do not need extra performance need not pay for it.


210

Co-processing

Functions

• A coprocessor may not be a general-purpose processor.

• Coprocessors cannot fetch instructions from memory, execute program flow

control instructions, do input/output operations manage memory and so on.

• The coprocessor requires the host (main) processor to fetch the coprocessor

instructions and handle all other operations aside from the coprocessor functions.

• In some architectures the coprocessor is a more general-purpose computer but

carries out only a limited range of functions under the close control of a

supervisory processor.


211

Co-processing

Taken from: https://www.doulos.com/knowhow/arm/using_your_c_compiler_to_exploit_neon/Resources/using_your_c_compiler_to_exploit_neon.pdf

Coprocessor

212

Co-processing

NEON Arm

• v7-A architecture, ARM has introduced a powerful SIMD implementation called

NEON™.

• NEON is a coprocessor which comes with its own instruction set for vector

operations.

• Most vector operations carry out the same operation on all elements of their

operand vector(s) in parallel.

• Using your C compiler to exploit NEON™ Advanced SIMD.


213

Co-processing

NEON Arm

• The goal of NEON is to provide a powerful, yet comparatively easy to program

SIMD instruction set that covers integer data types of up to 64-bit width as well

as single precision floating point (32 bit).

• Instead it shares its sixteen 128-bit registers with the vector floating point unit.

• Executed on the same processor core, NEON performance is influenced by

context switching overhead, non-deterministic memory access latency

(cache/MMU access) and interrupt handling.


214

Co-processing

NEON Arm


215

Co-processing

NEON Arm


216

Co-processing

NEON Arm


217

Co-processing

NEON Arm


218

Co-processing

NEON Arm


219

Co-processing

NEON Arm


220

Co-processing

DSP’s

Taken from: Introduccion a los Sistemas Empotrados Lectures Notes

221

Co-processing

DSP’s


222

Co-processing

DSP’s


223

Co-processing

GPU

Taken from: https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano

https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano

224

Co-processing

GPU

Taken from: https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano

https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano

225

Co-processing

Flight controller UAV

Taken from: https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf

https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf

226

Co-processing

Flight controller UAV

Taken from: https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf

https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf

227

References

[1] Lectures Notes, Tecnologico de Costa Rica, Course SoC.

[2] W. Wolf. High-Performance Embedded Computing: Architectures, Applications

and Methodologies. Elsevier, United States of America, 2007.

[3] E. Ashford and S. Arunkumar Introduction to Embedded Systems, 2017

Lectures notes and materials are available in TEC-Digital and web portal

www.ie.tec.ac.cr/sarriola/HPEC

www.ie.tec.ac.cr/joaraya

http://www.ie.tec.ac.cr/sarriola/HPEC

http://www.ie.tec.ac.cr/joaraya

high performance embedded systems mpsocs

Documents