computer and hardware architecture ii
TRANSCRIPT
Parallelism – Microscopic vs Macroscopic
• Microscopic parallelism – hardware solutions inside system components providing parallel computations without being visible to the user, e.g
• Registers
• Memory
• Parallel busses
• Instructions pipeline
• Macroscopic parallelism - duplicated large-scale components providing parallelism on system level
• Dual- or Quad-core processors
• Vector or Graphics processors
• Co-processors
• I/O processors
Parallelism – Symmetric vs Asymmetric
• Symmetric parallelism – uses replications of identical processing elements that can operate in parallel
• Multicore processors
• Asymmetric parallelism – uses a set of processing elements that operate in parallel but differs
• PC with CPU, Graphics processor, math processor, I/O processor
Parallelism – Fine-grain vs Coarse-grain
• Fine-grain parallelism – computers providing parallel computations on the level of instructions or data items
• Vector processors
• Digital signal processors with special SIMD instructions
• Coarse-grain parallelism – computers providing parallelism on the level of programs or larger data structures
• Dual- or Quad-core processors
Parallelism – Explicit vs Implicit
• Explicit parallelism – programmer need to control how available parallelism is exploited in the code, through e.g. partitioning into parallel processes, constraints and special instructions.
• Implicit parallelism – hardware can exploit parallelism in the executed code without constraints or any special instructions defined by the programmer
Flynn’s taxonomy
• 1966 Michael J Flynn proposed a classification of computers
One Many
One
Many
Instruction streams
Data streams
SISD: Single instruction stream
Single data stream
SIMD: Single instruction stream
Multiple data streams
MISD:Multiple instruction streams
Single data stream
MIMD: Multiple instruction streams
Multiple data streams
Flynn’s taxonomy - SISD
Processor
Instructions
Data
• Capable of executing single instructions, operating on a single data stream• E.g. conventional von-Neumann
architecture
Flynn’s taxonomy - SIMD
Processor
Instructions
Data
Processor Processor ProcessorProcessor
• Capable of executing the same instruction on all processing elements operating on different data streams• E.g. vector processors
Flynn’s taxonomy - MISD
Processor
Instructions
Data
Processor Processor ProcessorProcessor
• Executes different instructions on each processing element operating on the same data stream.(Useful for only a limited amount of applications)
Flynn’s taxonomy - MIMD
Processor
Instructions
Data
Processor Processor ProcessorProcessor Processor
• Executes multiple instructions on multiple data streams• E.g. multiprocessors
System Bus Architectures
Multi-master point to point communication over a single system bus requires bus arbitration.
Processors, co-processors and DMA-controllers are typically operating as bus masters.
Reference
System Bus Architectures
Time multiplexing of data and addresses on common lines• Lower cost• Lower performance
Reference
System Bus Architectures
• A computer could be designed for using multiple buses for different purposes
• Cheaper solution to include a bridge• Typically used for e.g. USB or Ethernet
Reference
System Bus Architectures
Conclusions:• A system bus can only perform one transfer at a time• It is thus a limited resource for communication• More than one master can compete for access to this
resource. Processors, co-processors and DMA-controllers
How to mitigate limitations on communication over a system bus?
AXI4 channel – switch
Xilinx AXI4 bus is a derivate of the Arm AMBA bus developed for SoC applications. Picture is showing a switch for AXI4
Connects one or more similar AXI memory mapped masters to one or more similar memory mapped slaves.
Reference: XilinxUser Guide 1037
AXI4 and AXI4-Lite bus
Consists of five channels:• Read address channel• Write address channel• Read data channel• Write data channel• Write response channel
Data can move simultaneously in both directions. AXI4 allow for burst of 256 data transfers using only one address. AXI4-Lite allow only for single data transactions.
Maste
r
Sla
ve
A master is taking the initiative to a data transfer, slave is responding
AXI4-stream implementation
• Used for high speed data centric streaming applications, e.g. video• TLAST indicates packet boundaries• TVALID indicates valid data
Reference: XilinxUser Guide 1037
AXI4-Stream Interconnect
Parallel routing of traffic between N masters and M slaves
Reference: XilinxUser Guide 1037
Multiprocessor architectures
Challenges for multiprocessor architectures• Communication• Coordination• Contention
Reference
Challenges
• Communication – must be scalable to handle communication between large number of processors
• Coordination – a strategy for how to distribute tasks among all processors is required
• Contention – situations where two or more processors try to access a resource at the same time. This problem explodes with increasing number of processors• In particular problems will occur with memory accesses• Cashing can mitigate but introduces another problem,
• Cache coherence – how to guarantee that cache memories, local to each processor carries the same data for common memory locations?
Data Pipelining
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
��� �� � �
Input data stream
Output data stream
…
• A pipeline divides a larger computational task into a series of smaller tasks
• Benefits:• Smaller tasks are less complex to describe• Allow for reuse of code modules• Reveals coarse grained parallelism that can be mapped to a
multi-processor architecture for increased throughput
Data Pipelining
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
��� �� � �
Input data stream
Output data stream
…
• Necessary conditions:• Partionable problem• Low communication overhead• Equivalent processor speed as for single processor
Data Pipelining
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
��� �� � �
Input data stream
Output data stream
������������ > � ∧ �� > �� ∧ �� > � …�� > �
�������������� = 1
�� [��������� ��������⁄ ]
Latency = � + �� + �� + � +⋯+ � [timeunits]
…
Data Flow Graph
Actor
1
Actor
2
Actor
3
Actor
4
Input data stream Output data stream
A data flow graph is describing computations without including any information on “how” the computation is going to be done. Hence, only data flow and no control flow is described.
This programming paradigm is supported by functional languages such as DFL suitable for digital signal processing systems and also ideal for capturing pipelined computations.
Imperative languages such as C and C++ model both control- and data flow and are no good for capturing parallelism.
Data Pipelining on FPGA logic
CN
�
QD
>
Input data stream Output data stream
Clk
A large combinatorial network is driving an output register
• Propagation delay time for CN is �
• Max frequency for clock signal Clk then becomes ,-./ =
0
Data Pipelining on FPGA logic
CN 1
�
QD
>
CN 2
��
QD
>
CN 3
��
QD
>
CN 4
�
QD
>
CN M
�1
QD
>
…
• Assume that CN is partionable into M smaller combinatorial networks• Insert registers in between all combinatorial nets
������������ > � ∧ �� > �� ∧ �� > � …�� > �
����,-./ = 1
��23
Latency = 456�575856��
Clk
Power in computational logic
• The dynamic energy 9:consumed when changing state of a cmoslogic output
• ; is the total capacitive load of the output• <:: is the supply voltage
9: =1
2· ; · <::
�
• The average dynamic power ?: =; · <::
�
@-./= ; · <::
� · ,-./
• We can conclude that power dissipation is proportional to clock frequency
and proportional to square of supply voltage
• Trying to increase speed of a processor by simply increasing clock frequency
at the same time as physical scaling of technology increases can only be
done until the power wall is reached
• With current technology ?�A��B�66 = 100DEFFG
-HI
Power in computational logic
• The delay time J for a gate can be approximated to
J = K ·L · <::
<:: − <FN
• <FN is the cmos threshold voltage and K, L are technology dependent constants
• Delay J will depend mostly on K andL for larger supply voltages <::• Delay J will increase dramatically when <:: is decreased close to <FN
• Dynamic voltage and frequency scaling means that both supply voltage and clock frequency is adjusted so that a processor can deliver just enough speed
• A reduction of both frequency and supply voltage will result in a dramatic reduction of dynamic power consumption
Using sleep mode to control energy consumption
9G = ?G · @GEnergy consumed during shutdown
9P = ?P · @PEnergy consumed during wakeup
Energy consumed when running processor for time � 9Q = ?QRS · �
Energy consumed when going to sleep for time t
9G.TTU = 9G + 9P + ?VWW · � − @G − @P
Energy can be saved when 9G.TTU < 9QRS
Reference
Example – Battery powered oil detector for wastewater
• A smart sensor can detect petroleum contamination in wastewater
• Numerous sensors are installed at selected checkpoints which allow
tracing of sources of contamination
• Task for sensor is to measure wastewater every 15 minutes and
send alarm data over radio link whenever a contamination is
detected.
• This task finishes in milliseconds while the rest of the 15 minutes
cycle is spent on sleeping