lecture 10 hardware accelerators ingo sander [email protected]

59
Lecture 10 Hardware Accelerators Ingo Sander [email protected]

Upload: phillip-harmon

Post on 02-Jan-2016

249 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

Lecture 10Hardware Accelerators

Ingo Sander

[email protected]

Page 2: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

IntroductionHardware Accelerator

Page 3: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 3

Design Constraint Propagation

A design constraint on system level leads to new design constraints on subsystem level

P1 P2 P3

t < 500 ms

Constraints on subsystems t < 100 ms t < 250 ms t < 150 ms

Constraint on System

Page 4: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 4

Design Constraint Propagation

An estimation tool can give the execution time of a subsystem

What happens, if a subsystem is too slow?

P1 P2 P3

t < 500 ms

Constraints on subsystems t < 100 ms t < 250 ms t < 150 ms

Constraint on System

Execution Time 95 ms 280 ms 145 ms

Too Slow!

Page 5: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 5

How to improve the performance of a microprocessor system?

Improve your code Choose a faster version of your

microprocessor Add additional computational units that are

perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator

Page 6: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 6

Hardware Accelerators

If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator!

The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor.

© 2000 Wolf (Morgan Kaufman)

Page 7: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 7

Accelerated System Architecture

CPU

accelerator

memory

I/O

Request1

Data2

Result

3

Request and Result may also require access to memory

© 2000 Wolf (Morgan Kaufman)

Page 8: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 8

An Accelerator is not a Co-Processor

A co-processor is connected to the CPU and executes special instructions. Instructions are dispatched by the CPU.

An accelerator appears as a device on the bus

© 2000 Wolf (Morgan Kaufman)

Page 9: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 9

Amdahl’s Law

Amdahl’s law states that the performance improvement of an improved unit is limited by the fraction of time the unit is in use!

Enhanced

EnhancedEnhanced

Enhanced

Old

SpeedupFraction

Fraction

imeExecutionT

imeExecutionTSpeedup

)1(

1

Fraction denotes the percentage the enhancement can be used!

Page 10: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 10

Example (Henessey & Patterson)

An application uses the floating point square root 20% of the time and floating point operations 50% of the time. Is it better to implement a square root unit that speeds up this

operation with a factor of 10, or to Improve the floating-point instructions in general

so that they can run 2 times faster.

Page 11: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 11

Example (Henessey & Patterson)

Square Root: Speedup = 1 / ((1-0.2)+0.2/10) = 1/0.82 = 1.22

Floating-Point: Speedup = 1 / ((1-0.5)+0.5/2) =

1/0.75 = 1.33

Page 12: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 12

Almdahl’s LawLessons to be learned

The maximum speedup that is possible is limited by the fraction! Assume infinite speedup Speedup = 1 / ((1-F)+F/Infinity) = 1/(1-F)

Fraction F 0.1 0.3 0.5 0.9

Max. Speedup 1.11 1.43 2 10

Improve the common cases!

Page 13: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 13

Amdahl’s Law for Parallel Architectures

Amdahl’s law can even be used for parallel architectures, where sequential code is parallelized and runs on identical parallel units!

itsParallelUnFraction

Fraction

imeExecutionTimeExecutionT

Speedup

ParallelParallel

Parallel

Serial

)1(

1

Fraction denotes the percentage of the code parallelism can be used!

Page 14: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

DesignHardware Accelerator

Page 15: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 15

Design of a hardware accelerator

Which functions shall be implemented in hardware and which functions in software?

Hardware/software co-design: joint design of hardware and software architectures

The hardware accelerator can be implemented in Application-specific integrated circuit. Field-programmable gate array (FPGA).

Page 16: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 16

Hardware Software Co-Design

SWCompilation

ExecutableProgram

SystemModel Original Program

(concurrent processes

Partitioning& Mapping

Which functions shall go to HW and SW?

Netlist

HWSynthesis

Verification

HW-Model(VHDL)

SW-Model(C/C++)

Verification

EstimationLibrary

Good estimates are needed for good partitioning

Page 17: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 17

Hardware/Software Co-Design

Hardware/Software Co-design covers the following problems Co-Specification: the creations of specifications

that describe both the hardware and software of a system

Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification

Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction

Page 18: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 18

Co-Synthesis

Four tasks are included in co-synthesis Partitioning: The functionality of the system is divided into

smaller, interacting computation units Allocation: The decision, which computational resources

are used to implement the functionality of the system Scheduling: If several system functions have to share the

same resource, the usage of the resource must be scheduled in time

Mapping: The selection of a particular allocated computational unit for each computation unit

All these tasks depend on each other!

Page 19: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 19

Partitioning During partitioning the functionality of the system is

partitioned into several parts (corresponding to the allocated/available components)

Many possible partitions exist Analysis is done by evaluating the costs of different

partitions

B

A

E

DC

B

A

E

DC

Page 20: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 20

Estimation

In order to get a good partitioning, there is a need for good figures about performance for a function on different

components execution time for communication time

Page 21: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 21

EstimationAccuracy and Fidelity

The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation

The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations

Page 22: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 22

Fidelity

Though accuracy is much higher in (2) than in (1), the estimates are not very useful for the partitioning process because of the low fidelity!

This can cause bad design decisions!

Quality metric

A B C

Quality metric

A B C

Fidelity = 100% Fidelity = 33% (only A > C correct)

1 2

Estimate

Measurement

Page 23: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 23

Hardware/Software Co-Design

Strategies:1. Start with an ”all-software”-configuration

While (Constraints are not satisfied)

Move the SW function that gives the best improvement to HW

(implemented in COSYMA [Ernst, Henkel, Brenner 1993])

2. Start with an ”all-hardware”-configurationWhile (Constraints are satisfied)

Move the most costly HW component to SW

(implemented in Vulcan [Gupta, DeMicheli 1995])

Page 24: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 24

Papers on HW/SW Co-Design R. Ernst et al. Hardware-software co-synthesis from

Microcontrollers. IEEE Design & Test of Computers. December 1993.

R. K. Gupta and G. de Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers. December 1993.

G. de Micheli and R. K. Gupta. Hardware/software co-design. Proceedings of the IEEE. March 1997.

… (and much much more)

Electronic versions of these and other papers can be accessed by the KTH Library (www.lib.kth.se)

Page 25: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 25

System design tasks

Design a heterogeneous multiprocessor architecture. Processing element (PE): CPU, accelerator, etc.

Divide Tasks to Processing Elements Verify that

Functionality of the system is correct System meets the performance constraints

Page 26: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 26

Why accelerators?

Better cost/performance. Custom logic may be able to perform operation

faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance.

To improve performance by choosing a faster CPU may be very expensive!

cost

performance

© 2000 Wolf (Morgan Kaufman)

Page 27: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 27

Accelerated system design

First, determine that the system really needs to be accelerated. Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead?

Design Tasks performance analysis; scheduling and allocation.

Design the accelerator itself. Design CPU interface to accelerator.

© 2000 Wolf (Morgan Kaufman)

Page 28: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 28

Performance analysis

Critical parameter is speedup: how much faster is the system with the accelerator?

Must take into account: Accelerator execution time. Data transfer time. Synchronization with the master CPU.

The Accelerator needs to know, when it can start its computation

The CPU needs to know when the results are ready© 2000 Wolf (Morgan Kaufman)

Page 29: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 29

Single- vs. multi-threaded

One critical factor is available parallelism: single-threaded/blocking: CPU waits for

accelerator; multithreaded/non-blocking: CPU continues to

execute along with accelerator. To multithread, CPU must have useful work

to do. But software must also support multithreading.

© 2000 Wolf (Morgan Kaufman)

Page 30: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 30

Sources of parallelism

Overlap I/O and accelerator computation. Perform operations in batches, read in second

batch of data while computing on first batch. Find other work to do on the CPU.

May reschedule operations to move work after accelerator initiation.

© 2000 Wolf (Morgan Kaufman)

Page 31: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 31

Total execution time

Single-threaded: Multi-threaded:

P2

P1

A1

P3

P4

P2

P1

A1

P3

P4

CPU

Accel.

CPU

Accel.

Split

Join

© 2000 Wolf (Morgan Kaufman)

Page 32: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 32

Communication OverheadData input/output times

Bus transactions include: flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets,

handshaking, etc.

© 2000 Wolf (Morgan Kaufman)

Page 33: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 33

Accelerator execution time

Total accelerator execution time: taccel = tin + tx + tout

Data input

Acceleratedcomputation

Data output

© 2000 Wolf (Morgan Kaufman)

Page 34: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 34

Execution time analysis

Single-threaded: Count execution time of

all component processes.

Multi-threaded: Find longest path

through execution.

P1

A1

P2 P3 P4CPU

Acc.

Time

Execution Time

Communication Overhead

tin tout

tx P1

A1

P2P3 P4CPU

Acc.

Time

Execution Time

Page 35: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 35

Example for Accelerator Architecture

CPU

Mem

DMA

Bus

Inte

rface Read

Unit

WriteUnit

Regis

ters

Core

Accelerator

© 2000 Wolf (Morgan Kaufman)

Page 36: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 36

Accelerator/CPU interface

Accelerator registers provide control registers for CPU.

Data registers can be used for small data objects.

Accelerator may include special-purpose read/write logic. Especially valuable for large data transfers.

© 2000 Wolf (Morgan Kaufman)

Page 37: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 37

Caching problems

Main memory provides the primary data transfer mechanism to the accelerator.

Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU).

© 2000 Wolf (Morgan Kaufman)

Page 38: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 38

Possible Problems with Caches

1. CPU reads location S.

2. Accelerator writes location S.

3. CPU reads location S.

Cache

S

CPU

Memory

Accelerator

12

3Wrong value!

© 2000 Wolf (Morgan Kaufman)

Page 39: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 39

Cache Coherence Problem

Cache coherence problems appears also on multiprocessor systems Cache and main memory do not have the same contents Avalon bus, like most on-chip busses do not have an inbuilt

mechanism to avoid these problems

P1

Cache

Main Memory

Bus

Pn

Cache

Page 40: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 40

Cache Coherence with Write-Through Caches

How to tackle cache coherence? Idea: Caches must be aware of the transactions on the bus! Add extra hardware and define a protocol to be able to detect invalid data in the

caches Take actions, if cache or memory (in case of write-back caches) is invalid

P1

Cache

Main Memory

Bus

Pn

Cache

Cache-MemoryTransition

Bus Snooping

V

I

V

I

CacheCoherence

Protocol

More about cache coherence protocols in IL2207 SoC Architectures

Page 41: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

What to do, if no cache coherence protocol exists?

Designer has to be aware of possible cache coherence problems

Disciplined programming is needed Use commands to explicitly bypass the

cache, if risk for cache coherence problem

April 20, 2023 IL2206 Embedded Systems 41

Page 42: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 42

ExampleAccelerator

f g

x y

h

h(f(x),g(y))

P A M

Data-flow Graph

Architecture

Page 43: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 43

Execution Times

Both P and A have sufficient registers

P and A cannot access the bus simultaneously

A memory access (load or store) takes 1 time unit

P A

f 5 2

g 5 2

h 5 -

Page 44: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 44

Single-Processor Solution

f g

x y

h

h(f(x),g(y))

Data-flow Graph

P

P

P

Load x 1

Load y 1

f 5

g 5

h 5

Store h(...) 1

18

Page 45: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 45

Processor-Accelerator Solution I

f g

x y

h

h(f(x),g(y))

Data-flow Graph

A

P

A

P A

Load x 1

Load y 1

f 2

g 2

Store f 1

Store g 1

Load f 1

Load g 1

h 5

Store h 1

Total 16

Still Single-Thread!

Page 46: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 46

Processor-Accelerator Solution II

f g

x y

h

h(f(x),g(y))

Data-flow Graph

A

P

P

P A

Load y 1

g 5 Load x 1

f 2

Store f 1

Load f 1

h 5

Store h 1

Total 13

Exploitation of parallelism leads to fast solution!

Page 47: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 47

System integration and debugging

Try to debug the CPU/accelerator interface separately from the accelerator core.

Build equipment to test the accelerator. Hardware/software co-simulation can be

useful.

© 2000 Wolf (Morgan Kaufman)

Page 48: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 48

Summary

The use of a hardware accelerator can lead to a more efficient solution In particular when the parallelism in the

functionality can be exploited Hardware/Software co-design techniques can

be used for the design of an accelerator You have to be aware of cache coherence

problems, if the processor or accelerator uses a cache

Page 49: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

Configurable Processor Cores

Ingo Sander

[email protected]

Page 50: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 50

Motivation for Configurable Processor Cores

Observations Time-to-market is critical Development time for software is much smaller

than for hardware Hardware can be customized and has much

better performance than software solution

Page 51: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 51

Why Configurable Processor Cores?

Idea Combine the advantages of hardware and software in form

of a customizable processor to achieve Clearly shorter Time-To-Market than hardware Clearly better performance than software

Provide a processor platform with a basic architecture that can be extended

by additional optimized units (MAC, Floating-Point Unit) Own instructions together with own customized hardware

can be defined for the processor

Page 52: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 52

Example for a configurable processor: Xtensa (Tensilica)

The Xtensa processor core targets system-on-chip applications is configurable, extensible and synthesizable has

Base Instruction Set Architecture Configurable Functions (Parametrised) Optional Functions Designer-Defined Functions and Registers (For

Accleration of Specific Algorithms)

Page 53: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 53

Xtensa Processor Core

Page 54: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 54

Basic Xtensa Core

32-bit architecture Base configuration:

32-bit ALU Up to 64 general purpose registers 6 special purpose registers 80 base instructions Improved 16- and 24-bit RISC instruction

encoding

Page 55: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 55

Optional Architecture

Execution Units Multipliers, 16 and 32 bits MAC-Unit, Floating-Point Unit

Interface Options Memory Subsystem Options

Memory Management Options Local Data and Instruction Caches Separate RAM, ROM Areas for Data and

Instruction

Page 56: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 56

Tensilica Extension Language The Tensilica extension language is used to

describe new instructions, registers and execution units that are then automatically added to the Xtensa processor

Page 57: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 57

Xtensa ProcessorDesign Process

Page 58: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 58

Design Flow

1. Choose basic Xtensa processor2. Specify algorithm in C3. Compile to Target Processor4. Profile and check, if design constraints are met5. If constraints are met, everything is fine, otherwise6. Choose optional functions (e.g. Multiplier) or design

new instructions for the critical part => improved architecture

7. Adjust your code for the new architecture8. Go back to 3.

Page 59: Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

April 20, 2023 IL2206 Embedded Systems 59

Summary

The Xtensa concept provides Not only a configurable architecture But also a design methodology

The idea is to take the best of both the hardware and the software world in order to Have good performance Short Time-to-Market

Xtensa processors can be used as parts of a system-on-chip architecture

Other extendable cores exist like the NIOS II from Altera