© 2007 elsevier lecture 6: embedded processors embedded computing systems mikko lipasti, adapted...

© 2007 Elsevier

Lecture 6: Embedded Processors

Embedded Computing Systems

Mikko Lipasti, adapted from M. Schulte

Based on slides and textbook from Wayne Wolf

High Performance Embedded Computing

© 2007 Elsevier

Topics

Embedded microprocessor market. Categories of CPUs. RISC, DSP, and Multimedia processors. CPU mechanisms.


© 2007 Elsevier

Demand for Embedded Processors

Embedded processors account for Over 97% of total

processors sold Over 60% of total sales from

processors Sales expected to

increase by roughly 15% each year


© 2007 Elsevier

Flynn’s taxonomy of processors Single-instruction single-data (SISD) Single-instruction multiple-data (SIMD) Multiple-instruction multiple-data (MIMD) Multiple-instruction single data (MISD)

What is an example of each? Which would you expect to see in embedded

systems?


© 2007 Elsevier

Other axes of comparison

RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multiple-

issue machines. Scalar vs. vector processing. Single-threaded vs. multithreading. A single CPU can fit into multiple categories.


© 2007 Elsevier

Embedded vs. general-purpose processors Embedded processors may be customized

for a category of applications. Customization may be narrow or broad.

We may judge embedded processors using different metrics: Code size. Energy efficiency. Memory system performance. Predictability.


© 2007 Elsevier

Embedded RISC processors RISC processors often

have simple, highly-pipelinable instructions

Pipelines of embedded RISC processors have grown over time: ARM7 has 3-stage

pipeline. ARM9 has 5-stage

pipeline ARM11 has 8-stage

pipeline.

ARM11 pipeline [ARM05].


© 2007 Elsevier

RISC processor families ARM:

ARM7 has in-order execution, and no memory management or branch prediction;

ARM9 ARM11 has out of order execution, memory management, and branch prediction,

MIPS: MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security.

PowerPC: PowerPC 400 series includes several embedded processors; Motorola and IBM offer superscalar versions of the PowerPC


© 2007 Elsevier

Embedded DSP Processors

DSP processors feature Deterministic execution times Fast multiply-accumulate instructions Multiple data accesses per cycle Specialized addressing modes Efficient support for loops and interrupts Efficient processing of “streaming” data

Embedded DSP processors are optimized to perform DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms

y b xk n k nn

N

0


© 2007 Elsevier

Example: TI C55x/C54x DSPs 40-bit arithmetic (32-bit values + 8 guard bits). Barrel shifter. 17 x 17 multiplier. Two address generators. Lots of special purpose registers and

addressing modes Coprocessors for compute-intensive functions

including pixel interpolation, motion estimation, and DCT/IDCT computations


© 2007 Elsevier

TI C55x microarchitecture


© 2007 Elsevier

Parallelism extraction

Static: Use compiler to

analyze program. Simpler CPU. Can’t depend on data

values. VLIW

Dynamic: Use hardware to

identify opportunities. More complex CPU. Can make use of data

values. Superscalar


© 2007 Elsevier

VLIW architectures

Each very long instruction word (VLIW) erforms multiple operations in parallel

Needs a good compiler that understands the architecture Allows deterministic execution times Code growth can be reduced by allowing

Operations within an instruction to be performed sequentially

A given field to specify different types of operations

Branch Memory Memory Arithmetic Logic Vector

Branch/Mem Mem/Arith VectorArith/LogicSeq


© 2007 Elsevier

Simple VLIW architecture Large register file feeds multiple function

units.

Register file

E boxAdd r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP

ALU ALU Load/store Load/store FU


© 2007 Elsevier

Clustered VLIW architecture Register file, function units divided into clusters. What are advantages/disadvantages of having clusters in

VLIW architectures?

Execution

Register file

Execution

Register file

Cluster bus


© 2007 Elsevier

TI C62x/C67x DSPs VLIW with up to 8 instructions/cycle. 32 32-bit registers. Function units:

Two multipliers. Six ALUs.

All instructions execute conditionally.


© 2007 Elsevier

TI C6x data operations

8/16/32-bit arithmetic. 40-bit operations. Bit manipulation operations. C67x processors add floating-point

arithmetic.


© 2007 Elsevier

C6x block diagram

Data path 1/Reg file 1

Data path 2/Reg file 2

Execute DMA

timers

Serial

Program RAM/cache512K bits

Data RAM512K bits

JTAG

PLL

bus


© 2007 Elsevier

Texas Instruments C62x

N. Seshan, “High VelociTI processing [Texas Instruments VLIW DSP architecture]”, IEEE Signal Processing Magazine, v. 15, no. 2, pp. 86-101, 117, 1998.


© 2007 Elsevier

Emerging DSP Architectures

Parallelism at multiple levels Multiple processors

System-on-a-chip designs Multiple simultaneous tasks

Multithreaded processors Multiple instruction per cycle

Very Long Instruction Word (VLIW) architectures Multiple operation per instruction

Single Instruction Multiple Data (SIMD) instructions Architecture/compiler pairs improve performance

and help manage application complexity


© 2007 Elsevier

Superscalar processors Instructions are dynamically scheduled.

Dependencies are checked at run time in hardware.

Used to some extent in embedded processors. Embedded Pentium is two-issue in-order. Some PowerPCs are superscalar

What advantages/disadvantages do VLIW processors compared to superscalar?


© 2007 Elsevier

SIMD and subword parallelism Many special-purpose SIMD machines

All processors perform same operation on different data

Subword parallelism is widely used for video. ALU is divided into subwords for independent

operations on small operands. Vector processing is another form of SIMD

processing Lots of times these terms are interchanged


© 2007 Elsevier

SIMD Instructions

Recent multimedia processors commonly support Single Instruction Multiple data (SIMD) instructions

The same operation is performed on multiple data operands using a single instruction

Exploits low precision and high data parallelism of multimedia applications

A3 A2 A1 A0

B3 B2 B1 B0

A3+B3 A2+B2 A1+B1 A0+B0


© 2007 Elsevier

Dynamic behavior of loops in MediaBench The loops of media

applications in many cases are not very deep

Path ratio = (instructions executed per iteration) / (total number of loop instructions).

What does the path ratio reveal?


© 2007 Elsevier

TriMedia TM-1 characteristics

Characteristics Floating point support Sub-word parallelism

support VLIW Additional custom

operations


© 2007 Elsevier

Trimedia TM-1memory interface

video in

audio in

I2C

timers

image co-p

PCI

video out

audio out

serial

VLD co-p

VLIW CPU


© 2007 Elsevier

Multithreading Low-level parallelism mechanism. Interleaved multithreading (IMT) alternately

fetches instructions from separate threads. Often used with VLIW and vector processors

Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle. Often used with superscalar processors

What advantages/disadvantages does IMT have relative to SMT?


© 2007 Elsevier

Dynamic voltage scaling (DVS) Power scales with V2

while performance scales roughly as V.

Reduce operating voltage, add parallel operating units to make up for lower clock speed.

DVS doesn’t work well in processors with high-leakage power.


© 2007 Elsevier

Dynamic voltage and frequency scaling (DVFS) Scale both voltage and

clock frequency. Can use control

algorithms to match performance to application, reduce power.


© 2007 Elsevier

Razor architecture

Razor runs clock faster than worst case allows

Used specialized latch to detect errors.

Recovers only on errors, gains average-case performance.

© 2007 elsevier lecture 6: embedded processors embedded computing systems mikko lipasti, adapted...

Documents

embedded processors

embedded systems

processors sales

total processors

multimedia processors

powerpc slide

elsevier example

year slide