developing optimized signal processing software on the … · jtag or 2-pin serial wire debug (swd)...

Copyright © 2008 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd.

All other trademarks are the property of their respective owners and are acknowledged

of 12

Developing optimized signal processing software on the Cortex‐M4 processor Shyam Sadasivan, ARM November, 2010

1. Introduction A microcontroller, according to the oft quoted Wikipedia, is a small computer on a single integrated

circuit consisting internally of a CPU, clock, timers, I/O ports, and memory. It also says that a digital

signal processor (DSP) is a specialized microprocessor with an optimized architecture for the fast

operational needs of digital signal processing, which is concerned with the representation of

signals by a sequence of numbers or symbols and the processing of these signals.

What it does not tell us is that these two words are coming closer every day.

32-bit microcontrollers have changed the embedded landscape in the recent past. End-users are

finding easy to use 32-bit technology within their grasp for their performance hungry signal

processing applications. With the ARM microcontroller partnership offering an incredible range of

products based upon Cortex-M processors, the choice of performance, peripherals and software is

now richer than ever before. Looking further into the future, processing demands will continue to

increase as designs incorporate both control and signal processing features into a single device.

Figure 1 : A typical embedded system with both control and signal processing requirements

2. Digital signal controller ( DSC ) The digital signal controller creates an efficient blend of digital control and signal processing. It

addresses the requirements of an increasingly converging but still demanding market.

Figure 2 : Digital Signal Controllers – efficient hybrid of MCU and DSP characteristics



of 12

The focus of this paper – the central processor in the Digital Signal Controller

One of the biggest challenges of most systems requiring digital signal processing is to manage the

data flow through the system. The input and output data can represent different real world signals

including motor position, audio signals, video signals, RF signals (GPS, etc.), sensors etc. Moreover,

there are many characteristics of a DSC including the CPU, the peripherals, memory, number of GPIO

pins, connectivity, etc., that contribute to its applicability for a particular design. For the purpose of

this paper, we will focus on the characteristics of the central processor, the architecture of which

significantly influences the software techniques employed for optimum signal processing

throughput.

3. The ARM Cortex-M4 processor – an excellent CPU for 32-bit DSCs A processor specifically designed for DSC devices is the ARM Cortex-M4 processor. This new

processor extends the ARM Cortex-M family of processors into signal processing markets through a

software compatible upgrade migration path for Cortex-M0 and Cortex-M3 users.

Cortex-M4 - microcontroller characteristics

The Cortex-M family of processors has a set of common technologies that make them an excellent

candidate for microcontroller applications. These features have already gained a lot of popularity

through the success of the Cortex-M0 and Cortex-M3 and form a key reason for the high rate of

adoption of the Cortex-M processors in the microcontroller marketplace today.

RISC processor core Thumb-2 technology

High performance 32-bit CPU Deterministic operation Low latency 3-stage pipeline

Optimal blend of 16/32-bit instructions Very high code density No compromise on performance

Low power modes Nested Vectored Interrupt Controller (NVIC)

Integrated sleep state support Multiple power domains Architected software control

Low latency, low jitter interrupt response No need for assembly programming Interrupt service routines in pure C

Tools and RTOS support

CoreSight debug and trace

Broad 3rd party tools support Cortex Microcontroller

Software Interface Standard (CMSIS) Maximizes software effort reuse

JTAG or 2-pin Serial Wire Debug (SWD) connection

Support for multiple processors Support for real-time trace

Table 1 : Microcontroller characteristics of the Cortex-M4 processor



of 12

Cortex-M4 - signal processing characteristics

The Cortex-M4 processor builds upon the microcontroller features of the Cortex-M family and

introduces signal processing performance typically only associated with DSPs until now. The

features that make this possible are detailed in Table 2 below.

Harvard architecture Single cycle 16,32-bit MAC

32-bit AHB-Lite interface for instruction fetches

32-bit AHB-Lite interface for data and debug accesses

Wide range of MAC instructions Choice of 32 or 64 bit

accumulatorInstructions execute in a single cycle

Single cycle SIMD arithmetic Single cycle dual 16-bit MAC

4 parallel 8-bit adds or subtracts 2 parallel 16-bit adds or subtracts Instructions execute in a single cycle

2 parallel 16 bit MAC operations Choice of 32 or 64 bit accumulator Instructions execute in a single cycle

Floating point unit Others

IEEE 754 standard compliant Single precision floating-point unit Fused MAC for higher precision

Saturating math Barrel shifter

Table 2 : Signal processing characteristics of the Cortex-M4 processor

4. Cortex-M4 processor signal processing features in detail

Harvard architecture

The Cortex-M4 processor is based on the Harvard architecture characterized by separate buses for

instructions and data. By being able to read both instruction and data from memory at the same

time, the Cortex-M4 processor can perform many operations in parallel, speeding application

execution. The 32-bit AHB-Lite ICode interface fetches instructions from the code space. The 32-bit

AHB-Lite DCode interface accesses data from the code memory space. The peripheral bus enables

access to components outside of the Cortex-M4 processor system.

8-bit & 16-bit packed data types

The data registers on the Cortex-M4 processor are 32-bits wide. Many signal processing

applications, like speech, audio, communications, and image processing manipulate 8-bit and 16-bit

data samples. As a further boost in performance, a 32-bit register can store two 16-bit data samples

or even four 8-bit samples and work with these multiple data items.



of 12

The Cortex-M4 provides a wide range of Single-Instruction-Multiple-Data (SIMD) functions to ensure

that such algorithms can execute in the minimum number of processor cycles. With SIMD, a

numeric operation will simultaneously apply to two 16-bit or four 8-bit values.

SIMD arithmetic

The Cortex-M4 has the ability to perform arithmetic operations on packed 8- and 16-bit data. The

various flavours of these instructions are show in Table 3 below.

INSTRUCTION

TYPE

S Signed

Q Signed

Saturating

SH Signed Halving

U Unsigned

UQ Unsigned Saturating

UH Unsigned Halving

ADD8 SADD8 QADD8 SHADD8 UADD8 UQADD8 UHADD8

SUB8 SSUB8 QSUB8 SHSUB8 USUB8 UQSUB8 UHSUB8

ADD16 SADD16 QADD16 SHADD16 UADD16 UQADD16 UHADD16

SUB16 SSUB16 QSUB16 SHSUB16 USUB16 UQSUB16 UHSUB16

Table 3 : Cortex-M4 SIMD arithmetic instructions

There are also other powerful instructions which allow you to exchange half words of the second

operand register and perform different operations on each half. There are also unsigned sum of

differences instructions that can work on pixel data from images and are quite useful in applications

like motion estimation.

Single cycle 16,32-bit MAC

One of the most important features of a processor for digital signal processing is an efficient single

cycle MAC responsible for speeding up a majority of DSP algorithms. The Cortex-M4 has a variety of

single –cycle MAC instructions for both 16 and 32-bit data as shown in Table 4 below.

OPERATION DESCRIPTION

16 x 16 = 32 16-bit signed multiply yielding 32-bit result

16 x 16 + 32 = 32 16-bit signed multiply with 32-bit accumulate

16 x 16 + 64 = 64 16-bit signed multiply with 64-bit accumulate

16 x 32 = 32 16-bit by 32-bit signed multiply returning 32-most-significant-bits

(16 x 32) + 32 = 32 16-bit by 32-bit signed multiply with 32-bit accumulate

32 x 32 = 32 32-bit multiply

32 ± (32 x 32) = 32 32-bit multiply accumulate/subtract

32 x 32 = 64 Signed/unsigned multiply to long

(32 x 32) + 64 = 64 Signed/unsigned multiply to long with accumulate

(32 x 32) + 32 + 32 = 64 32-bit unsigned multiply with double 32-bit accumulation yielding 64-bit result

32 ± (32 x 32) = 32 (upper) 32-bit multiply with 32-most-significant-bit accumulate/subtract

(32 x 32) = 32 (upper) 32-bit multiply returning 32-most-significant-bits

Table 4 : Single cycle 16 and 32-bit MAC operations of the Cortex-M4 processor



of 12

Single cycle dual 16-bit MAC

The Cortex-M4 processor can even perform two 16-bit MACs in parallel in a single cycle. This

effectively doubles the raw computational power of the core for 16-bit data and gives it a clear edge

compared to 16-bit devices.

OPERATION INSTRUCTION

(16 x 16) ± (16 x 16) = 32 Sum/difference of dual 16-bit signed multiply

(16 x 16) ± (16 x 16) + 32 = 32 Dual 16-bit signed multiply with single 32-bit accumulator

(16 x 16) ± (16 x 16) + 64 = 64 Dual 16-bit signed multiply with single 64-bit accumulator

Table 5 : Single cycle dual 16-bit MAC operations of the Cortex-M4 processor

Figure 3 shows the use of packed data for a dual 16-bit multiply operation with a single 64-bit

accumulator.

Figure 3 : Cortex-M4 packed data and dual 16-bit MAC

Single precision floating point unit ( FPU )

The FPU in the Cortex-M4 processor offers a wider dynamic range because it can represent a wide

range of numbers. It is also very easy to program, since designers need not worry about the

constraints imposed by fixed-point processing. So far, the availability of floating point hardware in

microcontrollers has been limited due to the higher silicon area costs. The low cost Cortex-M4

processor FPU now opens the path to a wide range of floating point enabled DSC devices.

The Cortex-M4 FPU provides functionality compliant with the IEEE 754 standard. The FPU supports single-precision data-processing instructions and data types. Some of the floating point operations supported are shown in Table 6 below.

FLOATING POINT OPERATION CYCLE COUNT

Add/Subtract 1

Divide 14

Multiply 1

Multiply Accumulate (MAC) 3

Fused MAC 3

Square Root 14

Add/Subtract 1

Table 6 : Selected Cortex-M4 FPU operations and execution times



of 12

Saturating math

Sample values in fixed-point signal processing algorithms have to live within a well-defined numeric

range. If numbers get too small, the effects of quantization noise degrade performance; if numbers

get too large there is the risk of overflow. Fixed-point algorithms require careful scaling and are

almost always designed to overflow under certain conditions. Standard integer arithmetic handles

overflow in the worst conceivable way. Sample values wrap around upon overflow leading to huge

discontinuities in signal.

Figure 4 : Processing with saturation

To mitigate these effects, the Cortex-M4 processor contains saturating math operations. When a

value overflows, it is saturated (ie: clipped) to the largest positive or negative value. The saturation

occurs in the same cycle as the arithmetic operation and incurs no overhead.

Barrel shifter

Shifting operations are also quite common in fixed-point DSP algorithms. Shifting is used, for

example, to provide additional guards bits to protect against saturation. Most devices can typically

shift values one bit left or right, but repeated shift operations are often required. The Cortex-M4

can shift data an arbitrary number of bits left or right in a single cycle, leading to more efficient code.

5. Ease of use - programming fully in C Adding hardware into a microcontroller that cannot be easily used is a futile exercise. Keeping

programming simple is absolutely crucial to ease adoption of high performance hardware.

Microcontrollers have attempted for many years to make leading technology available to the mass

market by making complex applications possible through very easy to use software tools. The

Cortex-M4 processor and its supporting software ecosystem also look to extend this ease-of-use

paradigm to traditionally hard-to-use signal processing features.



of 12

Software tools

The integrated signal processing features of the Cortex-M4 simplify the development of application software by offering a single tool-chain and processing device, when compared to architectures containing separate applications processors coupled with programmable DSPs or fixed-function accelerators. The single tool-chain environment speeds time-to-market as software plays an increasingly important role in product development.

Many of the high performance signal processing instructions of the Cortex-M4 processor can be

taken advantage of through the compiler. When further optimization is required, C compilers

support intrinsic functions for low-level assembly operations. Intrinsics allow you to leverage the

power of assembly programming in a C development environment while hiding much of the

complexity of pure assembly language.

Cortex Microcontroller Software Interface Standard ( CMSIS )

Typically, industries use standards to improve product quality and enable component sharing across

projects. The electronics industry is full of such standards, but the microcontroller market has many

proprietary CPU architectures which prevent the introduction of efficient software standards. This

situation is rapidly changing primarily due to wide adoption of ARM Cortex-M processors. For the

first time ever, the embedded microcontroller industry has the ability to standardize on a single

popular hardware platform.

Figure 5 : CMSIS – a hardware abstraction layer providing consistent access to CPU and peripherals

ARM has created the Cortex Microcontroller Software Interface Standard (CMSIS) that enables

silicon vendors and middleware providers to create software that can be easily integrated. CMSIS

has been developed in close partnership with several key silicon and software vendors. CMSIS is a

vendor-independent hardware abstraction layer that provides a common approach to interfacing

peripherals, real-time operating systems, and middleware components. The standard is scalable to



of 12

ensure that it is suitable for all Cortex-M series processor microcontrollers from the smallest 8KB

device up to devices with sophisticated communication peripherals such as Ethernet or USB-OTG.

CMSIS has been designed as an open software standard usable by everyone.

CMSIS has been extended specifically to support the Cortex-M4 processor. There will also very soon

be an extensive library of DSP software routines available along with the CMSIS standard. This library

will include filters, transforms, vector math, matrix math etc fully developed in C and heavily

optimized for the Cortex-M4 instruction set architecture.

6. Cortex-M4 programming examples and optimization strategies Some of the most often used functions in signal processing algorithms are -

Fast Fourier transforms (FFT) – used in audio compression, spread spectrum

communication, noise removal etc

Infinite impulse response (IIR) filters – used in audio equalization, motor control etc

Finite impulse response (FIR) filters – used in data communications, echo cancellation

(adaptive versions), smoothing data etc

The most important observation here is that all of these algorithms depend heavily on the MAC

operation. A high performance MAC is a key feature for optimizing these algorithms.

This section will focus on an FIR filter example and detail various software optimization strategies

that will result in highly optimized Cortex-M4 algorithms.

The FIR filter is one of the classic functions of signal processing and occurs frequently in

communications, audio, and video applications. A filter of length N requires N coefficients h[0], h[1],

…, h[N-1] , N state variables x[n], x[n-1], …, x[n-(N-1)] and N multiply accumulates.

Figure 6 : FIR filter

21

21

21

210

nyanya

nxbnxbnxbny

1z 1z 1z 1z

0h 1h 2h 3h 4h

nx

ny

jekXkXkY

kXkXkY

212

211

knxkhnyN

k

1

0



of 12

Computing coefficients The coefficients h[0], h[1], …, h[N-1] can either be pre-computed using tools like MATLAB and stored

or be computed on the fly by the processor. A good example of the latter is a tone control knob on

an audio system where turning the knob results in a new set of coefficients. These could have been

pre-computed for certain settings and stored in device memory or could be computed by the device

for higher granularity.

Software optimization strategies Typical FIR code would use block based processing and the inner loop would consist of dual memory

fetches, MAC and pointer updates with circular addressing to make the most of the processor’s

capabilities. There are multiple strategies to optimize such code for the Cortex-M4 processor and

this section will apply each one and show the benefits at each stage. Let us start with the inner loop

code for an FIR filter as below that takes 12 cycles to complete on a Cortex-M4 processor.

Original inner loop code Total of 12 cycles for(k=0;k<filtLen;k++) {

sum += coeffs[k] * state[stateIndex];

stateIndex--;

if (stateIndex < 0) {

stateIndex = filtLen-1;

}

}

Fetch coeffs[k] 2 cycles

Fetch state[stateIndex] 1 cycle

MAC 1 cycle

stateIndex-- 1 cycle

Circular wrap 4 cycles

Loop overhead 3 cycles

Optimization 1 - Circular addressing alternative

Instead of circular addressing, we can create a circular buffer of length N + blockSize-1 and shift this

once per block. Example. N = 6, blockSize = 4. Size of state buffer = 9.

Figure 7 : Circular buffering

Code with this change Total of 8 cycles for(k=0;k<filtLen;k++) {


stateIndex++;

}

Fetch coeffs[k] 2 cycles

Fetch state[stateIndex] 1 cycle

MAC 1 cycle

stateIndex++ 1 cycle

Loop overhead 3 cycles

Now the inner loop has been reduced to 8 cycles.

0h 1h

0x 1x 2x 3x 4x 5x

2h 3h 4h 5h

5x 6x 7x 8x

Shift in 4 new samplesCopy old samples



of 12

Optimization 2 - Loop unrolling

In order to overcome any overheads involved in frequently run loops, loop unrolling is a technique

used by compilers and can also be applied to code manually to improve performance. This is an

efficient language-independent optimization technique. There is overhead inherent in every loop for

checking the loop counter and incrementing it for every iteration (3 cycles on the Cortex-M4). Loop

unrolling processes ‘n’ loop indexes in one loop iteration, reducing the overhead by ‘n’ times.

Unroll loop by 4 Total of 5.75 cycles per tap for(k=0;k<filtLen;k++) {


stateIndex++;


stateIndex++;


stateIndex++;


stateIndex++;

}

Fetch coeffs[k] 2 x 4 = 8 cycles

Fetch state[stateIndex] 1 x 4 = 4 cycles

MAC 1 x 4 = 4 cycles

stateIndex++ 1 x 4 = 4 cycles

Loop overhead 3 x 1 = 3 cycles

TOTAL = 23 cycles for 4 taps

= 5.75 cycles per tap

The inner loop now runs at 5.75 cycles per filter tap

Optimization 3 - Extensive use of SIMD and intrinsics

Many image and video processing, and communications applications use 8- or 16-bit data types.

SIMD usage on 16-bit data yields a 2x speed improvement over 32-bit and on 8-bit data yields a 4x

speed improvement. Access to SIMD is via compiler intrinsic. Example : a dual 16-bit MAC can be

performed by this C code - SUM=__SMLALD(C, S, SUM).

Usage of SIMD intrinsic Total of 2.375 cycles per tap filtLen = filtLen << 2;

for(k = 0; k < filtLen; k++){

c = *coeffs++;

s = *state++;

sum = __SMLALD(c, s, sum);

c = *coeffs++;

s = *state++;


c = *coeffs++;

s = *state++;


c = *coeffs++;

s = *state++;


}

2 cycles

1 cycle

1 cycle

2 cycles

1 cycle

1 cycle

2 cycles

1 cycle

1 cycle

2 cycles

1 cycle

1 cycle

3 cycles

19 cycles TOTAL = 2.375 cycles per tap

We have now reduced the inner loop to 2.375 cycles per tap.



of 12

Optimization 4 - Caching of intermediate variables

An FIR filter is extremely memory intensive. 12 out of 19 cycles in the last code portion deal with

memory accesses. For example, 2 consecutive loads take 3 cycles on Cortex-M4 and a MAC takes

only 1 cycle on Cortex-M4. When operating on a block of data, memory bandwidth can be reduced

by simultaneously computing multiple outputs and caching several coefficients and state variables.

Figure 8 : Caching of intermediate variables

After applying this technique, we can reduce the cycles to just 1.6 cycles per filter tap. To recap, we

started with the Cortex-M4 standard C code taking 12 cycles and we improved performance by going

through a series of optimization techniques -

Using circular addressing alternative = 8 cycles

After loop unrolling < 6 cycles

After using SIMD instructions < 2.5 cycles

After caching intermediate values ~ 1.6 cycles

So in summary, basic C code written for the Cortex-M4 can, through simple optimizations, lead to

very high performance DSP algorithms. The FIR filter performance on the Cortex-M4 is now

comparable to high performance DSPs which can run this FIR filter at 1 cycle per filter tap. These

DSPs need optimized assembly to achieve this though, which requires a steep learning curve and

also removes portability. All of these optimizations on the Cortex-M4 can be done completely in a C

environment retaining all the benefits of writing and maintaining code in C.

A good example of digital signal control performance is audio playback(decode), which employs a

good mix of both control and DSP processor features. Figure 9 shows that the Cortex-M4 enables

users to reach close to optimized audio DSP performance in the area and power footprint of a

microcontroller CPU. Decode of a typical MP3 stream can be performed on the Cortex-M4

0h 1h

0x 4x 5x

2h 3h 4h 5h

5x 6x 7x 8x 1x 2x 3x

c0

Increment by 16-bits

statePtr++

Increment by 32-bits

coeffsPtr++

x0

x1

x2

x3

x0

x1

x2

x3

c0



of 12

consuming less than 10MHz, which translates to less than 0.5mW processor dynamic power

consumption.

Figure 9 : High performance MP3 decode on the Cortex-M4

7. Conclusion

The MCU and DSP worlds are rapidly converging as users demand efficient and easy-to-use signal

processing technologies. The ARM Cortex-M4 processor presents an excellent option for digital

signal control devices aimed at markets like motor control, industrial automation, embedded audio,

digital power management and automotive.

The Cortex-M4 processor extends the Cortex-M processor family into signal processing markets by

introducing DSP specific features like a high performance single cycle MAC, SIMD arithmetic,

saturating math and single precision floating point hardware.

Developing applications on the Cortex-M4 is easy and can be done fully in C. Using simple

techniques, highly optimized programs can be developed with fast learning curves and minimal

effort.

The Cortex-M4 processor is an excellent option for next-generation microcontroller and digital signal

controller designs as they start to target applications requiring higher signal processing capabilities.

developing optimized signal processing software on the … · jtag or 2-pin serial wire debug (swd)...

Documents