technische universität dortmund automatic mapping to tightly coupled memories and cache locking...

Technische Universität Dortmund

Automatic mapping to tightly coupled memories and cache locking

Peter Marwedel1,2, Heiko Falk1, Robert Pyka1, Lars Wehmeyer2

1TU Dortmund2Informatik Centrum Dortmund (ICD)

http://ls12-www.cs.uni-dortmund.de, http://www.icd.de

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 2 -

Problems with memory speeds

Speed gap between processor and main DRAM increases

[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]

2

4

8

2 4 5

Speed

years

CPU Per

form

ance

(1.5

-2 p

.a.)

DRAM (1.07 p.a.)

31

2x every 2 years

10

Similar problems also for embedded systems & MPSoCs

In the future:Memory access times >> processor cycle times

“Memory wall” problem


TU Dortmund

- 3 -

Problems with memory energy

[Segars 01 according to Vahid@ISSS01]

Caches consume muchof the available energy

[O. Vargas (Infineon): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ TU Dortmund) for providing this source.

Memories a major consumer of energy: Example: mobile phone


TU Dortmund

- 4 -

Tightly coupled memories/Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable

Address space

ARM7TDMI cores, well-known for low power consumption

scratch pad memory

0

FFF..

ExampleExample

Small; no tag memory

TCM/SPMs are small, physically separate memories mapped into the address space;

Selection is by an appropriate address decoder (simple!)

SPM

select


TU Dortmund

- 5 -

Migration of data and instructions to TCM/SPM

Which objects (array, loop, etc.) to be stored in SPM?

Non-overlaying (“static”) memory allocation:

Objects always in TCM while application is running

Overlaying (“dynamic”) allocation:

Moving objects back and forth between hierarchy levels

Processor

Scratch pad memory,capacity SSP

main memory

?

For i .{ }

for j ..{ }

while ...

Repeat

function ...

Array ...

Int ...

Array

Example:


TU Dortmund

- 6 -

A survey of algorithms for scratchpad allocation

Non-overlaying (“static”) approachesGain gk & size sk for each object k.Maximise gain G = gk, respecting SSP sk. Knapsack

• Code, static data, stack, heap,• Partitioning, handling large arrays

Overlaying (“dynamic”) approaches• single/multiple hierarchy levels• for single process• for static number of multiple processes• for dynamic number of multiple processes• not using/using MMU

Survey of algorithms: http://ls12-www.cs.uni-dortmund.de/publications/ papers/2007-marwedel-acaces.zip


TU Dortmund

- 7 -

Partitioning

# of partitions

number of partitions of size:

4k 2k 1k 512 256 128 64

7 0 1 1 1 1 1 2

6 0 1 1 1 1 2 0

5 0 1 1 1 2 0 0

4 0 1 1 2 0 0 0

3 0 1 2 0 0 0 0

2 0 2 0 0 0 0 0

1 1 0 0 0 0 0 0

Example of all considered memory partitions for a total capacity of 4096 bytes


TU Dortmund

- 8 -

Results for a non-overlaying approach:parts of GSM coder/decoder

Energy model: based on ARM evaluation board


TU Dortmund

- 9 -

Using these ideas with an gcc-based tool flow

Source is split into 2 different files by specially developed memory optimizer tool *.

applicationsource

profile Info.

main mem. src

spm src.

linker script*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)

*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es) .exe

.ld linker

ARM-GCCCompiler

ARM-GCCCompiler

.c

.c

.c

.txt

Memory optimizer(Incl. ICD-C*)


TU Dortmund

- 10 -

Hybrid Context Switch

Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads

Scratchpad

Process P1,P2, P3

Process P1

Process P2

Process P3

Process P1Process P2Process P3

P1

P2

P3Saving/Restoring at context switch

Saving/Restoring at context switch


TU Dortmund

- 11 -

Multi-process Scratchpad Allocation: Results

For small SPMs (64B-512B) Saving is better For large SPMs (1kB- 4kB) Non-Saving is better Hybrid is the best for all SPM sizes. Energy reduction @ 4kB SPM is 27% for Hybrid approach

80

90

100

110

120

130

140

150

160

64 128 256 512 1024 2048 4096

Scratchpad Size (bytes)

En

erg

y C

on

sum

ptio

n (

mJ)

Energy (SPA) Energy (Non-Saving)

Energy (Saving) CopyEnergy (Saving)

Energy (Hybrid) CopyEnergy (Hybrid)

27%

SPA: Single Process Approach

edge detection,

adpcm, g721, mpeg


TU Dortmund

- 12 -

Approach overview

App. 2App. 2

App. 1App. 1

App. nApp. n AllocationManager

AllocationManager

Standard Compiler(GCC)

Standard Compiler(GCC)

OperatingSystem RTEMS

OperatingSystem RTEMS

Compile-timeTransformations

Compile-timeTransformations

Profit values / Allocation hints

2 steps: compile-time analysis & runtime decisions

No need to know all applications at compile-time

Capable of managing runtime allocated memory objects

Integrated into an embedded operating system

Using MPArm simulator from U. Bologna


TU Dortmund

- 13 -

Comparison of SPMM to Caches for SORT

Baseline: Main memory only SPMM peak energy reduction by

83% at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way

cache

SPMM capable ofoutperforming caches

OS and libraries are not considered yet

Chunk allocation results:

SPM Size Δ 4-way

1024 74,81%

2048 65,35%

4096 64,39%

8192 65,64%

16384 63,73%


TU Dortmund

- 14 -

Worst case timing analysis using aiT

C program

SPM size

executable

Actualperformance

Worst caseexecution time

memory-awarecompiler

ARMulator

aiT


TU Dortmund

- 15 -

Results for G.721

L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004

L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005

Using Scratchpad: Using Unified Cache:

aiT timing analyzer


TU Dortmund

- 16 -

Locking of I-Caches

Many caches allow “locking” or “freezing” (no replacements).

Can be used to improve timing predictability: Load promising Functions into Cache

Optimizations: Worst case paths can change during optimization Requires coupling of timing analysis tool and compiler

[Heiko Falk, Sascha Plazar, Henrik Theiling: Compile-Time Decided Instruction Cache Locking Using Worst-Case Execution Paths, CODES/ISSS, 2007]

ANSI-C source

Compiler Linker

LinkerWCET-

optimized exeI-Cache

Optimizer

ait WCET Analysis

exe

startup code


TU Dortmund

- 17 -

Relative WCETs after I-Cache Locking

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

64 128 256 512 1024 2048 4096 8192 16384

Cache Size [bytes]

Re

l. W

CE

T [

%]

ADPCM G723 Statemate Compress MPEG2

(ARM920T)


TU Dortmund

- 18 -

More information

2006

• http://ls12-www.cs.uni-dortmund.de/publications/papers/2007-marwedel-acaces.zip

• http://ls12-www.cs.uni-dortmund.de/publications/global_index.html

• http://ls12-www.cs.uni-dortmund.de/publications/papers/2007-marwedel-acaces.zip

• http://ls12-www.cs.uni-dortmund.de/publications/global_index.html

2007


TU Dortmund

- 19 -

Conclusion

Major impact of the memory system on system speed, energy consumption and timing predictability.

Memory hierarchies comprising TCMs/SPMs are fast, energy-efficient and timing predictable.

Algorithms have been designed for• Code, static data, stack, heap• Single and multiple memory hierarchy levels• non-overlaying and overlaying allocation• saving, non-saving and hybrid context switches• mono- and multiprocessor systems.

Very large improvements in terms of the considered figures of merit.

Compatible with existing tool flows

technische universität dortmund automatic mapping to tightly coupled memories and cache locking...

Documents

tu dortmundinformatik

memory energy segars

considered memory partitions

peter marwedel

tag memory tcmspms

capacity ssp main memory

mobile phone slide

memory speeds speed