technische universität dortmund automatic mapping to tightly coupled memories and cache locking...

19
Technische Universität Dortmund Automatic mapping to tightly coupled memories and cache locking Peter Marwedel 1,2 , Heiko Falk 1 , Robert Pyka 1 , Lars Wehmeyer 2 1 TU Dortmund 2 Informatik Centrum Dortmund (ICD) http://ls12-www.cs.uni-dortmund.de, http://www.icd.de

Upload: arline-lane

Post on 23-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Technische Universität Dortmund

Automatic mapping to tightly coupled memories and cache locking

Peter Marwedel1,2, Heiko Falk1, Robert Pyka1, Lars Wehmeyer2

1TU Dortmund2Informatik Centrum Dortmund (ICD)

http://ls12-www.cs.uni-dortmund.de, http://www.icd.de

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 2 -

Problems with memory speeds

Speed gap between processor and main DRAM increases

[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]

2

4

8

2 4 5

Speed

years

CPU Per

form

ance

(1.5

-2 p

.a.)

DRAM (1.07 p.a.)

31

2x every 2 years

10

Similar problems also for embedded systems & MPSoCs

In the future:Memory access times >> processor cycle times

“Memory wall” problem

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 3 -

Problems with memory energy

[Segars 01 according to Vahid@ISSS01]

Caches consume muchof the available energy

[O. Vargas (Infineon): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ TU Dortmund) for providing this source.

Memories a major consumer of energy: Example: mobile phone

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 4 -

Tightly coupled memories/Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable

Address space

ARM7TDMI cores, well-known for low power consumption

scratch pad memory

0

FFF..

ExampleExample

Small; no tag memory

TCM/SPMs are small, physically separate memories mapped into the address space;

Selection is by an appropriate address decoder (simple!)

SPM

select

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 5 -

Migration of data and instructions to TCM/SPM

Which objects (array, loop, etc.) to be stored in SPM?

Non-overlaying (“static”) memory allocation:

Objects always in TCM while application is running

Overlaying (“dynamic”) allocation:

Moving objects back and forth between hierarchy levels

Processor

Scratch pad memory,capacity SSP

main memory

?

For i .{ }

for j ..{ }

while ...

Repeat

function ...

Array ...

Int ...

Array

Example:

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 6 -

A survey of algorithms for scratchpad allocation

Non-overlaying (“static”) approachesGain gk & size sk for each object k.Maximise gain G = gk, respecting SSP sk. Knapsack

• Code, static data, stack, heap,• Partitioning, handling large arrays

Overlaying (“dynamic”) approaches• single/multiple hierarchy levels• for single process• for static number of multiple processes• for dynamic number of multiple processes• not using/using MMU

Survey of algorithms: http://ls12-www.cs.uni-dortmund.de/publications/ papers/2007-marwedel-acaces.zip

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 7 -

Partitioning

# of partitions

number of partitions of size:

4k 2k 1k 512 256 128 64

7 0 1 1 1 1 1 2

6 0 1 1 1 1 2 0

5 0 1 1 1 2 0 0

4 0 1 1 2 0 0 0

3 0 1 2 0 0 0 0

2 0 2 0 0 0 0 0

1 1 0 0 0 0 0 0

Example of all considered memory partitions for a total capacity of 4096 bytes

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 8 -

Results for a non-overlaying approach:parts of GSM coder/decoder

Energy model: based on ARM evaluation board

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 9 -

Using these ideas with an gcc-based tool flow

Source is split into 2 different files by specially developed memory optimizer tool *.

applicationsource

profile Info.

main mem. src

spm src.

linker script*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)

*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es) .exe

.ld linker

ARM-GCCCompiler

ARM-GCCCompiler

.c

.c

.c

.txt

Memory optimizer(Incl. ICD-C*)

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 10 -

Hybrid Context Switch

Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads

Scratchpad

Process P1,P2, P3

Process P1

Process P2

Process P3

Process P1Process P2Process P3

P1

P2

P3Saving/Restoring at context switch

Saving/Restoring at context switch

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 11 -

Multi-process Scratchpad Allocation: Results

For small SPMs (64B-512B) Saving is better For large SPMs (1kB- 4kB) Non-Saving is better Hybrid is the best for all SPM sizes. Energy reduction @ 4kB SPM is 27% for Hybrid approach

80

90

100

110

120

130

140

150

160

64 128 256 512 1024 2048 4096

Scratchpad Size (bytes)

En

erg

y C

on

sum

ptio

n (

mJ)

Energy (SPA) Energy (Non-Saving)

Energy (Saving) CopyEnergy (Saving)

Energy (Hybrid) CopyEnergy (Hybrid)

27%

SPA: Single Process Approach

edge detection,

adpcm, g721, mpeg

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 12 -

Approach overview

App. 2App. 2

App. 1App. 1

App. nApp. n AllocationManager

AllocationManager

Standard Compiler(GCC)

Standard Compiler(GCC)

OperatingSystem RTEMS

OperatingSystem RTEMS

Compile-timeTransformations

Compile-timeTransformations

Profit values / Allocation hints

2 steps: compile-time analysis & runtime decisions

No need to know all applications at compile-time

Capable of managing runtime allocated memory objects

Integrated into an embedded operating system

Using MPArm simulator from U. Bologna

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 13 -

Comparison of SPMM to Caches for SORT

Baseline: Main memory only SPMM peak energy reduction by

83% at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way

cache

SPMM capable ofoutperforming caches

OS and libraries are not considered yet

Chunk allocation results:

SPM Size Δ 4-way

1024 74,81%

2048 65,35%

4096 64,39%

8192 65,64%

16384 63,73%

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 14 -

Worst case timing analysis using aiT

C program

SPM size

executable

Actualperformance

Worst caseexecution time

memory-awarecompiler

ARMulator

aiT

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 15 -

Results for G.721

L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004

L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005

Using Scratchpad: Using Unified Cache:

aiT timing analyzer

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 16 -

Locking of I-Caches

Many caches allow “locking” or “freezing” (no replacements).

Can be used to improve timing predictability: Load promising Functions into Cache

Optimizations: Worst case paths can change during optimization Requires coupling of timing analysis tool and compiler

[Heiko Falk, Sascha Plazar, Henrik Theiling: Compile-Time Decided Instruction Cache Locking Using Worst-Case Execution Paths, CODES/ISSS, 2007]

ANSI-C source

Compiler Linker

LinkerWCET-

optimized exeI-Cache

Optimizer

ait WCET Analysis

exe

startup code

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 17 -

Relative WCETs after I-Cache Locking

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

64 128 256 512 1024 2048 4096 8192 16384

Cache Size [bytes]

Re

l. W

CE

T [

%]

ADPCM G723 Statemate Compress MPEG2

(ARM920T)

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 18 -

More information

2006

• http://ls12-www.cs.uni-dortmund.de/publications/papers/2007-marwedel-acaces.zip

• http://ls12-www.cs.uni-dortmund.de/publications/global_index.html

• http://ls12-www.cs.uni-dortmund.de/publications/papers/2007-marwedel-acaces.zip

• http://ls12-www.cs.uni-dortmund.de/publications/global_index.html

2007

P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007

TU Dortmund

- 19 -

Conclusion

Major impact of the memory system on system speed, energy consumption and timing predictability.

Memory hierarchies comprising TCMs/SPMs are fast, energy-efficient and timing predictable.

Algorithms have been designed for• Code, static data, stack, heap• Single and multiple memory hierarchy levels• non-overlaying and overlaying allocation• saving, non-saving and hybrid context switches• mono- and multiprocessor systems.

Very large improvements in terms of the considered figures of merit.

Compatible with existing tool flows