real-time multi/many-core architectureheechul/courses/eecs753/s19/slides/w7-rtcore.pdf•idea2: wcet...

Real-Time Multi/Many-Core Architecture

Heechul Yun

1

Real-Time Multi/Many-Core Architecture

• Projects on Real-Time CPU Architectures

• Assigned Papers

– Shedding the Shackles of Time-Division Multiplexing, RTSS, 2018

– Deterministic Memory Abstraction and Supporting Multicore System Architecture. ECRTS, 2018

2

Trends in Automotive E/E Systems

3

A. Hamann (Bosch). “Industrial Challenge: Moving from Classical to High-Performance Real-Time Systems.” WATER, 2018.

Sourc

e:

Bosch

Centralization & High-Performance HW

https://www.ecrts.org/forum/download/file.php?id=93&sid=0cfd6ee35ca15ad0311b601483411ceb

Modern System-on-a-Chip (SoC)

4

Core1 Core2 GPU NPU…

Memory Controller (MC)

Shared Cache

• Integrate multiple cores, GPU, accelerators

• Good performance, size, weight, power

• Challenges: time predictability

DRAM

Worst-Case Execution Time (WCET)

• Real-time scheduling theory is based on the assumption of known WCETs of real-time tasks

5

Image source: [Wilhelm et al., 2008]

Computing WCET

• Static analysis

– Input: program code, architecture model

– output: WCET

– Problem: architecture model is hard and pessimistic

• Measurement

– No guarantee on true worst-case

– But, widely used in practice

6

Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-Critical Embedded Systems

IEEE TCAD, 2009

7

“Problematic” CPU Features

• Architectures are optimized to reduce average performance

• WCET estimation is hard because of– Pipelining

– TLBs/Caches

– Super-scalar

– Out-of-order scheduling

– Branch predictors

– Hardware prefetchers

– Basically anything that affect processor state

8

Static Timing Analysis

9

968 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 7, JULY 2009

Fig. 1. Main components of a timing-analysis framework and theirinteraction.

A. Timing-Analysis Framework

Over the last several years, a more or less standard archi-

tecture for timing-analysis tools has emerged [11]–[13]. Fig. 1

shows a general view on this architecture. First, one can distin-

guish three major building blocks:

1) control-flow reconstruction and static analyses for control

and data flow;

2) microarchitectural analysis, which computes upper and

lower bounds on execution times of basic blocks;

3) global bound analysis, which computes upper and lower

bounds for the whole program.

The following list presents the individual phases and de-

scribes their objectives and problems. Note that the first four

phases are part of the first building block.

1) Control-flow reconstruction [14] takes a binary exe-

cutable to be analyzed, reconstructs the program’s control

flow, and transforms the program into a suitable interme-

diate representation. Problems encountered are dynami-

cally computed control-flow successors, e.g., stemming

from switch statements, function pointers, etc.

2) Value analysis [15], [16] computes an overapproximation

of the set of possible values in registers and memory loca-

tions by an interval analysis and/or congruence analysis.

This information is, among others, used for a precise data-

cache analysis.

3) Loop bound analysis [17], [18] identifies loops in the

program and tries to determine bounds on the number

of loop iterations, information which is indispensable to

bound the execution time. Problems are the analysis of

arithmetic on loop counters and loop-exit conditions, as

well as dependencies in nested loops.

4) Control-flow analysis [17], [19] narrows down the set

of possible paths through the program by eliminating

infeasible paths or to determine correlations between the

number of executions of different blocks using the results

of value-analysis results. These constraints will tighten

the obtained timing bounds.

5) Microarchitectural analysis [10], [20], [21] determines

bounds on the execution time of basic blocks by per-

forming an abstract interpretation of the program, taking

into account the processor’s pipeline, caches, and spec-

ulation concepts. Static cache analyses determine safe

approximations to the contents of caches at each program

point. Pipeline analysis analyzes how instructions pass

through the pipeline accounting for occupancy of shared

resources like queues, functional units, etc. Ignoring these

average-case-enhancing features would result in impre-

cise bounds.

6) Global bound analysis [22], [23] finally determines

bounds on execution time for the whole program. In-

formation about the execution time of basic blocks is

combined to compute the shortest and the longest paths

through the program. This phase takes into account in-

formation provided by the loop bound and control-flow

analyses.

The commercially available tool ai T by AbsInt, cf.

http://www.absint.de/wcet.htm, implements this architecture.

It is used in the aeronautics and automotive industries and

has been successfully used to determine precise bounds on

execution times of real-time programs [6], [7], [10], [24].

III. PIPELINES

For nonpipelined architectures, one can simply add up the

execution times of individual instructions to obtain a bound

on the execution time of a basic block. Pipelines increase

performance by overlapping the executions of different in-

structions. Hence, a timing analysis cannot consider individual

instructions in isolation. Instead, they have to be considered

collectively—together with their mutual interactions—to obtain

tight timing bounds.

The analysis of a given program for its pipeline behavior is

based on an abstract model of the pipeline. All components

that contribute to the timing of instructions have to be modeled

conservatively. Depending on the employed pipeline features,

the number of states the analysis has to consider varies greatly.

A. Contributions to Complexity

Since most parts of the pipeline state influence timing, the

abstract model needs to closely resemble the concrete hard-

ware. The more performance-enhancing features a pipeline has,

the larger is the search space. Superscalar and out-of-order

executions increase the number of possible interleavings. The

larger the buffers (e.g., fetch buffers, retirement queues, etc.),

the longer the influence of past events lasts. Dynamic branch

prediction, cachelike structures, and branch history tables in-

crease history dependence even more.

All these features influence execution time. To compute a

precise bound on the execution time of a basic block, the analy-

sis needs to exclude as many timing accidents as possible. Such

Authorized licensed use limited to: University of Florida. Downloaded on March 29,2010 at 12:54:04 EDT from IEEE Xplore. Restrictions apply.

Control Flow Graph (CFG)

• Analyze code

• Split basic blocks

• Compute per-block WCET

– use abstract CPU model

10

Timing Anomalies

• Locally faster != globally faster

11Image source: [Wilhelm et al., 2008]

Timing Anomalies

• Locally faster != globally faster

12Image source: [Wilhelm et al., 2008]

Challenge: Shared Memory Hierarchy

13

• Memory performance varies widely due to interference

• Task WCET can be extremely pessimistic

Core1 Core2 Core3 Core4

Memory Controller (MC)

Shared Cache

DRAM

Task 1 Task 2 Task 3 Task 4

I D I D I D I D

Effect of Memory Interference

• DNN control task suffers >10X slowdown

– When co-scheduling different tasks on on idle cores.

14

0

2

4

6

8

10

12

DNN (Core 0,1) BwWrite (Core 2,3)

Norm

aliz

ed E

xeuction T

ime

SoloCorun

DRAM

LLC


DNN BwWrite

Waqar Ali and Heechul Yun. “RT-Gang: Real-Time Gang Scheduling Framework for Safety-Critical Systems.” RTAS, 2019 (to appear)

Cache Denial-of-Service Attacks

15Michael G. Bechtel and Heechul Yun. “Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention.” In RTAS, 2019 (to appear, Outstanding Paper Award)

LLC


victim attackers

• Observed worst-case: >300X (times) slowdown

– On simple in-order multicores (Raspberry Pi3, Odroid C2)Difficult to guarantee predictable timing

Real-Time CPU Architectures

• PRET– UC Berkeley.

• MERASA/parMERASA project– EU

• ACROSS– EU

• ARAMIS– Germany

• EMC2– EU

16

FlexPRET: A Processor Platform for Mixed-Criticality Systems

RTAS, 2014

17

PRET Pipeline

19

FETCHDECOD

EREGACC MEM

EXECUTE

EXCEPT

FETCHDECOD

EREGACC MEM

EXECUTE

EXCEPT

FETCHDECOD

EREGACC MEM

EXECUTE

EXCEPT

FETCHDECOD

EREGACC MEM

EXECUTE

EXCEPT

FETCHDECOD

EREGACC MEM

EXECUTE

EXCEPT

FETCHDECOD

EREGACC MEM

EXECUTE

EXCEPT

FETCHDECOD

EREGACC MEM

EXECUTE

EXCEPT

FETCHDECOD

EREGACC MEM

EXECUTE

FETCHDECOD

EREGACC MEM

FETCHDECOD

EREGACC

FETCHDECOD

E

FETCH

t

THREAD#1

THREAD#2

THREAD#3

THREAD#4

THREAD#5

THREAD#6

1 clock

Thread 1, Instruction 1 Thread 1, Instruction 2

FlexPRET Pipeline

20

Hardware Support for WCET Analysis of Hard Real-Time

Multicore Systems

ISCA 2009

21

Analyzable Multicore Architecture

• Idea1: Bound interference on shared resources

– On-chip shared bus

– (shared) L2 cache

• Idea2: WCET computation mode

22

Architecture

23

Round-Robin Bus Arbitration

• UBD = (NHRT – 1) * Lbus

24

Request vs. Job-level WCET Analysis

• Request-level analysis

– Assume worst-case interference for each access of the task under analysis

– Pessimistic as not all accesses will get interference

• Job-level analysis

– Assume the total number of competing memory access is known

– Can reduce pessimism

25

Summary

• Timing anomalies

– Locally fast != globally fast on non-timing compositional architectures (i.e., most architectures)

• Timing compositional architecture

– Free of timing anomalies

26

Discussion

• Why is this interesting?

• Are assumptions realistic?

– Task model

– Cache model

– Memory model

– CPU (pipeline) model

27

Discussion

• Why is this interesting?

• Are assumptions realistic?

– Task model

– Cache model

– Memory model

– CPU (pipeline) model

28

Atomic vs. Split-Transaction Bus

• …

29J. P. Shen and M. H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press, 2013.

Announcement

• Mini Project #1

• DeepPicar Competition

– Build a self-driving car

– Based on DeepPicar

– Competition format

30

Acknowledgement

• Some slides are from:

– Prof. Rodolfo Pellizzoni, University of Waterloo

– Prof. Edward A. Lee, University of Berkeley

31

real-time multi/many-core architectureheechul/courses/eecs753/s19/slides/w7-rtcore.pdf•idea2: wcet...

Documents