real-time multi/many-core architectureheechul/courses/eecs753/s19/slides/w7-rtcore.pdf•idea2: wcet...
TRANSCRIPT
Real-Time Multi/Many-Core Architecture
Heechul Yun
1
Real-Time Multi/Many-Core Architecture
• Projects on Real-Time CPU Architectures
• Assigned Papers
– Shedding the Shackles of Time-Division Multiplexing, RTSS, 2018
– Deterministic Memory Abstraction and Supporting Multicore System Architecture. ECRTS, 2018
2
Trends in Automotive E/E Systems
3
A. Hamann (Bosch). “Industrial Challenge: Moving from Classical to High-Performance Real-Time Systems.” WATER, 2018.
Sourc
e:
Bosch
Centralization & High-Performance HW
Modern System-on-a-Chip (SoC)
4
Core1 Core2 GPU NPU…
Memory Controller (MC)
Shared Cache
• Integrate multiple cores, GPU, accelerators
• Good performance, size, weight, power
• Challenges: time predictability
DRAM
Worst-Case Execution Time (WCET)
• Real-time scheduling theory is based on the assumption of known WCETs of real-time tasks
5
Image source: [Wilhelm et al., 2008]
Computing WCET
• Static analysis
– Input: program code, architecture model
– output: WCET
– Problem: architecture model is hard and pessimistic
• Measurement
– No guarantee on true worst-case
– But, widely used in practice
6
Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-Critical Embedded Systems
IEEE TCAD, 2009
7
“Problematic” CPU Features
• Architectures are optimized to reduce average performance
• WCET estimation is hard because of– Pipelining
– TLBs/Caches
– Super-scalar
– Out-of-order scheduling
– Branch predictors
– Hardware prefetchers
– Basically anything that affect processor state
8
Static Timing Analysis
9
968 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 7, JULY 2009
Fig. 1. Main components of a timing-analysis framework and theirinteraction.
A. Timing-Analysis Framework
Over the last several years, a more or less standard archi-
tecture for timing-analysis tools has emerged [11]–[13]. Fig. 1
shows a general view on this architecture. First, one can distin-
guish three major building blocks:
1) control-flow reconstruction and static analyses for control
and data flow;
2) microarchitectural analysis, which computes upper and
lower bounds on execution times of basic blocks;
3) global bound analysis, which computes upper and lower
bounds for the whole program.
The following list presents the individual phases and de-
scribes their objectives and problems. Note that the first four
phases are part of the first building block.
1) Control-flow reconstruction [14] takes a binary exe-
cutable to be analyzed, reconstructs the program’s control
flow, and transforms the program into a suitable interme-
diate representation. Problems encountered are dynami-
cally computed control-flow successors, e.g., stemming
from switch statements, function pointers, etc.
2) Value analysis [15], [16] computes an overapproximation
of the set of possible values in registers and memory loca-
tions by an interval analysis and/or congruence analysis.
This information is, among others, used for a precise data-
cache analysis.
3) Loop bound analysis [17], [18] identifies loops in the
program and tries to determine bounds on the number
of loop iterations, information which is indispensable to
bound the execution time. Problems are the analysis of
arithmetic on loop counters and loop-exit conditions, as
well as dependencies in nested loops.
4) Control-flow analysis [17], [19] narrows down the set
of possible paths through the program by eliminating
infeasible paths or to determine correlations between the
number of executions of different blocks using the results
of value-analysis results. These constraints will tighten
the obtained timing bounds.
5) Microarchitectural analysis [10], [20], [21] determines
bounds on the execution time of basic blocks by per-
forming an abstract interpretation of the program, taking
into account the processor’s pipeline, caches, and spec-
ulation concepts. Static cache analyses determine safe
approximations to the contents of caches at each program
point. Pipeline analysis analyzes how instructions pass
through the pipeline accounting for occupancy of shared
resources like queues, functional units, etc. Ignoring these
average-case-enhancing features would result in impre-
cise bounds.
6) Global bound analysis [22], [23] finally determines
bounds on execution time for the whole program. In-
formation about the execution time of basic blocks is
combined to compute the shortest and the longest paths
through the program. This phase takes into account in-
formation provided by the loop bound and control-flow
analyses.
The commercially available tool ai T by AbsInt, cf.
http://www.absint.de/wcet.htm, implements this architecture.
It is used in the aeronautics and automotive industries and
has been successfully used to determine precise bounds on
execution times of real-time programs [6], [7], [10], [24].
III. PIPELINES
For nonpipelined architectures, one can simply add up the
execution times of individual instructions to obtain a bound
on the execution time of a basic block. Pipelines increase
performance by overlapping the executions of different in-
structions. Hence, a timing analysis cannot consider individual
instructions in isolation. Instead, they have to be considered
collectively—together with their mutual interactions—to obtain
tight timing bounds.
The analysis of a given program for its pipeline behavior is
based on an abstract model of the pipeline. All components
that contribute to the timing of instructions have to be modeled
conservatively. Depending on the employed pipeline features,
the number of states the analysis has to consider varies greatly.
A. Contributions to Complexity
Since most parts of the pipeline state influence timing, the
abstract model needs to closely resemble the concrete hard-
ware. The more performance-enhancing features a pipeline has,
the larger is the search space. Superscalar and out-of-order
executions increase the number of possible interleavings. The
larger the buffers (e.g., fetch buffers, retirement queues, etc.),
the longer the influence of past events lasts. Dynamic branch
prediction, cachelike structures, and branch history tables in-
crease history dependence even more.
All these features influence execution time. To compute a
precise bound on the execution time of a basic block, the analy-
sis needs to exclude as many timing accidents as possible. Such
Authorized licensed use limited to: University of Florida. Downloaded on March 29,2010 at 12:54:04 EDT from IEEE Xplore. Restrictions apply.
Control Flow Graph (CFG)
• Analyze code
• Split basic blocks
• Compute per-block WCET
– use abstract CPU model
10
Timing Anomalies
• Locally faster != globally faster
11Image source: [Wilhelm et al., 2008]
Timing Anomalies
• Locally faster != globally faster
12Image source: [Wilhelm et al., 2008]
Challenge: Shared Memory Hierarchy
13
• Memory performance varies widely due to interference
• Task WCET can be extremely pessimistic
Core1 Core2 Core3 Core4
Memory Controller (MC)
Shared Cache
DRAM
Task 1 Task 2 Task 3 Task 4
I D I D I D I D
Effect of Memory Interference
• DNN control task suffers >10X slowdown
– When co-scheduling different tasks on on idle cores.
14
0
2
4
6
8
10
12
DNN (Core 0,1) BwWrite (Core 2,3)
Norm
aliz
ed E
xeuction T
ime
SoloCorun
DRAM
LLC
Core1 Core2 Core3 Core4
DNN BwWrite
Waqar Ali and Heechul Yun. “RT-Gang: Real-Time Gang Scheduling Framework for Safety-Critical Systems.” RTAS, 2019 (to appear)
Cache Denial-of-Service Attacks
15Michael G. Bechtel and Heechul Yun. “Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention.” In RTAS, 2019 (to appear, Outstanding Paper Award)
LLC
Core1 Core2 Core3 Core4
victim attackers
• Observed worst-case: >300X (times) slowdown
– On simple in-order multicores (Raspberry Pi3, Odroid C2)Difficult to guarantee predictable timing
Real-Time CPU Architectures
• PRET– UC Berkeley.
• MERASA/parMERASA project– EU
• ACROSS– EU
• ARAMIS– Germany
• EMC2– EU
16
FlexPRET: A Processor Platform for Mixed-Criticality Systems
RTAS, 2014
17
18
PRET Pipeline
19
FETCHDECOD
EREGACC MEM
EXECUTE
EXCEPT
FETCHDECOD
EREGACC MEM
EXECUTE
EXCEPT
FETCHDECOD
EREGACC MEM
EXECUTE
EXCEPT
FETCHDECOD
EREGACC MEM
EXECUTE
EXCEPT
FETCHDECOD
EREGACC MEM
EXECUTE
EXCEPT
FETCHDECOD
EREGACC MEM
EXECUTE
EXCEPT
FETCHDECOD
EREGACC MEM
EXECUTE
EXCEPT
FETCHDECOD
EREGACC MEM
EXECUTE
FETCHDECOD
EREGACC MEM
FETCHDECOD
EREGACC
FETCHDECOD
E
FETCH
t
THREAD#1
THREAD#2
THREAD#3
THREAD#4
THREAD#5
THREAD#6
1 clock
Thread 1, Instruction 1 Thread 1, Instruction 2
FlexPRET Pipeline
20
Hardware Support for WCET Analysis of Hard Real-Time
Multicore Systems
ISCA 2009
21
Analyzable Multicore Architecture
• Idea1: Bound interference on shared resources
– On-chip shared bus
– (shared) L2 cache
• Idea2: WCET computation mode
22
Architecture
23
Round-Robin Bus Arbitration
• UBD = (NHRT – 1) * Lbus
24
Request vs. Job-level WCET Analysis
• Request-level analysis
– Assume worst-case interference for each access of the task under analysis
– Pessimistic as not all accesses will get interference
• Job-level analysis
– Assume the total number of competing memory access is known
– Can reduce pessimism
25
Summary
• Timing anomalies
– Locally fast != globally fast on non-timing compositional architectures (i.e., most architectures)
• Timing compositional architecture
– Free of timing anomalies
26
Discussion
• Why is this interesting?
• Are assumptions realistic?
– Task model
– Cache model
– Memory model
– CPU (pipeline) model
27
Discussion
• Why is this interesting?
• Are assumptions realistic?
– Task model
– Cache model
– Memory model
– CPU (pipeline) model
28
Atomic vs. Split-Transaction Bus
• …
29J. P. Shen and M. H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press, 2013.
Announcement
• Mini Project #1
• DeepPicar Competition
– Build a self-driving car
– Based on DeepPicar
– Competition format
30
Acknowledgement
• Some slides are from:
– Prof. Rodolfo Pellizzoni, University of Waterloo
– Prof. Edward A. Lee, University of Berkeley
31