marco caccamo department of computer science university of illinois at urbana-champaign
DESCRIPTION
Toward the Predictable Integration of Real-Time COTS Based Systems. Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign. Acknowledgement. Part of this research is a joint work with prof. Lui Sha This presentation is from selected research sponsored by - PowerPoint PPT PresentationTRANSCRIPT
Marco CaccamoDepartment of Computer Science
University of Illinois at Urbana-Champaign
Toward the Predictable Integration of Real-Time COTS Based Systems
2
Part of this research is a joint work with prof. Lui Sha
This presentation is from selected research sponsored by ◦ National Science Foundation◦ Lockheed Martin Corporation
Graduate students who led these research efforts were: ◦ Rodolfo Pellizzoni◦ Bach D. Bui
References R. Pellizzoni, B.D. Bui, M. Caccamo and L. Sha, "Coscheduling of CPU and
I/O Transactions in COTS-based Embedded Systems," To appear at IEEE Real-Time Systems Symposium, Barcelona, December 2008.
R. Pellizzoni and M. Caccamo, "Toward the Predictable Integration of Real-Time COTS based Systems", Proceedings of the IEEE Real-Time Systems Symposium, Tucson, Arizona, December 2007.
Acknowledgement
Embedded systems are increasingly built by using Commercial Off-The-Shelf (COTS) components to reduce costs and time-to-market
This trend is true even for companies in the safety-critical avionic market such as Lockheed Martin Aeronautics, Boeing and Airbus
COTS components usually provide better performance:◦ SAFEbus used in the Boing777 transfers data up to 60 Mbps, while a COTS
interconnection such as PCI Express can reach higher transfer speeds (over three orders of magnitude)
COTS components are mainly optimized for the average case performance and not for the worst-case scenario.
COTS HW & RT Embedded Systems
Experiment based on an Intel Platform, typical embedded system speed.
PCI-X 133Mhz, 64 bit fully loaded. Task suffers continuous cache misses. Up to 44% wcet increase.
This is a big problem!!!This is a big problem!!!
I/O Bus Transactions & WCETs
According to ARINC 653 avionic standard, different computational components should be put into isolated partitions (cyclic time slices of the CPU).
ARINC 653 does not provide any isolation from the effects of I/O bus traffic. A peripheral is free to interfere with cache fetches while any partition (not requiring that peripheral) is executing on the CPU.
To provide true temporal partitioning, enforceable specifications must address the complex dependencies among all interacting resources.
See Aeronautical Radio Inc. ARINC 653 Specification. It defines the Avionics Application Standard Software Interface.
ARINC 653 and unpredictable COTS behaviors
Cache-peripheral conflict:◦ Master peripheral working for Task B.◦ Task A suffers cache miss.◦ Processor activity can be stalled due to
interference at the FSB level. How relevant is the problem?
◦ Four high performance network cards, saturated bus.
◦ Up to 49% increased wcet for memory intensive tasks.
CPU
Front Side Bus
DDRAM
Host PCI Bridge
Masterperipheral
Slaveperipheral
Task ATask A Task BTask B
This effect MUST be considered in wcet
computation!!
This effect MUST be considered in wcet
computation!!
Sebastian Schonberg, Impact of PCI-Bus Load on Applications in a PC Architecture, RTSS 03
PCI Bus
Peripheral Integration: Problem Scenario
To achieve end-to-end temporal isolation, shared resources (CPU, bus, cache, peripherals, etc.) should either support strong isolation or temporal interference should be quantifiable.
Highly pessimistic assumptions are often made to compensate
for the lack of end-to-end temporal isolation on COTS◦ An example is to account for the effect of all peripheral traffic in the
wcet of real-time tasks (up to 44% increment in task wcet)!
Lack of end-to-end temporal isolation raises dramatically integration costs and is source of serious concerns during the development of safety critical embedded systems ◦ At integration time (last phase of the design cycle), testing can
reveal unexpected deadline misses causing expensive design rollbacks
Goal: End-to-End Temporal Isolation on COTS
It is mandatory to have a closer look at HW behavior and its
integration with OS, middleware, and applications
We aim at analyzing temporal interference caused by COTS integration◦ if analyzed performance is not satisfactory, we search for
alternative (non-intrusive) HW solutions see Peripheral Gate
Goal: End-to-End Temporal Isolation on COTS
We introduced an analytical technique that computes safe
bounds on the I/O-induced task delay (D).
To control I/O interference over task execution, we introduced a coscheduling technique for CPU & I/O Peripherals
We designed a COTS-compatible peripheral gate and hardware server to enable/disable I/O peripherals (hw server is in progress!)
Main Contributions
COTS are inherently unpredictable due to:
GraphicsProcessor
MPEGComp.
Digital Video
Copper Fibre ChannelNetwork Interface
Fibre ChannelNetwork Interface
Network Interface
IEEE 1394Network Interface
Network Interface
Discrete IO
IO
CPU+Multi-Level Cache
PCI Bus 0a
PCI Bus 1a
SharedMemory
Ethernet
RS-485
PCI Bus 0b
PCI Bus 1b
SystemController
Network Interface
Power PC clocked @ 1000 MHz
64 Bit Wide Memory Bus256 MB DDR SDRAMClocked @ 125 MHz
64 Bit PCI-XClocked @ 100 MHz
32 Bit PCIClocked @ 66 MHz
32 Bit PCIClocked @ 33 MHz
PCI to PCI Bridge
32 Bit PCIClocked @ 66 MHz
PCI-X to PCI Bridge
Port 1 Port 2
Inactive
o Pipelined, cached CPUs.o Master (DMA) peripherals.o Etc.
Modern COTS-based embedded architectures are multi-master platforms
We assume a shared memory architecture with single-port RAM
We will show safe bounds for cache-peripheral interference at the main memory level.
The cache-peripheral interference problem
Similar to network calculus approach.
: maximum cumulative bus time required in any interval of length t.
How to compute: ◦ Measurement.◦ Knowledge of distributed traffic.
Assumptions:◦ Maximum non preemtive
transaction length: L’◦ No buffering in bridges (the analysis
was extended in presence of buffering too!).
tE
Peripheral Burstiness Bound
: cumulative bus time required to fetch/replace cache lines in .
Note: not an upper bound! Assumptions:
◦ CPU is stalled while waiting for lvl2 cache line fetch (no hyperthreading).
How to compute:◦ Static analysis.◦ Profiling.
Profiling yields multiple traces, run delay analysis on all.
tBu
s tim
e
Lc
tc],0[ t
wcet
flat curve: CPU executing
increasing curve (slope 1):CPU stalled during cache line
fetch
Cache Miss Profile
The proposed analysis computes worst case increase (D) on task computation time due to cache delays caused by FSB interference.
Main idea: treat the FSB + CPU cache logic as a switch that multiplexes accesses to system memory. ◦ Inputs: Cache line misses over time and peripheral bandwidth.◦ Output: Curve representing the delayed cache misses.
Bus arbitration is assumed RR or FP, transactions are non preemptive.
t
Cach
e m
isse
s
t
Perip
h. B
aund
. t
Cach
e m
isse
s
wcet increament(D)
CPU
PCIwcet
(no I/O interference)
Cache Delay Analysis
Worst case situation: PCI transaction accepted just before CPU cache miss.
Worst case interference: min ( CM, PT/L’ ) * L’◦ CM: # of cache misses ◦ PT: total peripheral traffic during task execution◦ Assuming RR bus arbitration
CPU
PCI
l
'L
cache line length
max transaction length
t
: cache miss
Analysis: Intuition (1/2)
The analysis shown is pessimistic; cache misses exhibit burst behavior.
Example: assume 1 peripheral transaction every T time units.
Real analysis: compute exact interference pattern based on burstiness of cache misses and peripheral transactions.
CPU
PCI
T T T T T
these CPU memory accesses can not be delayed
these peripheral transactions can not delay the CPU
t
Analysis: Intuition (2/2)
Worst case situation: peripheral transaction of length L’ accepted just before CPU cache miss.
0 5 10 15 20 25 30
0
5
10
35 40
1t 2t 3t 4t 5t wcet
)(tc
CACHE
CPU
5 10 15 20 25 30 35 40
t45 50 55
Fetch start time in the cache access
function c(t) unmodified by peripheral activity
Worst Case Interference Scenario
Cache Bound: max number of interfering peripheral trans. = number of cache misses.
Let CM be the number of cache misses. Then .
0 5 10 15 20 25 30 35 40
t
CACHE
CPU
PERIPHERAL
45 50 55
D
5t
wcet
D
'LCMD
Bound: Cache Misses
Peripheral Bound: max interference D max bus time requested by peripherals in interval
. Let .
Then equivalently:
0 5 10 15 20 25 30 35 40
t
CACHE
CPU
PERIPHERAL
45 50 55
5t
Dwcet
D
Dtt 15
)( 15 DttED )(|max)( xtExxtE
)( 15 ttED
In general, given a set of fetches {fi,…,fj} with start times {ti,…,tj}
D E(tj-ti)
In general, given a set of fetches {fi,…,fj} with start times {ti,…,tj}
D E(tj-ti)
Bound: Peripheral Load
There is a circular dependency between the amount of peripheral load that interferes with {fi,…,fj} and the delay D(fi, fj).
When peripheral traffic is injected on the FSB, the start time of each fetch is delayed. In turn, this increases the time interval between fi and fj and therefore more peripheral traffic can now interfere with those fetches.
Our key idea is that we do not need to modify the start times {fi,…,fj} of fetches when we take into account the I/O traffic injected on the FSB. Instead, we take it into account using the equation that defines
Some Insights about Peripheral Bound
)(tE
Some Insights about Peripheral Bound represents both the maximum delay suffered by fetches within [0-36] and the increase in the time interval for interfering
traffic.
Fetches in interval [0-36]
max interference D )36(E
0 5 10 15 20 25 30 35 40
t
CACHE
PERIPHERAL
45 50 45
5
10
)(tc
0
5
10
15
)(tE
t
5t
0 5 10 15 20 25 30 35 40 45 50 45
The real worst case delay is 13! Reason: cache is too bursty, interference from one
peripheral trans. is “lost” while the cache is not used.
This trans. can not interfere!
D
E(t 5-t
1+D
) = 1
4
14))()(,'min( 1515 ttEDttELCMD
wcet
The Intersection is not Tight!
0 5 10 15 20 25 30 35 40
t
wcet
CACHE
PERIPHERAL
45 50 45
5
10
)(tc
0
5
10
15
t0 5 10 15 20 25 30 35 40 45 50 45
Solution: split into multiple intervals. . How many intervals do we need to
consider?
7)()( 33,133,1 tEDtED 6'25,4 LD
7)( 3 tE
135,43,1 DD
)(tE
The Intersection is not Tight!
Iterative algorithm evaluates N(N+1)/2 intervals.
Each interval computed in O(1), overall complexity O(N2).
Bound is tight (see RTSS’07).
],[ 11 tt
],[ 21 tt],[ 22 tt],[ 31 tt],[ 32 tt],[ 33 tt
],[ 41 tt
],[ 44 tt
. . .
max delay for miss 1 (u1)
max delay for miss 2 (u2)
max delay for miss 3 (u3)
max delay for miss 4(u4)
Delay Algorithm
1, kii ttDy
Multitasking analysis using cyclic executive (it was extended to EDF with restricted-preemption model).
1. Analyze task Control Flow Graph.
2. Build a set of sequential superblocks.
3. Schedule is interleaving of slots composed of superblocks.
4. Algorithm: compute number of superblocks in each slot.
5. Account for additional cache misses due to inter-task cache interference.
1S
2S
4S
5S
6S
3S
1
Multitasking analysis
The proposed analysis makes a fairly restrictive assumption: it must know the exact time of each cache miss.
I/O interference is significant: when added to the wcet of all tasks, the system can suffer a huge waste of bandwidth!
Key idea: let’s coschedule CPU & I/O Peripherals
Goal: allow as much peripheral traffic as possible at run-time while using CPU reservations that do NOT include I/O interference (D).
Great! But c(t) is hard to get... and 44% is awful
Problem: obtaining an exact cache miss pattern is very hard.◦ CPU simulation requires simulating all peripherals.◦ Static analysis scales poorly.◦ In practice testing is often the preferred way.
Our solution:◦ Split the tasks into intervals.◦ Insert a checkpoint at the end of each
interval.◦ Measure wcet and worst case # of cache
misses for each interval (with no peripheral traffic).
◦ Checkpoints should not break loops or branches (sequential macroblock boundaries).
1S
2S
4S
5S
6S
3S
1
checkpoint
checkpoint
checkpoint
checkpoint
checkpoint
checkpoint
start
Cache Miss Profile is Hard to Get
A coscheduling technique for COTS peripherals
1. divide each task into a series of sequential superblocks;
2. Run off-line profiling for each task, collecting information on wcet and # of cache misses in each superblock (without I/O interference);
3. Compute a safe (wcet+D) bound (it includes I/O interference) for each superblock by assuming a “critical cache miss pattern”
4. Design a peripheral gate (p-gate) to enable/disable I/O peripherals
5. Design a new peripheral (on FPGA board), the reservation controller, which executes the coscheduling algorithm and controls all p-gates.
6. Use profiling information at run-time to coschedule tasks and I/O transactions
CPU & I/O coscheduling: HOW TO
Input: a set of intervals with wcet and cache misses.
Since we do not know when each cache miss happens within each interval, we need to identify a worst case pattern.
wcet1
CM1
wcet2
CM2
wcet3
CM3
wcet4
CM4
wcet5
CM5
t
Bus
time
t
Bus
time
wcetiCMi =4
t
Bus
time
If the Peripheral Load Curve is concave, then we obtain a tight bound for delay D (details are in a technical report).
If the Peripheral Load Curve is not concave, the bound for delay D is not tight. Simulations showed that the upper bound is within 0.2% of the real worst case delay. This is actually the
worst case pattern!This is actually the worst case pattern!
Analysis with Interval Information
wcet1 wcet2 wcet3 wcet4 wcet5
Task total wcet
The on-line algorithm:◦ Non-safety critical tasks have CPU reservation = wcet (D NOT
included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate.
On-line Coscheduling Algorithm
wcet1 wcet2 wcet3 wcet4 wcet5
Initial slack = 0 => p-gate closed
Coscheduling algorithm: an example
The on-line algorithm:◦ Non-safety critical tasks have CPU reservation = wcet (D NOT
included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate.
wcet1 wcet2 wcet3 wcet4 wcet5
Slack += wcet1 -exec1
Slack < D2
wcet2 + D2exec1
p-gate closed
Coscheduling algorithm: an example
The on-line algorithm:◦ Non-safety critical tasks have CPU reservation = wcet (D NOT
included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate.
wcet1 wcet2 wcet3 wcet4 wcet5
Slack += wcet2 – exec2
Slack >= D3
wcet3 + D3exec1
p-gate open
exec2
Coscheduling algorithm: an example
The on-line algorithm:◦ Non-safety critical tasks have CPU reservation = wcet (D NOT
included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate.
System composed of tasks/partitions with different criticalities: each task/partition uses different I/O peripherals.
The right action depends on the task/partition criticality
◦ Class A: block all non relevant peripheral traffic (Reservation=wcet+D)
◦ Class B: coschedule tasks and peripherals to maximize I/O traffic (Reservation=wcet).
◦ Class C: all I/O peripherals are enabled
t
Class A: safety critical(e.g., flying control)
Class B: mission critical(e.g., radar processing)
Class C: non critical(e.g., display)
System Integration: example for avionic domain
Peripheral Gate We designed the peripheral gate (or p-gate for short) for the PCI/PCI-
X bus: it allows us to control peripheral access to the bus.
The peripheral gate is compatible with COTS devices: its use does not require any modifications to either the peripheral or the motherboard.
Reservation controller commands Peripheral Gate (p-gate).
Kernel sends scheduling information to Reservation Controller.
Minimal kernel modification (send PID and exec of executing process).
Class A task: block all non relevant peripheral traffic
Class B task: reservation controller implements coscheduling algorithm.
CPU RAMFSBlogic
Reservation
Controller
Peri
phera
l B
us Peripheral
Gate
Peripheral Gate
P#2
P#3P#1
time
Reservation
Controller
cpu schedule
P#1 executing
P#2 executing
P#3 executing
Processes #1,#2,#3 belong to class A
Peripheral Gate
Testbed uses standard Intel platform.
Reservation controller implemented on FPGA, p-gate uses PCI extender card + discrete logic.
Logic analyzer for debugging and measurament
P-gate Gigabit ethernet NIC
Reservation Controller(Xilinx FPGA)
Current Prototype
Getting this information requires support from the CPU and the OS.
We used Architectural Performance Monitor Counters for the Intel Core2 microarch, but other manifacturers (ex: IBM) have similar support (implementation is specific, the lesson is general).
Two APMCs configured to count cache misses and CPU cycles in user space.
Task descriptor extended with exec. time and cache miss fields.
At context switch, the APMCs are saved/restored in descriptors like any other task-specific CPU registers.
Implemented under Linux/RK.
Kernel Implementation
We compared our adaptive heuristic with other algorithms. Assumption: At the beginning of each interval the algorithm
chooses whether to open or close the switch for that interval.1. Slack-only: baseline comparison, uses only remaining slack time
when task has finished.2. Predictive:
◦ Also uses measured average exec times.◦ “Predicts” slack time in the feature and optimizes open intervals
at each step.◦ Computing an optimal allocation is NP-hard, instead it uses a
fast greedy heuristic.3. Optimal:
◦ Clairvoyant (not implementable).◦ Provides an upper bound to the performance of any run-time,
predictive algorithm.
Other Coscheduling Algorithms
All run-time algorithms implemented on Xilinx ML505 FPGA.
Optimal computed using Matlab optimization tool.
We used a mpeg decoder as benchmark.◦ As a trend, video processing is increasingly used in the avionic
domain for mission control.◦ It simulates a Class B application subject to heavy I/O traffic
The task misses its deadline by up to 30% if I/O traffic is always allowed!
The run-time algorithm is already close to the optimal; not much to gain with the improved heuristic.
Slack-only Run-time Predictive Optimal
4.89% 31.21% 36.65% 40.85%
Results in term of % time the p-gate is open
The Test
We performed synthetic simulations to better understand the performance of the run-time algorithm.
20 superblocks per task, ◦ α is the variation between wcet and avg computation time.◦ β is the % of time the task is stalled due to cache misses.
β α
Simulation Results
Problem: blocking the peripheral reduces maximum throughput.◦ Ok only if critical tasks/partitions run for limited amount of time.
Better solution: implement a hardware server with buffering on SoC◦ Transactions are queued in hw server’s memory during non relevant
partitions.◦ Interrupts/DMA transfers are delivered only during execution of interested
tasks/partitions◦ Similar to real-time aperiodic servers: a hw server permits aperiodic I/O
requests to be analyzed as if they were following a predictable (periodic) pattern
DRAM
Xilinx FPGA
CPU MemBridge
OPB
PCIinterface
interruptcontroller
PCIHost bridge
peripheral
DDRAM
FPGA-based SoC design with Linux device drivers.
Currently in development.
Improving the P-Gate: Hardware Server (in progress)
A major issue in peripherals integration is task delay due to cache-peripheral contention at the main memory level
We proposed a framework to: 1) analyze the delay due to cache peripheral contention ; 2) control task execution times.
The proposed co-scheduling technique was tested with PCI/PCI-X bus; hw server will be ready soon.
Future work:
Extend to multi-processor and distributed systems
Conclusions