marco caccamo department of computer science university of illinois at urbana-champaign

Marco CaccamoDepartment of Computer Science

University of Illinois at Urbana-Champaign

Toward the Predictable Integration of Real-Time COTS Based Systems

2

Part of this research is a joint work with prof. Lui Sha

This presentation is from selected research sponsored by ◦ National Science Foundation◦ Lockheed Martin Corporation

Graduate students who led these research efforts were: ◦ Rodolfo Pellizzoni◦ Bach D. Bui

References R. Pellizzoni, B.D. Bui, M. Caccamo and L. Sha, "Coscheduling of CPU and

I/O Transactions in COTS-based Embedded Systems," To appear at IEEE Real-Time Systems Symposium, Barcelona, December 2008.

R. Pellizzoni and M. Caccamo, "Toward the Predictable Integration of Real-Time COTS based Systems", Proceedings of the IEEE Real-Time Systems Symposium, Tucson, Arizona, December 2007.

Acknowledgement

Embedded systems are increasingly built by using Commercial Off-The-Shelf (COTS) components to reduce costs and time-to-market

This trend is true even for companies in the safety-critical avionic market such as Lockheed Martin Aeronautics, Boeing and Airbus

COTS components usually provide better performance:◦ SAFEbus used in the Boing777 transfers data up to 60 Mbps, while a COTS

interconnection such as PCI Express can reach higher transfer speeds (over three orders of magnitude)

COTS components are mainly optimized for the average case performance and not for the worst-case scenario.

COTS HW & RT Embedded Systems

Experiment based on an Intel Platform, typical embedded system speed.

PCI-X 133Mhz, 64 bit fully loaded. Task suffers continuous cache misses. Up to 44% wcet increase.

This is a big problem!!!This is a big problem!!!

I/O Bus Transactions & WCETs

According to ARINC 653 avionic standard, different computational components should be put into isolated partitions (cyclic time slices of the CPU).

ARINC 653 does not provide any isolation from the effects of I/O bus traffic. A peripheral is free to interfere with cache fetches while any partition (not requiring that peripheral) is executing on the CPU.

To provide true temporal partitioning, enforceable specifications must address the complex dependencies among all interacting resources.

See Aeronautical Radio Inc. ARINC 653 Specification. It defines the Avionics Application Standard Software Interface.

ARINC 653 and unpredictable COTS behaviors

Cache-peripheral conflict:◦ Master peripheral working for Task B.◦ Task A suffers cache miss.◦ Processor activity can be stalled due to

interference at the FSB level. How relevant is the problem?

◦ Four high performance network cards, saturated bus.

◦ Up to 49% increased wcet for memory intensive tasks.

CPU

Front Side Bus

DDRAM

Host PCI Bridge

Masterperipheral

Slaveperipheral

Task ATask A Task BTask B

This effect MUST be considered in wcet

computation!!

This effect MUST be considered in wcet

computation!!

Sebastian Schonberg, Impact of PCI-Bus Load on Applications in a PC Architecture, RTSS 03

PCI Bus

Peripheral Integration: Problem Scenario

To achieve end-to-end temporal isolation, shared resources (CPU, bus, cache, peripherals, etc.) should either support strong isolation or temporal interference should be quantifiable.

Highly pessimistic assumptions are often made to compensate

for the lack of end-to-end temporal isolation on COTS◦ An example is to account for the effect of all peripheral traffic in the

wcet of real-time tasks (up to 44% increment in task wcet)!

Lack of end-to-end temporal isolation raises dramatically integration costs and is source of serious concerns during the development of safety critical embedded systems ◦ At integration time (last phase of the design cycle), testing can

reveal unexpected deadline misses causing expensive design rollbacks

Goal: End-to-End Temporal Isolation on COTS

It is mandatory to have a closer look at HW behavior and its

integration with OS, middleware, and applications

We aim at analyzing temporal interference caused by COTS integration◦ if analyzed performance is not satisfactory, we search for

alternative (non-intrusive) HW solutions see Peripheral Gate

Goal: End-to-End Temporal Isolation on COTS

We introduced an analytical technique that computes safe

bounds on the I/O-induced task delay (D).

To control I/O interference over task execution, we introduced a coscheduling technique for CPU & I/O Peripherals

We designed a COTS-compatible peripheral gate and hardware server to enable/disable I/O peripherals (hw server is in progress!)

Main Contributions

COTS are inherently unpredictable due to:

GraphicsProcessor

MPEGComp.

Digital Video

Copper Fibre ChannelNetwork Interface

Fibre ChannelNetwork Interface

Network Interface

IEEE 1394Network Interface

Network Interface

Discrete IO

IO

CPU+Multi-Level Cache

PCI Bus 0a

PCI Bus 1a

SharedMemory

Ethernet

RS-485

PCI Bus 0b

PCI Bus 1b

SystemController

Network Interface

Power PC clocked @ 1000 MHz

64 Bit Wide Memory Bus256 MB DDR SDRAMClocked @ 125 MHz

64 Bit PCI-XClocked @ 100 MHz

32 Bit PCIClocked @ 66 MHz


PCI to PCI Bridge


PCI-X to PCI Bridge

Port 1 Port 2

Inactive

o Pipelined, cached CPUs.o Master (DMA) peripherals.o Etc.

Modern COTS-based embedded architectures are multi-master platforms

We assume a shared memory architecture with single-port RAM

We will show safe bounds for cache-peripheral interference at the main memory level.

The cache-peripheral interference problem

Similar to network calculus approach.

: maximum cumulative bus time required in any interval of length t.

How to compute: ◦ Measurement.◦ Knowledge of distributed traffic.

Assumptions:◦ Maximum non preemtive

transaction length: L’◦ No buffering in bridges (the analysis

was extended in presence of buffering too!).

tE

Peripheral Burstiness Bound

: cumulative bus time required to fetch/replace cache lines in .

Note: not an upper bound! Assumptions:

◦ CPU is stalled while waiting for lvl2 cache line fetch (no hyperthreading).

How to compute:◦ Static analysis.◦ Profiling.

Profiling yields multiple traces, run delay analysis on all.

tBu

s tim

e

Lc

tc],0[ t

wcet

flat curve: CPU executing

increasing curve (slope 1):CPU stalled during cache line

fetch

Cache Miss Profile

The proposed analysis computes worst case increase (D) on task computation time due to cache delays caused by FSB interference.

Main idea: treat the FSB + CPU cache logic as a switch that multiplexes accesses to system memory. ◦ Inputs: Cache line misses over time and peripheral bandwidth.◦ Output: Curve representing the delayed cache misses.

Bus arbitration is assumed RR or FP, transactions are non preemptive.

t

Cach

e m

isse

s

t

Perip

h. B

aund

. t

Cach

e m

isse

s

wcet increament(D)

CPU

PCIwcet

(no I/O interference)

Cache Delay Analysis

Worst case situation: PCI transaction accepted just before CPU cache miss.

Worst case interference: min ( CM, PT/L’ ) * L’◦ CM: # of cache misses ◦ PT: total peripheral traffic during task execution◦ Assuming RR bus arbitration

CPU

PCI

l

'L

cache line length

max transaction length

t

: cache miss

Analysis: Intuition (1/2)

The analysis shown is pessimistic; cache misses exhibit burst behavior.

Example: assume 1 peripheral transaction every T time units.

Real analysis: compute exact interference pattern based on burstiness of cache misses and peripheral transactions.

CPU

PCI

T T T T T

these CPU memory accesses can not be delayed

these peripheral transactions can not delay the CPU

t

Analysis: Intuition (2/2)

Worst case situation: peripheral transaction of length L’ accepted just before CPU cache miss.

0 5 10 15 20 25 30

0

5

10

35 40

1t 2t 3t 4t 5t wcet

)(tc

CACHE

CPU

5 10 15 20 25 30 35 40

t45 50 55

Fetch start time in the cache access

function c(t) unmodified by peripheral activity

Worst Case Interference Scenario

Cache Bound: max number of interfering peripheral trans. = number of cache misses.

Let CM be the number of cache misses. Then .

0 5 10 15 20 25 30 35 40

t

CACHE

CPU

PERIPHERAL

45 50 55

D

5t

wcet

D

'LCMD

Bound: Cache Misses

Peripheral Bound: max interference D max bus time requested by peripherals in interval

. Let .

Then equivalently:

0 5 10 15 20 25 30 35 40

t

CACHE

CPU

PERIPHERAL

45 50 55

5t

Dwcet

D

Dtt 15

)( 15 DttED )(|max)( xtExxtE

)( 15 ttED

In general, given a set of fetches {fi,…,fj} with start times {ti,…,tj}

D E(tj-ti)

In general, given a set of fetches {fi,…,fj} with start times {ti,…,tj}

D E(tj-ti)

Bound: Peripheral Load

There is a circular dependency between the amount of peripheral load that interferes with {fi,…,fj} and the delay D(fi, fj).

When peripheral traffic is injected on the FSB, the start time of each fetch is delayed. In turn, this increases the time interval between fi and fj and therefore more peripheral traffic can now interfere with those fetches.

Our key idea is that we do not need to modify the start times {fi,…,fj} of fetches when we take into account the I/O traffic injected on the FSB. Instead, we take it into account using the equation that defines

Some Insights about Peripheral Bound

)(tE

Some Insights about Peripheral Bound represents both the maximum delay suffered by fetches within [0-36] and the increase in the time interval for interfering

traffic.

Fetches in interval [0-36]

max interference D )36(E

0 5 10 15 20 25 30 35 40

t

CACHE

PERIPHERAL

45 50 45

5

10

)(tc

0

5

10

15

)(tE

t

5t

0 5 10 15 20 25 30 35 40 45 50 45

The real worst case delay is 13! Reason: cache is too bursty, interference from one

peripheral trans. is “lost” while the cache is not used.

This trans. can not interfere!

D

E(t 5-t

1+D

) = 1

4

14))()(,'min( 1515 ttEDttELCMD

wcet

The Intersection is not Tight!

0 5 10 15 20 25 30 35 40

t

wcet

CACHE

PERIPHERAL

45 50 45

5

10

)(tc

0

5

10

15

t0 5 10 15 20 25 30 35 40 45 50 45

Solution: split into multiple intervals. . How many intervals do we need to

consider?

7)()( 33,133,1 tEDtED 6'25,4 LD

7)( 3 tE

135,43,1 DD

)(tE

The Intersection is not Tight!

Iterative algorithm evaluates N(N+1)/2 intervals.

Each interval computed in O(1), overall complexity O(N2).

Bound is tight (see RTSS’07).

],[ 11 tt

],[ 21 tt],[ 22 tt],[ 31 tt],[ 32 tt],[ 33 tt

],[ 41 tt

],[ 44 tt

. . .

max delay for miss 1 (u1)



max delay for miss 4(u4)

Delay Algorithm

1, kii ttDy

Multitasking analysis using cyclic executive (it was extended to EDF with restricted-preemption model).

1. Analyze task Control Flow Graph.

2. Build a set of sequential superblocks.

3. Schedule is interleaving of slots composed of superblocks.

4. Algorithm: compute number of superblocks in each slot.

5. Account for additional cache misses due to inter-task cache interference.

1S

2S

4S

5S

6S

3S

1

Multitasking analysis

The proposed analysis makes a fairly restrictive assumption: it must know the exact time of each cache miss.

I/O interference is significant: when added to the wcet of all tasks, the system can suffer a huge waste of bandwidth!

Key idea: let’s coschedule CPU & I/O Peripherals

Goal: allow as much peripheral traffic as possible at run-time while using CPU reservations that do NOT include I/O interference (D).

Great! But c(t) is hard to get... and 44% is awful

Problem: obtaining an exact cache miss pattern is very hard.◦ CPU simulation requires simulating all peripherals.◦ Static analysis scales poorly.◦ In practice testing is often the preferred way.

Our solution:◦ Split the tasks into intervals.◦ Insert a checkpoint at the end of each

interval.◦ Measure wcet and worst case # of cache

misses for each interval (with no peripheral traffic).

◦ Checkpoints should not break loops or branches (sequential macroblock boundaries).

1S

2S

4S

5S

6S

3S

1

checkpoint

checkpoint

checkpoint

checkpoint

checkpoint

checkpoint

start

Cache Miss Profile is Hard to Get

A coscheduling technique for COTS peripherals

1. divide each task into a series of sequential superblocks;

2. Run off-line profiling for each task, collecting information on wcet and # of cache misses in each superblock (without I/O interference);

3. Compute a safe (wcet+D) bound (it includes I/O interference) for each superblock by assuming a “critical cache miss pattern”

4. Design a peripheral gate (p-gate) to enable/disable I/O peripherals

5. Design a new peripheral (on FPGA board), the reservation controller, which executes the coscheduling algorithm and controls all p-gates.

6. Use profiling information at run-time to coschedule tasks and I/O transactions

CPU & I/O coscheduling: HOW TO

Input: a set of intervals with wcet and cache misses.

Since we do not know when each cache miss happens within each interval, we need to identify a worst case pattern.

wcet1

CM1

wcet2

CM2

wcet3

CM3

wcet4

CM4

wcet5

CM5

t

Bus

time

t

Bus

time

wcetiCMi =4

t

Bus

time

If the Peripheral Load Curve is concave, then we obtain a tight bound for delay D (details are in a technical report).

If the Peripheral Load Curve is not concave, the bound for delay D is not tight. Simulations showed that the upper bound is within 0.2% of the real worst case delay. This is actually the

worst case pattern!This is actually the worst case pattern!

Analysis with Interval Information

wcet1 wcet2 wcet3 wcet4 wcet5

Task total wcet

The on-line algorithm:◦ Non-safety critical tasks have CPU reservation = wcet (D NOT

included!)

◦ At the beginning of each job the p-gates are closed.

◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller.

◦ The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate.

On-line Coscheduling Algorithm


Initial slack = 0 => p-gate closed

Coscheduling algorithm: an example


included!)





Slack += wcet1 -exec1

Slack < D2

wcet2 + D2exec1

p-gate closed



included!)





Slack += wcet2 – exec2

Slack >= D3

wcet3 + D3exec1

p-gate open

exec2



included!)




System composed of tasks/partitions with different criticalities: each task/partition uses different I/O peripherals.

The right action depends on the task/partition criticality

◦ Class A: block all non relevant peripheral traffic (Reservation=wcet+D)

◦ Class B: coschedule tasks and peripherals to maximize I/O traffic (Reservation=wcet).

◦ Class C: all I/O peripherals are enabled

t

Class A: safety critical(e.g., flying control)

Class B: mission critical(e.g., radar processing)

Class C: non critical(e.g., display)

System Integration: example for avionic domain

Peripheral Gate We designed the peripheral gate (or p-gate for short) for the PCI/PCI-

X bus: it allows us to control peripheral access to the bus.

The peripheral gate is compatible with COTS devices: its use does not require any modifications to either the peripheral or the motherboard.

Reservation controller commands Peripheral Gate (p-gate).

Kernel sends scheduling information to Reservation Controller.

Minimal kernel modification (send PID and exec of executing process).

Class A task: block all non relevant peripheral traffic

Class B task: reservation controller implements coscheduling algorithm.

CPU RAMFSBlogic

Reservation

Controller

Peri

phera

l B

us Peripheral

Gate

Peripheral Gate

P#2

P#3P#1

time

Reservation

Controller

cpu schedule

P#1 executing

P#2 executing

P#3 executing

Processes #1,#2,#3 belong to class A

Peripheral Gate

Testbed uses standard Intel platform.

Reservation controller implemented on FPGA, p-gate uses PCI extender card + discrete logic.

Logic analyzer for debugging and measurament

P-gate Gigabit ethernet NIC

Reservation Controller(Xilinx FPGA)

Current Prototype

Getting this information requires support from the CPU and the OS.

We used Architectural Performance Monitor Counters for the Intel Core2 microarch, but other manifacturers (ex: IBM) have similar support (implementation is specific, the lesson is general).

Two APMCs configured to count cache misses and CPU cycles in user space.

Task descriptor extended with exec. time and cache miss fields.

At context switch, the APMCs are saved/restored in descriptors like any other task-specific CPU registers.

Implemented under Linux/RK.

Kernel Implementation

We compared our adaptive heuristic with other algorithms. Assumption: At the beginning of each interval the algorithm

chooses whether to open or close the switch for that interval.1. Slack-only: baseline comparison, uses only remaining slack time

when task has finished.2. Predictive:

◦ Also uses measured average exec times.◦ “Predicts” slack time in the feature and optimizes open intervals

at each step.◦ Computing an optimal allocation is NP-hard, instead it uses a

fast greedy heuristic.3. Optimal:

◦ Clairvoyant (not implementable).◦ Provides an upper bound to the performance of any run-time,

predictive algorithm.

Other Coscheduling Algorithms

All run-time algorithms implemented on Xilinx ML505 FPGA.

Optimal computed using Matlab optimization tool.

We used a mpeg decoder as benchmark.◦ As a trend, video processing is increasingly used in the avionic

domain for mission control.◦ It simulates a Class B application subject to heavy I/O traffic

The task misses its deadline by up to 30% if I/O traffic is always allowed!

The run-time algorithm is already close to the optimal; not much to gain with the improved heuristic.

Slack-only Run-time Predictive Optimal

4.89% 31.21% 36.65% 40.85%

Results in term of % time the p-gate is open

The Test

We performed synthetic simulations to better understand the performance of the run-time algorithm.

20 superblocks per task, ◦ α is the variation between wcet and avg computation time.◦ β is the % of time the task is stalled due to cache misses.

β α

Simulation Results

Problem: blocking the peripheral reduces maximum throughput.◦ Ok only if critical tasks/partitions run for limited amount of time.

Better solution: implement a hardware server with buffering on SoC◦ Transactions are queued in hw server’s memory during non relevant

partitions.◦ Interrupts/DMA transfers are delivered only during execution of interested

tasks/partitions◦ Similar to real-time aperiodic servers: a hw server permits aperiodic I/O

requests to be analyzed as if they were following a predictable (periodic) pattern

DRAM

Xilinx FPGA

CPU MemBridge

OPB

PCIinterface

interruptcontroller

PCIHost bridge

peripheral

DDRAM

FPGA-based SoC design with Linux device drivers.

Currently in development.

Improving the P-Gate: Hardware Server (in progress)

A major issue in peripherals integration is task delay due to cache-peripheral contention at the main memory level

We proposed a framework to: 1) analyze the delay due to cache peripheral contention ; 2) control task execution times.

The proposed co-scheduling technique was tested with PCI/PCI-X bus; hw server will be ready soon.

Future work:

Extend to multi-processor and distributed systems

Conclusions

marco caccamo department of computer science university of illinois at urbana-champaign

Documents

cots interconnection

peripheral traffic

bus load

saturated bus

temporal interference

effects of io bus traffic

pci busperipheral integration

impact of pci