os and application...

20
http://ls12-www.cs.tu-dortmund.de/daes/ OS and Application Reliability ASTEROID, DanceOS and FEHLER Projects NSF Variability Expedition DFG SPP1500 Workshop Irvine, CA, November 23 rd , 2013 Peter Marwedel and Michael Engel, FEHLER Project, TU Dortmund

Upload: others

Post on 26-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

http://ls12-www.cs.tu-dortmund.de/daes/

OS and Application Reliability ASTEROID, DanceOS and FEHLER Projects

NSF Variability Expedition – DFG SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Peter Marwedel and Michael Engel, FEHLER Project, TU Dortmund

Page 2: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

2

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Trends in Hardware Reliability

Shrinking structure sizes, reduced supply voltages

⇒ more and new errors

The idea of reliable HW is (an unrealistic) fiction

Neutron flux secondary radiation (factor)

feature

size

commercial aircraft

Altitude (km

above sea level)

failure rate (/109h)

at sea level

Ca

use

d b

y

en

viro

nm

en

t

Ca

use

d b

y

tech

no

logy

Failure rate flight

level, Boeing E-3

Based o

n s

lide b

y D

. Lohm

ann;

refe

rences:

see last

slid

e

Page 3: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

3

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

SW-Based Reliability Techniques

What is needed in SW to reflect unreliable HW?

SW-based reliability techniques enable profitable scaling

Focus on the most beneficial error corrections

Idea: apply cross-layer knowledge on error impact

Application knowledge to determine relevant errors

Semiconductor

Layout

Transistor and

Gate Level

Micro-

architecture

ISA effects and

relation to app.

source code

Discernible

(visible, audible)

effects

Page 4: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

4

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Additional Advantages of

Software-Based FT

Error handling tailored to system requirements

Flexibility

Choice of multiple error correction methods for a given error

Methods differ in resulting output quality, timeliness, …

Decisions based on static analysis results

Enables adaptation to available resources at runtime

Adaptability

Adapt to different system load and resource utilization conditions

React to changing external conditions (temperature, energy, etc.)

Adapt to permanent changes due to device aging etc.

Page 5: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

5

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Software-Based Reliability

Combine static & dynamic approaches

Efficient approach for embedded systems

Compile-time analyses & transformations

Control/data flow and timing analyses

Source-to-source transformations, AOP

Binary-level

Selective task replication

Run-time adaptation (OS functionality)

Adaptive methods for error correction

Use of application semantics for OS-level

resource management Hardware

Application

Binaries 0111011

Application

Source Code int foo();

Operating

System

µkernel

Virtualization

Res. management

Error handling

Resource

abstraction

Analyses

Replication

Analyses

Transformations

Page 6: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

6

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

SPP1500 Software Projects

ASTEROID – An Analyzable, Resilient, Embedded Real-

Time Operating System Design

Task replication

Error detection using HW-assisted fingerprinting

DanceOS: Dependability Aspects in Configurable

Embedded Operating Systems

Novel dependability measures with SW techniques

Build dependable operating systems

FEHLER – Flexible Error Handling for

Embedded Real-Time Systems

Application knowledge & static analyses

-> determine error impact, incl. timing

Base run-time error handling on analysis

results

ASTEROID

DanceOS

ASTEROID DanceOS

FEHLER

DanceOS

Page 7: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

7

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

SPP1500 Software Projects

Software projects cover different layers in HW/SW stack

ASTEROID

Microkernel

Application binaries

DanceOS

Source code: analyses and transformations

Operating system

FEHLER

Source code: analyses and transformations

Virtualization for dependability

On microkernel and OS layer Hardware

µkernel

Virtualization

Operating

System

Application

Binaries

Application

Source Code

0111011

int foo();

Page 8: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

8

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

ASTEROID ROMAIN: Redundant execution of tasks as an OS service

Replication on binary executable level

Based on FIASCO L4-family microkernel

Error detection using hardware-assisted fingerprinting

Fingerprint unit in CPU pipeline hashes retired instructions

Unique for given instruction/data sequence: basis for DMR voting

Vir

tual

ad

dre

ss s

pace

Axer, Döbel, Härtig: Designing an Analyzable and Resilient Embedded Operating System, SOBRES 2012

Axer et al.: Response-time analysis of parallel fork-join workloads with real-time constraints, ECRTS 2013

Döbel, Härtig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012

Page 9: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

9

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

DanceOS

Static and dynamic analysis and evaluation of errors

Development of novel dependability measures

SW techniques implementing dependability measures

Especially aspect-oriented approaches for OS dependability

Borchert et al.: Generative SW-based memory error detection and correction for operating system data structures, DSN’13

Hoffmann, Dietrich, Lohmann: dOSEK: A dependable RTOS for automotive applications, PRDC 2013

Stilkerich et al.: A JVM for Soft-Error-Prone Embedded Systems, LCTES 2013

Page 10: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

10

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

FEHLER

Schmoll et al.: Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods. ACM TECS, 2013

Heinig et al.: Classification-based Improvement of App. Robustness and QoS in Probabilistic Computer Systems, ARCS’12

Heinig et al.: Using Application Knowledge to Improve Embedded Systems Dependability, HotDep 2011

Page 11: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

11

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Synergetic Effects

Comparison of different metrics

New system models for dependability

High-level models for impact on OS and application SW

Fault injection techniques

Reliable computing base

Source code analysis and transformation

Appropriate OS concepts

Impact of application area

Implementation/Evaluation in different environments

Different microkernel, OS environments, application scenarios,

hardware platforms

Page 12: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

12

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Common Concepts, Methods, Tools

PVF Assessment

Program Vulnerability Factor

Approximation of error effects

instead of complex fault injection

RCB: Reliable Computing Base

Required reliable components for SW FT

FAIL*

Framework for performing fault injection campaigns

Provide realistic distribution of errors for a given platform

JTAG Error Injection

Predictable, application-aware error injection into real hardware

Döbel, Schirmeier, Engel: Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment, DFR 2013

Heinig, Korb, Schmoll, Marwedel, Engel: Fast and Low-Cost Instruction-Aware Fault Injection, SOBRES 2013

Schirmeier et al. FAIL*: Towards a versatile fault-injection experiment framework, ARCS 2012

Engel, Döbel: The Reliable Computing Base – A Paradigm for Software-based Reliability, SOBRES 2012

Page 13: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

13

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Tradeoff of SW-based FT

Decide which error handling method to apply

HW- vs. SW-based error handling

Design decisions for efficient fault-tolerant systems

Which errors have to be corrected at or below a certain layer?

Especially: which errors prevent application of SW-based methods

for error correction?

What are the tradeoffs involved for different errors?

Provide cost models for error detection and correction, e.g.,

hardware cost and runtime, memory, energy overhead

What is the best location in the HW/SW stack to correct errors?

Page 14: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

14

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

The Reliable Computing Base

Definition of the RCB

Subset of HW and SW components which

have to be reliable for error correction to work

What are the components of the RCB?

Which HW and SW components have to be reliable?

Develop methods to determine these components

What are the dependencies between these

components?

Can we provide a constructive approach to determine

the RCB?

Page 15: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

15

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Exemplary results: ASTEROID

Overhead of redundant multithreading using redundant

execution in Romain < 30%, often < 5%

Axer, Döbel, Härtig: Designing an Analyzable and Resilient Embedded Operating System, SOBRES 2012

Axer et al.: Response-time analysis of parallel fork-join workloads with real-time constraints, ECRTS 2013

Döbel, Härtig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012

Page 16: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

16

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Exemplary results: DanceOS

Generic Object Protection

for embedded OS

Protect only critical objects (based on hot-spot analysis)

while they are inactive

99.9% of errors in OS detected or corrected,

only avg 0.1% runtime overhead

Borchert et al.: Generative SW-based memory error detection and correction for operating system data structures, DSN’13

Hoffmann, Dietrich, Lohmann: dOSEK: A dependable RTOS for automotive applications, PRDC 2013

Stilkerich et al.: A JVM for Soft-Error-Prone Embedded Systems, LCTES 2013

Page 17: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

17

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Exemplary results: FEHLER

Flexible error handling allows significant jitter reduction

even at high fault injection rates

Schmoll et al.: Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods. ACM TECS, 2013

Heinig et al.: Classification-based Improvement of App. Robustness and QoS in Probabilistic Computer Systems, ARCS’12

Heinig et al.: Using Application Knowledge to Improve Embedded Systems Dependability, HotDep 2011

10-5 10-6

Page 18: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

18

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Cooperation/Complementarity

Metrics & analysis techniques, linking layers

Fault injection techniques

Which SW techniques to cope with unreliable HW ?

Source code, exception annotations, representation in files

Which mechanisms at run time, which OS concepts?

No focus on variability effects in SPP1500

Current focus (mostly) on transient errors

Handle permanent errors, aging, degradation

Platforms

HW reconfiguration vs. software adaptation

HW/SW-Codesign approach feasible?

Page 19: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

19

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

Conclusions

Future SW has to incorporate mechanisms for handling

possible HW errors

SW-Mechanisms need to be found

Cooperation between static and dynamic techniques

Application knowledge helps to reduce error handling overhead

OS/System software play important role

Infrastructure is needed

Evaluation for metrics, fault injection, simulation, cross-layer …

Some errors are not correctable in SW

RCB: Set of system components that have to be reliable

Codesigned HW/SW error handling approaches?

Page 20: OS and Application Reliabilityspp1500.itec.kit.edu/downloads/irvine_slides/13_Irvine_SPP_Softwar… · 7 Peter Marwedel – OS and Application Reliability Variability – SPP1500

20

Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop

Irvine, CA, November 23rd, 2013

References for slide 2

Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug

Burger, and Lorenzo Alvisi. “Modeling the Effect of Technology Trends

on the Soft Error Rate of Combinational Logic”. In: Proceedings of the

32nd International Conference on Dependable Systems and Networks

(DSN ’02). (Washington, D.C., USA). Washington, DC, USA: IEEE

Computer Society Press, June 2002, pp. 389–398. DOI:

10.1109/DSN.2002.1028924.

A. Taber and E. Normand. “Single event upset in avionics”. In: IEEE

Transactions on Nuclear Science 40.2 (1993), pp. 120–126. ISSN:

0018-9499.DOI: 10.1109/23.212327.

James Ziegler and Helmut Puchner. SER – History, Trends and

Challenges: A guide for designing with Memory ICs. Cypress

Semiconductor Corporation, 2004.