os and application...
TRANSCRIPT
http://ls12-www.cs.tu-dortmund.de/daes/
OS and Application Reliability ASTEROID, DanceOS and FEHLER Projects
NSF Variability Expedition – DFG SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Peter Marwedel and Michael Engel, FEHLER Project, TU Dortmund
2
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Trends in Hardware Reliability
Shrinking structure sizes, reduced supply voltages
⇒ more and new errors
The idea of reliable HW is (an unrealistic) fiction
Neutron flux secondary radiation (factor)
feature
size
commercial aircraft
Altitude (km
above sea level)
failure rate (/109h)
at sea level
Ca
use
d b
y
en
viro
nm
en
t
Ca
use
d b
y
tech
no
logy
Failure rate flight
level, Boeing E-3
Based o
n s
lide b
y D
. Lohm
ann;
refe
rences:
see last
slid
e
3
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
SW-Based Reliability Techniques
What is needed in SW to reflect unreliable HW?
SW-based reliability techniques enable profitable scaling
Focus on the most beneficial error corrections
Idea: apply cross-layer knowledge on error impact
Application knowledge to determine relevant errors
Semiconductor
Layout
Transistor and
Gate Level
Micro-
architecture
ISA effects and
relation to app.
source code
Discernible
(visible, audible)
effects
4
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Additional Advantages of
Software-Based FT
Error handling tailored to system requirements
Flexibility
Choice of multiple error correction methods for a given error
Methods differ in resulting output quality, timeliness, …
Decisions based on static analysis results
Enables adaptation to available resources at runtime
Adaptability
Adapt to different system load and resource utilization conditions
React to changing external conditions (temperature, energy, etc.)
Adapt to permanent changes due to device aging etc.
5
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Software-Based Reliability
Combine static & dynamic approaches
Efficient approach for embedded systems
Compile-time analyses & transformations
Control/data flow and timing analyses
Source-to-source transformations, AOP
Binary-level
Selective task replication
Run-time adaptation (OS functionality)
Adaptive methods for error correction
Use of application semantics for OS-level
resource management Hardware
Application
Binaries 0111011
Application
Source Code int foo();
Operating
System
µkernel
Virtualization
Res. management
Error handling
Resource
abstraction
Analyses
Replication
Analyses
Transformations
6
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
SPP1500 Software Projects
ASTEROID – An Analyzable, Resilient, Embedded Real-
Time Operating System Design
Task replication
Error detection using HW-assisted fingerprinting
DanceOS: Dependability Aspects in Configurable
Embedded Operating Systems
Novel dependability measures with SW techniques
Build dependable operating systems
FEHLER – Flexible Error Handling for
Embedded Real-Time Systems
Application knowledge & static analyses
-> determine error impact, incl. timing
Base run-time error handling on analysis
results
ASTEROID
DanceOS
ASTEROID DanceOS
FEHLER
DanceOS
7
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
SPP1500 Software Projects
Software projects cover different layers in HW/SW stack
ASTEROID
Microkernel
Application binaries
DanceOS
Source code: analyses and transformations
Operating system
FEHLER
Source code: analyses and transformations
Virtualization for dependability
On microkernel and OS layer Hardware
µkernel
Virtualization
Operating
System
Application
Binaries
Application
Source Code
0111011
int foo();
8
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
ASTEROID ROMAIN: Redundant execution of tasks as an OS service
Replication on binary executable level
Based on FIASCO L4-family microkernel
Error detection using hardware-assisted fingerprinting
Fingerprint unit in CPU pipeline hashes retired instructions
Unique for given instruction/data sequence: basis for DMR voting
Vir
tual
ad
dre
ss s
pace
Axer, Döbel, Härtig: Designing an Analyzable and Resilient Embedded Operating System, SOBRES 2012
Axer et al.: Response-time analysis of parallel fork-join workloads with real-time constraints, ECRTS 2013
Döbel, Härtig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012
9
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
DanceOS
Static and dynamic analysis and evaluation of errors
Development of novel dependability measures
SW techniques implementing dependability measures
Especially aspect-oriented approaches for OS dependability
Borchert et al.: Generative SW-based memory error detection and correction for operating system data structures, DSN’13
Hoffmann, Dietrich, Lohmann: dOSEK: A dependable RTOS for automotive applications, PRDC 2013
Stilkerich et al.: A JVM for Soft-Error-Prone Embedded Systems, LCTES 2013
10
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
FEHLER
Schmoll et al.: Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods. ACM TECS, 2013
Heinig et al.: Classification-based Improvement of App. Robustness and QoS in Probabilistic Computer Systems, ARCS’12
Heinig et al.: Using Application Knowledge to Improve Embedded Systems Dependability, HotDep 2011
11
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Synergetic Effects
Comparison of different metrics
New system models for dependability
High-level models for impact on OS and application SW
Fault injection techniques
Reliable computing base
Source code analysis and transformation
Appropriate OS concepts
Impact of application area
Implementation/Evaluation in different environments
Different microkernel, OS environments, application scenarios,
hardware platforms
12
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Common Concepts, Methods, Tools
PVF Assessment
Program Vulnerability Factor
Approximation of error effects
instead of complex fault injection
RCB: Reliable Computing Base
Required reliable components for SW FT
FAIL*
Framework for performing fault injection campaigns
Provide realistic distribution of errors for a given platform
JTAG Error Injection
Predictable, application-aware error injection into real hardware
Döbel, Schirmeier, Engel: Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment, DFR 2013
Heinig, Korb, Schmoll, Marwedel, Engel: Fast and Low-Cost Instruction-Aware Fault Injection, SOBRES 2013
Schirmeier et al. FAIL*: Towards a versatile fault-injection experiment framework, ARCS 2012
Engel, Döbel: The Reliable Computing Base – A Paradigm for Software-based Reliability, SOBRES 2012
13
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Tradeoff of SW-based FT
Decide which error handling method to apply
HW- vs. SW-based error handling
Design decisions for efficient fault-tolerant systems
Which errors have to be corrected at or below a certain layer?
Especially: which errors prevent application of SW-based methods
for error correction?
What are the tradeoffs involved for different errors?
Provide cost models for error detection and correction, e.g.,
hardware cost and runtime, memory, energy overhead
What is the best location in the HW/SW stack to correct errors?
14
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
The Reliable Computing Base
Definition of the RCB
Subset of HW and SW components which
have to be reliable for error correction to work
What are the components of the RCB?
Which HW and SW components have to be reliable?
Develop methods to determine these components
What are the dependencies between these
components?
Can we provide a constructive approach to determine
the RCB?
15
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Exemplary results: ASTEROID
Overhead of redundant multithreading using redundant
execution in Romain < 30%, often < 5%
Axer, Döbel, Härtig: Designing an Analyzable and Resilient Embedded Operating System, SOBRES 2012
Axer et al.: Response-time analysis of parallel fork-join workloads with real-time constraints, ECRTS 2013
Döbel, Härtig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012
16
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Exemplary results: DanceOS
Generic Object Protection
for embedded OS
Protect only critical objects (based on hot-spot analysis)
while they are inactive
99.9% of errors in OS detected or corrected,
only avg 0.1% runtime overhead
Borchert et al.: Generative SW-based memory error detection and correction for operating system data structures, DSN’13
Hoffmann, Dietrich, Lohmann: dOSEK: A dependable RTOS for automotive applications, PRDC 2013
Stilkerich et al.: A JVM for Soft-Error-Prone Embedded Systems, LCTES 2013
17
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Exemplary results: FEHLER
Flexible error handling allows significant jitter reduction
even at high fault injection rates
Schmoll et al.: Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods. ACM TECS, 2013
Heinig et al.: Classification-based Improvement of App. Robustness and QoS in Probabilistic Computer Systems, ARCS’12
Heinig et al.: Using Application Knowledge to Improve Embedded Systems Dependability, HotDep 2011
10-5 10-6
18
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Cooperation/Complementarity
Metrics & analysis techniques, linking layers
Fault injection techniques
Which SW techniques to cope with unreliable HW ?
Source code, exception annotations, representation in files
Which mechanisms at run time, which OS concepts?
No focus on variability effects in SPP1500
Current focus (mostly) on transient errors
Handle permanent errors, aging, degradation
Platforms
HW reconfiguration vs. software adaptation
HW/SW-Codesign approach feasible?
19
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
Conclusions
Future SW has to incorporate mechanisms for handling
possible HW errors
SW-Mechanisms need to be found
Cooperation between static and dynamic techniques
Application knowledge helps to reduce error handling overhead
OS/System software play important role
Infrastructure is needed
Evaluation for metrics, fault injection, simulation, cross-layer …
Some errors are not correctable in SW
RCB: Set of system components that have to be reliable
Codesigned HW/SW error handling approaches?
20
Peter Marwedel – OS and Application Reliability Variability – SPP1500 Workshop
Irvine, CA, November 23rd, 2013
References for slide 2
Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug
Burger, and Lorenzo Alvisi. “Modeling the Effect of Technology Trends
on the Soft Error Rate of Combinational Logic”. In: Proceedings of the
32nd International Conference on Dependable Systems and Networks
(DSN ’02). (Washington, D.C., USA). Washington, DC, USA: IEEE
Computer Society Press, June 2002, pp. 389–398. DOI:
10.1109/DSN.2002.1028924.
A. Taber and E. Normand. “Single event upset in avionics”. In: IEEE
Transactions on Nuclear Science 40.2 (1993), pp. 120–126. ISSN:
0018-9499.DOI: 10.1109/23.212327.
James Ziegler and Helmut Puchner. SER – History, Trends and
Challenges: A guide for designing with Memory ICs. Cypress
Semiconductor Corporation, 2004.