1 software fault tolerance (swft) swft for wireless sensor networks (lec 1) dependable embedded...

32
1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 1) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Abdelmajid Khelil Dept. of Computer Science TU Darmstadt, Germany

Upload: darcy-flowers

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

1

Software Fault Tolerance (SWFT)

SWFT for Wireless Sensor Networks (Lec 1)

Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de

Prof. Neeraj Suri

Abdelmajid Khelil

Dept. of Computer ScienceTU Darmstadt, Germany

2

Typical Wireless Sensor Networks (WSN)

Internet, GSM, Internet, GSM, Satellite, etcSatellite, etc

Sink

User mote

• Large number of cheap sensing devices

• Self-orgonized• Multi-hop comm.

3

Application Spectrum: Measurement vs. Event

Hazard Detection

Biological Monitoring

MedicalDomain

Smart Environment

Wearable Computing

Transportation

Earth Science & Exploration

Context-Aware Computing

Interactive VR Game

Wireless Sensor Networks

Urban WarfareMilitary Surveillance

Disaster Recovery Environmental Monitoring

Src: Univeristy of Virginia

4

Typical Network Functionalities

Data Collection (SNsSink) Covergecast Aggregation Routing ..

Data Dissemination (SinkSNs) Code update

(Reprogramming) Query Diagnosis ..

Sink

Event

Sink

5

Challenges for Fault-Tolerance (FT)

Cheap components Scarcity of resources: Processing/storage/communication Finite energy supply

In situ in physical environment Test environment different from the deployment

environment Evolvable deployment conditions Difficulty of testing after deployment

Failure-prone components FT is as critical as other performance metrics

(energy efficiency, latency & accuracy) in supporting distributed Apps.

Earlier work from wired networks does not directly apply for WSNs

6

Fault Model Fault Tolerance in WSN Fault Tolerance in Collaborative Sensor Networks

for Target Detection Distributed Bayesian Algorithms for Fault-Tolerant

Event Region Detection in Wireless Sensor Networks

Papers to Discuss

© DEEDS GroupSWFT WS ‘07

Outline of Today’s Lecture

7

Fault, Error and Failure in WSN A sensor service running on node A is expected to periodically send

the measurements of its sensors to an aggregation service running on node B.

However, node A suffers an impact causing a loose connection with one of its sensors (Fault).

Since the code implementing node A’s service is not designed to detect and overcome such situations, an erroneous state is reached (Error) when the sensor service tries to acquire data from the sensor.

Due to this state, the service does not send sensor data (Failure) to the aggregation service within the specified time interval. This results in a crash or omission failure of node A observed by node B.

8

Sources of Faults Fragile sensor nodes

Node failures• Depletion of batteries• Harsh environment: Dammage, short circuits, enclosure etc.

Incorrect sensor reading or processing • Due to battery low level, etc.

SW bugs (e.g. in routing layer) Erroneous wireless communication

Reading transmissiom failures• Temporarily/permanent link failures• Packet corruption/loss• Network partitioning

Sink faults Security breaches Byzantine faults: Overlap between faults and

security breaches describing arbitrary behaviour

9

Failure Classification Omission Crash

Omission: A service is sporadically not responding to requests (e.g. due to msg loss)

Crash: The service at some point stops responding to any request (e.g. after f omissions)

Timing: The service´s response is received out of the time interal specified by the application (too eraly or too late).

Value: A service sends a timely response but with lack of accuracy.

Arbitrary: Include all other failures! E.g. Byzantine failures: Failures that are in general caused

by a malicious service that behaves erroneously AND not consistently

10

Node Architecture

System

Threads

Address space

Files

Hardware Drivers

Physical

Data Link

Network

Transport

Sensor Driver

Hardware Sensor

Middleware management

Algorithms Modules Services Virtual Machine

App1 App2 App3

P o w er

M a n a g e m e nt

Application

Hardware

Middleware

Operating System

Src: Abdul-Halim Jallad and Tanya Vladimirova

Software

11

Fault Classification and Propagation in WSN

Sink

Event

Internet, Internet, GSM, GSM, Satellite, Satellite, etcetc

User

12

Fault Tolerance (FT)

Fault tolerance: the ability to sustain sensor network functionalities without any interruption due to failures

FT techniques taxonomy Fault prevention Fault detection Fult isolation Fault identification Fault recovery

FT can be adressed at different layers (HW, SW, Network and application)

There is always a trade-off between FT and efficiency

13

FT Protocol Stack

Fault avoidance

Fault detectionFault recovery

Fault recovery

14

Fault Prevention

Sensor node deployment/placement Connectivity Coverage Relay node placement in two-tiered networks Power on-off Transmission range ajustment

Sensor network monitoring („watch the watch-dog!“) Active Passive

15

Fault Detection (, Isolation and Identification) Self-diagnosis

Node itself can identify faults. Examples: Monitor physical environment using accelerometers Measuring the current battery voltage Measuring comm. link quality

Cooperative diagnosis (group detection)Several nodes monitor the behavior of another node. Examples: „Sensors from same region should have similar

values“ (detailed in SWOSFT) Consumer nodes observe the behaviours of the service

provider (eg. next-hop relay nodes). Hierarchical detection

Uses a tree In general detection is shifted to more powerful nodes

(e.g. sink) Frameworks: Momento, Sympathy and SNIF

16

Fault Recovery in WSN The most common technique: Replication

(Redundancy of components) Active replication (request is processed by all

replicas) Inherent in WSN

• Multipath routing• Sensor value aggregation• Ignore values from faulty nodes

Passive replication (request is processed by a primary replica, backup replicas are then synchronized) 1. Fault detection 2. Election (self, group, hierarchical) 3. Service distribution

• Pre-Copy (dynamic role assignment)• Code update (Maté, Agilla, Impala ..)

17

Fault Model Fault Tolerance in WSN Fault Tolerance in Collaborative Sensor Networks

for Target Detection Distributed Bayesian Algorithms for Fault-Tolerant

Event Region Detection in Wireless Sensor Networks

Papers to Discuss

© DEEDS GroupSWFT WS ‘07

Outline of Today’s Lecture

18

Fault Tolerance in Collaborative Sensor Networks for Target

Detection

IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004

Thomas Clouqueur, Kewal K. Saluja, and Parameswaran Ramanathan

19

Target Detection Problem

Target Emit signals characterizing their presence (sound,

pressure, temperature etc.)

Signal strength decreases with the distance SN collaborate by exchanging&fusing their local

information to produce a result global to sensor field.

Detection results need to be available at each node

20

Difference from Agreement Problem

Nodes sharing information may contain local information that can be totally different from one node to another. In target detection, nodes close to the target report high signal measurement, while nodes far from the target report low signal measurements.

Thus, in fusion, there is a lack of common truth in the measured values. Yet, it is desirable to arrive at a common value or common values and determine the impact of faults in the methods developed to arrive at consensus.

21

Fault Model Faults include misbehaviors ranging from simple crash faults,

where a node becomes inactive, to Byzantine faults, where the node behaves arbitrarily or maliciously

Faulty nodes are assumed to send inconsistent and arbitrary values to other nodes during information sharing

No comm. Failures (reliable links)

- Target is outside R - C is faulty! A and D may conclude the presence of a traget!

22

Motivation The algorithm for target detection needs to be robust to such

inconsistent behavior that can jeopardize the collaboration in the sensor network.

For example, if the detection results trigger subsequent actions at each node, then inconsistent detection results can lead each node to operate in a different mode, resulting in the sensor network going out of service.

The performance of fusion is therefore also defined by precision. Precision measures the closeness of decisions from each other, the goal being that all nodes obtain the same decision.

23

Centralized/Decentralized Target Detection

Centralized detection: All local sensors communicate their data to a central processor performing optimal or near optimal detection The correctness of such a scheme relies on the central

node’s correctness, therefore, central node-based schemes have low robustness to sensor failure.

Decentralized detection: Some preliminary processing of data is performed at each sensor node so that compressed information is gathered at the fusion center Loss of performance, but reduced communication

bandwidth Improves reliability The performance loss may be reduced by optimally

processing the information at each sensor node

24

FT Fusion Algorithms

Nodes share their information Nodes use fusion rule to arrive at a decision Algorithms guarantee

All non-faulty nodes obtain the same set S of data S contains all data sent by non-faulty nodes Problem: Consistent outliers remains in the set largest

and smallest data are dropped. Average is computed over remaining data

Different fusion algorithms can be derived by varying the size of the information shared between sensor nodes.

25

Value/Decision Fusion

Two extrem cases: Value fusion: Nodes exchange raw measurements Decision fusion: Nodes exchange local detection decisions

Value Fusion Algorithm At each node

• Obtain raw measurements from every node• Drop the p largest values and the p smallest values (step

needed for faulty nodes)• Compute the AVERAGE of remaining values• Compare the AVERAGE to THRESHOLD for final decision

Decision Fusion Algorithm It works in the same way as the value fusion algorithm

2*p < N/3

26

Evaluation Metrics Precision

Measures the closeness of the final decisions obtained by all sensors, the goal being that all non-faulty nodes obtain the same decision

If f< N/3 Precision is guranteed

Accuracy Measures how well the node decisions represent the environment,

the goal being that the decision of non-faulty nodes is “object detected” if and only if a target is present

Measured by false alarm probability and detection probability Determined by THRESHOLD, noise, target position, node

placement. If f< N/3 Relative accuracy is guranteed (due to noise)

Communication overhead Number and size of msg exchanged

Robustness Robustness is measured by system failure probability System failure when the faulty nodes exceed the bound of

tolerable faulty nodes

27

Comparison of Algorithms

Compare metrics in absence/presence of faulty nodes

Due to the reduced amount of information shared in decision fusion, the communication cost is lower in decision fusion than in value fusion.

The system failure probability is identical for value and decision fusion since failures depend on the number of faulty nodes present and not on the algorithm used.

However, the performance measured in terms of precision and accuracy differs.

28

Papers to Discuss-Overview-

29

Paper 1: In-Network Fault Tolerance in Networked Sensor Systems

WSN are more prone to failure than other wireless networks. Energy-aware fault tolerant mobile sink node may help WSN meet dependability and QoS requirements while conserving energy.

Approach: apply in-network data checkpointing and recovery in order to achieve high QoS with minimum energy overhead.

Sm

Sm-1

S1

Si

checkpoint

E1 E2E3

En

(See paper discussion)

30

Paper 2: Detection & Repair of SW Error in Hierarchical Sensornets

Associate INVARIANTs with particular SW execution state Local vs. remote Sateless vs. stateful

Compiler inserts the checking code At run time:

Check invariants during this state Send debug messages if needed

If anomaly is detected Increase logging details Report errors to sink Remote reprogramming if necessary

(See paper discussion)

31

Paper 3: Declarative Failure Recovery for Sensor Networks

API for checkpointing Based on macropragmming Also support partition recovery

Declarative recovery Modular code annotation with checkpoints Automatic fault detection and roll back Automatic determination of nearest checkpoint

Transparent recovery Heuristics for automatic determination where the

checkpoints should be taken

(See paper discussion)

32

Literature Luciana M. S. D. Souza, Harald Vogt, Michael Beigl, “A survey

on fault tolerance in wireless sensor networks”, 2007. Hai Liu, Amiya Nayak, and Ivan Stojmenovic, "On Fault

Tolerance in Wireless Sensor Networks," Handbook of Wireless Ad Hoc and Sensor Networks, 2007.

Lilia Paradis and Qi Han, “A Survey of Fault Management in Wireless Sensor Networks”, Journal of Network and Systems Management, 15(2), 2007.

F. Koushanfar, M. Potkonjak, A. Sangiovanni-Vincentelli. “Fault Tolerance in Wireless Ad Hoc Sensor Networks”, IEEE Sensors, Vol. 2, 2002.