1 software fault tolerance (swft) swft for wireless sensor networks (lec 1) dependable embedded...
TRANSCRIPT
1
Software Fault Tolerance (SWFT)
SWFT for Wireless Sensor Networks (Lec 1)
Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Prof. Neeraj Suri
Abdelmajid Khelil
Dept. of Computer ScienceTU Darmstadt, Germany
2
Typical Wireless Sensor Networks (WSN)
Internet, GSM, Internet, GSM, Satellite, etcSatellite, etc
Sink
User mote
• Large number of cheap sensing devices
• Self-orgonized• Multi-hop comm.
3
Application Spectrum: Measurement vs. Event
Hazard Detection
Biological Monitoring
MedicalDomain
Smart Environment
Wearable Computing
Transportation
Earth Science & Exploration
Context-Aware Computing
Interactive VR Game
Wireless Sensor Networks
Urban WarfareMilitary Surveillance
Disaster Recovery Environmental Monitoring
Src: Univeristy of Virginia
4
Typical Network Functionalities
Data Collection (SNsSink) Covergecast Aggregation Routing ..
Data Dissemination (SinkSNs) Code update
(Reprogramming) Query Diagnosis ..
Sink
Event
Sink
5
Challenges for Fault-Tolerance (FT)
Cheap components Scarcity of resources: Processing/storage/communication Finite energy supply
In situ in physical environment Test environment different from the deployment
environment Evolvable deployment conditions Difficulty of testing after deployment
Failure-prone components FT is as critical as other performance metrics
(energy efficiency, latency & accuracy) in supporting distributed Apps.
Earlier work from wired networks does not directly apply for WSNs
6
Fault Model Fault Tolerance in WSN Fault Tolerance in Collaborative Sensor Networks
for Target Detection Distributed Bayesian Algorithms for Fault-Tolerant
Event Region Detection in Wireless Sensor Networks
Papers to Discuss
© DEEDS GroupSWFT WS ‘07
Outline of Today’s Lecture
7
Fault, Error and Failure in WSN A sensor service running on node A is expected to periodically send
the measurements of its sensors to an aggregation service running on node B.
However, node A suffers an impact causing a loose connection with one of its sensors (Fault).
Since the code implementing node A’s service is not designed to detect and overcome such situations, an erroneous state is reached (Error) when the sensor service tries to acquire data from the sensor.
Due to this state, the service does not send sensor data (Failure) to the aggregation service within the specified time interval. This results in a crash or omission failure of node A observed by node B.
8
Sources of Faults Fragile sensor nodes
Node failures• Depletion of batteries• Harsh environment: Dammage, short circuits, enclosure etc.
Incorrect sensor reading or processing • Due to battery low level, etc.
SW bugs (e.g. in routing layer) Erroneous wireless communication
Reading transmissiom failures• Temporarily/permanent link failures• Packet corruption/loss• Network partitioning
Sink faults Security breaches Byzantine faults: Overlap between faults and
security breaches describing arbitrary behaviour
9
Failure Classification Omission Crash
Omission: A service is sporadically not responding to requests (e.g. due to msg loss)
Crash: The service at some point stops responding to any request (e.g. after f omissions)
Timing: The service´s response is received out of the time interal specified by the application (too eraly or too late).
Value: A service sends a timely response but with lack of accuracy.
Arbitrary: Include all other failures! E.g. Byzantine failures: Failures that are in general caused
by a malicious service that behaves erroneously AND not consistently
10
Node Architecture
System
Threads
Address space
Files
Hardware Drivers
Physical
Data Link
Network
Transport
Sensor Driver
Hardware Sensor
Middleware management
Algorithms Modules Services Virtual Machine
App1 App2 App3
P o w er
M a n a g e m e nt
Application
Hardware
Middleware
Operating System
Src: Abdul-Halim Jallad and Tanya Vladimirova
Software
11
Fault Classification and Propagation in WSN
Sink
Event
Internet, Internet, GSM, GSM, Satellite, Satellite, etcetc
User
12
Fault Tolerance (FT)
Fault tolerance: the ability to sustain sensor network functionalities without any interruption due to failures
FT techniques taxonomy Fault prevention Fault detection Fult isolation Fault identification Fault recovery
FT can be adressed at different layers (HW, SW, Network and application)
There is always a trade-off between FT and efficiency
14
Fault Prevention
Sensor node deployment/placement Connectivity Coverage Relay node placement in two-tiered networks Power on-off Transmission range ajustment
Sensor network monitoring („watch the watch-dog!“) Active Passive
15
Fault Detection (, Isolation and Identification) Self-diagnosis
Node itself can identify faults. Examples: Monitor physical environment using accelerometers Measuring the current battery voltage Measuring comm. link quality
Cooperative diagnosis (group detection)Several nodes monitor the behavior of another node. Examples: „Sensors from same region should have similar
values“ (detailed in SWOSFT) Consumer nodes observe the behaviours of the service
provider (eg. next-hop relay nodes). Hierarchical detection
Uses a tree In general detection is shifted to more powerful nodes
(e.g. sink) Frameworks: Momento, Sympathy and SNIF
16
Fault Recovery in WSN The most common technique: Replication
(Redundancy of components) Active replication (request is processed by all
replicas) Inherent in WSN
• Multipath routing• Sensor value aggregation• Ignore values from faulty nodes
Passive replication (request is processed by a primary replica, backup replicas are then synchronized) 1. Fault detection 2. Election (self, group, hierarchical) 3. Service distribution
• Pre-Copy (dynamic role assignment)• Code update (Maté, Agilla, Impala ..)
17
Fault Model Fault Tolerance in WSN Fault Tolerance in Collaborative Sensor Networks
for Target Detection Distributed Bayesian Algorithms for Fault-Tolerant
Event Region Detection in Wireless Sensor Networks
Papers to Discuss
© DEEDS GroupSWFT WS ‘07
Outline of Today’s Lecture
18
Fault Tolerance in Collaborative Sensor Networks for Target
Detection
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004
Thomas Clouqueur, Kewal K. Saluja, and Parameswaran Ramanathan
19
Target Detection Problem
Target Emit signals characterizing their presence (sound,
pressure, temperature etc.)
Signal strength decreases with the distance SN collaborate by exchanging&fusing their local
information to produce a result global to sensor field.
Detection results need to be available at each node
20
Difference from Agreement Problem
Nodes sharing information may contain local information that can be totally different from one node to another. In target detection, nodes close to the target report high signal measurement, while nodes far from the target report low signal measurements.
Thus, in fusion, there is a lack of common truth in the measured values. Yet, it is desirable to arrive at a common value or common values and determine the impact of faults in the methods developed to arrive at consensus.
21
Fault Model Faults include misbehaviors ranging from simple crash faults,
where a node becomes inactive, to Byzantine faults, where the node behaves arbitrarily or maliciously
Faulty nodes are assumed to send inconsistent and arbitrary values to other nodes during information sharing
No comm. Failures (reliable links)
- Target is outside R - C is faulty! A and D may conclude the presence of a traget!
22
Motivation The algorithm for target detection needs to be robust to such
inconsistent behavior that can jeopardize the collaboration in the sensor network.
For example, if the detection results trigger subsequent actions at each node, then inconsistent detection results can lead each node to operate in a different mode, resulting in the sensor network going out of service.
The performance of fusion is therefore also defined by precision. Precision measures the closeness of decisions from each other, the goal being that all nodes obtain the same decision.
23
Centralized/Decentralized Target Detection
Centralized detection: All local sensors communicate their data to a central processor performing optimal or near optimal detection The correctness of such a scheme relies on the central
node’s correctness, therefore, central node-based schemes have low robustness to sensor failure.
Decentralized detection: Some preliminary processing of data is performed at each sensor node so that compressed information is gathered at the fusion center Loss of performance, but reduced communication
bandwidth Improves reliability The performance loss may be reduced by optimally
processing the information at each sensor node
24
FT Fusion Algorithms
Nodes share their information Nodes use fusion rule to arrive at a decision Algorithms guarantee
All non-faulty nodes obtain the same set S of data S contains all data sent by non-faulty nodes Problem: Consistent outliers remains in the set largest
and smallest data are dropped. Average is computed over remaining data
Different fusion algorithms can be derived by varying the size of the information shared between sensor nodes.
25
Value/Decision Fusion
Two extrem cases: Value fusion: Nodes exchange raw measurements Decision fusion: Nodes exchange local detection decisions
Value Fusion Algorithm At each node
• Obtain raw measurements from every node• Drop the p largest values and the p smallest values (step
needed for faulty nodes)• Compute the AVERAGE of remaining values• Compare the AVERAGE to THRESHOLD for final decision
Decision Fusion Algorithm It works in the same way as the value fusion algorithm
2*p < N/3
26
Evaluation Metrics Precision
Measures the closeness of the final decisions obtained by all sensors, the goal being that all non-faulty nodes obtain the same decision
If f< N/3 Precision is guranteed
Accuracy Measures how well the node decisions represent the environment,
the goal being that the decision of non-faulty nodes is “object detected” if and only if a target is present
Measured by false alarm probability and detection probability Determined by THRESHOLD, noise, target position, node
placement. If f< N/3 Relative accuracy is guranteed (due to noise)
Communication overhead Number and size of msg exchanged
Robustness Robustness is measured by system failure probability System failure when the faulty nodes exceed the bound of
tolerable faulty nodes
27
Comparison of Algorithms
Compare metrics in absence/presence of faulty nodes
Due to the reduced amount of information shared in decision fusion, the communication cost is lower in decision fusion than in value fusion.
The system failure probability is identical for value and decision fusion since failures depend on the number of faulty nodes present and not on the algorithm used.
However, the performance measured in terms of precision and accuracy differs.
29
Paper 1: In-Network Fault Tolerance in Networked Sensor Systems
WSN are more prone to failure than other wireless networks. Energy-aware fault tolerant mobile sink node may help WSN meet dependability and QoS requirements while conserving energy.
Approach: apply in-network data checkpointing and recovery in order to achieve high QoS with minimum energy overhead.
Sm
Sm-1
S1
Si
checkpoint
E1 E2E3
En
(See paper discussion)
30
Paper 2: Detection & Repair of SW Error in Hierarchical Sensornets
Associate INVARIANTs with particular SW execution state Local vs. remote Sateless vs. stateful
Compiler inserts the checking code At run time:
Check invariants during this state Send debug messages if needed
If anomaly is detected Increase logging details Report errors to sink Remote reprogramming if necessary
(See paper discussion)
31
Paper 3: Declarative Failure Recovery for Sensor Networks
API for checkpointing Based on macropragmming Also support partition recovery
Declarative recovery Modular code annotation with checkpoints Automatic fault detection and roll back Automatic determination of nearest checkpoint
Transparent recovery Heuristics for automatic determination where the
checkpoints should be taken
(See paper discussion)
32
Literature Luciana M. S. D. Souza, Harald Vogt, Michael Beigl, “A survey
on fault tolerance in wireless sensor networks”, 2007. Hai Liu, Amiya Nayak, and Ivan Stojmenovic, "On Fault
Tolerance in Wireless Sensor Networks," Handbook of Wireless Ad Hoc and Sensor Networks, 2007.
Lilia Paradis and Qi Han, “A Survey of Fault Management in Wireless Sensor Networks”, Journal of Network and Systems Management, 15(2), 2007.
F. Koushanfar, M. Potkonjak, A. Sangiovanni-Vincentelli. “Fault Tolerance in Wireless Ad Hoc Sensor Networks”, IEEE Sensors, Vol. 2, 2002.