full paper software aging and rejuvenation

Download Full Paper Software Aging and Rejuvenation

Post on 26-Aug-2014




3 download

Embed Size (px)


Rejuvenation A technique to slow down Software AgingAuthored by Urvesh Chaudhery Associate Prof., DAVIM, Faridabad. Chaudhery.urvesh@gmail.com AbstractA proactive technique called software rejuvenation has been proposed as a way to counteract software aging. It involves occasionally terminating the software application, cleaning its internal state and/or its environment, and then restarting it. Due to the costs incurred by software rejuvenation, an important question is when to schedule this action. While periodic rejuvenation at constant time intervals is straightforward to implement, it may not yield the best results. Software rejuvenation should therefore be planned & initiated in the face of the actual system behavior. This requires the measurement, analysis, and prediction of system resource usage. In this paper we have focused on the self-healing scheme which is meant to disrupt the running service for a minimal amount of time, achieving zero downtime in most cases. We present an approach for software rejuvenation based on automated self-healing techniques that can be easily applied to off-the-shelf Application Servers. Software aging and transient failures are detected through continuous monitoring of system data and performability metrics of the application server. If some anomalous behavior is identified, the system triggers an automatic rejuvenation action. For this we have also explained the usage of virtualization to optimize the self-recovery actions along with simple self healing examples. Keywords: Software rejuvenation, software aging, virtualization, self-healing.

1.0 INTRODUCTIONThe normal stages of software life begin with its initialization and the start of its processing. It works flawlessly for a long time. While it is working it is slowly degrading. Little pieces of unused memory are reserved for uses that never occur; bits of information are saved even though they wont be used again, etc. Eventually a software fault will activate. Perhaps it runs out of memory or some other resource, or gets confused by the existence of unused data. When this occurs, an error is likely to happen which might be followed by some type of failure. Many studies reported that system suffered from outages more due to software faults than hardware faults and named this phenomenon as software aging, The phenomenon of software aging, is the one in which the state of the software or its environment degrades with time.. The primary causes of this degradation are the exhaustion of operating system resources, data corruption, and numerical error accumulation. Eventually, this may lead to performance degradation of the software or crash/ hang failure or both. Some common causes of software aging are memory bloating and leaking, unreleased file locks, data corruption, storage space fragmentation, and accumulation of round-off errors. Since aging leads to transient failures in software

systems, environment diversity, a software fault tolerance technique that is based on changing the operating environment, can be employed proactively to prevent degradation or crashes. This involves occasionally stopping the running software, cleaning its internal state or its environment, and restarting it, such a technique known as software rejuvenation. The term was proposed by Huang et al. This counteracts the aging phenomenon in a proactive manner by removing the accumulated error conditions and freeing up operating system resources. Garbage collection, flushing operating system kernel tables and reinitializing internal data structures are some actions by which the internal state or the environment of the software can be cleaned. Softwares never deteriote, the software appears to age due to the degradation of the operating environment (for example, gradual depletion of resources). Likewise, software rejuvenation actually refers to rejuvenation of the environment in which the software is executing not the re-manufacturing of the softwares. Software rejuvenation is a periodic, pre-emptive restart of a running system to prevent future failures. It is one aspect of a selfhealing system. Interesting and new problems to study rejuvenation of large scale systems are:

What is a state in a large-scale system for rejuvenation analysis? Here, all the states which can lead to failure are considered. Failure symptoms are at a system/ network (macro) level but rejuvenation actions are at a component (micro) level; we study the relation between the two. What are the models and analytical methods for rejuvenation in large-scale systems How does one do rejuvenation in a large system through gradual load shedding What is a safe (clean internal) state to take back up of the data in the system? How does the technology become common practice

2.0 Why Systems fail?A fault in an algorithms implementation can lead to an erroneous computation for specific values of a program variable, a case of fault activation causing an error. The software can use this incorrect result internally for further calculations, in which case the error propagation leads to additional errors. A failure occurs only when the system uses one of these incorrect calculations in a way that influences perceivable system behaviour, or when error propagation causes a failure occurrence. Based on the relationships between faults, errors, and failures, we can offer two explanations as to why software may behave differently under apparently identical conditions. First, if theres a long delay between the fault activation and the final failure occurrencefor example, traversing several different error states in the error propagationthen its difficult to identify the user actions that actually activated the fault and caused the failure. Simply repeating the steps carried out a short time before the failure occurrence might not lead to its reproduction. 3.0 Self Healing in Softwares ( Rejuvenation)

Systems that have the ability to manage themselves and dynamically adapt to change in accordance with policies and objectives are termed as autonomic computing systems. This system enables computers to identify and correct problems often before they are noticed by the user. Autonomic systems have the capability to self-configure, self-optimize, selfprotect, as well as self-healing. Systems that have these characteristics are termed as selfmanaging systems. Autonomic pervasive computing, maintains characteristics from both autonomic computing and pervasive computing environment. Like pervasive computing, devices running in this area should be context and situation aware and these devices form an ad-hoc ephemeral network. These devices are also expected to have the ability to selfoptimize,

Figure 1 Self-healing describes devices or systems that have the ability to perceive those are not operating correctly and, without human intervention, make the necessary adjustments to restore them to normal operation. A system that continues its operation even in presence of faults is termed as a fault tolerant system. The concept of self-healing goes beyond fault tolerance since it also provides the device with the capability of recovering from fault by itself or with the assistance of other devices present in the network. Fig. 1 depicts the scope of self-healing autonomic pervasive computing.

4.0 Characteristics of Self-healing ModelA self-healing model targeted for autonomic pervasive computing should have the following attributes:

1. Less Infrastructure. Extensive infrastructure support is not required. The main requirements are a powerful servers, proxies, etc. If the focus is on truly pervasive environments then the model should work independently without any external support as in this environment infrastructure support is not always available. 2. Lightweight. The model should be lightweight in terms of executable file size. 3. Non-degradable performance. The model should not put much overhead on the performance of the device. 4. Energy efficient. Self-healing models should be energy efficient. It should not require much battery power for computation or communication purposes. 5. Transparent. The main idea behind designing autonomic systems is transparency. Every self-healing model should have some level of transparency. However, it should also inform the user about critical system information. 6. Secure. Self-healing systems require information distribution and backup to recover from faults. The information distribution and storage should be highly secure to maintain user privacy and security. An effective self-healing service should address the following challenges: No regular functionality of the network will be hampered due to any fault of any device. All significant information of the faulty device should be preserved in secure fashion. The device will be facilitated to heal its fault by itself or at best with the assistance of other devices of that network. After reviving, the faulty device should be able to regain its previous states in such a way that it should feel there was no fault. To address the above challenges in an opposite manner, the self-healing system has following steps: A. Fault detection B. Fault notification C. Faulty device isolation D. Fault healing Fig. 2 shows the architecture of self-healing service. The fault-detection and faultnotification unit is used by each device running on the server. The fault detection unit may be utilized by both the healing manager and the device itself. But the other three units (faulty device isolation, alteration, information distribution) should be handled by the healing manager.

Figure 2: Architecture of Self Healing Service 4.1 FAULT DETECTION Fault can be classified from different perspectives: 1. Hardware Fault - This fault is mainly caused by physical constraints of the devices running in autonomic pervasive computing environment. Hardware faults considered h

View more >