s oftware f ault t olerance i n a c lustered a rchitecture : t echniques & r eliability m...

SOFTWARE FAULT TOLERANCE IN A CLUSTERED ARCHİTECTURE:TECHNİQUES & RELİABİLİTY MODELİNGHüsnü Şensoy

AGENDA

Introduction RCC Principal Techniques & Architecture

Assumptions Reliability Techniques Reliability Modeling & Analysis Conclusion

INTRODUCTIONSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

AVAİLABİLİTY & DATA CONSİSTENCY

AVAİLABİLİTY IN CLUSTERED ENVİRONMENT

4+2 Configuratio

n

RCC PRINCIPAL TECHNIQUES & ARCHITECTURE ASSUMPTIONSSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

CLUSTERED ARCHITECTURE RELIABILITY

Commercial Hardware

OS

Database

Commercial Hardware

OS

Database

Commercial Hardware

OS

Database

Commercial Hardware

OS

Database

Application

•Error Detection•Switchover

Application Application Application

•Error detection•Consequent recovery actions•Data backup

ZOOM IN TO A PROCESSİNG NODE

RCC Platform AssetsRCC Platform Assets•WatchDog Interface•State Server•Cluster Management•Process Monitoring•Resource Monitors: Disk, Network

RCC Aware ApplicationRCC Aware Application•Network Systems’ Applications

Off-the-Off-the-shelf shelf ApplicationApplicationss

Standard LibrariesStandard Libraries

RCC LibrariesRCC Libraries

Commercial UNIX Operating SystemCommercial UNIX Operating System

CommercialCommercialMirroring/Mirroring/

Journaling File Journaling File System SoftwareSystem Software

Commercial UNIX Sytem Hardware DriversCommercial UNIX Sytem Hardware Drivers

Disk MirrorDisk MirrorPseudo DriverPseudo Driver

RELİABİLİTY TECHNİQUESSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

RELİABİLİTY DİMENSİONS

Availability Data Consistency

MTTRMTBF

MTBFtyAvailabili

RELIABILITY MODELS

LEVELS OF RELİABİLİTY Level 0: Basic automatic fault detection by watchdog, no automatic fault recovery, no data

consistency A small set of fault classes – hardware & software – is detected by the watchdog. For a hardware fault, the system is manually reconfigured. For a software fault, the application process is restarted at the initial internal state which will require

initialization of the faulty processor since the application may leave its data in an inconsistent or incorrect state.

Level 1: Basic automatic fault detection by watchdog, automatic fault recovery, no data consistency

A small set of fault classes – hardware & software – is detected by the watchdog & recovery is automatic. When a fault is detected by the watchdog, the system is automatically recovered – reconfigured for hardware

faults and initialized for software faults.

Level 2: Level 1 plus enhanced automatic fault detection by watchdog plus periodic checkpointing, logging & recovery of internal state.

The watchdog & application are enhanced to automatically detect a larger set of fault The internal state of the application process is periodically checkpointed. After a hardware failure is detected, the system is reconfigured around the faulty unit. The application is restarted at the most recent checkpointed internal state

Level 3: Level 2 plus persistent data recovery. (this is the highest level achievable with RCC) The persistent data of the application is replicated on a backup disk connected to a backup node, and is kept

consistent with the data on the primary node throughout the normal operation of the application. In case of a fault, in backup node, the backup disk brings the application’s persistent data as close to the state

at which the application crashed as possible.

Level 4: Continuous operation without interruption This level of reliability is not achievable with the RCC.

RELİABİLİTY MODELİNG & ANALYSİS Software Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

BASİC MODEL FOR SOFTWARE FAULT TOLERANCE

WorkingWorking

Fault Fault DetectioDetectio

n & n & RecoveryRecovery

Volatile Volatile Data Data

RecoveryRecovery

PersistenPersistent Data t Data

RecoveryRecovery

FailedFailed

11c 22c

c

)1( c

1 23

11)1( c

22 )1( c

33)1( c

LEVEL 0 RELİABİLİTY

WorkingWorking FailedFailed

41

001.0

%99,96


WorkingWorking



FailedFailed

c

)1( c

1

9.0

30

30

1

1

c

%99,98


WorkingWorking




RecoveryRecovery

FailedFailed

11c

c

)1( c

1 2

11)1( c

%99,99

1800

1800

9.0

99.0

2

2

1

c

c


WorkingWorking




RecoveryRecovery

PersistenPersistent Data t Data

RecoveryRecovery

FailedFailed

11c 22c

c

)1( c

1 23

11)1( c

22 )1( c 3600,100

1800,1800

9.0

99.0

999.0

33

22

2

1

c

c

c

~%100

CONCLUSİONSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

CONCLUSION

In this work, a RCC has been proposed. Different levels of reliability have been

defined. A reliability analysis is held via Markov

modelling.

QUESTİONS & COMMENTSSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

?

s oftware f ault t olerance i n a c lustered a rchitecture : t echniques & r eliability m...

Documents

automatic fault recovery

software fault tolerancelevel

watchdog application

watchdog recovery

internal state level

persistent data recovery

application process

software faults