fault tolerance in an event rule framework for distributed systems

1

Fault Tolerance in an Event Rule Framework for Distributed Systems

Hillary Caituiro Monge

2

Contents1. Introduction2. Related Works3. Overview of the Event Rule Framework (ERF)4. Overview of the Fault Tolerant CORBA5. Design of the Fault tolerant ERF (FT-ERF)6. Performance Analysis7. Conclusions

3

1. Introduction Justification

Distributed Systems (DS)Fault Tolerance (FT)Reactive Components (RC)The Event Rule Framework (ERF) Motivation

Objectives

4

Distributed Systems (DS) A DS is a

Collection of software components distributed among processors of heterogeneous platforms

DSs purpose are: Sharing resources and workload, and Maximizing availability.

Design goals of DSs are: Transparency, Scalability, Reliability and Performance.

Computer

Server Workstation

Laptop

Mainframe

PDA

5

Fault Tolerance (FT)

FT is the ability of a system to continue operating as expected, despite internal or external failures.

DSs are prone to failures.Some faults can be detected.Some others cannot be detected.

FT of a DS can be improved through the redundancy, i.e. replication of its hardware or software components.

6

Reactive Components (RC Reactive Components

React to external stimulus (i.e. events) Initiate action

A RC can be Asynchronous or synchronous Non-deterministic or deterministic

A reactive component can be asynchronous and non-deterministic (ANDRC).

7

The Event Rule Framework (ERF)

An example of a DS framework having ANDRCs is: ERF (Event/Rule Framework).

ERF Developed at the Center for Computing Research and

Development of University of Puerto Rico – Mayagüez Campus.

It is an Event-Rule Framework for developing distributed systems.

In ERF, Events and rules are used as abstractions for specifying system behavior.

8

Motivation (1/2) There is a challenge to achieve fault tolerance in

ANDRCs. In non-deterministic components:

The output could be different; Even if the same sequence of stimuli is input with the

same initial state. Since the component is asynchronous:

Timing assumptions are not valid. Moreover, ANDRCs behavior fulfills the

Heisenberg’s uncertainly principle:

9

Motivation (2/2)

Existing fault-tolerance techniquesFailure detectors

Timing assumptions Synchronous or semi synchronous systems,

State transfer protocols Deterministic systems Very intrusive

Duplicates detection and suppression mechanisms

Sequencers

10

Objectives (1/2) This research is about the use of active and

semi-active replication techniques for achieving fault tolerance in ERF, which is a framework that uses ANDRCs.

Active replication technique All replicated components accept third-party incoming

events. A middle-tier component is in charge of

Event multicasting Detecting and suppressing duplicated events.

11

Objectives (2/2)

Semi-active replication technique All replicated components accept third-party incoming

events Only one (“the leader”) is able to post events, Backup replicas listen to the leader to make a

consistent production of events. Each replicated component is in charge of the

detection and suppression of duplicated events.

12

2. Related Works Generic support of FT in DSs FT Event-Based DSs

13

Generic support of FT in DSs

OMG Fault-Tolerant CORBA Standard

IRL Eternal OGS SENSEI OGM FT-CORBA Compliant

Yes Yes Soon No

Design Style Non-intrusive Non-intrusive Non-intrusive Non-intrusive Interoperability Free Expensive Free Free Replication Logic Implementation

Centralized with Passive Replication

Group toolkit Group toolkit Group toolkit

Asynchronism support

Yes No No No

Non-determinism support

No No No No

14

FT Event-Based DSs

NODS ISEE YEAST Fault tolerance Framework RS2.7 Not directly

supported Watchd and Libft

15

3. Overview of the Event Rule Framework (ERF)

Model Event Model Rule Model Behavioral Model

Components Event Channel RUBIES

Architecture of ERF-CORBA

16

Model Event Model

ERF provides the event abstraction to represent significant occurrences in a distributed system.

i.e. Flood alert system.

The base class Event defines the structure and behavior applicable to all types of events.

package erf;

import erf.lang.*;import java.io.Serializable;

public class Event implements Serializable{ /* Attributes */ public String id = ""; public TimeValue ttl; public TimeValue daytime; public DistributedObject producer;

/* Methods */ public TimeValue t() {...} public TimeValue ts() {...} public TimeValue ttl() {...} public void setttl(long tv) {...} public DistributedObject getProducer() {...} public void setProducer(DistributedObject producer) {...} public void sett(long tv) {...} public String pName() {...} public boolean isDead() {...} public String getTypeName() {...}}

Figure 3.2 Java definition of the class Event

17

Model

Rule Model In ERF, the behavior of a DS is defined in terms of

rules. A rule is an algorithm that is triggered when events in

the event set match a rule’s event pattern

[package <package_specification> ]rule <rule_id>[priority <priority_number>]on <trigger_events>[use <usage_specification>][if <condition> then <actions> [else <alternative_actions>]][do <unconditional_actions>]

Figure 3.5 Syntax of rule definition language (RDL)

18

Model

Behavioral ModelDefines how rules are triggered and evaluated

upon the occurrence of events.Evaluation of rules needs to be made

periodically because RUBIES receive events constantly.

The evaluation of rules is performed based on a rule priority.

19

Components (1/2)

Event Channel Is a middleware distributed component It allows sending events to consumers. It allows receiving events from producers.Events are treated as objects.

20

Components (2/2)

Rule Based Intelligent Event Service (RUBIES) Is the main component of ERF. It is an engine that handles events through

the evaluation of rules. RUBIES is a distributed component It is registered to the event channel both as a

consumer and as a producer.

21

Architecture of ERF-CORBA

StructuredPushConsumer StructuredPushSupplier

StructuredProxyPushSupplier StructuredProxyPushConsumer

RUBIES

RuleCompiler

CORBAEventChannel

StructuredEvent

*

*

*

*

EventChannel*

*

*

*

Figure 3.8 Architecture of ERF-CORBA

22

4. Overview of the Fault Tolerant CORBA (FT-CORBA)

Fault Tolerant CORBA (FT-CORBA) Replication Management Fault Management Logging and Recovery Management

23

Fault Tolerant CORBA

Adopted by OMG through 2000. Commitments rather than a solution. Full interoperability among different products. It provides support for applications that require

High levels of reliability With minimal modifications.

This research was addressed to be compliant with this standard.

24

Replication Management Replication management

covers a Fault Tolerant Domain.

It is done through the Replication Manager component, which inherits from the Property Manager, Object Group Manager, and Generic Factory components.

GenericFactory

PropertyManager

ReplicationManager

ObjectGroupManager

Figure 4.3 Hierarchy of the Replication Management

25

Fault Management It includes the Fault

Notification, Fault Detection, and Fault Analysis services.

The Fault Notifier sends informs to its consumers.

The Fault Detectors are connected to replicas or host and provide “faults” to the Fault Notifier.

The Fault Analyzer analyzes faults and produce reports to the Fault Notifier.

FaultDetector

FaultNotifier FaultAnalyzerStructuredPushConsumer

PullMonitorable

SequencePushConsumer

Figure 4.8 Architecture of Fault Management

26

Logging and Recovery Management

Loggin mechanism.Log the state of the primary member.

Recovery mechanismAct on fails or for new members.Recover from the log to the new primary.

Consistency must be controlled by the infrastructure.

27

5. Design of the Fault tolerant ERF (FT-ERF)

Scalability and Fault Tolerance Problems in ERF CORBA

Architecture of Scalable and Fault Tolerant ERF Architecture of Fault-Tolerant ERF-CORBA EID Uniqueness Events and Pattern equality rules. Pattern Management Active Replication Semi-Active Replication

28

Scalability and Fault Tolerance Problems in ERF CORBA

Figure 5.1 Two possible points of scalability and fault-tolerance problems in ERF: (a) the size of the rules database; (b) a crash of RUBIES.

RULES DB

RUBIES

(a)

(b)

29Figure 5.3 Architecture of Scalable and fault-tolerant ERF

RUBIESRUBIES (γ11, δ1)

DISTRIBUTION DIMENSIONR

EPLI

CA

TIO

N D

IMEN

SIO

N


RUBIESRUBIES (γ1M, δ1)



RUBIESRUBIES (γ2M, δ2)

RUBIESRUBIES (γN1, δN)

RUBIESRUBIES (γN2, δN)

RUBIESRUBIES (γNM, δN)

30

GenericFactory

PropertyManager

ObjectGroupManager

FTRUBIES

RUBIES

PullMonitorable

UpdateableReplicationManager

FaultNotifier

PullMonitor

FaultDetector

Monitorable

RuleCompiler

ObjectGroup

ClientRequestInterceptorServerRequestInterceptor

ClientIOGRSupportServerIOGRSupport

FTRUBIESServant

FTRUBIESIntOperations

CORBAEventChannel

Figure 5.3 Architecture of FT ERF-CORBA

31

EID Uniqueness (1/2) Each event in the system need to be uniquely

identified by an event identifier EID.

EID uniqueness must guaranteed in different contexts Local, replication group, system.

The use of sequencers is an option to achieve EID uniqueness Each replica start a sequencer. But, is only valid with deterministic components.

32

EID Uniqueness (2/2)

Events can be identified by its history. Each event is produced due to an event pattern.

Such history includes The list of previous events that triggered the event. The function or rule that caused its production.

Figure 5.5 Conceptual View of the Event Unique Identification

33

EVENT EQUALITY RULE Two events are equal if:

Both are of the same Type.Both were produced due to the same Rule.Both have the same Order of production in the

time when the Rule was triggered. Both have the same Pattern.

34

PATTERN EQUALITY RULE Two event patterns are equal if:

Both have the same number of events.Both have events in the same order. Two events for the same position are equal if

the Event Equality Rule is accomplished as previously defined.

35

Pattern Management (1/2)

Rules use a pattern management framework To prevent events being triggered more than once for

a given event pattern. In this framework, patterns are defined in terms

of: Source events (i.e., events that cause rules to trigger)

and Target events (i.e., events that are produced by

rules).

36

Pattern Management (2/2)

The framework has three main components for pattern management: Pattern Manager to manage patterns of events. Pattern to store patterns of events. Indexer to organize patterns of events.

Event

TargetEvents

0..*0..*

SourceEvents

0..*0..*

Pattern

Rule

PatternManager

0..* 10..* 1

Indexer

Figure 5.6 Architecture of Pattern Management

37

Active Replication (AR) For systems with tight time constraints. All replicas are running at the same time.

Are accepting events. Are sending events.

So, duplicated events are going around. Therefore, it is crucial.

To detect and suppress duplicated-events. To deliver a unique reply. To keep consistency. To be fault tolerant transparent.

38

AR: Pattern Naming For Duplicated-Events Detection and

Suppression Is a centralized Mid-tier component that

Through an analysis of an event’s history Detects if the event has already been delivered.

It relies on two primitives: Event binding

Register an event. Pattern solving.

Resolve if an equivalent event was already delivered.

39

AR: Pattern Naming

EventChannel

FTRUBIES

CORBAEventChannel

0..*

EventHandler11

PatternContext1

1

PatternName

0..*1

11

0..*1

1

0..1

1

Figure 5.9 Architecture of the Pattern Naming

40

Semi-Active Replication (SAR)

For systems with relatively loose time constraints. All replicas are running at the same time. Only the primary is able to reply to clients.

When the primary fails, a new member is selected. When a backup member fails, it is released from the

group. Failure detectors are used to detect failures in

group members. Time delay before the selection of new primary (sec).

41

SAR: Production Controller For Duplicated-Events Detection and Suppression It is distributed within each replica. The following algorithm is executed on backup members.

On incoming event P from the primary If in queue BQ is an equivalent event B for the event P then

Update B.id with P.id across the entire system Remove P

Else Enqueue P in PQ

On produced event B from the backup If in queue PQ is an equivalent event P for the event B then

Update B.id with P.id across the entire system Remove P

Else Enqueue B in BQ

On fail and if the backup replica is elected as new primary Post all events of the queue BQ

42

SAR: Production Controller

EventChannel

FaultNotifier

CORBAEventChannel

1

0..1

Pattern

Rule PatternManager

1

0..*

1

0..*

SourceEvents

TargetEvents

Updater Event0..*0..*

0..*0..*

-primaryEvent

-backupEvent

ProductionController

FTRUBIES

0..*

1

11 11

Figure 5.14 Architecture of the Production Controller

43

6. Performance Analysis Objectives Methodology

Test ScenariosTest Procedure

Test Results

44

Objectives Measure the execution time of fault-tolerant ERF

using active and semi-active replication techniques for: An increasing number of replicas. An Increasing number of failures. An increasing workload.

Compare the execution time of: Active versus semi-active replication techniques. Failure-free versus failure execution scenarios. Fault-tolerant versus non fault-tolerant execution.

45

Test Scenarios: Services distribution

* *

*

*

* * *

*

*

*

*

*

*

*

*

*

*

*

china test app

adaselsvr1 corba event

channel

*

*

* *

*

*

10. jobo

factory

ft rubies

9. quayaba

factory

ft rubies

8. chironja

factory

ft rubies

7. guineo

factory

ft rubies

6. acerola

factory

ft rubies

5. pajuil factory

ft rubies

4. quenepa factory

ft rubies

3. toronja factory

ft rubies

2. melon

factory

ft rubies

1. parcha

factory

fault tolerant rubies

sorelsvr1 event

channel

* *

*

*

*

* *

* *

* * * *

*

*

*

*

*

*

*

*

*

ece

name server

Replication manager

fault detector

http server

* *

Figure 7.1 UML deployment diagram of the test environment. (The domain for all computers is ece.uprm.edu)

46

Test Scenarios: Failure schedule: First scenario

Six workstations, 3 to 8 replicas. 193 rules. Failure schedule defined by power set F.

Where n is the number of replicas f(p=n) = ∞ f(p=1...n-1) = p*T/n determines the time of the failure

p is the position of the replica in the sub set T is the arithmetic average of the execution time of ten free failure

runs with n replicas.

47

Test Scenarios: Failure schedule: Second scenario

Ten workstations, Ten replicas. 193 rules. Failure schedule defined by set G.

Where n is the number of replicas g(p=n) = ∞ g(p=1...n-1) = p*T/n determines the time of the failure

p is the position of the replica in the sub set T is the arithmetic average of the execution time of ten free failure

runs with n replicas.

48

Test Scenarios: Failure schedule Third scenario

Ten workstations, Ten replicas. Six rule sets of 6, 12, 24, 48, 96, and 193 rules

each time. The failure schedule was given by the function

G(n) defined for the second scenario.

49

Test Scenarios: Test application Client consumer/producer of the event channel.

It starts sending two events of GageLevelReport type to start the test

It ends its execution when an event of the TestEventEnd type arrives.

Measures the execution time Starting just after second event is posted, and Ending just after a event of TestEventEnd type

arrives.

50

Methodology: Test Procedure

The procedure consisted of three major steps: Clear the environment; Launch the infrastructure; and Run the test application.

The results are The arithmetic media of 10 runs on each test case.

The arithmetic media of the standard deviation was 1.46%

51

0 1 23 4

56

7

3

4

56

78

0

20000

40000

60000

80000

100000

120000

140000

Tim

e (m

illis

econ

ds)

Number of failuresNumber

of replicas

Figure 7.4 Cost of fault tolerant ERF execution using active replication technique with increasing number of replicas and number of failures.

1722-608

52

0 1 2 3 4 5 6 7

3

45

67

8

0

20000

40000

60000

80000

100000

120000

140000Ti

me

(mse

c) (t

)

Number of Failures

Number ofReplicas

Figure 7.5 Cost of fault tolerant ERF execution using semi-active replication technique with increasing number of replicas and number

of failures.

9563051

53

Figure 7.6 Impact of (n-1) failures execution over failure- free execution in active replication technique.

Failure-free !!!

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10

Number of Replicas (n)

Tim

e (m

sec)

(t)

Failure-free (n-1) failures1406 1176

54

Figure 7.7 Impact of (n-1) failures execution over failure-free execution in semi-active replication technique.

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10


Tim

e (m

sec)

(t)

Failure-free (n-1) failures1377 5341

55

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10


Tim

e (m

sec)

(t)

Active Semi active

Figure 7.8 Comparison between active and semi active replication techniques on failure-free scenario.

1406 1377

56

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10


Tim

e (m

sec)

(t)

Active Semi active

Figure 7.9 Comparison between active and semi active replication techniques on nine-failures scenario.

1176 5341

57

Figure 7.10 Workload impact on the time of the failure-free execution and of the nine-failures execution using the active replication technique.

Overload respect to the non fault tolerant execution and to the pure RUBIES execution.

0

20000

40000

60000

80000

100000

120000

140000

160000

0 500 1000 1500 2000 2500 3000

Events

Tim

e (m

sec)

(t)

RUBIES Non fault tolerant Failure-free Nine failures

58

Figure 7.11 Workload impact on the time of the failure-free execution and the nine-failures execution using the semi-active replication technique.

Overload respect to non fault tolerant execution and pure RUBIES execution.

0

20000

40000

60000

80000

100000

120000

140000

160000

0 500 1000 1500 2000 2500 3000

Events

Tim

e (m

sec)

(t)

RUBIES Non fault tolerant Failure-free Nine failures

59

7. Conclusions The performance results of the implementation of the

active and semi-active replication techniques Shows linear time curves both

For increasing number of replicas, failures and work load. Therefore,

The proposed solutions were proved to be feasible, and Their performance results were proved to be acceptable.

Additionally, The active replication technique has better overall performance

than semi-active replication technique.

60

Research Contributions Active Replication for Asynchronous, non-

deterministic Reactive Components Tight time constrains. The replication logic is in a centralized component Advantages

It is not significantly affected by either increasing number of replicas, failures, or workload.

Both clients and replicas need not to be aware of the replication mechanism.

It can be used for large distributed systems. Disadvantage

Relies on a centralized component.

61

Research Contributions Semi-active Replication for Asynchronous,

non-deterministic Reactive ComponentsLoose time constraintsThe replication logic is distributed.Advantages

The replication mechanism is distributed, Disadvantages

Relies on failure detectors.

62

Questions and/or Comments?

?

fault tolerance in an event rule framework for distributed systems

Documents

eventrule framework

ds framework

erf eventrule framework

event abstraction

fault tolerance ftft

fault tolerant corbadesign

nondeterministic components

replicated component