fault tolerance in an event rule framework for distributed systems

62
1 Fault Tolerance in an Event Rule Framework for Distributed Systems Hillary Caituiro Monge

Upload: linh

Post on 11-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Fault Tolerance in an Event Rule Framework for Distributed Systems. Hillary Caituiro Monge. Contents. Introduction Related Works Overview of the Event Rule Framework (ERF) Overview of the Fault Tolerant CORBA Design of the Fault tolerant ERF (FT-ERF) Performance Analysis Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fault Tolerance in an  Event Rule Framework for Distributed Systems

1

Fault Tolerance in an Event Rule Framework for Distributed Systems

Hillary Caituiro Monge

Page 2: Fault Tolerance in an  Event Rule Framework for Distributed Systems

2

Contents1. Introduction2. Related Works3. Overview of the Event Rule Framework (ERF)4. Overview of the Fault Tolerant CORBA5. Design of the Fault tolerant ERF (FT-ERF)6. Performance Analysis7. Conclusions

Page 3: Fault Tolerance in an  Event Rule Framework for Distributed Systems

3

1. Introduction Justification

Distributed Systems (DS)Fault Tolerance (FT)Reactive Components (RC)The Event Rule Framework (ERF) Motivation

Objectives

Page 4: Fault Tolerance in an  Event Rule Framework for Distributed Systems

4

Distributed Systems (DS) A DS is a

Collection of software components distributed among processors of heterogeneous platforms

DSs purpose are: Sharing resources and workload, and Maximizing availability.

Design goals of DSs are: Transparency, Scalability, Reliability and Performance.

Computer

Server Workstation

Laptop

Mainframe

PDA

Page 5: Fault Tolerance in an  Event Rule Framework for Distributed Systems

5

Fault Tolerance (FT)

FT is the ability of a system to continue operating as expected, despite internal or external failures.

DSs are prone to failures.Some faults can be detected.Some others cannot be detected.

FT of a DS can be improved through the redundancy, i.e. replication of its hardware or software components.

Page 6: Fault Tolerance in an  Event Rule Framework for Distributed Systems

6

Reactive Components (RC Reactive Components

React to external stimulus (i.e. events) Initiate action

A RC can be Asynchronous or synchronous Non-deterministic or deterministic

A reactive component can be asynchronous and non-deterministic (ANDRC).

Page 7: Fault Tolerance in an  Event Rule Framework for Distributed Systems

7

The Event Rule Framework (ERF)

An example of a DS framework having ANDRCs is: ERF (Event/Rule Framework).

ERF Developed at the Center for Computing Research and

Development of University of Puerto Rico – Mayagüez Campus.

It is an Event-Rule Framework for developing distributed systems.

In ERF, Events and rules are used as abstractions for specifying system behavior.

Page 8: Fault Tolerance in an  Event Rule Framework for Distributed Systems

8

Motivation (1/2) There is a challenge to achieve fault tolerance in

ANDRCs. In non-deterministic components:

The output could be different; Even if the same sequence of stimuli is input with the

same initial state. Since the component is asynchronous:

Timing assumptions are not valid. Moreover, ANDRCs behavior fulfills the

Heisenberg’s uncertainly principle:

Page 9: Fault Tolerance in an  Event Rule Framework for Distributed Systems

9

Motivation (2/2)

Existing fault-tolerance techniquesFailure detectors

Timing assumptions Synchronous or semi synchronous systems,

State transfer protocols Deterministic systems Very intrusive

Duplicates detection and suppression mechanisms

Sequencers

Page 10: Fault Tolerance in an  Event Rule Framework for Distributed Systems

10

Objectives (1/2) This research is about the use of active and

semi-active replication techniques for achieving fault tolerance in ERF, which is a framework that uses ANDRCs.

Active replication technique All replicated components accept third-party incoming

events. A middle-tier component is in charge of

Event multicasting Detecting and suppressing duplicated events.

Page 11: Fault Tolerance in an  Event Rule Framework for Distributed Systems

11

Objectives (2/2)

Semi-active replication technique All replicated components accept third-party incoming

events Only one (“the leader”) is able to post events, Backup replicas listen to the leader to make a

consistent production of events. Each replicated component is in charge of the

detection and suppression of duplicated events.

Page 12: Fault Tolerance in an  Event Rule Framework for Distributed Systems

12

2. Related Works Generic support of FT in DSs FT Event-Based DSs

Page 13: Fault Tolerance in an  Event Rule Framework for Distributed Systems

13

Generic support of FT in DSs

OMG Fault-Tolerant CORBA Standard

IRL Eternal OGS SENSEI OGM FT-CORBA Compliant

Yes Yes Soon No

Design Style Non-intrusive Non-intrusive Non-intrusive Non-intrusive Interoperability Free Expensive Free Free Replication Logic Implementation

Centralized with Passive Replication

Group toolkit Group toolkit Group toolkit

Asynchronism support

Yes No No No

Non-determinism support

No No No No

Page 14: Fault Tolerance in an  Event Rule Framework for Distributed Systems

14

FT Event-Based DSs

NODS ISEE YEAST Fault tolerance Framework RS2.7 Not directly

supported Watchd and Libft

Page 15: Fault Tolerance in an  Event Rule Framework for Distributed Systems

15

3. Overview of the Event Rule Framework (ERF)

Model Event Model Rule Model Behavioral Model

Components Event Channel RUBIES

Architecture of ERF-CORBA

Page 16: Fault Tolerance in an  Event Rule Framework for Distributed Systems

16

Model Event Model

ERF provides the event abstraction to represent significant occurrences in a distributed system.

i.e. Flood alert system.

The base class Event defines the structure and behavior applicable to all types of events.

package erf;

import erf.lang.*;import java.io.Serializable;

public class Event implements Serializable{ /* Attributes */ public String id = ""; public TimeValue ttl; public TimeValue daytime; public DistributedObject producer;

/* Methods */ public TimeValue t() {...} public TimeValue ts() {...} public TimeValue ttl() {...} public void setttl(long tv) {...} public DistributedObject getProducer() {...} public void setProducer(DistributedObject producer) {...} public void sett(long tv) {...} public String pName() {...} public boolean isDead() {...} public String getTypeName() {...}}

Figure 3.2 Java definition of the class Event

Page 17: Fault Tolerance in an  Event Rule Framework for Distributed Systems

17

Model

Rule Model In ERF, the behavior of a DS is defined in terms of

rules. A rule is an algorithm that is triggered when events in

the event set match a rule’s event pattern

[package <package_specification> ]rule <rule_id>[priority <priority_number>]on <trigger_events>[use <usage_specification>][if <condition> then <actions> [else <alternative_actions>]][do <unconditional_actions>]

Figure 3.5 Syntax of rule definition language (RDL)

Page 18: Fault Tolerance in an  Event Rule Framework for Distributed Systems

18

Model

Behavioral ModelDefines how rules are triggered and evaluated

upon the occurrence of events.Evaluation of rules needs to be made

periodically because RUBIES receive events constantly.

The evaluation of rules is performed based on a rule priority.

Page 19: Fault Tolerance in an  Event Rule Framework for Distributed Systems

19

Components (1/2)

Event Channel Is a middleware distributed component It allows sending events to consumers. It allows receiving events from producers.Events are treated as objects.

Page 20: Fault Tolerance in an  Event Rule Framework for Distributed Systems

20

Components (2/2)

Rule Based Intelligent Event Service (RUBIES) Is the main component of ERF. It is an engine that handles events through

the evaluation of rules. RUBIES is a distributed component It is registered to the event channel both as a

consumer and as a producer.

Page 21: Fault Tolerance in an  Event Rule Framework for Distributed Systems

21

Architecture of ERF-CORBA

StructuredPushConsumer StructuredPushSupplier

StructuredProxyPushSupplier StructuredProxyPushConsumer

RUBIES

RuleCompiler

CORBAEventChannel

StructuredEvent

*

*

*

*

EventChannel*

*

*

*

Figure 3.8 Architecture of ERF-CORBA

Page 22: Fault Tolerance in an  Event Rule Framework for Distributed Systems

22

4. Overview of the Fault Tolerant CORBA (FT-CORBA)

Fault Tolerant CORBA (FT-CORBA) Replication Management Fault Management Logging and Recovery Management

Page 23: Fault Tolerance in an  Event Rule Framework for Distributed Systems

23

Fault Tolerant CORBA

Adopted by OMG through 2000. Commitments rather than a solution. Full interoperability among different products. It provides support for applications that require

High levels of reliability With minimal modifications.

This research was addressed to be compliant with this standard.

Page 24: Fault Tolerance in an  Event Rule Framework for Distributed Systems

24

Replication Management Replication management

covers a Fault Tolerant Domain.

It is done through the Replication Manager component, which inherits from the Property Manager, Object Group Manager, and Generic Factory components.

GenericFactory

PropertyManager

ReplicationManager

ObjectGroupManager

Figure 4.3 Hierarchy of the Replication Management

Page 25: Fault Tolerance in an  Event Rule Framework for Distributed Systems

25

Fault Management It includes the Fault

Notification, Fault Detection, and Fault Analysis services.

The Fault Notifier sends informs to its consumers.

The Fault Detectors are connected to replicas or host and provide “faults” to the Fault Notifier.

The Fault Analyzer analyzes faults and produce reports to the Fault Notifier.

FaultDetector

FaultNotifier FaultAnalyzerStructuredPushConsumer

PullMonitorable

SequencePushConsumer

Figure 4.8 Architecture of Fault Management

Page 26: Fault Tolerance in an  Event Rule Framework for Distributed Systems

26

Logging and Recovery Management

Loggin mechanism.Log the state of the primary member.

Recovery mechanismAct on fails or for new members.Recover from the log to the new primary.

Consistency must be controlled by the infrastructure.

Page 27: Fault Tolerance in an  Event Rule Framework for Distributed Systems

27

5. Design of the Fault tolerant ERF (FT-ERF)

Scalability and Fault Tolerance Problems in ERF CORBA

Architecture of Scalable and Fault Tolerant ERF Architecture of Fault-Tolerant ERF-CORBA EID Uniqueness Events and Pattern equality rules. Pattern Management Active Replication Semi-Active Replication

Page 28: Fault Tolerance in an  Event Rule Framework for Distributed Systems

28

Scalability and Fault Tolerance Problems in ERF CORBA

Figure 5.1 Two possible points of scalability and fault-tolerance problems in ERF: (a) the size of the rules database; (b) a crash of RUBIES.

RULES DB

RUBIES

(a)

(b)

Page 29: Fault Tolerance in an  Event Rule Framework for Distributed Systems

29Figure 5.3 Architecture of Scalable and fault-tolerant ERF

RUBIESRUBIES (γ11, δ1)

DISTRIBUTION DIMENSIONR

EPLI

CA

TIO

N D

IMEN

SIO

N

RUBIESRUBIES (γ12, δ1)

RUBIESRUBIES (γ1M, δ1)

RUBIESRUBIES (γ21, δ2)

RUBIESRUBIES (γ22, δ2)

RUBIESRUBIES (γ2M, δ2)

RUBIESRUBIES (γN1, δN)

RUBIESRUBIES (γN2, δN)

RUBIESRUBIES (γNM, δN)

Page 30: Fault Tolerance in an  Event Rule Framework for Distributed Systems

30

GenericFactory

PropertyManager

ObjectGroupManager

FTRUBIES

RUBIES

PullMonitorable

UpdateableReplicationManager

FaultNotifier

PullMonitor

FaultDetector

Monitorable

RuleCompiler

ObjectGroup

ClientRequestInterceptorServerRequestInterceptor

ClientIOGRSupportServerIOGRSupport

FTRUBIESServant

FTRUBIESIntOperations

CORBAEventChannel

Figure 5.3 Architecture of FT ERF-CORBA

Page 31: Fault Tolerance in an  Event Rule Framework for Distributed Systems

31

EID Uniqueness (1/2) Each event in the system need to be uniquely

identified by an event identifier EID.

EID uniqueness must guaranteed in different contexts Local, replication group, system.

The use of sequencers is an option to achieve EID uniqueness Each replica start a sequencer. But, is only valid with deterministic components.

Page 32: Fault Tolerance in an  Event Rule Framework for Distributed Systems

32

EID Uniqueness (2/2)

Events can be identified by its history. Each event is produced due to an event pattern.

Such history includes The list of previous events that triggered the event. The function or rule that caused its production.

Figure 5.5 Conceptual View of the Event Unique Identification

Page 33: Fault Tolerance in an  Event Rule Framework for Distributed Systems

33

EVENT EQUALITY RULE Two events are equal if:

Both are of the same Type.Both were produced due to the same Rule.Both have the same Order of production in the

time when the Rule was triggered. Both have the same Pattern.

Page 34: Fault Tolerance in an  Event Rule Framework for Distributed Systems

34

PATTERN EQUALITY RULE Two event patterns are equal if:

Both have the same number of events.Both have events in the same order. Two events for the same position are equal if

the Event Equality Rule is accomplished as previously defined.

Page 35: Fault Tolerance in an  Event Rule Framework for Distributed Systems

35

Pattern Management (1/2)

Rules use a pattern management framework To prevent events being triggered more than once for

a given event pattern. In this framework, patterns are defined in terms

of: Source events (i.e., events that cause rules to trigger)

and Target events (i.e., events that are produced by

rules).

Page 36: Fault Tolerance in an  Event Rule Framework for Distributed Systems

36

Pattern Management (2/2)

The framework has three main components for pattern management: Pattern Manager to manage patterns of events. Pattern to store patterns of events. Indexer to organize patterns of events.

Event

TargetEvents

0..*0..*

SourceEvents

0..*0..*

Pattern

Rule

PatternManager

0..* 10..* 1

Indexer

Figure 5.6 Architecture of Pattern Management

Page 37: Fault Tolerance in an  Event Rule Framework for Distributed Systems

37

Active Replication (AR) For systems with tight time constraints. All replicas are running at the same time.

Are accepting events. Are sending events.

So, duplicated events are going around. Therefore, it is crucial.

To detect and suppress duplicated-events. To deliver a unique reply. To keep consistency. To be fault tolerant transparent.

Page 38: Fault Tolerance in an  Event Rule Framework for Distributed Systems

38

AR: Pattern Naming For Duplicated-Events Detection and

Suppression Is a centralized Mid-tier component that

Through an analysis of an event’s history Detects if the event has already been delivered.

It relies on two primitives: Event binding

Register an event. Pattern solving.

Resolve if an equivalent event was already delivered.

Page 39: Fault Tolerance in an  Event Rule Framework for Distributed Systems

39

AR: Pattern Naming

EventChannel

FTRUBIES

CORBAEventChannel

0..*

EventHandler11

PatternContext1

1

PatternName

0..*1

11

0..*1

1

0..1

1

Figure 5.9 Architecture of the Pattern Naming

Page 40: Fault Tolerance in an  Event Rule Framework for Distributed Systems

40

Semi-Active Replication (SAR)

For systems with relatively loose time constraints. All replicas are running at the same time. Only the primary is able to reply to clients.

When the primary fails, a new member is selected. When a backup member fails, it is released from the

group. Failure detectors are used to detect failures in

group members. Time delay before the selection of new primary (sec).

Page 41: Fault Tolerance in an  Event Rule Framework for Distributed Systems

41

SAR: Production Controller For Duplicated-Events Detection and Suppression It is distributed within each replica. The following algorithm is executed on backup members.

On incoming event P from the primary If in queue BQ is an equivalent event B for the event P then

Update B.id with P.id across the entire system Remove P

Else Enqueue P in PQ

On produced event B from the backup If in queue PQ is an equivalent event P for the event B then

Update B.id with P.id across the entire system Remove P

Else Enqueue B in BQ

On fail and if the backup replica is elected as new primary Post all events of the queue BQ

Page 42: Fault Tolerance in an  Event Rule Framework for Distributed Systems

42

SAR: Production Controller

EventChannel

FaultNotifier

CORBAEventChannel

1

0..1

Pattern

Rule PatternManager

1

0..*

1

0..*

SourceEvents

TargetEvents

Updater Event0..*0..*

0..*0..*

-primaryEvent

-backupEvent

ProductionController

FTRUBIES

0..*

1

11 11

Figure 5.14 Architecture of the Production Controller

Page 43: Fault Tolerance in an  Event Rule Framework for Distributed Systems

43

6. Performance Analysis Objectives Methodology

Test ScenariosTest Procedure

Test Results

Page 44: Fault Tolerance in an  Event Rule Framework for Distributed Systems

44

Objectives Measure the execution time of fault-tolerant ERF

using active and semi-active replication techniques for: An increasing number of replicas. An Increasing number of failures. An increasing workload.

Compare the execution time of: Active versus semi-active replication techniques. Failure-free versus failure execution scenarios. Fault-tolerant versus non fault-tolerant execution.

Page 45: Fault Tolerance in an  Event Rule Framework for Distributed Systems

45

Test Scenarios: Services distribution

* *

*

*

* * *

*

*

*

*

*

*

*

*

*

*

*

china test app

adaselsvr1 corba event

channel

*

*

* *

*

*

10. jobo

factory

ft rubies

9. quayaba

factory

ft rubies

8. chironja

factory

ft rubies

7. guineo

factory

ft rubies

6. acerola

factory

ft rubies

5. pajuil factory

ft rubies

4. quenepa factory

ft rubies

3. toronja factory

ft rubies

2. melon

factory

ft rubies

1. parcha

factory

fault tolerant rubies

sorelsvr1 event

channel

* *

*

*

*

* *

* *

* * * *

*

*

*

*

*

*

*

*

*

ece

name server

Replication manager

fault detector

http server

* *

Figure 7.1 UML deployment diagram of the test environment. (The domain for all computers is ece.uprm.edu)

Page 46: Fault Tolerance in an  Event Rule Framework for Distributed Systems

46

Test Scenarios: Failure schedule: First scenario

Six workstations, 3 to 8 replicas. 193 rules. Failure schedule defined by power set F.

Where n is the number of replicas f(p=n) = ∞ f(p=1...n-1) = p*T/n determines the time of the failure

p is the position of the replica in the sub set T is the arithmetic average of the execution time of ten free failure

runs with n replicas.

Page 47: Fault Tolerance in an  Event Rule Framework for Distributed Systems

47

Test Scenarios: Failure schedule: Second scenario

Ten workstations, Ten replicas. 193 rules. Failure schedule defined by set G.

Where n is the number of replicas g(p=n) = ∞ g(p=1...n-1) = p*T/n determines the time of the failure

p is the position of the replica in the sub set T is the arithmetic average of the execution time of ten free failure

runs with n replicas.

Page 48: Fault Tolerance in an  Event Rule Framework for Distributed Systems

48

Test Scenarios: Failure schedule Third scenario

Ten workstations, Ten replicas. Six rule sets of 6, 12, 24, 48, 96, and 193 rules

each time. The failure schedule was given by the function

G(n) defined for the second scenario.

Page 49: Fault Tolerance in an  Event Rule Framework for Distributed Systems

49

Test Scenarios: Test application Client consumer/producer of the event channel.

It starts sending two events of GageLevelReport type to start the test

It ends its execution when an event of the TestEventEnd type arrives.

Measures the execution time Starting just after second event is posted, and Ending just after a event of TestEventEnd type

arrives.

Page 50: Fault Tolerance in an  Event Rule Framework for Distributed Systems

50

Methodology: Test Procedure

The procedure consisted of three major steps: Clear the environment; Launch the infrastructure; and Run the test application.

The results are The arithmetic media of 10 runs on each test case.

The arithmetic media of the standard deviation was 1.46%

Page 51: Fault Tolerance in an  Event Rule Framework for Distributed Systems

51

0 1 23 4

56

7

3

4

56

78

0

20000

40000

60000

80000

100000

120000

140000

Tim

e (m

illis

econ

ds)

Number of failuresNumber

of replicas

Figure 7.4 Cost of fault tolerant ERF execution using active replication technique with increasing number of replicas and number of failures.

1722-608

Page 52: Fault Tolerance in an  Event Rule Framework for Distributed Systems

52

0 1 2 3 4 5 6 7

3

45

67

8

0

20000

40000

60000

80000

100000

120000

140000Ti

me

(mse

c) (t

)

Number of Failures

Number ofReplicas

Figure 7.5 Cost of fault tolerant ERF execution using semi-active replication technique with increasing number of replicas and number

of failures.

9563051

Page 53: Fault Tolerance in an  Event Rule Framework for Distributed Systems

53

Figure 7.6 Impact of (n-1) failures execution over failure- free execution in active replication technique.

Failure-free !!!

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10

Number of Replicas (n)

Tim

e (m

sec)

(t)

Failure-free (n-1) failures1406 1176

Page 54: Fault Tolerance in an  Event Rule Framework for Distributed Systems

54

Figure 7.7 Impact of (n-1) failures execution over failure-free execution in semi-active replication technique.

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10

Number of Replicas (n)

Tim

e (m

sec)

(t)

Failure-free (n-1) failures1377 5341

Page 55: Fault Tolerance in an  Event Rule Framework for Distributed Systems

55

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10

Number of Replicas (n)

Tim

e (m

sec)

(t)

Active Semi active

Figure 7.8 Comparison between active and semi active replication techniques on failure-free scenario.

1406 1377

Page 56: Fault Tolerance in an  Event Rule Framework for Distributed Systems

56

0

20000

40000

60000

80000

100000

120000

140000

160000

3 4 5 6 7 8 9 10

Number of Replicas (n)

Tim

e (m

sec)

(t)

Active Semi active

Figure 7.9 Comparison between active and semi active replication techniques on nine-failures scenario.

1176 5341

Page 57: Fault Tolerance in an  Event Rule Framework for Distributed Systems

57

Figure 7.10 Workload impact on the time of the failure-free execution and of the nine-failures execution using the active replication technique.

Overload respect to the non fault tolerant execution and to the pure RUBIES execution.

0

20000

40000

60000

80000

100000

120000

140000

160000

0 500 1000 1500 2000 2500 3000

Events

Tim

e (m

sec)

(t)

RUBIES Non fault tolerant Failure-free Nine failures

Page 58: Fault Tolerance in an  Event Rule Framework for Distributed Systems

58

Figure 7.11 Workload impact on the time of the failure-free execution and the nine-failures execution using the semi-active replication technique.

Overload respect to non fault tolerant execution and pure RUBIES execution.

0

20000

40000

60000

80000

100000

120000

140000

160000

0 500 1000 1500 2000 2500 3000

Events

Tim

e (m

sec)

(t)

RUBIES Non fault tolerant Failure-free Nine failures

Page 59: Fault Tolerance in an  Event Rule Framework for Distributed Systems

59

7. Conclusions The performance results of the implementation of the

active and semi-active replication techniques Shows linear time curves both

For increasing number of replicas, failures and work load. Therefore,

The proposed solutions were proved to be feasible, and Their performance results were proved to be acceptable.

Additionally, The active replication technique has better overall performance

than semi-active replication technique.

Page 60: Fault Tolerance in an  Event Rule Framework for Distributed Systems

60

Research Contributions Active Replication for Asynchronous, non-

deterministic Reactive Components Tight time constrains. The replication logic is in a centralized component Advantages

It is not significantly affected by either increasing number of replicas, failures, or workload.

Both clients and replicas need not to be aware of the replication mechanism.

It can be used for large distributed systems. Disadvantage

Relies on a centralized component.

Page 61: Fault Tolerance in an  Event Rule Framework for Distributed Systems

61

Research Contributions Semi-active Replication for Asynchronous,

non-deterministic Reactive ComponentsLoose time constraintsThe replication logic is distributed.Advantages

The replication mechanism is distributed, Disadvantages

Relies on failure detectors.

Page 62: Fault Tolerance in an  Event Rule Framework for Distributed Systems

62

Questions and/or Comments?

?