fault tolerance in an event rule framework for distributed systems
DESCRIPTION
Fault Tolerance in an Event Rule Framework for Distributed Systems. Hillary Caituiro Monge. Contents. Introduction Related Works Overview of the Event Rule Framework (ERF) Overview of the Fault Tolerant CORBA Design of the Fault tolerant ERF (FT-ERF) Performance Analysis Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
1
Fault Tolerance in an Event Rule Framework for Distributed Systems
Hillary Caituiro Monge
2
Contents1. Introduction2. Related Works3. Overview of the Event Rule Framework (ERF)4. Overview of the Fault Tolerant CORBA5. Design of the Fault tolerant ERF (FT-ERF)6. Performance Analysis7. Conclusions
3
1. Introduction Justification
Distributed Systems (DS)Fault Tolerance (FT)Reactive Components (RC)The Event Rule Framework (ERF) Motivation
Objectives
4
Distributed Systems (DS) A DS is a
Collection of software components distributed among processors of heterogeneous platforms
DSs purpose are: Sharing resources and workload, and Maximizing availability.
Design goals of DSs are: Transparency, Scalability, Reliability and Performance.
Computer
Server Workstation
Laptop
Mainframe
PDA
5
Fault Tolerance (FT)
FT is the ability of a system to continue operating as expected, despite internal or external failures.
DSs are prone to failures.Some faults can be detected.Some others cannot be detected.
FT of a DS can be improved through the redundancy, i.e. replication of its hardware or software components.
6
Reactive Components (RC Reactive Components
React to external stimulus (i.e. events) Initiate action
A RC can be Asynchronous or synchronous Non-deterministic or deterministic
A reactive component can be asynchronous and non-deterministic (ANDRC).
7
The Event Rule Framework (ERF)
An example of a DS framework having ANDRCs is: ERF (Event/Rule Framework).
ERF Developed at the Center for Computing Research and
Development of University of Puerto Rico – Mayagüez Campus.
It is an Event-Rule Framework for developing distributed systems.
In ERF, Events and rules are used as abstractions for specifying system behavior.
8
Motivation (1/2) There is a challenge to achieve fault tolerance in
ANDRCs. In non-deterministic components:
The output could be different; Even if the same sequence of stimuli is input with the
same initial state. Since the component is asynchronous:
Timing assumptions are not valid. Moreover, ANDRCs behavior fulfills the
Heisenberg’s uncertainly principle:
9
Motivation (2/2)
Existing fault-tolerance techniquesFailure detectors
Timing assumptions Synchronous or semi synchronous systems,
State transfer protocols Deterministic systems Very intrusive
Duplicates detection and suppression mechanisms
Sequencers
10
Objectives (1/2) This research is about the use of active and
semi-active replication techniques for achieving fault tolerance in ERF, which is a framework that uses ANDRCs.
Active replication technique All replicated components accept third-party incoming
events. A middle-tier component is in charge of
Event multicasting Detecting and suppressing duplicated events.
11
Objectives (2/2)
Semi-active replication technique All replicated components accept third-party incoming
events Only one (“the leader”) is able to post events, Backup replicas listen to the leader to make a
consistent production of events. Each replicated component is in charge of the
detection and suppression of duplicated events.
12
2. Related Works Generic support of FT in DSs FT Event-Based DSs
13
Generic support of FT in DSs
OMG Fault-Tolerant CORBA Standard
IRL Eternal OGS SENSEI OGM FT-CORBA Compliant
Yes Yes Soon No
Design Style Non-intrusive Non-intrusive Non-intrusive Non-intrusive Interoperability Free Expensive Free Free Replication Logic Implementation
Centralized with Passive Replication
Group toolkit Group toolkit Group toolkit
Asynchronism support
Yes No No No
Non-determinism support
No No No No
14
FT Event-Based DSs
NODS ISEE YEAST Fault tolerance Framework RS2.7 Not directly
supported Watchd and Libft
15
3. Overview of the Event Rule Framework (ERF)
Model Event Model Rule Model Behavioral Model
Components Event Channel RUBIES
Architecture of ERF-CORBA
16
Model Event Model
ERF provides the event abstraction to represent significant occurrences in a distributed system.
i.e. Flood alert system.
The base class Event defines the structure and behavior applicable to all types of events.
package erf;
import erf.lang.*;import java.io.Serializable;
public class Event implements Serializable{ /* Attributes */ public String id = ""; public TimeValue ttl; public TimeValue daytime; public DistributedObject producer;
/* Methods */ public TimeValue t() {...} public TimeValue ts() {...} public TimeValue ttl() {...} public void setttl(long tv) {...} public DistributedObject getProducer() {...} public void setProducer(DistributedObject producer) {...} public void sett(long tv) {...} public String pName() {...} public boolean isDead() {...} public String getTypeName() {...}}
Figure 3.2 Java definition of the class Event
17
Model
Rule Model In ERF, the behavior of a DS is defined in terms of
rules. A rule is an algorithm that is triggered when events in
the event set match a rule’s event pattern
[package <package_specification> ]rule <rule_id>[priority <priority_number>]on <trigger_events>[use <usage_specification>][if <condition> then <actions> [else <alternative_actions>]][do <unconditional_actions>]
Figure 3.5 Syntax of rule definition language (RDL)
18
Model
Behavioral ModelDefines how rules are triggered and evaluated
upon the occurrence of events.Evaluation of rules needs to be made
periodically because RUBIES receive events constantly.
The evaluation of rules is performed based on a rule priority.
19
Components (1/2)
Event Channel Is a middleware distributed component It allows sending events to consumers. It allows receiving events from producers.Events are treated as objects.
20
Components (2/2)
Rule Based Intelligent Event Service (RUBIES) Is the main component of ERF. It is an engine that handles events through
the evaluation of rules. RUBIES is a distributed component It is registered to the event channel both as a
consumer and as a producer.
21
Architecture of ERF-CORBA
StructuredPushConsumer StructuredPushSupplier
StructuredProxyPushSupplier StructuredProxyPushConsumer
RUBIES
RuleCompiler
CORBAEventChannel
StructuredEvent
*
*
*
*
EventChannel*
*
*
*
Figure 3.8 Architecture of ERF-CORBA
22
4. Overview of the Fault Tolerant CORBA (FT-CORBA)
Fault Tolerant CORBA (FT-CORBA) Replication Management Fault Management Logging and Recovery Management
23
Fault Tolerant CORBA
Adopted by OMG through 2000. Commitments rather than a solution. Full interoperability among different products. It provides support for applications that require
High levels of reliability With minimal modifications.
This research was addressed to be compliant with this standard.
24
Replication Management Replication management
covers a Fault Tolerant Domain.
It is done through the Replication Manager component, which inherits from the Property Manager, Object Group Manager, and Generic Factory components.
GenericFactory
PropertyManager
ReplicationManager
ObjectGroupManager
Figure 4.3 Hierarchy of the Replication Management
25
Fault Management It includes the Fault
Notification, Fault Detection, and Fault Analysis services.
The Fault Notifier sends informs to its consumers.
The Fault Detectors are connected to replicas or host and provide “faults” to the Fault Notifier.
The Fault Analyzer analyzes faults and produce reports to the Fault Notifier.
FaultDetector
FaultNotifier FaultAnalyzerStructuredPushConsumer
PullMonitorable
SequencePushConsumer
Figure 4.8 Architecture of Fault Management
26
Logging and Recovery Management
Loggin mechanism.Log the state of the primary member.
Recovery mechanismAct on fails or for new members.Recover from the log to the new primary.
Consistency must be controlled by the infrastructure.
27
5. Design of the Fault tolerant ERF (FT-ERF)
Scalability and Fault Tolerance Problems in ERF CORBA
Architecture of Scalable and Fault Tolerant ERF Architecture of Fault-Tolerant ERF-CORBA EID Uniqueness Events and Pattern equality rules. Pattern Management Active Replication Semi-Active Replication
28
Scalability and Fault Tolerance Problems in ERF CORBA
Figure 5.1 Two possible points of scalability and fault-tolerance problems in ERF: (a) the size of the rules database; (b) a crash of RUBIES.
RULES DB
RUBIES
(a)
(b)
29Figure 5.3 Architecture of Scalable and fault-tolerant ERF
RUBIESRUBIES (γ11, δ1)
DISTRIBUTION DIMENSIONR
EPLI
CA
TIO
N D
IMEN
SIO
N
RUBIESRUBIES (γ12, δ1)
RUBIESRUBIES (γ1M, δ1)
RUBIESRUBIES (γ21, δ2)
RUBIESRUBIES (γ22, δ2)
RUBIESRUBIES (γ2M, δ2)
RUBIESRUBIES (γN1, δN)
RUBIESRUBIES (γN2, δN)
RUBIESRUBIES (γNM, δN)
30
GenericFactory
PropertyManager
ObjectGroupManager
FTRUBIES
RUBIES
PullMonitorable
UpdateableReplicationManager
FaultNotifier
PullMonitor
FaultDetector
Monitorable
RuleCompiler
ObjectGroup
ClientRequestInterceptorServerRequestInterceptor
ClientIOGRSupportServerIOGRSupport
FTRUBIESServant
FTRUBIESIntOperations
CORBAEventChannel
Figure 5.3 Architecture of FT ERF-CORBA
31
EID Uniqueness (1/2) Each event in the system need to be uniquely
identified by an event identifier EID.
EID uniqueness must guaranteed in different contexts Local, replication group, system.
The use of sequencers is an option to achieve EID uniqueness Each replica start a sequencer. But, is only valid with deterministic components.
32
EID Uniqueness (2/2)
Events can be identified by its history. Each event is produced due to an event pattern.
Such history includes The list of previous events that triggered the event. The function or rule that caused its production.
Figure 5.5 Conceptual View of the Event Unique Identification
33
EVENT EQUALITY RULE Two events are equal if:
Both are of the same Type.Both were produced due to the same Rule.Both have the same Order of production in the
time when the Rule was triggered. Both have the same Pattern.
34
PATTERN EQUALITY RULE Two event patterns are equal if:
Both have the same number of events.Both have events in the same order. Two events for the same position are equal if
the Event Equality Rule is accomplished as previously defined.
35
Pattern Management (1/2)
Rules use a pattern management framework To prevent events being triggered more than once for
a given event pattern. In this framework, patterns are defined in terms
of: Source events (i.e., events that cause rules to trigger)
and Target events (i.e., events that are produced by
rules).
36
Pattern Management (2/2)
The framework has three main components for pattern management: Pattern Manager to manage patterns of events. Pattern to store patterns of events. Indexer to organize patterns of events.
Event
TargetEvents
0..*0..*
SourceEvents
0..*0..*
Pattern
Rule
PatternManager
0..* 10..* 1
Indexer
Figure 5.6 Architecture of Pattern Management
37
Active Replication (AR) For systems with tight time constraints. All replicas are running at the same time.
Are accepting events. Are sending events.
So, duplicated events are going around. Therefore, it is crucial.
To detect and suppress duplicated-events. To deliver a unique reply. To keep consistency. To be fault tolerant transparent.
38
AR: Pattern Naming For Duplicated-Events Detection and
Suppression Is a centralized Mid-tier component that
Through an analysis of an event’s history Detects if the event has already been delivered.
It relies on two primitives: Event binding
Register an event. Pattern solving.
Resolve if an equivalent event was already delivered.
39
AR: Pattern Naming
EventChannel
FTRUBIES
CORBAEventChannel
0..*
EventHandler11
PatternContext1
1
PatternName
0..*1
11
0..*1
1
0..1
1
Figure 5.9 Architecture of the Pattern Naming
40
Semi-Active Replication (SAR)
For systems with relatively loose time constraints. All replicas are running at the same time. Only the primary is able to reply to clients.
When the primary fails, a new member is selected. When a backup member fails, it is released from the
group. Failure detectors are used to detect failures in
group members. Time delay before the selection of new primary (sec).
41
SAR: Production Controller For Duplicated-Events Detection and Suppression It is distributed within each replica. The following algorithm is executed on backup members.
On incoming event P from the primary If in queue BQ is an equivalent event B for the event P then
Update B.id with P.id across the entire system Remove P
Else Enqueue P in PQ
On produced event B from the backup If in queue PQ is an equivalent event P for the event B then
Update B.id with P.id across the entire system Remove P
Else Enqueue B in BQ
On fail and if the backup replica is elected as new primary Post all events of the queue BQ
42
SAR: Production Controller
EventChannel
FaultNotifier
CORBAEventChannel
1
0..1
Pattern
Rule PatternManager
1
0..*
1
0..*
SourceEvents
TargetEvents
Updater Event0..*0..*
0..*0..*
-primaryEvent
-backupEvent
ProductionController
FTRUBIES
0..*
1
11 11
Figure 5.14 Architecture of the Production Controller
43
6. Performance Analysis Objectives Methodology
Test ScenariosTest Procedure
Test Results
44
Objectives Measure the execution time of fault-tolerant ERF
using active and semi-active replication techniques for: An increasing number of replicas. An Increasing number of failures. An increasing workload.
Compare the execution time of: Active versus semi-active replication techniques. Failure-free versus failure execution scenarios. Fault-tolerant versus non fault-tolerant execution.
45
Test Scenarios: Services distribution
* *
*
*
* * *
*
*
*
*
*
*
*
*
*
*
*
china test app
adaselsvr1 corba event
channel
*
*
* *
*
*
10. jobo
factory
ft rubies
9. quayaba
factory
ft rubies
8. chironja
factory
ft rubies
7. guineo
factory
ft rubies
6. acerola
factory
ft rubies
5. pajuil factory
ft rubies
4. quenepa factory
ft rubies
3. toronja factory
ft rubies
2. melon
factory
ft rubies
1. parcha
factory
fault tolerant rubies
sorelsvr1 event
channel
* *
*
*
*
* *
* *
* * * *
*
*
*
*
*
*
*
*
*
ece
name server
Replication manager
fault detector
http server
* *
Figure 7.1 UML deployment diagram of the test environment. (The domain for all computers is ece.uprm.edu)
46
Test Scenarios: Failure schedule: First scenario
Six workstations, 3 to 8 replicas. 193 rules. Failure schedule defined by power set F.
Where n is the number of replicas f(p=n) = ∞ f(p=1...n-1) = p*T/n determines the time of the failure
p is the position of the replica in the sub set T is the arithmetic average of the execution time of ten free failure
runs with n replicas.
47
Test Scenarios: Failure schedule: Second scenario
Ten workstations, Ten replicas. 193 rules. Failure schedule defined by set G.
Where n is the number of replicas g(p=n) = ∞ g(p=1...n-1) = p*T/n determines the time of the failure
p is the position of the replica in the sub set T is the arithmetic average of the execution time of ten free failure
runs with n replicas.
48
Test Scenarios: Failure schedule Third scenario
Ten workstations, Ten replicas. Six rule sets of 6, 12, 24, 48, 96, and 193 rules
each time. The failure schedule was given by the function
G(n) defined for the second scenario.
49
Test Scenarios: Test application Client consumer/producer of the event channel.
It starts sending two events of GageLevelReport type to start the test
It ends its execution when an event of the TestEventEnd type arrives.
Measures the execution time Starting just after second event is posted, and Ending just after a event of TestEventEnd type
arrives.
50
Methodology: Test Procedure
The procedure consisted of three major steps: Clear the environment; Launch the infrastructure; and Run the test application.
The results are The arithmetic media of 10 runs on each test case.
The arithmetic media of the standard deviation was 1.46%
51
0 1 23 4
56
7
3
4
56
78
0
20000
40000
60000
80000
100000
120000
140000
Tim
e (m
illis
econ
ds)
Number of failuresNumber
of replicas
Figure 7.4 Cost of fault tolerant ERF execution using active replication technique with increasing number of replicas and number of failures.
1722-608
52
0 1 2 3 4 5 6 7
3
45
67
8
0
20000
40000
60000
80000
100000
120000
140000Ti
me
(mse
c) (t
)
Number of Failures
Number ofReplicas
Figure 7.5 Cost of fault tolerant ERF execution using semi-active replication technique with increasing number of replicas and number
of failures.
9563051
53
Figure 7.6 Impact of (n-1) failures execution over failure- free execution in active replication technique.
Failure-free !!!
0
20000
40000
60000
80000
100000
120000
140000
160000
3 4 5 6 7 8 9 10
Number of Replicas (n)
Tim
e (m
sec)
(t)
Failure-free (n-1) failures1406 1176
54
Figure 7.7 Impact of (n-1) failures execution over failure-free execution in semi-active replication technique.
0
20000
40000
60000
80000
100000
120000
140000
160000
3 4 5 6 7 8 9 10
Number of Replicas (n)
Tim
e (m
sec)
(t)
Failure-free (n-1) failures1377 5341
55
0
20000
40000
60000
80000
100000
120000
140000
160000
3 4 5 6 7 8 9 10
Number of Replicas (n)
Tim
e (m
sec)
(t)
Active Semi active
Figure 7.8 Comparison between active and semi active replication techniques on failure-free scenario.
1406 1377
56
0
20000
40000
60000
80000
100000
120000
140000
160000
3 4 5 6 7 8 9 10
Number of Replicas (n)
Tim
e (m
sec)
(t)
Active Semi active
Figure 7.9 Comparison between active and semi active replication techniques on nine-failures scenario.
1176 5341
57
Figure 7.10 Workload impact on the time of the failure-free execution and of the nine-failures execution using the active replication technique.
Overload respect to the non fault tolerant execution and to the pure RUBIES execution.
0
20000
40000
60000
80000
100000
120000
140000
160000
0 500 1000 1500 2000 2500 3000
Events
Tim
e (m
sec)
(t)
RUBIES Non fault tolerant Failure-free Nine failures
58
Figure 7.11 Workload impact on the time of the failure-free execution and the nine-failures execution using the semi-active replication technique.
Overload respect to non fault tolerant execution and pure RUBIES execution.
0
20000
40000
60000
80000
100000
120000
140000
160000
0 500 1000 1500 2000 2500 3000
Events
Tim
e (m
sec)
(t)
RUBIES Non fault tolerant Failure-free Nine failures
59
7. Conclusions The performance results of the implementation of the
active and semi-active replication techniques Shows linear time curves both
For increasing number of replicas, failures and work load. Therefore,
The proposed solutions were proved to be feasible, and Their performance results were proved to be acceptable.
Additionally, The active replication technique has better overall performance
than semi-active replication technique.
60
Research Contributions Active Replication for Asynchronous, non-
deterministic Reactive Components Tight time constrains. The replication logic is in a centralized component Advantages
It is not significantly affected by either increasing number of replicas, failures, or workload.
Both clients and replicas need not to be aware of the replication mechanism.
It can be used for large distributed systems. Disadvantage
Relies on a centralized component.
61
Research Contributions Semi-active Replication for Asynchronous,
non-deterministic Reactive ComponentsLoose time constraintsThe replication logic is distributed.Advantages
The replication mechanism is distributed, Disadvantages
Relies on failure detectors.
62
Questions and/or Comments?
?