UPV / EHU
Distributed Algorithms forFailure Detection inCrash Environments
R. Cortiñas, A. Lafuente, M. Larrea
Distributed Systems GroupUniversity of the Basque Country UPV/EHU
2
UPV / EHU
Master SIA – Sistemas Distribuidos
Guest Stars: P, S and Omega
P: strong completeness, eventual strong accuracy– Eventually every process that crashes is permanently
suspected by every correct process– There is a time after which correct processes are not
suspected by any correct process
S: strong completeness, eventual weak accuracy– There is a time after which some correct process is
never suspected by any correct process
• Omega: eventual leader election– There is a time after which all the correct processes
always trust the same correct process
3
UPV / EHU
Master SIA – Sistemas Distribuidos
The First P Algorithm [CT96]
4
UPV / EHU
Master SIA – Sistemas Distribuidos
p1
p3
p4
p6
p5
p2
Communication Optimality
A ring arrangement of processes
5
UPV / EHU
Master SIA – Sistemas Distribuidos
p1
p3
p4
p6
p5
p2
Communication Optimality
Communication-efficient algorithms:
n links are used forever
6
UPV / EHU
Master SIA – Sistemas Distribuidos
p1
p3
p4
p6
p5
p2
Communication Optimality
Communication-optimal algorithms:
C links are used forever
7
UPV / EHU
Master SIA – Sistemas Distribuidos
Communication-optimal P
8
UPV / EHU
Master SIA – Sistemas Distribuidos
• We also propose an optimal implementation of S, the weakest failure detector for solving Consensus:
– processes ordered: p1, ..., pn– heartbeat strategy– communication pattern: one-to-successors– based on a trusted process (instead of a list of suspected
processes)
Communication-optimal Omega
9
UPV / EHU
Master SIA – Sistemas Distribuidos
i) Initially, p1 starts sending messages periodically to the rest of processes, and all processes trust p1
p2p1 p5p4p3
trusted1 = p1 trusted2 = p1 trusted3 = p1 trusted4 = p1 trusted5 = p1
Communication-optimal Omega
10
UPV / EHU
Master SIA – Sistemas Distribuidos
ii) If a process does not receive a message within some timeout period from its trusted process pi, then it suspects pi and takes the next process pi+1 as its new trusted process
p2p1 p5p4
trusted1 = p1 trusted2 = p1 trusted3 = p1 timeout on p1
trusted4 = p2
trusted5 = p1
p3
Communication-optimal Omega
11
UPV / EHU
Master SIA – Sistemas Distribuidos
iii) If a process trusts itself, then it starts sending messages periodically to its successors
p2p1 p5p4
trusted1 = p1 trusted3 = p1 trusted4 = p2 trusted5 = p1
p3
timeout on p1
trusted2 = p2
Communication-optimal Omega
12
UPV / EHU
Master SIA – Sistemas Distribuidos
iv) If a process receives a message from a process pi preceding its trusted process, then it will trust pi again, increasing its timeout period with respect to pi
p2p1 p5
trusted1 = p1 message from p1
trusted2 = p1
timeout_period21++
trusted3 = p2 message from p1
trusted4 = p1
timeout_period41++
trusted5 = p1
p3 p4
Communication-optimal Omega
13
UPV / EHU
Master SIA – Sistemas Distribuidos
• Lemma. With the previous algorithm, eventually all the correct processes will permanently trust the first correct process in p1, ..., pn
• This property trivially allows us to provide the properties of S:
– Eventual weak accuracy: by not suspecting the trusted process– Strong completeness: by suspecting all the processes except the
trusted process
Communication-optimal Omega
14
UPV / EHU
Master SIA – Sistemas Distribuidos
Communication-optimal Omega