fault tolerant real-time scheduling
DESCRIPTION
TRANSCRIPT
Quasi-static fault-tolerant scheduling schemes forenergy-efficient hard real-time systems
• Wei Tongquan, CS Department of East China Normal University, China
• Piyush Mishra, GE Global Research, Niskayuna, NY 12309, USA • Kaijie Wu, ECE Department of University of Illinois, Chicago, IL 60607, USA
• Junlong Zhou, CS Department of East China Normal University, China
Journal of Systems and Software
2012
Reza Ramezani 1
بسم الله الرحمن الرحیم
A Unified Approach for Fault Tolerance and Dynamic Power Management in Fixed-Priority Real-Time
Embedded Systems
• Ying Zhang a Senior Software engineer with the Research and Development
Department, Guidant Corporation, St. Paul, MN, USA
• Krishnendu Chakrabarty Department of Electrical and Computer Engineering, Duke University,
Durham, USA
Computer-Aided Design of Integrated Circuits and Systems,IEEE Transactions on 25, no. 1 (2006): 111-125.
2
3
Overview Primaries
Checkpointing & Response Time
Reliability, The best fault tolerance count?
Feasibility Analysis
Offline Application Level Voltage Scaling
Offline Task Level Voltage Scaling
Online DVS by Using Slacks
Previous Work (Ying Zhang, Krishnendu Chakrabarty, 2006)
Results
Suggestion
4
Primaries
5
Features• Fault Tolerance Scheduling
Transient Faults Fast Detection Fault occurrences at runtime, checkpointing and state restoration.
• Dynamic Voltage Scaling (DVS)• Offline Scheduling
Application Level Voltage Scaling (A-DVS) Task Level Voltage Scaling (T-DVS)
• Online Scheduling Using Slacks
• Exact Rate-Monotonic Characterization Instead of iteratively deriving the response time of each task for
feasibility analysis.
6
Online DVS Outline• The adaptation of the offline task schedules to the
runtime behavior of fault occurrences is implemented:
(1) Pre-computing and saving in a lookup table the maximum slack requirements for the processor to dynamically slow down.
(2) Retrieving and comparing the stored slack time requirements with the generated cumulative slack in the runtime.
(3) Dynamically scaling down processor speed when the generated slack time is equal to or greater than the stored slack requirements.
7
System Architecture• Focus of the study
Fixed-priority Hard Real-time Embedded System Uni-processor DVS-capable Preemtive Scheduler
• Tasks : n independent task
o is the periodo is the deadlineo is the worst case execution cycles
is the hyper-period Task order: (RMA)
8
System Architecture (2)• Processor
Uni-processor support L discrete frequency or voltage levels. denotes the operating frequency of task , .
• Scheduling Task set is assumed to be scheduled using RMA. The resulting schedule is feasible under fault-free conditions at a
certain voltage level.
• Fault and recovery model A watchdog processor is used for timing checking. Faults are assumed to be detected using low-latency fault detection
techniques (detection time is small enough to be accounted for in task execution time).
Recovery by using checkpoints (immediate previous checkpoint).
9
Checkpointing&
Response Time
10
Checkpoint count Fault-tolerant computing refers to the correct execution of user
programs and system software in the presence of faults.
Fault tolerance is typically achieved in real-time systems through online fault detection, checkpointing, and rollback recovery .
Checkpointing increases the task execution time, and in the absence of faults, it might cause a missed deadline for a task that completes on time without checkpointing.
Frequent checkpointing reduces re-execution time due to faults but increases task execution time and vice versa.
Therefore, the checkpointing interval, i.e., the duration between two consecutive checkpoints, must be carefully chosen to balance checkpointing cost with the re-execution time.
11
Fault occurrences count• Relation between fault occurrences count and fault
arrival rate k is the fault occurrences count to be tolerated. a fault arrival rate λ and a task execution interval t, the mean number
of faults that arrive during the interval is λt.o If k is much smaller than λt, a sophisticated fault-tolerant scheme with its
associated overhead is not appropriate.o if k is much larger than λt, a fault-tolerant scheme that provides deterministic
real-time guarantee may not exist. In order to target a system with reasonable real-time performance with
fault tolerance, the value of k can be taken to be a small multiple of λt, e.g., 2λt ≤ k ≤ 3λt.
12
Checkpointing
13
Fault placement Suppose the checkpointing interval is , where is the number of checkpoints
inserted equidistantly during the computation time to tolerate faults in one job.
The total execution time of the job can be divided into three categories:o Effective computation (the time when the job performs real computation) o Checkpoint saving o Checkpoint retrieval
The maximum time penalty due to a fault during job execution is The maximum time penalty due to a fault during checkpoint saving is The maximum time penalty due to a fault during checkpoint retrieval is
14
Fault placement
15
Task response time• The response time R for a task is composed of five terms:
1) The fault-free task execution time: ; 2) The total time for saving checkpoints: ; 3) The additional penalty due to faults during checkpoint saving: ; 4) The additional penalty due to faults during state restoration: ; 5) The additional penalty due to faults during task execution: . Response time :
16
Task response time• Worst case response time
The worst case response time is obtained when and .o This means that all k faults occur at the end of checkpoint saving.
We are aiming to find the optimal value of such that is minimized. To satisfy the deadline constraint, they must have .
are constants Change the value of ,to minimize the value of :
or
17
Reliability
The best fault tolerance count?
18
Reliability• Task Reliability
Probability of completing the task successfully subject to faults.
• Reliability Goal All tasks in a given task set share a common given reliability goal.
Then the task level reliability is maintained if all tasks in the task set finish their execution successfully under the given reliability target
is the reliability of task
How to compute ?
19
Task Reliability• Notations
: denote the exact number of fault occurrences in task : denote the operating frequency of task at the voltage level
(1 ≤ ≤ L), : is the execution cycles of task : is the checkpointing overhead
• Optimal Checkpoint Number : denote the optimal number of checkpoints for task at the voltage
level that minimizes the worst case response time of the task.
20
Task Reliability (2)• Task Execution Time
: denote the overall execution time of an instance of task
• Transient Faults Let be the average fault arrival rate at the frequency level The probability of fault occurrences during the execution of task at
frequency level is thus given by
21
Task Reliability (3)• Task Reliability
• Reliability Goal is the reliability goal The reliability of task is maintained if this inequality holds:
• Max Number of Faults For a given voltage level (1 ≤ ≤ L), and are known.
The worst case number of fault occurrences at the voltage level subject to target reliability can be iteratively derived using the inequality
22
Feasibility Analysis
23
Exact Characterization of RMA (ECRMA)• Critical Instant
The worst case behavior of RMA occurs when all tasks in a task set are instantiated simultaneously and are ready for execution immediately after initiation.
It has been shown that a schedule of independent periodic tasks is feasible if the first instance of each task is schedulable when it is instantiated at a critical instant Lehoczky et al. (1989) .
24
Exact Characterization of RMA (ECRMA) (2)
• The entire task set is schedulable iff
• Scheduling Points The schedulability test of each task needs to be performed only at a finite
number of time instances called scheduling points.
, is defined as multiples of for ≤
25
Exact Characterization of RMA (ECRMA) (3)
• Another Notation Substitute the and the , the necessary and sufficient condition for real-
time tasks in a given task set to be feasible becomes:
The task execution time with fault recovery overhead, which is denoted by , is utilized in the feasibility analysis of the offline schedule.
As a result, the generated feasible task schedules will be feasible when faults occur.
26
Offline Application Level Voltage Scaling
27
Application level voltage scaling (A-DVS)
• A-DVS All tasks in a given task set run at the same processor speed. The A-DVS is suitable for systems in which frequent voltage and
frequency scaling is inefficient.
• Inputs Task set The lowest voltage level (low) The highest voltage level (high) The maximum number of faults each task should tolerate () Checkpoint overhead ()
• Output The best voltage level for all tasks.
28
A-DVS algorithm
29
A-DVS algorithm (2)• Some Considerations
The binary search based A-DVS algorithm is valid only if the energy consumption is monotonic with respect to frequency/voltage changes.
When the processor static power consumption as well as context switching overhead is considered, the monotonicity does not hold.
In this case, there exists a critical processor speed below which scaling down the processor speed will instead increase the energy consumption.
The minimum voltage level low is initialized to the level corresponding to the processor critical speed.
30
Feasibility Checking Algorithm (FCA)• Inputs
Task set ( ) The array of (S) The maximum number of faults each task should tolerate () Checkpoint overhead () The current voltage level ()
• Output Feasibility (True/False)
31
FeasibilityCheckingAlgorithm(FCA)
32
Offline Task Level Voltage Scaling
33
Task level voltage scaling (T-DVS)• T-DVS
T-DVS offers higher energy savings and improves fault-tolerance at the cost of increased complexity.
T-DVS is similar to A-DVS except whenever a task, for example task , is found unschedulable, the T-DVS repeatedly selects one task from among the tasks of equal and higher priorities and scales the voltage level of the task up by one level until task becomes schedulable.
If the highest voltage level for all tasks of equal and higher priorities is reached and task is still un-schedulable, the task set is deemed to be infeasible.
• Task to voltage scaling (1) Its voltage level is lower than the highest processor supported voltage
level.
(2) The subsequent increase in energy consumption due to scaling up the voltage level of the selected task is minimal among all candidate tasks (MINIMUM).
34
T-DVS algorithm• Inputs
task set which is assumed to be feasible at a certain voltage level The maximum number of faults each task instance should tolerate () Checkpointing overhead ()
• Parameters denotes the operating frequency of task denotes the voltage level of task min denotes the index of the task selected for voltage scaling
• Output Likely the best voltage level for each task
• Initialize = 1
35
T-DVS algorithm (2)
36
T-DVS algorithm (3)
37
T-DVS Consideration• MINIMUM function
The array of voltage levels (Level)
The array of overall execution time of tasks in the task set (OE)
Index of the task whose schedulability needs to be tested (i) are inputs.
The algorithm is implemented by deriving the energy increase of each task due to voltage scaling and returning the index of the task that incurs the minimum energy increase.
• Optimum voltage level combination Unlike the task level feasibility analysis algorithm of Zhang and
Chakrabarty (2006), T-DVS does not need to exhaustively explore all possible combinations of tasks and voltage levels.
The first feasible schedule generated by the algorithm is the desired task schedule and taken as the output.
38
Schedulability Checking Algorithm (SCA)
39
Online DVS by Using Slacks
40
Online reevaluation of DVS policies Offline scheduling assumes that all tasks exhibit the worst case execution
time and all faults occur during the checkpointing.
The runtime behavior of task execution and fault occurrences can vary significantly.
In the runtime, not all tasks execute up to their worst case execution times and not all faults occur during task executions.
Hence, the slack generated in the runtime could be used to dynamically scale down the processor speed to save energy.
The online reevaluation of DVS policies can save significant energy by using generated slacks due to uncertainties in fault occurrence.
41
Reevaluation of DVS at application level• Policy
Some slacks are available in runtime, due to uncertainties in fault occurrences.
The online DVS policy manager runs a test to determine whether the cumulative slack is sufficient to slow down the processor for all the unexecuted lower priority tasks in the ready queue.
The test compares the amount of time needed for all the unexecuted lower priority tasks in the ready queue to be feasible at or a lower voltage level with the available slack.
denotes the accumulated slack time
42
Reevaluation of DVS at application level (2)
• Updating task slack
The slack from the task is initialized to 0
is updated at the end of the execution of the task to incorporate the generated slack.
is reset to 0 at the deadline of the task, indicating that the slack from the task is expired when a new instance of the task is released.
• Order of using slacks
When the accumulated slack is consumed by lower priority tasks, the slack from the task of the highest priority is consumed first.
43
Reevaluation of DVS at application level (3)• Execution time overflow
Define the execution time overflow as the additional time required by a task to be feasibly scheduled by each scheduling point at a certain voltage level.
Let denote the execution time overflow of task at the voltage level . denotes the worst case response time of task at the voltage level denotes the deadline of task
• Scale down the voltage level Before running ,the processor can be scaled down to the voltage level if
this Inequality holds:
44
Computing • Improvement
Computing requires iteratively estimating the response times of task , hence it is highly computation intensive.
An alternate simple yet efficient approach is proposed as follows.
An efficient approach: for the task , is the minimum of the differences between t2 and t1oWhere t1 is the scheduling points of the task.o And t2 is the total demand for processor time by the task at the voltage
level The execution time overflow of task (1 ≤ i ≤ n) at the voltage level (1
≤ ≤ L) is given by:
45
Computing (2)• Improvement
Both t1 and t2 are computed during the offline feasibility analysis of the A-DVS algorithm.
Hence the execution time overflow can be pre-computed during the offline feasibility analysis and stored in system memory to form a lookup table.
the scheduler searches the lookup table at the end of each task execution to calculate the sum of the execution time overflows of all the remaining unexecuted lower priority tasks
o It compares the derived execution time overflows with the accumulated slack time, and determines if the processor can be feasibly scaled down.
The proposed scheme computes the execution time overflow based on the minimum of differences between t2 and t1
o hence it provides a better opportunity to scale the processor frequency.
46
Dynamic ADVS Algorithm• Policy
Reevaluation of DVS polices is performed whenever a task instance finishes its execution.o This strategy avoids incurring during the task execution any extra overhead
that may cause the task to miss its deadline.
• Algorithm inputs is the index of the task to be executed is the voltage level of task is the sum of the overflows of unexecuted lower priority tasks in the
ready queue.
• Algorithm output Scaled voltage level
47
Dynamic ADVS Algorithm (2)
48
Example• Example
Denotes a feasible schedule at the voltage level 3 If the slack generated in the runtime satisfies , the processor can be
scaled down to level 2 without violating the feasibility of the schedule.
49
Reevaluation of DVS policies at task level
• T-DVS During the offline scheduling, the T-DVS algorithm statically tries to derive
the optimum combination of frequency allocations for all the tasks.
T-DVS stores the information, such aso the pre-computed execution time overflows o overall execution times of tasks
in a lookup table for dynamic adaptation.
In the runtime, the frequency of each task is scaled individually without affecting the operating frequency of other tasks.
• D-TDVS algorithm inputs : task finishes its execution and task is scheduled to execute at the voltage
level .
The scheduler checks if the accumulated slack time from tasks (1 ≤ j ≤ i − 1) is large enough to scale down the frequency of task by one or more levels.
50
Reevaluation of T-DVS (D-TDVS)
• Improvement The new candidate operating frequency of task can be derived as:
51
Previous Work (Ying Zhang, Krishnendu Chakrabarty, 2006)
52
Feasibility Analysis
is the optimum checkpoint count. f(. If f(, there exist equidistant checkpointing schemes for fault tolerance, and the
response time is minimum when checkpoints are inserted. If f( then no equidistant checkpointing schemes exist for tolerating up to faults.
Exampleo For a hypothetical real-time job with parameters = 10, = 10, = 1,
= 9000, and = 10 000, we get = 29 and f(. This implies that there exists an equidistant checkpointing scheme to tolerate a single fault for this job, and the worst case response time is minimized when 29 checkpoints are inserted.
o Now we change from 1 to 3, i.e., the system is required to tolerate up to three faults. Then, we get = 51 and f(= 89.23. Since f( > 0, no equidistant checkpointing scheme exists to tolerate up to three faults for this job.
53
Feasibility of a task set under fault-free conditions The feasibility analysis is based on time-demand analysis for fixed-priority
scheduling.
Here, and are the period and the execution time of a task with higher priority than .
The iteration is terminated either when o and for some o Or when
In the former case, is schedulable; in the later case, is not schedulable. The time complexity of the time-demand analysis for each task is , where is
the ratio of the largest period to the smallest period.
Fault Free
54
Tolerating k Faults in Each Task There are equidistant checkpoints for each instance of
The overall time complexity of this procedure is , where R is the ratio of the largest period to the smallest period.
55
Fault Tolerance With DVS• Feasibility Test
Without the voltage switching cost, the worst case response time for task can be expressed as:o : checkpoint count for task o : the frequency of task o To minimize all response time, must
is schedulable if and for some isn’t schedulable if
56
Fault Tolerance With DVS (2)• Checkpoint count
Since the optimal number of checkpoints depends on the speed assignment, we first need to choose the appropriate processor speeds.
• Application level We have to examine previous formula to determine the best processor
speed. There are possibilities in total. The lowest speed that satisfies the
timing constraints is selected to minimize energy consumption.
57
Fault Tolerance With DVS (3)• Task level
To obtain an optimal solution, a straightforward solution is to use an exhaustive search method. Since each task can be run at speeds, there are possible speed combinations for tasks.
The feasibility test is performed according to previous formula. Meanwhile, the energy consumption is calculated from below formula.
The speed combination that satisfies the timing constraints with the minimum energy consumption is chosen as the optimal solution.
The total energy consumption Π during one hyper-period can be expressed as:
58
Heuristic Method Based on GA• Heuristic
Heuristics based on GAs have been used to solve the problem of finding the best combination of frequency levels.
The solution is approximate.
• Goals Meeting deadlines Minimizing energy consumption
• Algorithm Each chromosome is a n-dimensional vector (, , … , ), where n is the
number of tasks and is the corresponding speed for task . is viable if the task set can be scheduled under the corresponding
speed assignment, otherwise it is not viable.
59
Heuristic Method Based on GA (2)• Init function
Initializes the search space (chromosome population). One chromosome is initially generated using the computationally
feasible application-level speed scaling method. The other chromosomes are generated randomly.
60
Heuristic Method Based on GA• GA(Ω) function
Procedure GA(Ω) applies crossover and mutation operators to Ω based on the fitness values.
If αi is not viable: fit() = rand(), where rand() is a uniform random function that returns a value between 0 and 1.
If is viable, we need to make sure that it has a higher probability to be chosen compared to the case when it is not viable.
61
Results
62
Experiments The worst case execution time of a task is assumed to correspond to the
maximum processor speed.
It is assumed that both checkpointing and data retrieval take
The energy overhead of both checkpointing and data retrieval is 160 .
Energy values are obtained by multiplying processor power consumption and task execution time and considering checkpointing overhead and DVS transition overhead.
Simulation results are reported for both application and task
Level techniques and compared with the JFTC, JFTA and JFT techniques presented in Zhang and Chakrabarty (2006).
o JFTC: offline constant frequency scheme
o JFTA: application level voltage scaling scheme
o JFTT: task level voltage scaling scheme
63
Processors
64
Task Sets
65
Application level results on Tranmeta Crusoe Optimal: Exhaustive search
The is relatively small for INS task set, because execution times of tasks in INS task set are relatively long and much more slack is needed to scale down the processor speed.
66
Task level results on Tranmeta Crusoe
67
Application level results on Intel XScale
68
Task level results on Intel XScale
69
Real life implementation The energy consumptions of the system board ,excludes the processor time.
70
Suggestion The scheduler can tolerate at least k faults and then tries to DVS by using
slacks. Tolerating more faults than k by increasing processor speed when more
faults than k occur.