dedicated systems experts 2005 - martin timmerman p. 1 mars pathfinder failure

61
Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Upload: jodie-ball

Post on 04-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1

Mars pathfinder failure

Page 2: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 2

Documentation

• Speech Dave Wilner recorded by Mike Jones

• Comments by Glenn Reeves, Mars Pathfinder Flight Software Cognizant Engineer– Hereafter called JPL (Jet Propulsion Lab)

• Talk by Ian A. Mason, University of New England, Australia– http://mcs.une.edu.au/~iam/Data/threads/threads.html

Adobe Acrobat 7.0 Document

Page 3: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 3

Pathfinder mission• LAUNCH 4/12/1996• Mars Pathfinder was

originally designed as atechnology demonstration of a way to deliver an instrumented lander and a free-ranging robotic rover to the surface of the red planet.

• Pathfinder not only accomplished this goal but also returned an unprecedented amount of data and outlived its primary design life.

Page 4: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 4

Budget

• Due to limited funds, Pathfinder’s development had to be dramatically different from the way in which previous spacecraft had been developed.

• Instead of the traditional 8- to 10-year schedule and $1-billion-plus budget, Pathfinder was developed in three years for less than $150 million= the cost of some Hollywood movies!

Page 5: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 5

Pathfinder exploration• landing: 4/7/1997

last transmission: 27/09/1997• Pathfinder & Soujerner

Page 6: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 6

Lander• The lander was controlled by a derivative of

the commercially availableIBM RAD6000 computer, radiation-hardened to survive the flight.

• The computer featured a computing speed of20 MIPS 128 MB of DRAMfor storage of flight software and engineering and science data, including images and rover information.

• 6 MB ROMstored flight software and time-critical data.

Page 7: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 7

Page 8: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 8

Rover Sojourner

• The rover, capable of autonomous navigation and performance of tasks, communicated with Earth via the lander.

• Sojourner’s control system was built around an Intel 80C85,with a computing speed of 0,1 MIPS and 500 KB of RAM.

• ? ROM

Page 9: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 9

Page 10: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 10

Page 11: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 11

Page 12: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 12

Page 13: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 13

Page 14: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 14

The landerhardware and software

Page 15: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 15

VMEbus

CPU

RS6000

Rad

io

Cam

era

Mil1

55

3

Cruise stagecontrols

thrusters,valves,

a sun sensor,a star scanner

Landerinterface to

accelerometers,a radar altimeter,

an instrument formeteorological scienceknown as the ASI/MET

Mil1553: specific paradigm:the software will schedule activity at an 8 Hz rate.

This **feature** dictated the architecture of the softwarewhich controls both the 1553 bus and the devices attached to it.

Mil1553: specific paradigm:the software will schedule activity at an 8 Hz rate.

This **feature** dictated the architecture of the softwarewhich controls both the 1553 bus and the devices attached to it.

Mil 1553 bus

Page 16: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 16

The software• VxWorks 5.x (x = 3 or 4?)• 2 tasks to control the 1553 bus and the

attached instruments. • bc_sched task (called the bus

scheduler)– a task controlled the setup of transactions on

the 1553 bus• bc_dist task (for distribution) task

also referred as the “communication task”– handles the collection of the transaction

results i.e. the data.

Page 17: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 17

Marsrobot general communication pattern

time

t1 - bus hardware starts via hardware control on the 8 Hz boundary. The transactions for the this cycle had been set up by the previous executionof the bc_sched task.

t2 - 1553 traffic is complete and the bc_dist task is awakened.

t3 - bc_dist task has completed all of the data distribution

t4 - bc_sched task is awakened to setup transactions for the next cycle

t5 - bc_sched activity is complete

t1 t2 t1t3 t4 t5

125 ms (8 Hz)

Spacecraft functionsLOW priority

bc-distMEDIUM priority

bc-schedHIGH priority

Mil 1553 transaction

Science functions (ASI/MET, …)LOWEST priority

Check order!bc-dist

bc-sched

Page 18: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 18

1553 communication

• Powered 1553 devices deliver data. • Tasks in the system that access the

information collected over the 1553 do so via a double buffered shared memory mechanism into which the bc_dist task places the latest data.

• The exception to this is the ASI/MET task which is delivered its information via an interprocess communication mechanism (IPC). The IPC mechanism uses the VxWorks pipe() facility.

Page 19: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 19

VMEbus

CPU

RS6000

Rad

io

Cam

era

Mil1

55

3

Cruise stagecontrols

thrusters,valves,

a sun sensor,a star scanner

Landerinterface to

accelerometers,a radar altimeter,

an instrument formeteorological scienceknown as the ASI/MET

MEM

Packedbuffer

D-B

uff

er

D-B

uff

er

D-B

uff

er

IPC

PIP

E

FileDescriptor

List

Page 20: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 20

Dedicated Systems’ tasking graphics model - example

P3

Thread

P4

startedthread

Shareddata

D Mailbox

D4 message queue

data usage

Page 21: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 21

bc-dist

bc-schedSpacecraft

functiontaks 1

Spacecraftfunction

tasks

Sciencefunctiontask 1

Sciencefunction

tasks

Shareddata

ASI/MET

task

Shareddata

Shareddata

Shareddata

Mil 1553transsetup

pipe

Mil 1553datasetup

File

Descriptor

Table System_mutex

Page 22: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 22

IPC mechanism• Tasks wait on one or more IPC "queues"

for messages to arrive using the VxWorks select() mechanism to wait for message arrival.

• Multiple queues are used when both high and lower priority messages are required.

• Most of the IPC traffic in the system is not for the delivery of real-time data. The exception to this is the use of the IPC mechanism with the ASI/MET task.

• The cause of the reset on Mars was in the use and configuration of the IPC mechanism.

Page 23: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 23

VXWorks Select ()• Pending on multiple file descriptiors:

this routine permits a task to pend until one of a set of file descriptors becomes available

• Wait for multiple I/O devices (task level and driver level)

• file descriptors– pReadFds, pWriteFds

• Bits set in pReadFds will cause select() to pend until data becomes available on any of the corresponding file descriptors.

• Bits set in pWriteFds will cause select() to pend until any of the corresponding file descriptorsbecomes available.

• http://www.eelab.usyd.edu.au/tornado/docs/vxworks/ref/selectLib.html

Page 24: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 24

Marsrobot design

Shared ressource forCommunication

Using select()

Thread AThread A

Thread BThread B

Thread CThread C

Low priority thread

Lowest priority sporadicmeteo thread

ASI/MET

Middle priority long lasting Comm thread

bc_distDifferent I/O

channels

System_mutex

Page 25: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 25

The problem

• Priority inversion– Bounded– Unbounded

Page 26: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 26

Priority Inversion

• Priority inversion occurs when a thread of low priority blocks the execution of threads of higher priority.

• Priority inversion comes in two flavours:– bounded priority inversion (common &

relatively harmless)– unbounded priority inversion (insidious &

potentially disastrous) • Priority inversion is not new

– the earliest mention of it that I've found dates back to the Burroughs MCP (Master Control Program) of the early 1970's.

Page 27: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 27

Bounded Priority Inversion• Suppose a high priority thread becomes

blocked waiting for an event to happen. A low priority thread then starts to run and in doing so obtains (i.e locks) a mutex for a shared resource. While the mutex is locked by the low priority thread, the event occurs waking up the high priority thread.

• Inversion takes place when the high priority thread tries to lock the mutex held by the low priority thread. In effect the high priority thread must wait for the low priority thread to finish.

• It is called bounded inversion since the inversion is limited by the duration of the critical section.

Page 28: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 28

ISR A

HIGH:TASK A (40)

LOWTASK C (30)

Bounded priority inversion

time

LockMUTEX (m)

LockMUTEX (m)

run

blocked

ready

UnLockMUTEX (m)

Bounded

inversion time

Page 29: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 29

Unbounded Priority Inversion

• This is a simple elaboration on bounded inversion. Here the high level thread can be blocked indefinitely by a medium priority thread. The medium level thread running prevents the low priority thread from releasing the lock. All that is required for this to happen is that while the low level thread has locked the mutex, the medium level thread becomes unblocked, preempting the low level thread. The medium level thread then runs indefinitely.

Page 30: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 30

ISR A

HIGH:TASK A (40)

LOWTASK C (30)

Unbounded priority inversion

time

LockMUTEX (m)

LockMUTEX (m)

run

blocked

ready

MIDDLE: TASK B (35)

ISR B

Unbounded inversion time

Page 31: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 31

Mission failure• The failure was identified by the spacecraft

as a failure of the bc_dist task to complete its execution before the bc_sched task started.

• The reaction to this by the spacecraft was to reset the computer.

• This reset reinitializes all of the hardware and software. It also terminates the execution of the current ground commanded activities. No science or engineering data is lost that has already been collected (the data in RAM is recovered so long as power is not lost).

• The remainder of the activities for that day were not accomplished until the next day.

Page 32: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 32

Comm threadpre-emption

HIGH:Bus threadbc_sched

Marsrobot normal operation

time

LockSystemMUTEX (m)

run

blocked

ready

MIDDLEComm threadbc_dist

Comm threadPre-emption

End ofcycle

OK!

LOWESTMeteo thead

LOWTasks

Un-LockSystemMUTEX (m)

Page 33: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 33

Comm threadpre-emption

HIGH:Bus threadbc_sched

LOWTasks

Marsrobot priority inversion

time

LockSystemMUTEX (m)

run

blocked

ready

MIDDLEComm threadbc_dist

NOK!

End ofcycle

Comm threadPre-emption

System

Reset

LOWESTMeteo thead

LockSystemMUTEX (m)

LockSystemMUTEX (m)

Un-LockSystemMUTEX (m)

Page 34: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 34

Priority inversion• The higher priority bc_dist task was blocked by the

much lower priority ASI/MET task that was holding a shared resource.

• The ASI/MET task had acquired this resource and then been preempted by several of the medium priority tasks.

• When the bc_sched task was activated, to setup the transactions for the next 1553 bus cycle, it detected that the bc_dist task had not completed its execution.

• The resource that caused this problem was a mutex (here called system_mutex) used within the select() mechanism to control access to the list of file descriptors that the select() mechanism was to wait on.

Page 35: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 35

• The select() mechanism creates a system_mutex to protect the "wait list" of file descriptors for those devices which support select().

• The VxWorks pipe mechanism is such a device and the IPC mechanism used is based on using pipes.

• The ASI/MET task had called select(), which had called pipeIoctl(), which had called selNodeAdd(), which was in the process of giving the system_mutex.

• The ASI/ MET task was preempted and semGive() was not completed.

• Several medium priority tasks ran until the bc_dist task was activated.

• The bc_dist task attempted to send the newest ASI/MET data via the IPC mechanism which called pipeWrite().

• pipeWrite() blocked, taking the system_mutex. More of the medium priority tasks ran, still not allowing the ASI/MET task to run, until the bc_sched task was awakened.

• At that point, the bc_sched task determined that the bc_dist task had not completed its cycle (a hard deadline in the system) and declared the error that initiated the reset.

Page 36: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 36

Debug the problem

• On replica on earth• Total Tracing on

– Context switches– Uses of synchronisation objects– Interrupts

• Took time to reproduce the error• Trace analyses => priority inversion

problem

Page 37: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 37

Bug Detection

• The software that flies on Mars Pathfinder has several debug features within it that are used in the lab but are not used on the flight spacecraft (not used because some of them produce more information than we can send back to Earth).

• These features remain in the software by design because JPL strongly believes in the"test what you fly and fly what you test" philosophy.

Page 38: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 38

• One of these tools is a trace/log facility which was originally developed to find a bug in an early version of the VxWorks port (Wind River ported VxWorks to the RS6000 processor for us for this mission).

• This trace/log facility was built by David Cummings who was one of the software engineers on the task. Lisa Stanley, of Wind River, took this facility and instrumented the pipe services, msgQ services, interrupt handling, select services, and the tExec task.

• The facility initializes at startup and continues to collect data (in ring buffers) until told to stop. The facility produces a voluminous dump of information when asked.

Page 39: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 39

System tracing

•Traces system call or OS events•Uses circular buffer•Overhead•RT if ......

Ty

Hardware

AS

I/MET

Tx

VxWorks 5.x

Physical I/O (BIOS)TICKERroutine

TRACE

bc_d

ist

bc_s

ch

ed

Page 40: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 40

• After the problem occurred on Mars JPL did run the same set of activities over and over again in the lab.

• The bc_sched was already coded so as to stop the trace/log collection and dump the data (even though JPL knew they could not get the dump in flight) for this error.

• So, when JPL went into the lab to test it they did not have to change the software.

• In less that 18 hours JPL were able to cause the problem to occur. Once they were able to reproduce the failure the priority inversion problem was obvious.

??

Page 41: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 41

Problem correction (1)• Once JPL understood the problem the fix

appeared obvious: change the creation flags for the semaphore so as to enable the priority inheritance.

• The Wind River folks, for many of their services, supply global configuration variables for parameters such as the "options" parameter for the semMCreate used by the select service(although this is not documented and those who do not have vxWorks source code or have not studied the source code might

be unaware of this feature).

Page 42: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 42

Problem correction (2)• However, the fix is not so obvious for several reasons

1. The code for this is in the selectLib() and is common for all device creations. When you change this global variable all of the select semaphores created after that point will be created with the new options. There was no easy way in our initialization logic to only modify the semaphore associated with the pipe used for bc_dist task to ASI/MET task communications.

2. If you make this change, and it is applied on a global basis, how will this change the behavior of the rest of the system ?

3. The priority inversion option was deliberately left out by Wind River in the default selectLib() service for optimum performance. How will performance degrade if we turn the priority inversion on ?

4. Was there some intrinsic behavior of the select mechanism itself that would change if the priority inversion was enabled ?

Page 43: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 43

Problem correction (3)• JPL did end up modifying the global variable to include the

priority inversion. This corrected the problem. • JPL asked Wind River to analyze the potential impacts for

(3) and (4). • They concluded that the performance impact would be

minimal and that the behavior of select() would not change so long as there was always only one task waiting for any particular file descriptor. This is true in our system. JPL believes that the debate at Wind River still continues on whether the priority inversion option should be on as the default.

• For (1) and (2) the change did alter the characteristics of all of the select mutexes. JPL concluded, both by analysis and test, that there was no adverse behavior. JPL tested the system extensively before they changed the software on the spacecraft.

Page 44: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 44

CHANGED THE SOFTWARE ON THE SPACECRAFT

• JPL did not use the vxWorks shell to change the software(although the shell is usable on the spacecraft).

• The process of "patching" the software on the spacecraft is a specialized process. It involves sending the differences between what you have onboard and what you want (and have on Earth) to the spacecraft.

• Custom software on the spacecraft (with a whole bunch of validation) modifies the onboard copy.

Page 45: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 45

WHY DIDN’T JPL CATCH IT BEFORE LAUNCH ?

• The problem would only manifest itself when ASI/MET data was being collected and intermediate tasks were heavily loaded.

• Our before launch testing was limited to the "best case" high data rates and science activities.

• The fact that data rates from the surface were higher than anticipated and the amount of science activities proportionally greater served to aggravate the problem.

• We did not expect nor test the "better than we could have ever imagined" case.

Page 46: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 46

Lessons learned• Only detailed traces of actual system behavior

enabled the faulty execution sequence to be captured and identified.

• Leaving the « debugging » facilities in the system saved the day. Without the ability to modify the system in the field, the problem could not have been corrected.

• Finally, the engineer's initial analysis that "the data bus task executes very frequently and is time-critical -- we shouldn't spend the extra time in it to perform priority inheritance" was exactly wrong.

• It is precisely in such time critical and important situations where correctness is essential, even at some additional performance cost.

Page 47: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 47

Lessons learned – human factors

• JPL engineers later confessed that one or two system resets had occurred in their months of pre-flight testing. They had never been reproducible or explainable, and so the engineers, in a very human-nature response of denial, decided that they probably weren't important, using the rationale "it was probably caused by a hardware glitch".

• Part of it too was the engineers' focus. They were extremely focused on ensuring the quality and flawless operation of the landing software. Should it have failed, the mission would have been lost. It is entirely understandable for the engineers to discount occasional glitches in the less-critical land-mission software, particularly given that a spacecraft reset was a viable recovery strategy at that phase of the mission.

Page 48: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 48

Priority inversion solution implementations

• History• Priority Inheritance – pro’s and con’s• Priority ceiling - pro’s and con’s

Page 49: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 49

History

• Theory provides (at least) two simple solutions to priority inversion:– Priority Inheritance Protocol – Priority Ceiling Protocol

• The first is the simplest, while the second has nicer theoretical properties.

• The theoretical results (about both) date back to about 1987, while the actual protocols date back quite earlier.

• Burroughs MCP implemented a version of Priority Ceiling Protocol in the 1970's

• Lampson & Redell suggest the Priority Ceiling Protocol (in Mesa) in the late 1970's

• Important IEEE paper by L. Sha, R. Rajkumar & P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep 1990

Adobe Acrobat 7.0 Document

Page 50: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 50

Priority Inheritance Protocol

• Priority Inheritance means that when a thread waits on a mutex owned by a lower priority thread, the priority of the owner is increased to that of the waiter. In the priority inheritance protocol when a thread locks a mutex its priority is not changed. The action takes place when a thread attempts to lock a mutex owned by another thread.

• In this situation the priority of the thread owning the mutex is raised to the priority of the blocked thread (if higher).

• When the thread releases the mutex its old priority (i.e prior to locking this mutex) is restored.

• This prevents unbounded priority inversion since the low priority thread gets a high priority and thus cannot be pre-empted by medium priority thread.

Page 51: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 51

Priority Inheritance Protocol (cont'd)

• The theoretical results concerning the Priority Inheritance Protocol are:

A thread can only be blocked once by each thread of lower priority, and the duration of each blockage is limited to one critical section.

If there are n mutexes (which can block a thread), the thread can be blocked at most n times.

It does not prevent deadlock. Blocking can be prolonged

(i.e. all blocking can be chained together). • The practical aspects of the Priority Inheritance

Protocol are: It is easy to program using the protocol. It is complicated to implement.

Page 52: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 52

Priority Ceiling Protocol

• Priority Ceiling means that while a thread owns the mutex it runs at a priority higher than any other thread that may acquire the mutex.

• In the priority ceiling solution each shared mutex us initialised to a priority ceiling.

• Whenever a thread locks this mutex, the priority of the thread is raised to the priority ceiling.

• This works as long as the priority ceiling is greater than the priorities of any thread that may lock the mutex (hence its name).

• Note again how this solves the unbounded priority inversion problem.

Page 53: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 53

Priority Ceiling Protocol (cont'd)

• The theoretical results concerning the Priority Inheritance Protocol also require a scheduling rider:

• A thread can lock a mutex only if it's priority is higher than the ceilings of all other locked mutexes.

• The theoretical results are then: The protocol prevents deadlock. A thread can only be blocked by one other thread's

(maximal) critical section. The protocol introduces a new form of blocking (i.e the

scheduling rider). • The practical results are:

It is easy to implement (modulo the scheduling rider). It requires careful programming (i.e. correct choices of

ceilings).

Page 54: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 54

POSIX Solutions

POSIX provides both solutions:– the priority ceiling protocol

(although it doesn't seem to require the scheduling rider)

– the priority inheritance protocol It also allows for three scheduling algorithms

on a thread by thread basis: – FIFO (used for high priority threads)– Round Robin, (used for routine threads), and – other (a standard way to be non-standard).

It also allows for the tweaking of which threads compete with which other threads (contention scope)

Page 55: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 55

Java's Solutions

Java doesn't specify a scheduling policy: • Each thread has a priority that is used by the Java runtime in

scheduling threads for execution. A thread that has a higher priority than another thread is typically scheduled ahead of the other thread. However, the way thread priorities precisely affect scheduling is platform dependent. In some systems, priority-based scheduling is guaranteed, while in others, priorities act only as hints to the scheduler. Therefore you should not depend on priorities in designing your program .

• from page 1359 of The Java Class Libraries The Java Library fails to even specify protocols for avoiding

priority inversion. The reason for this total cop-out on the part of Java is perhaps

because of its desire for platform independence. Since it has to run on a variety of operating systems (pre-emptive (Microsoft) vs non-pre-emptive (Apple)).

Consequently, multithreaded Java programs can run well in one operating, and not run at all in another operating system.

Page 56: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 56

Case study Conclusions

Page 57: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 57

Conclusions

• The Pathfinder Problem was fixed by simply flicking a switch.

• The system_mutex merely had to be initialized with priority inheritance turned on.

• Priority inversion is not new but a lot of designers ignore it.

• There are non documented features in an RTOS.

Page 58: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 58

Application Design Advice• Rules

– No other function call between de P and V operations– Critical code should be as short as possible – in most

cases people doesn’t know what they are really doing and they lock too much in onder to be “sure

– Never use a mutex inside a lock of another mutex

• Multi-processor situation is much more difficult• Both systems are OK to solve the problem• Priority ceiling is simpler to implement in the OS

but needs design attention.

Page 59: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 59

Page 60: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 60

Other interesting info

• Streamlined Design Approach Lands Mars Pathfinder– Steven A. Stolper, ComTier

Adobe Acrobat 7.0 Document

Page 61: Dedicated Systems Experts 2005 - Martin TIMMERMAN p. 1 Mars pathfinder failure

Dedicated Systems Experts – 2005 – Martin TIMMERMAN p. 61