a race-cure case study

A race-cure case study

A look at how some standard software tools can illuminate what

is happening inside Linux

Our recent ‘race’ example

• Our ‘cmosram.c’ device-driver included a ‘race condition’ in its ‘read()’ and ‘write()’ functions, since accessing any CMOS memory-location is a two-step operation, and thus is a ‘critical section’ in our code:

outb( reg_id, 0x70 );

datum = inb( 0x71 );

• Once the first step in this sequence is taken, the second step needs to follow

No interventions!

• To guarantee the integrity of each access to CMOS memory, we must prohibit every possibility that another control-thread may intervene and access that same i/o-port

• The main ways in which an intervention by another ‘thread’ might happen are:– The current CPU could get ‘interrupted’; or– Another CPU could access the same i/o-port

Linux’s solution

• Linux provides a function that an LKM can call which is designed to insure ‘exclusive access’ to a CMOS memory-location:

datum = rtc_cmos_read( reg_id );

• By using this function, a programmer does not have to expend time and mental effort analyzing the race-condition and devising a suitable ‘cure’ for it

But how does it work?

• As computer science students, we are not satisfied with just using convenient ‘black-box’ solutions which we don’t understand

• Such purported ‘solutions’ may not always accomplish everything that they claim – if they perform correctly today, they still may fail in some way in the future (if hardware changes); we don’t want to be helpless!

Is ‘open source’ enough?

• In theory we could try to track down the actual behavior of the ‘rtc_cmos_read()’ function, by reading Linux’s source-code

• But is that really a practical approach?

• In some cases the answer might be ‘yes’, but in other situations it might be ‘no’!

• Life is short, and the kernel source-files are very numerous – with many layers

‘LXR’ can help

• The Linux Cross-Reference tool offers a way to automate searching kernel source

• This tool is online (see our website’s link under ‘Resources’) and it is hosted on a server in Norway:

http://lxr.linux.no/

• Here you just click on “Browse the Code”




From: <arch/i386/kernel/time.c>

unsigned char rtc_cmos_read(unsigned char addr) {

unsigned char val;

lock_cmos_prefix( addr ); outb_p( addr, RTC_PORT(0) ); val = inb_p( RTC_PORT(1) ; lock_cmos_suffix( addr ); return val;

} EXPORT_SYMBOL( rtc_cmos_read );

http://lxr.linux.no/ident?i=rtc_cmos_read

http://lxr.linux.no/ident?i=addr

http://lxr.linux.no/ident?i=val

http://lxr.linux.no/ident?i=lock_cmos_prefix

http://lxr.linux.no/ident?i=addr

http://lxr.linux.no/ident?i=outb_p

Another approach…

• There is an alternative to searching kernel source files -- which may well be faster

• We can use some standard command-line tools, including ‘objdump’ and ‘grep’

• In this approach, we look at the compiled kernel’s object-file, named ‘vmlinux’, found normally in the ‘/usr/src/linux’ subdirectory

• Using ‘objdump’ that file can be parsed!

‘objdump’ can disassemble

• Change the current working directory:$ cd /usr/src/linux

• Then, to disassemble the ‘vmlinux’ kernel file we use can this command:

$ objdump -d vmlinux

• But the amount of output will be huge, so it’s hard to find the part we’re interested in

‘grep’ can do filtering

• If we want to see the ‘rtc_cmos_read’ code we could use ‘grep’ to eliminate irrelevant parts of the disassembly-output:

$ objdump –d vmlinux | grep rtc_cmos_read

• But we still see too many lines of output (because the ‘rtc_cmos_read()’ function gets called at many places in the kernel)

‘System.map’

• We can use a special textfile, located in the ‘/boot’ directory, which tells us where each ‘exported’ kernel-symbol will reside at run-time in the virtual address-space

• You can use ‘cat’ to look at this textfile:$ cat /boot/System.map

• And you can use ‘grep’ to find only the symbol you care about:

$ cat /boot/System.map | grep rtc_cmos_read

Example on our machines

$ cat /boot/System.map-2.6.22.5cslabs | grep rtc_cmos_read

c0105574 T rtc_cmos_readc029b8a8 r __ksymtab_rtc_cmos_readc02a0bff r __kstrtab_rtc_cmos_read

Note that the usual ‘symbolic link’ is missing from the ‘/boot’ directory

on our class and lab machines -- so you have to type a longer name

With superuser privileges this could be fixed using the ‘ln’ command:

root# ln System.map-2.6.22.5cslabs System.map

Now we know where to look…

• From the ‘System.map’ we learn where in the kernel our ‘rtc_cmos_read()’ function will reside

• We can ‘extract’ that function’s code, for study purpose, using these steps:– Save the complete ‘vmlinux’ disassembly– Use ‘grep’ to find its starting-address– Use ‘vi’ to delete earlier and later instructions

• Step 1: saving the ‘vmlinux’ disassembly$ objdump –d /usr/src/linux/vmlinux > ~/vmlinux.asm

• Step 2: finding our function’s entry-point$ cat ~/vmlinux.asm | grep -n c0105574

What we discover

Find the line that shows this virtual address (with colon)

$ cat vmlinux.asm | grep -n c0105574:

6812:c0105574: 53 push %ebx

…and tell us which line-number it’s on

OK, here’s that line

…and this is it’s line-number

Use a text-editor

• Remove all the lines in your ‘vmlinux.asm’ textfile whose line-numbers precede 6812

• Scroll down, to find where your function ends (i.e., find its return-instruction ‘ret’):

c01055b7: c3 ret

• Delete all the lines that follow the ‘return’

The complete functionc0105574 <rtc_cmos_read>:c0105574: 53 push %ebxc0105575: 9c pushfc0105576: 5b pop %ebxc0105577: fa clic0105578: 64 8b 15 08 20 30 c0 mov %fs:0xc0302008,%edxc010557f: 0f b6 c8 movzbl %al,%ecxc0105582: 42 inc %edxc0105583: c1 e2 08 shl $0x8,%edxc0105586: 09 ca or %ecx,%edxc0105588: a1 3c 99 30 c0 mov 0xc030993c,%eaxc010558d: 85 c0 test %eax,%eaxc010558f: 75 f7 jne c0105588 <rtc_cmos_read+0x14>c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993cc0105598: c0c0105599: 85 c0 test %eax,%eaxc010559b: 75 eb jne c0105588 <rtc_cmos_read+0x14>c010559d: 88 c8 mov %cl,%alc010559f: e6 70 out %al,$0x70c01055a1: e6 80 out %al,$0x80c01055a3: e4 71 in $0x71,%alc01055a5: e6 80 out %al,$0x80c01055a7: c7 05 3c 99 30 c0 00 movl $0x0,0xc030993cc01055ae: 00 00 00c01055b1: 53 push %ebxc01055b2: 9d popfc01055b3: 0f b6 c0 movzbl %al,%eaxc01055b6: 5b pop %ebxc01055b7: c3 ret

Some ‘magic’ numbers

• There are some hexadecimal constants in this code-disassembly which we probably will not understand without more research– This memory-address: 0xc030993c– This i/o-port address: 0x80– This memory-address: %fs:0xc0302008

• There’s also a jump-target, but we do have some help in deciphering what it means:

jne c0105588 <rtc_cmos_read+0x14>

The ‘cmpxchg’ instruction

• The ‘cmpxchg’ instruction performs these CPU actions in a single operation:

cmpxchg source, destination

– The destination-operand is compared with the accumulator-register’s value, and the eflags-bits are adjusted to reflect this comparison’s result

– If ZF is set, the value of the source-operand is copied to the destination-operand; otherwise, the destination operand is copied to the accumulator register

• A ‘lock’ prefix stops another CPUs’ bus-access

‘spinlock’

c0105588: a1 3c 99 30 c0 mov 0xc030993c,%eaxc010558d: 85 c0 test %eax,%eaxc010558f: 75 f7 jne c0105588 <rtc_cmos_read+0x14>c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993cc0105598: c0c0105599: 85 c0 test %eax,%eaxc010559b: 75 eb jne c0105588 <rtc_cmos_read+0x14>

Before the code’s ‘critical section’ we have this:

And then after the code’s ‘critical section’ we have this: c01055a7: c7 05 3c 99 30 c0 00 movl $0x0,0xc030993c

c010559d: 88 c8 mov %cl,%alc010559f: e6 70 out %al,$0x70c01055a1: e6 80 out %al,$0x80c01055a3: e4 71 in $0x71,%alc01055a5: e6 80 out %al,$0x80

Then we have the function’s ‘critical section’ of code:

I/O-port 0x80 has an ‘undefined’ system functionused for time-delay

The ‘System-map’ again

• The ‘System.map’ shows what the other mysterious memory-addresses mean:

• We see that memory-address c030993c has the label ‘cmos_lock’ (supporting our previous conclusion about a ‘spinlock’); also we get a ‘clue’ about 0xc0302008

$ cat /boot/System.map-2.6.22.5cslabs | grep c030993cc030993c B cmos_lock

$ cat /boot/System.map-2.6.22.5cslabs | grep c0302008c0302008 D per_cpu__cpu_number

What is ‘per_cpu’ data?

• With SMP systems there is often a need for each CPU to have its own version of some program-variable’s value

• One example: each CPU needs a unique identification-number (used in scheduling tasks for ‘load-balancing’ and respecting ‘processor-affinity’, and keeping track of which CPU now owns a particular ‘lock’)

• That’s what ‘per_cpu__cpu_number’ is

Role of segmentation

• Linux has a clever way of allowing CPUS to access their ‘per_cpu’ variables using the same name for different locations

• This can be arranged by exploiting the CPU’s memory-segmentation architecture

• The FS segment-register is used by the kernel to reference identically-named, but differently positioned, storage-locations

Each CPU has its own GDT

• The Operating System sets up a Global Descriptor Table for each CPU; it’s an array of memory-segment descriptors:

segmentaccessrights

segment-base[ 15..0 ] segment-limit[ 15..0 ]

segment-base[ 23..16 ]

segment-base[ 31..24 ] segment-

limit[ 19..16 ]G D

63 32

31 0

‘segment-base’ tells where the memory-area begins, ‘segment-limit’ tells how far the memory-area extends, and ‘access rights’ specifies how the memory-area will be used by the CPU (e.g., user or kernel)

In-class exercise #1

• Install our ‘dram.c’ device-driver, so you can run our ‘showgdt.cpp’ application

• You will see a CPU’s memory-descriptors (displayed as quadwords in hex format)

• You will probably see a slightly different table when you run ‘showgdt’ again – if Linux schedules it on a different CPU

What’s in register FS?

• You can use our ‘newinfo.cpp’ utility to quickly create an LKM that displays the values in the CPU’s segment-registers:

// using ‘global variables’ simplifies the inline assembly language short _cs, _ds, _es, _fs, _gs, _ss; // global variables

int my_get_info( ){

int len;asm(“ mov %cs, _cs \n mov %ds, _ds “);len = sprintf( buf, “CS=%04X DS=%04X \n”, _cs, _ds );return len;

}

In-class exercise #2

• Use the value in the FS segment-register to look up that segment’s ‘base-address’ (different base-address on different CPU)

• Convert the ‘virtual’ base-address to its corresponding ‘physical’ base-address

• Use our ‘fileview’ utility to look at what’s stored in physical memory at those spots

• Check the location: %fs:0xc0302008

‘virtual-to-physical’

• If a virtual address is not in the ‘high’ area (i.e., if it’s below 0xF8000000), then it is easy to calculate it’s physical address by doing a simple subtraction

userspace(3GB)

kernelspace(1GB)

virtual address-space

4GB

0xC0000000

0xF8000000

Subtract 0xC0000000 from virtual address to get physical address – but NOT in HMA

High Memory Area

a race-cure case study

Documents

cmos memorylocation

symbol rtc

unsigned char rtc

objdump d vmlinux grep

objdump d vmlinux

vmlinux kernel file

kernel sourcefiles

kernel source files