Download - Linux Kernel Tour

Linux Kernel TourBy:

Samrat Das

AED600

Tour Map Starting From

To – Full functionality working of OS

Topics to be covered

o Introduction

o Kernel Source Organization

o Compilation Process

o Booting Process

o Loading of Kernel

o Initialization Process

o Working of Kernel

o Subsystem of Kernelo Introduction to common Kernel API's

o Kernel Symbols usage

o Introduction to mailing List & How to contribute to kernel tree

- (creating a patch and submitting)

Introduction

Introduction - Kernel Map

Lets see how the Linux Kernel

source is organizedNext Kernel Source Organization

You can get a get source from kernel.org or git

Kernel Source Organization

Browsing source code

cscope - Tool to browse source code

http://lxr.free-electrons.com – Online Source Browser

Compilation ProcessAfter configurations, when the user types 'make zImage' or 'make

bzImage' resulting bootable kernel image is stored as

arch/i386/boot/zImage or bzimage .

Here is how the image is built

Compilation ProcessI. C and assembly source files are compiled into ELF relocatable object format

(.o) and some of them are grouped logically into archives (.a) using ar(1).

II. Using ld(1), the above .o and .a are linked into vmlinux which is a statically

linked, non-stripped ELF 32-bit LSB 80386 executable file.

III. System.map is produced by nm vmlinux, irrelevant or uninteresting symbols

are grepped out.

IV. Enter directory arch/i386/boot.

V. Bootsector asm code bootsect.S is preprocessed either with or without -

D__BIG_KERNEL__, depending on whether the target is bzImage or zImage,

into bbootsect.s or bootsect.s respectively.

VI. bbootsect.s is assembled and then converted into 'raw binary' form called

bbootsect (or bootsect.s assembled and raw-converted into bootsect for

zImage).

VII. Setup code setup.S (setup.S includes video.S) is preprocessed into bsetup.s for

bzImage or setup.s for zImage. In the same way as the bootsector code, the

difference is marked by -D__BIG_KERNEL__ present for bzImage. The result is

then converted into 'raw binary' form called bsetup.

Compilation Process cont.VIII.Enter directory arch/i386/boot/compressed and convert

/usr/src/linux/vmlinux to $tmppiggy (tmp filename) in raw binary format,

removing .note and .comment ELF sections.

IX. gzip -9 < $tmppiggy > $tmppiggy.gz

X. Link $tmppiggy.gz into ELF relocatable (ld -r) piggy.o.

XI. Compile compression routines head.S and misc.c (still in

arch/i386/boot/compressed directory) into ELF objects head.o and misc.o.

XII. Link together head.o, misc.o and piggy.o into bvmlinux (or vmlinux for

zImage, don't mistake this for /usr/src/linux/vmlinux!). Note the difference

between -Ttext 0x1000 used for vmlinux and -Ttext 0x100000 for bvmlinux,

i.e. for bzImage compression loader is high-loaded.

XIII.Convert bvmlinux to 'raw binary' bvmlinux.out removing .note and .comment

ELF sections.

XIV.Go back to arch/i386/boot directory and, using the program tools/build, cat

together bbootsect, bsetup and compressed/bvmlinux.out into bzImage

(delete extra 'b' above for zImage). This writes important variables like

setup_sects and root_dev at the end of the bootsector.

Result after compilation - bzimageWhat's there inside

Objdump –D bzImage

Let us see how this kernel is workingLets start from boot process

Booting Process

I. BIOS selects the boot device.

II. BIOS loads the bootsector from the boot device.

III. Bootsector loads setup, decompression routines and

compressed kernel image.

IV. The kernel is uncompressed in protected mode.

V. Low-level initialization is performed by asm code.

VI. High-level C initialization.

Mapping of Kernel and

other peripherals

Initializations – asmI. Initialize segment values.

II. Initialize page tables.

III. Enable paging by setting PG bit in %cr0.

IV. Zero-clean BSS (on SMP, only first CPU does this).

V. Copy the first 2k of bootup parameters (kernel command

line).

VI. Check CPU type using EFLAGS and, if possible, cpuid, able

to detect 386 and higher.

VII. The first CPU calls start_kernel(), all others call

arch/i386/kernel/smpboot.c:initialize_secondary() if

ready=1, which just reloads esp/eip and doesn't return.

Initializations – high level

I. Take a global kernel lock (it is needed so that only one

CPU goes through initialization).

II. Perform arch-specific setup (memory layout analysis,

copying boot command line again, etc.).

III. Print Linux kernel "banner" containing the version.

IV. Initialize traps.

V. Initialize irqs.


VI. Initialize data required for scheduler.

VII. Initialize time keeping data.

VIII.Initialize softirq subsystem.

IX. Parse boot commandline options.

X. Initialize console.

XI. If module support was compiled into the kernel, initialize

dynamical module loading facility.

XII. If "profile=" command line was supplied, initialize

profiling buffers.

XIII.kmem_cache_init(), initialize most of slab allocator.

XIV.Enable interrupts.


XV. Calculate BogoMips value for this CPU.

XVI. Call mem_init() which calculates max_mapnr,

totalram_pages and high_memory and prints out the

"Memory: ..." line.

XVII. kmem_cache_sizes_init(), finish slab allocator

initialization.

XVIII. Initialize data structures used by procfs.

XIX. fork_init(), create uid_cache, initialise max_threads

based on the amount of memory available and configure

RLIMIT_NPROC for init_task to be max_threads/2.

XX. Create various slab caches needed for VFS, VM, buffer

cache, etc.


XXI.If System V IPC support is compiled in, initialise the IPC

subsystem. Note that for System V shm, this includes

mounting an internal (in-kernel) instance of shmfs

filesystem.

XXII. If quota support is compiled into the kernel, create and

initialise a special slab cache for it.

XXIII. Perform arch-specific "check for bugs" and, whenever

possible, activate workaround for processor/bus/etc

bugs. Comparing various architectures reveals that "ia64

has no bugs" and "ia32 has quite a few bugs", good

example is "f00f bug" which is only checked if kernel is

compiled for less than 686 and worked around

accordingly.

Initializations – high levelFinally the kernel is ready to move_to_user_mode()

XXIV. Set a flag to indicate that a schedule should be invoked

at "next opportunity" and create a kernel thread init()

which execs execute_command if supplied via "init=" boot

parameter, or tries to exec /sbin/init, /etc/init,

/bin/init, /bin/sh in this order; if all these fail, panic

with "suggestion" to use "init=" parameter.

XXV. Go into the idle loop, this is an idle thread with pid=0.

Working of KernelAfter exec()ing the init program from one of the

standard places the kernel has no direct control on

the program flow.

Its role, from now on is to provide processes with

system calls, as well as servicing asynchronous

events.

Multitasking has been setup, and it is now init which

manages multiuser access by fork()ing system

daemons and login processes.

Working of Kernel

Whenever program tries

to use system resource, it

uses system call

System Call Implementation• The mechanism to signal the kernel is a software interrupt.

• Incur an exception and then the system will switch to kernel mode and

execute the exception handler/System call handler.

• The defined software interrupt on x86 is the int $0x80 instruction.

• It triggers a switch to kernel mode and the execution of exception

vector 128, which is the system call handler.

• The system call handler is the aptly named function system_call(). It is

architecture dependent and typically implemented in assembly in

entry.S.

• x86 processors added a feature known as sysenter. This feature

provides a faster, more specialized way of trapping into a kernel to

execute a system call than using the int interrupt instruction.

System Call ImplementationDenoting the Correct System Call

• On x86, the syscall number is fed to the kernel via the eax register.

• Before causing the trap into the kernel, user-space sticks in eax the

number corresponding to the desired system call.

• The system call handler then reads the value from eax.

• The system_call() function checks the validity of the given system call

number by comparing it to NR_syscalls.

• If it is larger than or equal to NR_syscalls, the function returns -

ENOSYS. Otherwise, the specified system call is invoked:

• call *sys_call_table(,%eax,4)

Because each element in the system call table is 32 bits (four bytes), the

kernel multiplies the given system call number by four to arrive at its

location in the system call table.

System Call ImplementationParameter Passing

In addition to the system call number, most syscalls require that

one or more parameters be passed to them. The easiest way to

do this is via the same means that the syscall number is passed:

• The parameters are stored in registers. On x86, the registers

ebx, ecx, edx, esi, and edi contain, in order, the first five

arguments.

• In the unlikely case of six or more arguments, a single register

is used to hold a pointer to user-space where all the

parameters are stored.

The return value is sent to user-space also via register. On x86,

it is written into the eax register.

We have seen how system calls are

implemented. But what about the

system calls?.System calls are the calls to the subsystems of the kernel.

Now let us understand about Subsystems of kernel.

Subsystem of Kernel

Human Interface

System Interface

Process Management

Memory Management

Storage Handling

Networking

Human InterfaceSubsystem of Kernel Required to handle input output of

system

It controls the functionality of:

• Keyboard

• Console screen

• Mouse

• Etc.

System InterfaceDevice Drivers are the part of system Interface.

Which is responsible to interface the system with the

peripherals and system Hardware Components

Types of drivers:

• Character Drivers

• Block Drivers

• USB Drivers

• Network Drivers

Process ManagementFrom the kernel point of view, a process is an entry in the process table.

Nothing more.

The process table, then, is one of the most important data structures

within the system, together with the memory-management tables and the

buffer cache. The individual item in the process table is the task_struct

structure, defined in include/linux/sched.h.

The process table is both an array and a double-linked list, as well as a

tree. The physical implementation is a static array of pointers, whose

length is NR_TASKS, a constant defined in include/linux/tasks.h, and each

structure resides in a reserved memory page. The list structure is

achieved through the pointers next_task and prev_task.

Process Management Cont.After booting is over, the kernel is always working on behalf of one of the

processes, and the global variable current, a pointer to a task_struct

item, is used to record the running one. current is only changed by the

scheduler, in kernel/sched.c. When, however, all processes must be

looked at, the macro for_each_task is used. It is considerably faster than

a sequential scan of the array, when the system is lightly loaded.

A process is always running in either ``user mode'' or ``kernel mode''. The

main body of a user program is executed in user mode and system calls

are executed in kernel mode.

System calls, within the kernel, exist as C language functions, their

`official' name being prefixed by `sys_'. A system call named, for

example, burnout invokes the kernel function sys_burnout().

Process ManagementCreating processes

A unix system creates a process though the fork() system call, and process

termination is performed either by exit() or by receiving a signal.

The Linux implementation for them resides in kernel/fork.c and

kernel/exit.c.

Fork’s main task is filling the data structure for the new process. Relevant

steps, apart from filling fields, are:

• getting a free page to hold the task_struct

• finding an empty process slot (find_empty_process())

• getting another free page for the kernel_stack_page

• copying the father's LDT to the child

• duplicating mmap information of the father

sys_fork() also manages file descriptors and inodes.

Process ManagementDestroying processes

Exiting from a process is trickier, because the parent process must be

notified about any child who exits.

Moreover, a process can exit by being kill()ed by another process (these

are Unix features).

The file exit.c is therefore the home of sys_kill() and the various flavors

of sys_wait(), in addition to sys_exit().

Process ManagementExecuting programs

• After fork()ing, two copies of the same program are running. One of them

usually exec()s another program.

• The exec() system call must locate the binary image of the executable file,

load and run it.

• The Linux implementation of exec() supports different binary formats. This is

accomplished through the linux_binfmt structure.

• Loading of shared libraries is implemented in the same source file as exec() is,

but let's stick to exec() itself.

• The Unix systems provide the programmer with six flavors of the exec()

function. All but one of them can be implemented as library functions, and the

Linux kernel implements sys_execve() alone.

It performs quite a simple task: loading the head of the executable, and trying to

execute it. If the first two bytes are ``#!'', then the first line is parsed and an

interpreter is invoked, otherwise the registered binary formats are sequentially

tried.

Process ManagementStateAs a process executes it changes state according to its circumstances.

Linux processes have the following states:

• Running: The process is either running or it is ready to run

• Waiting: The process is waiting for an event or for a resource. Linux

differentiates between two types of waiting process; interruptible and

uninterruptible.

• Stopped: The process has been stopped, usually by receiving a signal.

A process that is being debugged can be in a stopped state.

• Zombie: This is a halted process which, for some reason, still has a

task_struct data structure in the task vector. It is what it sounds like, a

dead process.

The scheduler needs this information in order to fairly decide which process in

the system most deserves to run

Process ManagementProcess Handling - Schedulers

History of Schedulers

• O(n) - before – 2.6

• O(1) - Ingo Molnar - 2.6 to 2.6.23

• Rotating Staircase Deadline Scheduler - Con Kolivas

• Complete Fair Scheduler - Ingo Molnar - 2.6.23 to 3.18

• Brain Fuck Scheduler - Con Kolivas – 3.18.1

Processes System Calls

Scheduler

Memory ManagementLinux uses segmentation + pagination, which simplifies notation.

Linux uses only 4 segments:

2 segments (code and data/stack) for KERNEL SPACE (3 GB) to (4 GB)

2 segments (code and data/stack) for USER SPACE from (0 GB) to (3 GB)

Memory Management

Storage HandlingThe Virtual Filesystem (sometimes called the Virtual File Switch or more

commonly simply the VFS) is the subsystem of the kernel that implements

the file and filesystem-related

interfaces provided to user-space programs.

The VFS is the glue that enables system calls such as open(), read(), and

write() to work regardless of the filesystem or underlying physical

medium.

NetworkingThis Layer is Responsible for handling the network Packets.

Protocol stacks required, are implemented here.

It is also responsible for decrypting / encrypting the network

Packets.

How To ProgramHow to use the features of kernel or change existing thing in kernel.

Kernel Common API'sKernel API’s are documented here

https://www.kernel.org/doc/htmldocs/kernel-api/

• Data Types

• Basic C Library Functions

• Basic Kernel Library Functions

• Memory Management in Linux

• Kernel IPC facilities

• FIFO Buffer

• relay interface support

• Module Support

• Hardware Interfaces

• Firmware Interfaces

• ……. Etc.

https://www.kernel.org/doc/htmldocs/kernel-api/

Kernel Symbol UsageWhen modules are loaded, they are dynamically linked into the kernel. As with

user-space, dynamically linked binaries can call only into external functions that

are explicitly exported for use. In the kernel, this is handled via special directives

called EXPORT_ SYMBOL() and EXPORT_SYMBOL_GPL().

Functions that are exported are available for use by modules. Functions that are

not exported cannot be invoked from modules.

The set of kernel symbols that are exported are known as the exported kernel

interfaces or even the kernel API.

Exporting a symbol is easy. After the function is declared, it is usually followed by

an EXPORT_SYMBOL(). For example,

int get_pirate_beard_color(void)

{

return pirate->beard->color;

}

EXPORT_SYMBOL(get_pirate_beard_color);

Introduction to mailing

List & How to contribute

---------------------------------------------------------------------------------------

git diff

git commit

git show

git format-patch

git send-email

References

The Linux Document Project – TLPD

http://www.tldp.org/LDP/lki/lki.html

Kernelnewbies.orghttp://kernelnewbies.org/Documentation/Subsystems

Free-electrons

http://free-electrons.com

http://lxr.free-electrons.com

Kernel Maphttp://www.makelinux.net/kernel_map/

http://www.tldp.org/LDP/lki/lki.html

http://kernelnewbies.org/Documentation/Subsystems

http://free-electrons.com/

http://lxr.free-electrons.com/

http://free-electrons.comhttp/www.makelinux.net/kernel_map/

Thank youSamrat Das

[email protected]

Download - Linux Kernel Tour

Top Related