Download - Linux Kernel Tour
Linux Kernel TourBy:
Samrat Das
AED600
Tour Map Starting From
To – Full functionality working of OS
Topics to be covered
o Introduction
o Kernel Source Organization
o Compilation Process
o Booting Process
o Loading of Kernel
o Initialization Process
o Working of Kernel
o Subsystem of Kernelo Introduction to common Kernel API's
o Kernel Symbols usage
o Introduction to mailing List & How to contribute to kernel tree
- (creating a patch and submitting)
Introduction
Introduction - Kernel Map
Lets see how the Linux Kernel
source is organizedNext Kernel Source Organization
You can get a get source from kernel.org or git
Kernel Source Organization
Kernel Source Organization
Kernel Source Organization
Kernel Source Organization
Kernel Source Organization
Browsing source code
cscope - Tool to browse source code
http://lxr.free-electrons.com – Online Source Browser
Compilation ProcessAfter configurations, when the user types 'make zImage' or 'make
bzImage' resulting bootable kernel image is stored as
arch/i386/boot/zImage or bzimage .
Here is how the image is built
Compilation ProcessI. C and assembly source files are compiled into ELF relocatable object format
(.o) and some of them are grouped logically into archives (.a) using ar(1).
II. Using ld(1), the above .o and .a are linked into vmlinux which is a statically
linked, non-stripped ELF 32-bit LSB 80386 executable file.
III. System.map is produced by nm vmlinux, irrelevant or uninteresting symbols
are grepped out.
IV. Enter directory arch/i386/boot.
V. Bootsector asm code bootsect.S is preprocessed either with or without -
D__BIG_KERNEL__, depending on whether the target is bzImage or zImage,
into bbootsect.s or bootsect.s respectively.
VI. bbootsect.s is assembled and then converted into 'raw binary' form called
bbootsect (or bootsect.s assembled and raw-converted into bootsect for
zImage).
VII. Setup code setup.S (setup.S includes video.S) is preprocessed into bsetup.s for
bzImage or setup.s for zImage. In the same way as the bootsector code, the
difference is marked by -D__BIG_KERNEL__ present for bzImage. The result is
then converted into 'raw binary' form called bsetup.
Compilation Process cont.VIII.Enter directory arch/i386/boot/compressed and convert
/usr/src/linux/vmlinux to $tmppiggy (tmp filename) in raw binary format,
removing .note and .comment ELF sections.
IX. gzip -9 < $tmppiggy > $tmppiggy.gz
X. Link $tmppiggy.gz into ELF relocatable (ld -r) piggy.o.
XI. Compile compression routines head.S and misc.c (still in
arch/i386/boot/compressed directory) into ELF objects head.o and misc.o.
XII. Link together head.o, misc.o and piggy.o into bvmlinux (or vmlinux for
zImage, don't mistake this for /usr/src/linux/vmlinux!). Note the difference
between -Ttext 0x1000 used for vmlinux and -Ttext 0x100000 for bvmlinux,
i.e. for bzImage compression loader is high-loaded.
XIII.Convert bvmlinux to 'raw binary' bvmlinux.out removing .note and .comment
ELF sections.
XIV.Go back to arch/i386/boot directory and, using the program tools/build, cat
together bbootsect, bsetup and compressed/bvmlinux.out into bzImage
(delete extra 'b' above for zImage). This writes important variables like
setup_sects and root_dev at the end of the bootsector.
Result after compilation - bzimageWhat's there inside
Objdump –D bzImage
Let us see how this kernel is workingLets start from boot process
Booting Process
I. BIOS selects the boot device.
II. BIOS loads the bootsector from the boot device.
III. Bootsector loads setup, decompression routines and
compressed kernel image.
IV. The kernel is uncompressed in protected mode.
V. Low-level initialization is performed by asm code.
VI. High-level C initialization.
Mapping of Kernel and
other peripherals
Initializations – asmI. Initialize segment values.
II. Initialize page tables.
III. Enable paging by setting PG bit in %cr0.
IV. Zero-clean BSS (on SMP, only first CPU does this).
V. Copy the first 2k of bootup parameters (kernel command
line).
VI. Check CPU type using EFLAGS and, if possible, cpuid, able
to detect 386 and higher.
VII. The first CPU calls start_kernel(), all others call
arch/i386/kernel/smpboot.c:initialize_secondary() if
ready=1, which just reloads esp/eip and doesn't return.
Initializations – high level
I. Take a global kernel lock (it is needed so that only one
CPU goes through initialization).
II. Perform arch-specific setup (memory layout analysis,
copying boot command line again, etc.).
III. Print Linux kernel "banner" containing the version.
IV. Initialize traps.
V. Initialize irqs.
Initializations – high level
VI. Initialize data required for scheduler.
VII. Initialize time keeping data.
VIII.Initialize softirq subsystem.
IX. Parse boot commandline options.
X. Initialize console.
XI. If module support was compiled into the kernel, initialize
dynamical module loading facility.
XII. If "profile=" command line was supplied, initialize
profiling buffers.
XIII.kmem_cache_init(), initialize most of slab allocator.
XIV.Enable interrupts.
Initializations – high level
XV. Calculate BogoMips value for this CPU.
XVI. Call mem_init() which calculates max_mapnr,
totalram_pages and high_memory and prints out the
"Memory: ..." line.
XVII. kmem_cache_sizes_init(), finish slab allocator
initialization.
XVIII. Initialize data structures used by procfs.
XIX. fork_init(), create uid_cache, initialise max_threads
based on the amount of memory available and configure
RLIMIT_NPROC for init_task to be max_threads/2.
XX. Create various slab caches needed for VFS, VM, buffer
cache, etc.
Initializations – high level
XXI.If System V IPC support is compiled in, initialise the IPC
subsystem. Note that for System V shm, this includes
mounting an internal (in-kernel) instance of shmfs
filesystem.
XXII. If quota support is compiled into the kernel, create and
initialise a special slab cache for it.
XXIII. Perform arch-specific "check for bugs" and, whenever
possible, activate workaround for processor/bus/etc
bugs. Comparing various architectures reveals that "ia64
has no bugs" and "ia32 has quite a few bugs", good
example is "f00f bug" which is only checked if kernel is
compiled for less than 686 and worked around
accordingly.
Initializations – high levelFinally the kernel is ready to move_to_user_mode()
XXIV. Set a flag to indicate that a schedule should be invoked
at "next opportunity" and create a kernel thread init()
which execs execute_command if supplied via "init=" boot
parameter, or tries to exec /sbin/init, /etc/init,
/bin/init, /bin/sh in this order; if all these fail, panic
with "suggestion" to use "init=" parameter.
XXV. Go into the idle loop, this is an idle thread with pid=0.
Working of KernelAfter exec()ing the init program from one of the
standard places the kernel has no direct control on
the program flow.
Its role, from now on is to provide processes with
system calls, as well as servicing asynchronous
events.
Multitasking has been setup, and it is now init which
manages multiuser access by fork()ing system
daemons and login processes.
Working of Kernel
Whenever program tries
to use system resource, it
uses system call
System Call Implementation• The mechanism to signal the kernel is a software interrupt.
• Incur an exception and then the system will switch to kernel mode and
execute the exception handler/System call handler.
• The defined software interrupt on x86 is the int $0x80 instruction.
• It triggers a switch to kernel mode and the execution of exception
vector 128, which is the system call handler.
• The system call handler is the aptly named function system_call(). It is
architecture dependent and typically implemented in assembly in
entry.S.
• x86 processors added a feature known as sysenter. This feature
provides a faster, more specialized way of trapping into a kernel to
execute a system call than using the int interrupt instruction.
System Call ImplementationDenoting the Correct System Call
• On x86, the syscall number is fed to the kernel via the eax register.
• Before causing the trap into the kernel, user-space sticks in eax the
number corresponding to the desired system call.
• The system call handler then reads the value from eax.
• The system_call() function checks the validity of the given system call
number by comparing it to NR_syscalls.
• If it is larger than or equal to NR_syscalls, the function returns -
ENOSYS. Otherwise, the specified system call is invoked:
• call *sys_call_table(,%eax,4)
Because each element in the system call table is 32 bits (four bytes), the
kernel multiplies the given system call number by four to arrive at its
location in the system call table.
System Call ImplementationParameter Passing
In addition to the system call number, most syscalls require that
one or more parameters be passed to them. The easiest way to
do this is via the same means that the syscall number is passed:
• The parameters are stored in registers. On x86, the registers
ebx, ecx, edx, esi, and edi contain, in order, the first five
arguments.
• In the unlikely case of six or more arguments, a single register
is used to hold a pointer to user-space where all the
parameters are stored.
The return value is sent to user-space also via register. On x86,
it is written into the eax register.
We have seen how system calls are
implemented. But what about the
system calls?.System calls are the calls to the subsystems of the kernel.
Now let us understand about Subsystems of kernel.
Subsystem of Kernel
Human Interface
System Interface
Process Management
Memory Management
Storage Handling
Networking
Human InterfaceSubsystem of Kernel Required to handle input output of
system
It controls the functionality of:
• Keyboard
• Console screen
• Mouse
• Etc.
System InterfaceDevice Drivers are the part of system Interface.
Which is responsible to interface the system with the
peripherals and system Hardware Components
Types of drivers:
• Character Drivers
• Block Drivers
• USB Drivers
• Network Drivers
Process ManagementFrom the kernel point of view, a process is an entry in the process table.
Nothing more.
The process table, then, is one of the most important data structures
within the system, together with the memory-management tables and the
buffer cache. The individual item in the process table is the task_struct
structure, defined in include/linux/sched.h.
The process table is both an array and a double-linked list, as well as a
tree. The physical implementation is a static array of pointers, whose
length is NR_TASKS, a constant defined in include/linux/tasks.h, and each
structure resides in a reserved memory page. The list structure is
achieved through the pointers next_task and prev_task.
Process Management Cont.After booting is over, the kernel is always working on behalf of one of the
processes, and the global variable current, a pointer to a task_struct
item, is used to record the running one. current is only changed by the
scheduler, in kernel/sched.c. When, however, all processes must be
looked at, the macro for_each_task is used. It is considerably faster than
a sequential scan of the array, when the system is lightly loaded.
A process is always running in either ``user mode'' or ``kernel mode''. The
main body of a user program is executed in user mode and system calls
are executed in kernel mode.
System calls, within the kernel, exist as C language functions, their
`official' name being prefixed by `sys_'. A system call named, for
example, burnout invokes the kernel function sys_burnout().
Process ManagementCreating processes
A unix system creates a process though the fork() system call, and process
termination is performed either by exit() or by receiving a signal.
The Linux implementation for them resides in kernel/fork.c and
kernel/exit.c.
Fork’s main task is filling the data structure for the new process. Relevant
steps, apart from filling fields, are:
• getting a free page to hold the task_struct
• finding an empty process slot (find_empty_process())
• getting another free page for the kernel_stack_page
• copying the father's LDT to the child
• duplicating mmap information of the father
sys_fork() also manages file descriptors and inodes.
Process ManagementDestroying processes
Exiting from a process is trickier, because the parent process must be
notified about any child who exits.
Moreover, a process can exit by being kill()ed by another process (these
are Unix features).
The file exit.c is therefore the home of sys_kill() and the various flavors
of sys_wait(), in addition to sys_exit().
Process ManagementExecuting programs
• After fork()ing, two copies of the same program are running. One of them
usually exec()s another program.
• The exec() system call must locate the binary image of the executable file,
load and run it.
• The Linux implementation of exec() supports different binary formats. This is
accomplished through the linux_binfmt structure.
• Loading of shared libraries is implemented in the same source file as exec() is,
but let's stick to exec() itself.
• The Unix systems provide the programmer with six flavors of the exec()
function. All but one of them can be implemented as library functions, and the
Linux kernel implements sys_execve() alone.
It performs quite a simple task: loading the head of the executable, and trying to
execute it. If the first two bytes are ``#!'', then the first line is parsed and an
interpreter is invoked, otherwise the registered binary formats are sequentially
tried.
Process ManagementStateAs a process executes it changes state according to its circumstances.
Linux processes have the following states:
• Running: The process is either running or it is ready to run
• Waiting: The process is waiting for an event or for a resource. Linux
differentiates between two types of waiting process; interruptible and
uninterruptible.
• Stopped: The process has been stopped, usually by receiving a signal.
A process that is being debugged can be in a stopped state.
• Zombie: This is a halted process which, for some reason, still has a
task_struct data structure in the task vector. It is what it sounds like, a
dead process.
The scheduler needs this information in order to fairly decide which process in
the system most deserves to run
Process ManagementProcess Handling - Schedulers
History of Schedulers
• O(n) - before – 2.6
• O(1) - Ingo Molnar - 2.6 to 2.6.23
• Rotating Staircase Deadline Scheduler - Con Kolivas
• Complete Fair Scheduler - Ingo Molnar - 2.6.23 to 3.18
• Brain Fuck Scheduler - Con Kolivas – 3.18.1
Processes System Calls
Scheduler
Memory ManagementLinux uses segmentation + pagination, which simplifies notation.
Linux uses only 4 segments:
2 segments (code and data/stack) for KERNEL SPACE (3 GB) to (4 GB)
2 segments (code and data/stack) for USER SPACE from (0 GB) to (3 GB)
Memory Management
Memory Management
Storage HandlingThe Virtual Filesystem (sometimes called the Virtual File Switch or more
commonly simply the VFS) is the subsystem of the kernel that implements
the file and filesystem-related
interfaces provided to user-space programs.
The VFS is the glue that enables system calls such as open(), read(), and
write() to work regardless of the filesystem or underlying physical
medium.
NetworkingThis Layer is Responsible for handling the network Packets.
Protocol stacks required, are implemented here.
It is also responsible for decrypting / encrypting the network
Packets.
How To ProgramHow to use the features of kernel or change existing thing in kernel.
Kernel Common API'sKernel API’s are documented here
https://www.kernel.org/doc/htmldocs/kernel-api/
• Data Types
• Basic C Library Functions
• Basic Kernel Library Functions
• Memory Management in Linux
• Kernel IPC facilities
• FIFO Buffer
• relay interface support
• Module Support
• Hardware Interfaces
• Firmware Interfaces
• ……. Etc.
Kernel Symbol UsageWhen modules are loaded, they are dynamically linked into the kernel. As with
user-space, dynamically linked binaries can call only into external functions that
are explicitly exported for use. In the kernel, this is handled via special directives
called EXPORT_ SYMBOL() and EXPORT_SYMBOL_GPL().
Functions that are exported are available for use by modules. Functions that are
not exported cannot be invoked from modules.
The set of kernel symbols that are exported are known as the exported kernel
interfaces or even the kernel API.
Exporting a symbol is easy. After the function is declared, it is usually followed by
an EXPORT_SYMBOL(). For example,
int get_pirate_beard_color(void)
{
return pirate->beard->color;
}
EXPORT_SYMBOL(get_pirate_beard_color);
Introduction to mailing
List & How to contribute
---------------------------------------------------------------------------------------
git diff
git commit
git show
git format-patch
git send-email
References
The Linux Document Project – TLPD
http://www.tldp.org/LDP/lki/lki.html
Kernelnewbies.orghttp://kernelnewbies.org/Documentation/Subsystems
Free-electrons
http://free-electrons.com
http://lxr.free-electrons.com
Kernel Maphttp://www.makelinux.net/kernel_map/
Thank youSamrat Das