cs533 concepts of operating systems class 6
DESCRIPTION
CS533 Concepts of Operating Systems Class 6. Micro-kernels Mach vs L3 vs L4. Binary Compatibility. Emulation libraries Trampoline mechanism Single server architecture Multi-server architecture IPC overhead proportional to number of servers (independent protection domains). Optimizing IPC. - PowerPoint PPT PresentationTRANSCRIPT
CS533 Concepts of Operating Systems
Class 6
Micro-kernelsMach vs L3 vs L4
CS533 - Concepts of Operating Systems 2
Binary Compatibility
Emulation librarieso Trampoline mechanism
Single server architecture Multi-server architecture
o IPC overhead proportional to number of servers (independent protection domains)
CS533 - Concepts of Operating Systems 3
Optimizing IPC
Liedtke argues Mach’s overhead is due to poor implementation!
Optimized IPC implementation in L3o Architectural level
• System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks.
o Algorithmic level• Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy
Scheduling, Direct Process Switch, Short Messages.o Interface level
• Unnecessary Copies, Parameter passing.o Coding level
• Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch.
CS533 - Concepts of Operating Systems 4
L3 IPC Performance vs Mach IPC
CS533 - Concepts of Operating Systems 5
L3 RPC Performance vs Previous Systems
CS533 - Concepts of Operating Systems 6
But Is That Enough?
What is the impact on overall system performance?
Haertig et al explore performance and extensibility of L4-based Linux OS vs Mach-based Linux and native Linux
o L4 has even more IPC optimizations than L3!
CS533 - Concepts of Operating Systems 7
L4Linux – Design & Implementation
Fully binary compliant with Linux/X86 Restricted modifications to architecture-
dependent part of Linux No Linux-specific modifications to L4 kernel
CS533 - Concepts of Operating Systems 8
Experiment
What is the penalty of using L4Linux?Compare L4Linux to native Linux
Does the performance of the underlying micro-kernel matter?
Compare L4Linux to MkLinux Does co-location improve performance?
Compare L4Linux to an in-kernel version of MkLinux
CS533 - Concepts of Operating Systems 9
Microbenchmarks
measured system call overhead on shortest system call “getpid()”
CS533 - Concepts of Operating Systems 10
Microbenchmarks (cont.)
Measures specific system calls to determine basic performance.
CS533 - Concepts of Operating Systems 11
Macrobenchmarks
measured time to recompile Linux server
CS533 - Concepts of Operating Systems 12
Macrobenchmarks (cont.)
Next use a commercial test suite to simulate a system under full load.
CS533 - Concepts of Operating Systems 13
Performance Analysis
L4Linux is, on average 8.3% slower than native Linux. Only 6.8% slower at maximum load.
MkLinux: 49% average, 60% at maximum. Co-located MkLinux: 29% average, 37% at
maximum.
CS533 - Concepts of Operating Systems 14
Conclusion?
Can hardware-based protection be made to work efficiently enough?
Did these experiments explore the cost of “fine grained” protection?
CS533 - Concepts of Operating Systems 15
Spare Slides
CS533 - Concepts of Operating Systems 16
The IPC Dilemma
IPC is very import in μ-kernel designo Increases modularity, flexibility, security and
scalability. Past implementations have been inefficient.
o Message transfer takes 50 - 500μs.
CS533 - Concepts of Operating Systems 17
The L3 (μ-kernel based) OS
A task consists of:o Threads
• Communicate via messages that consist of strings and/or memory objects.
o Dataspaces• Memory objects.
o Address space • Where dataspaces are mapped.
CS533 - Concepts of Operating Systems 18
Redesign Principles
IPC performance is the Master. All design decisions require a performance discussion. If something performs poorly, look for new techniques. Synergetic effects have to be taken into considerations. The design has to cover all levels from architecture down to
coding. The design has to be made on a concrete basis. The design has to aim at a concrete performance goal.
CS533 - Concepts of Operating Systems 19
Achievable Performance
A simple scenarioo Thread A sends a null message to thread Bo Minimum of 172 cycles
Will aim at 350 cycles (7 μs)o Will actually achieve 250 cycles (5 μs)
CS533 - Concepts of Operating Systems 20
Levels of the redesign
Architecturalo System Calls, Messages, Direct Transfer, Strict Process Orientation,
Control Blocks. Algorithmic
o Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages.
Interfaceo Unnecessary Copies, Parameter passing.
Codingo Cache Misses, TLB Misses, Segment Registers, General Registers,
Jumps and Checks, Process Switch.
CS533 - Concepts of Operating Systems 21
Architectural Level
System Callso Expensive! So, require as few as possible.o Implement two calls:
• Call• Reply & Receive Next
o Combines sending an outgoing message with waiting for an incoming message.
• Schedulers can handle replies the same as requests.
CS533 - Concepts of Operating Systems 22
Messages
Complex Messages:o Direct String, Indirect Strings (optional)o Memory Objects
Used to combine sends if no reply is needed. Can transfer values directly from sender’s variable to receiver’s variables.
A Complex Message
CS533 - Concepts of Operating Systems 23
Direct Transfer
Each address space has a fixed kernel accessible part.o Messages transferred via the kernel parto User A space -> Kernel -> User B spaceo Requires 2 copies.o Larger Messages lead to higher costs
User A
User B
Kernel
CS533 - Concepts of Operating Systems 24
Shared User Level memory (LRPC, SRC RPC)o Security can be penetrated.o Cannot check message’s legality.o Long messages -> address space becoming a critical
resource.o Explicit opening of communication channels.o Not application friendly.
CS533 - Concepts of Operating Systems 25
Temporary Mapping
L3 uses a Communication Windowo Only kernel accessible, and exists per address space.o Target region is temporarily mapped there.o Then the message is copied to the communication window and ends
up in the correct place in the target address space.
User A
User B
Kernel
CS533 - Concepts of Operating Systems 26
Temporary Mapping
Must be fast! 2 level page table only requires one word to be copied.
o pdir A -> pdir B TLB must be clean of entries relating to the use of the
communication window by other operations.o One thread
• TLB is always “window clean”.o Multiple threads
• Interrupts – TLB is flushed• Thread switch – Invalidate Communication window entries.
CS533 - Concepts of Operating Systems 27
Strict Process Orientation
Kernel mode handled in same way as User mode One kernel stack per thread May lead to a large number of stacks
o Minor problem if stacks are objects in virtual memory
CS533 - Concepts of Operating Systems 28
Thread Control Blocks (tcb’s)
Hold kernel, hardware, and thread-specific data. Stored in a virtual array in shared kernel space.
User area Kernel area
tcb Kernel stack
CS533 - Concepts of Operating Systems 29
Tcb Benefits
Fast tcb access Saves 3 TLB misses per IPC Threads can be locked by unmapping the tcb Helps make thread persistent IPC independent from memory management
CS533 - Concepts of Operating Systems 30
Algorithmic Level
Thread ID’so L3 uses a 64 bit unique identifier (uid) containing the thread
number.o Tcb address is easily obtained
• anding the lower 32 bits with a bit mask and adding the tcb base address.
Virtual Queueso Busy queue, present queue, polling-me queue.o Unmapping the tcb includes removal from queues
• Prevents page faults from parsing/adding/deleting from the queues.
CS533 - Concepts of Operating Systems 31
Algorithmic Level
Timeouts and Wakeupso Operation fails if message transfer has not started t ms after
invoking it.o Kept in n unordered wakeup lists.
• A new thread’s tcb is linked into the list τ mod n.o Thread with wakeups far away are kept in a long time
wakeup list and reinserted into the normal lists when time approaches.
o Scheduler will only have to check k/n entries per clock interrupt.
o Usually costs less the 4% of ipc time.
CS533 - Concepts of Operating Systems 32
Algorithmic Level
Lazy Schedulingo Only a thread state variable is changed (ready/waiting).o Deletion from queues happens when queues are
parsed.• Reduces delete operations.• Reduces insert operations when a thread needs to be
inserted that hasn’t been deleted yet.
CS533 - Concepts of Operating Systems 33
Algorithmic Level
Short messages via registerso Register transfers are fasto 50-80% of messages ≥ 8 byteso Up to 8 byte messages can be transferred by registers
with a decent performance gain.o May not pay off for other processors.
CS533 - Concepts of Operating Systems 34
Interface Level
Unnecessary Copieso Message objects grouped by typeso Send/receive buffers structured in the same wayo Use same variable for sending and receiving
• Avoid unnecessary copies Parameter Passing
o Use registers whenever possible.• Far more efficient• Give compilers better opportunities to optimize code.
CS533 - Concepts of Operating Systems 35
Code Level
Cache Misseso Cache line fill sequence should match the usual data
access sequence. TLB Misses
o Try and pack in one page:• Ipc related kernel code• Processor internal tables• Start/end of Larger tables• Most heavily used entries
CS533 - Concepts of Operating Systems 36
Coding Level
Registerso Segment register loading is expensive.o One flat segment coving the complete address space.
• On entry, kernel checks if registers contain the flat descriptor.• Guarantees they contain it when returning to user level.
Jumps and Checko Basic code blocks should be arranged so that as few jumps are taken
as possible. Process switch
o Save/restore of stack pointer and address space only invoked when really necessary.
CS533 - Concepts of Operating Systems 37
L4 Slides
CS533 - Concepts of Operating Systems 38
Introduction
μ-kernels have reputation for being too slow, inflexible
Can 2nd generation μ-kernel (L4) overcome limitations?
Experiment: o Port Linux to run on L4 (Mach 3.0)o Compared to native Linux, MkLinux (Linux on 1st gen
Mach derived μ-kernel)
CS533 - Concepts of Operating Systems 39
Introduction (cont.)
Test speed of standard OS personality on top of fast μ-kernel: Linux implemented on L4
Test extensibility of system:o pipe-based communication implemented directly on μ-kernelo mapping-related OS extensions implemented as user taskso user-level real-time memory management implemented
Test if L4 abstractions independent of platform
CS533 - Concepts of Operating Systems 40
L4 Essentials
Based on threads and address spaces Recursive construction of address spaces by user-level
serverso Initial address space σ0 represents physical memoryo Basic operations: granting, mapping, and unmapping.
Owner of address space can grant or map page to another address space
All address spaces maintained by user-level servers (pagers)
CS533 - Concepts of Operating Systems 41
L4Linux – Design & Implementation
Fully binary compliant with Linux/X86 Restricted modifications to architecture-
dependent part of Linux No Linux-specific modifications to L4 kernel
CS533 - Concepts of Operating Systems 42
L4Linux – Design & Implementation
Address Spaceso Initial address space σ0 represents physical memoryo Basic operations: granting, mapping, and unmapping.o L4 uses “flexpages”: logical memory ranging from one
physical page up to a complete address space.o An invoker can only map and unmap pages that have
been mapped into its own address space
CS533 - Concepts of Operating Systems 43
L4Linux – Design & Implementation
CS533 - Concepts of Operating Systems 44
L4Linux – Design & Implementation
Address Spaces (cont.)o I/O ports are parts of address spaces.o Hardware interrupts are handled by user-level
processes. The L4 kernel will send a message via IPC.
CS533 - Concepts of Operating Systems 45
L4Linux – Design & Implementation
The Linux servero L4Linux will use a single-server approach.o A single Linux server will run on top of L4, multiplexing
a single thread for system calls and page faults.o The Linux server maps physical memory into its
address space, and acts as the pager for any user processes it creates.
o The Server cannot directly access the hardware page tables, and must maintain logical pages in its own address space.
CS533 - Concepts of Operating Systems 46
L4Linux – Design & Implementation
Interrupt Handlingo All interrupt handlers are mapped to messages.o The Linux server contains threads that do nothing but
wait for interrupt messages.o Interrupt threads have a higher priority than the main
thread.
CS533 - Concepts of Operating Systems 47
L4Linux – Design & Implementation
User Processeso Each different user process is implemented as a
different L4 task: Has its own address space and threads.
o The Linux Server is the pager for these processes. Any fault by the user-level processes is sent by RPC from the L4 kernel to the Server.
CS533 - Concepts of Operating Systems 48
L4Linux – Design & Implementation
System Callso Three system call interfaces:
• A modified version of libc.so that uses L4 primitives.• A modified version of libc.a• A user-level exception handler (trampoline) calls the
corresponding routine in the modified shared library.o The first two options are the fastest. The third is
maintained for compatibility.
CS533 - Concepts of Operating Systems 49
L4Linux – Design & Implementation
Signallingo Each user-level process has an additional thread for
signal handling.o Main server thread sends a message for the signal
handling thread, telling the user thread to save it’s state and enter Linux
CS533 - Concepts of Operating Systems 50
L4Linux – Design & Implementation
Schedulingo All thread scheduling is down by the L4 kernelo The Linux server’s schedule() routine is only used for
multiplexing it’s single thread.o After each system call, if no other system call is
pending, it simply resumes the user process thread and sleeps.
CS533 - Concepts of Operating Systems 51
L4Linux – Design & Implementation
Tagged TLB & Small Space.o In order to reduce TLB conflicts, L4Linux has a special
library to customize code and data for communicating with the Linux Server
o The emulation library and signal thread are mapped close to the application, instead of default high-memory area.
CS533 - Concepts of Operating Systems 52
Performance
What is the penalty of using L4Linux?Compare L4Linux to native Linux
Does the performance of the underlying micro-kernel matter?
Compare L4Linux to MkLinux Does co-location improve performance?
Compare L4Linux to an in-kernel version of MkLinux
CS533 - Concepts of Operating Systems 53
Microbenchmarks
measured system call overhead on shortest system call “getpid()”
CS533 - Concepts of Operating Systems 54
Microbenchmarks (cont.)
Measures specific system calls to determine basic performance.
CS533 - Concepts of Operating Systems 55
Macrobenchmarks
measured time to recompile Linux server
CS533 - Concepts of Operating Systems 56
Macrobenchmarks (cont.)
Next use a commercial test suite to simulate a system under full load.
CS533 - Concepts of Operating Systems 57
Performance Analysis
L4Linux is, on average 8.3% slower than native Linux. Only 6.8% slower at maximum load.
MkLinux: 49% average, 60% at maximum. Co-located MkLinux: 29% average, 37% at
maximum.
CS533 - Concepts of Operating Systems 58
Extensibility Performance
A micro-kernel must provide more than just the features of the OS running on top of it.
Specialization – improved implementation of Os functionality
Extensibility – permits implementation of new services that cannot be easily added to a conventional OS.
CS533 - Concepts of Operating Systems 59
Pipes and RPC
First five (1) use the standard pipe mechanism of the Linux kernel.(2) Is asynchronous and uses only L4 IPC primitives. Emulates POSIX
standard pipes, without signalling. Added thread for buffering and cross-address-space communication.
(3) Is synchronous and uses blocking IPC without buffering data. (4) Maps pages into the receiver’s address space.
CS533 - Concepts of Operating Systems 60
Virtual Memory Operations
The “Fault” operation is an example of extensibility – measures the time to resolve a page fault by a user-defined pager in a separate address space.
“Trap” – Latency between a write operation to a protected page, and the invocation of related exception handler.
“Appel1” – Time to access a random protected page. The fault handler unprotects the page, protects some other page, and resumes.
“Appel2” – Time to access a random protected page where the fault handler only unprotects the page and resumes.
CS533 - Concepts of Operating Systems 61
Conclusion
Using the L4 micro-kernel imposes a 5-10% slowdown to native Linux. Much faster than previous micro-kernels.
Further optimizations such as co-locating the Linux Server, and providing extensibility could improve L4Linux even further.