ipc in microkernel systems, capabilities
TRANSCRIPT
http://d3s.mff.cuni.czhttp://d3s.mff.cuni.cz/aosy
Martin Děcký
IPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, Capabilities
2Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Inter-Process CommunicationInter-Process Communication
Sharing data between processes (tasks)
Crossing the process isolation in a managed and predictable way
Technically, any means of sharing data can be considered IPC (e.g. files, networking, middleware)
In monolithic systems, this usually works without usinga dedicated IPC mechanism
Crucial for microkernel systems
In microkernel systems, even files and networking cannot be implemented without an IPC mechanism
3Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Classical IPCClassical IPC
POSIX signals
Since UNIX Version 4
Asynchronous notification sent to a process (thread)
Similar to level-triggered interrupts (including masking)
Sender uses the kill(2) syscallRun-time exceptions and state changes also cause signals (SIGFPE, SIGSEGV; SIGPIPE, SIGINT, SIGSTOP/SIGTSTP, SIGCONT, SIGTRAP)
Receiver thread is interrupted and a signal handler is executed (installed using signal(2) or sigaction(2))
Race conditions due to nested signals
Calling non-reentrant functions (e.g. malloc(), printf()) is undefined behavior
Interruption of some syscalls
Real-time signalsQueued, guaranteed sending order
4Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Classical IPC (2)Classical IPC (2)
Anonymous pipes
Named pipes
Persistent uni-directional pipes
Same API as files (anonymous pipes)
Pipe identification: File system i-node (bound to a directory entry)
No identification of senders on the receiver endWrites of data larger than PIPE_BUF bytes can be interleaved
Windows named pipes
Dedicated namespace (Named Pipe File System \\.\pipe\)
Non-persistent (removed when all clients close the pipe)
Anonymous pipes are named pipes with random names
5Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Classical IPC (3)Classical IPC (3)
UNIX domain sockets
Reliable bi-directional stream of bytes (akin to TCP), or
Unordered unreliable datagrams (akin to UDP), or
Reliable ordered stream of datagrams ...
... between local processes
Same API as BSD sockets
Socket identification: File system i-node (bound to a directory entry or to an abstract socket namespace)
Sending file descriptors (sendmsg(), rescvmsg())Rudimentary capabilities
6Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Classical IPC (4)Classical IPC (4)
Software shared memory
POSIX Shared Memory, System V Shared Memory
Persistent shared memory objects in dedicated namespaceIn Linux, objects created as tmpfs files (usually /dev/shm)
shm_open(3), mmap(2), munmap(2), shm_unlink(3) vs. shmget(2), shmat(2), shmdt(2)
Memory mapped files
Shared memory backed by a file (or anonymous memory)
mmap(2), munmap(2)
memfd_create(2)Removed when no longer referenced
File sealing
7Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Classical IPC (5)Classical IPC (5)
Message passing
Synchronous/asynchronous, blocking/non-blocking, symmetrical/asymmetrical/indirect addressing
POSIX message queues, System V Message Passing
Indirect addressing using a message queue (key for msgget(2), i-node for mq_open(3))
msgsnd(2), mq_send(3) asynchronous, non-blocking (unless the queue is full), msgrcv(2), mq_receive(3) blocking by default
Windows Messages
Symmetrical addressing using window/thread handles
SendMessage() synchronous, SendMessageCallback(), SendNotifyMessage(), PostMessage() asynchronous
GetMessage() blocking, PeekMessage() non-blocking
8Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Classical IPC (6)Classical IPC (6)
IPC abstractions
D-Bus
Single-node middleware replacing CORBA (GNOME) and DCOP (KDE)
Software bus abstraction (end-points communicating over a shared virtual channel)
System bus vs. session bus
End-points identified by a component string and unique connection (instance) name
Method calls and signals implemented on top of message passing– Synchronous one-to-one request-response (libdbus)
● UNIX domain sockets, TCP sockets, kdbus (abandoned), BUS1– Asynchronous publish/subscribe (dbus-daemon)
9Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Classical IPC (7)Classical IPC (7)
IPC abstractions
Doors
Synchronous remote procedure call
Originally implemented for Spring, later ported to Solaris
Request and reply buffer, request and reply list of file descriptors
Binder (OpenBinder)
Middleware and component framework, originally implemented for BeOS, now used by Android (similar to Microsoft COM)
Synchronous remote method invocationCustom kernel IPC mechanism
– Method invocation implemented as thread migration● BINDER_WRITE_READ ioctl with a request and reply buffer
– Object reference tracking
Windows Dynamic Data Exchange, Object Linking and Embedding, Component Object Model
Based on [Advanced] Local Procedure Call ([A]LPC)
10Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
CapabilitiesCapabilities
Capability
Object identifying an OS resource
Logical objects (open files, connections), typed memory areas (physical memory regions)
Capability reference
Local user space identification of a capability (file handles, virtual memory regions)
Operations with capabilities
Invoking a method with a capability reference
Permissible methods defined by the capability itself
11Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Trivial Capability ExampleTrivial Capability Example
kernel space
user space
read(0, ...);
0 1 2 3 file descriptor table(capabilities)
file descriptor(capability reference)
vfs_file_t operating system resource(open file)
12Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Trivial Capability Example (2)Trivial Capability Example (2)
kernel space
user space
struct msghdr msg;struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);// ...
memmove(CMSG_DATA(cmsg), &fd, sizeof(fd));sendmsg(socket, &msg, 0);
0 1 2 3
vfs_file_t
0 1 2 3
13Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Trivial Capability Example (2)Trivial Capability Example (2)
kernel space
user space
struct msghdr msg;struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);// ...
memmove(CMSG_DATA(cmsg), &fd, sizeof(fd));sendmsg(socket, &msg, 0);
0 1 2 3
vfs_file_t
struct msghdr msg;struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);// ...
recvmsg(socket, &msg, 0);
int fd;memmove(&fd, CMSG_DATA(cmsg), sizeof(fd));
0 1 2 3 4
14Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Capabilities as a Directed GraphCapabilities as a Directed Graph
kernel space
user space
00 01 11
cnodecap
cnode_t
cnodecap
untypedcap
cnode_t
untypedcap
endpointcap
pagecap
untypedcap cnode_t
mem_region_t
cspace
cref_t
15Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Capabilities as a Directed Graph (2)Capabilities as a Directed Graph (2)
Capability space (cspace)
Directed graph of capability nodes (cnodes) belonging to a task (defines the root cnode)
Capability nodes (cnodes)
Fixed array of capabilities (power-of-two sized)
Given number of bits from the capability reference identifies the capability slot
cnode capability points to another cnodeThe node contains a radix value to define how many bits should be skipped
Other capability types point to other kernel objects
Similar to hierarchical page tables
16Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Operations with CapabilitiesOperations with Capabilities
Cloning
Creating a new capability pointing to the same resource as the original capability
Minting
Creating a new capability with a subset of permissions to the original resource
Deriving
Creating a new capability based on a hierarchy of capability types
Untyped memory (representing a region of physical memory)Can be split into several untyped memory regions
Can be retyped to a different capability (thus creating a different type of kernel object)
Capability derivation tree
Maintaining sole authority on the cloned/minted/derived capabilities by the original owner of the capability
17Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
User Space-managed Capability SpaceUser Space-managed Capability Space
Microkernel provides capability mechanisms
User space is responsible for managing the cspace by deriving capabilities from the untyped memory capability and managing cnodes
untypedcap
untypedcap
untypedcap
untypedcap
untypedcap
untypedcap
untypedcap
cnodecap
TCBcap
L1 PTcap
L2 PTcap
18Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
User Space-managed Capability Space (2)User Space-managed Capability Space (2)
History of capability-based microkernels
HYDRA (Carnegie-Mellon, William Wulf, 1971)
Targeting early MIMD architecture (16 PDP-11 computing nodes)
GNOSIS, KeyKOS (Tymshare Inc., Key Logic, 1979)
IBM S/370, emulating VM, MVS and UNIX environments
EROS, CapROS, Coyotos (Johns Hopinks University, University of Pennsylvania, Jonathan Shapiro, 1991)
386, 486
seL4 (NICTA, General Dynamics, 2006)
Capability-based mechanisms inspired by Jonathan Shapiro
19Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Mach IPCMach IPC
Prototypical asynchronous message passing
Ports
Receive end-points and associated message queues
Port rights
Client capabilities for accessing a port (send, receive, send-once)Only a single server can have a receive right
Each task has an initial set of port rightsCommunicating with the kernel, etc.
Tagged message structure
Kernel enforces type correctness
Port rights can be also passed
Timeouts
20Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Mach IPC (2)Mach IPC (2)
Prototypical performance issues
50% IPC overhead compared to monolithic UNIX
With a single UNIX server
Complex kernel-side code
Tagged data type evaluation, handling of timeouts, etc.
Dynamic data structuresBut unfortunately only linked lists
Excessive cache footprint
Asynchronicity rarely used in practice
User space tasks (originally ported from UNIX) usually written for sequential performance and blocking I/O
21Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
The Era of Synchronous IPCThe Era of Synchronous IPC
L3 (1988), L4 (1993) by Jochen Liedke [1]
3% IPC overhead compared to monolithic UNIX
With a single UNIX server
Single IPC call overhead comparable to single syscall overhead in UNIX (approx. 20 times faster than on Mach)
Synchronous IPC
Explicit client/server rendez-vous and thread migrationNo need for full context switch (just address space switch)
No buffering, no scheduling, data passed directly in registers
Highly target-optimized implementationSmall working set, cache-friendly code
No complex algorithms (capabilities, etc.)
22Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
The Era of Synchronous IPC (2)The Era of Synchronous IPC (2)
[2]
23Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
The Era of Synchronous IPC (3)The Era of Synchronous IPC (3)
Drawbacks
The microkernel is unportable (by design)
The readability and maintainability of the IPC code is very poor
Preoccupation with single-threaded performance conflicted with other goals
Design issues of synchronous IPC
Unresponsive server blocks the client indefinitivelyInitial response: Synchronous IPC with timeouts
– In hindsight a bad solution [3]
Asynchronous communication on top of synchronous IPCAbstraction inversion anti-pattern (e.g. multithreading)
Paradoxically, on massive multicore machines the scalability suffers
24Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
The Return of Asynchronous IPCThe Return of Asynchronous IPC
The best of both worlds
Reasonably simple, cache-friendly, fast-path kernel code
Bounded kernel buffers (additional buffering possible on the client user space side)
Intelligent bookkeeping data structures (hash tables, trees)
Simple IPC message structure (only integer payload that fits into registers)
Additional semantics for memory copying and memory sharing
Possibility to build rich abstractions in user space
Actors, agents, continuations, futures, promises
25Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
HelenOS IPCHelenOS IPC
Basic design
Asynchronous messages over uni-directional connections
6-integer payload (1st integer interpreted as method ID)
Bounded kernel buffers
Every message needs to be answered (6-integer return value)
Connections identified by capabilities
The receiver can forward a message via its connection to a different receiver (even recursively)
The responsibility to reply is passed on
Kernel events and hardware interrupts generate IPC messages (no reply needed)
Initially, every task has just the connection to the naming service
26Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
HelenOS IPC (2)HelenOS IPC (2)
Kernel API
Global method IDs with special semantics
IPC_M_CONNECTION_CLONE (clone a connection capability from the client to the server)
IPC_M_CONNECT_TO_ME (establish a callback connection)
IPC_M_CONNECT_ME_TO (establish a new connection)When forwarded, the connection is established to the next receiver
– A broker (naming service, location service, device manager, VFS) connects a client to the target server
IPC_M_SHARE_IN / IPC_M_SHARE_OUT (receive/send a shared virtual address space area)
IPC_M_DATA_READ / IPC_M_DATA_WRITE (receive/send data)
IPC_M_STATE_CHANGE_AUTHORIZE (update a server state on behalf of a different client)
Three-way handshake
27Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
HelenOS IPC (3)HelenOS IPC (3)
Async framework
Writing single-threaded IPC code that can make use of the parallelism
User space-scheduler cooperative threads (fibrils)Preempted only when blocking on IPC calls
Abstracting the low-level IPC connections into sessions
Each session can have a different threading model
Grouping of atomic IPC messages into logical exchanges
28Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
HelenOS IPC (4)HelenOS IPC (4)
async_exch_t *ns_exch = async_exchange_begin(session_ns);
async_sess_t *sess =async_connect_me_to_iface(ns_exch, INTERFACE_VFS, SERVICE_VFS, 0);
async_exchange_end(ns_exch);
async_exch_t *exch = async_exchange_begin(sess);
ipc_call_t answer;aid_t req =
async_send_3(exch, VFS_IN_OPEN, lflags, oflags, 0, &answer);
async_data_write_start(exch, path, path_size);
async_exchange_end(exch);
// Do some other useful work in the meantime
sysarg_t rc;async_wait_for(req, &rc);
if (rc == EOK)fd = (int) IPC_GET_ARG1(answer);
29Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
Q&A
30Martin Děcký, Advanced Operating Systems, March 10th 2017 IPC, Capabilities
ReferencesReferences
[1] Härtig H., Hohmuth M., Liedtke J., Schönberg S., Wolter J.: The Performance of μ-Kernel-Based Systems, Proceedings of 16th ACM Symposium on Operating Systems Principles (SOSP), ACM, 1997 [2] Gernot Heiser: https://en.wikipedia.org/wiki/File:L4_family_tree.png[3] Heiser G., Elphinstone K.: L4 Microkernels: The Lessons from 20 Years of Research and Deployment, ACM Transactions on Computer Systems (TOCS), Volume 34, Issue 1, 2016