From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id F045B2C for ; Sat, 30 Jul 2016 09:24:12 +0000 (UTC) Received: from mout.kundenserver.de (mout.kundenserver.de [217.72.192.74]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id CC82C1E7 for ; Sat, 30 Jul 2016 09:24:11 +0000 (UTC) From: Arnd Bergmann To: ksummit-discuss@lists.linuxfoundation.org Date: Sat, 30 Jul 2016 11:24:07 +0200 Message-ID: <1519332.ZM9tMjbubR@wuerfel> In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Subject: Re: [Ksummit-discuss] [TECH TOPIC] Bus IPC List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Friday, July 29, 2016 12:24:03 AM CEST David Herrmann wrote: > Tom Gundersen and I would like to propose a technical session on > in-kernel IPC systems. For roughly half a year now we have been > developing (with others) a capability-based [1] IPC system for linux, > called bus1 [2]. We would like to present bus1, start a discussion on > open problems, and talk about the possible path forward for an upstream > inclusion. > > While bus1 emerged out of the kdbus project, it is a new, independent > project, designed from scratch. Its main goal is to implement an n-to-n > communication bus on linux. A lot of inspiration is taken from both > DBus, as well as the the most commonly used IPC systems of other OSs, > and related research projects (including Android Binder, OS-X/Hurd Mach > IPC, Solaris Doors, Microsoft Midori IPC, seL4, Sandstorm's Cap'n'Proto, > ..). > > The bus1 IPC system was designed to... > > o be a machine-local IPC system. It is a fast communication channel > between local threads and processes, independent of the marshaling > format used. > > o provide secure, reliable capability-based [1] communication. A > message is always invoked on a capability, requiring the caller to > own said capability, otherwise it cannot perform that operation. > > o efficiently support n-to-n communication. Every peer can communicate > with every other peer (given the right capabilities), with minimal > overhead for state-tracking. > > o be well-suited for both unicast and multicast messages. > > o guarantee a global message order [3], allowing clients to rely on > causal ordering between messages they send and receive (for further > reading, see Leslie Lamport's work on distributed systems [4]). > > o scale with the number of CPUs available. There is no global context > specific to the bus1 IPC, but all communication happens based on > local context only. That is, if two independent peers never talk to > each other, their operations never share any memory (no shared > locks, no shared state, etc.). > > o avoid any in-kernel buffering and rather transfer data directly > from a sender into the receiver's mappable queue (single-copy). > > A user-space implementation of bus1 (or even any bus-based IPC) was > considered, but was found to have several seemingly unavoidable issues. > > o To guarantee reliable, global message ordering including multicasts, > as well as to provide reliable capabilities, a bus-broker is > required. In other words, the current linux syscall API is not > sufficient to implement the design as described above in an efficient > way without a dedicated, trusted, privileged process that manages the > bus and routes messages between the peers. > > o Whenever a bus-broker is involved, any message transaction between > two clients requires the broker process to execute code in its own > time-slice. While this time-slice can be distributed fairly across > clients, it is ultimately always accounted on the user of the broker, > rather than the originating user. Kernel time-slice accounting, and > the accounting in the broker are completely separated and cannot make > decisions based on the data of each other. > Furthermore, the broker needs to be run with quite excessive resource > limits and execution rights to be able to serve requests of high > priority peers, making the same resources available to low priority > peers as well. > An in-kernel IPC mechanism removes the requirement for such a highly > privileged bus-broker, and rather accounts any operation and resource > exactly on the calling user, cgroup, and process. > > o Bus ipc often involves peers requesting services from other trusted > peers, and waiting for a possible result before continuing. If > said trust relationship is given, privileged processes actively want > priority inheritance when calling into less privileged, but trusted > processes. There is currently no known way to implement this in a > user-space broker without requiring n^2 PI-futex pairs. > > o A userspace broker would entail two UDS transactions and potentially > an extra context-switch, compared to a single bus1 transaction with > the in-kernel broker. Our x86-benchmarks (before any serious > optimization work has started) shows that two UDS transactions are > always slower than one bus1 transaction. On top of that comes the > extra context switch, which has about the same cost as a full bus1 > transaction, as well as any time spent in the broker itself. With an > imaginary no-overhead broker, we found an in-kernel broker to be >40% > faster. The numbers will differ between machines, but the reduced > latency is undeniable. > > o Accounting of inflight resources (e.g., file-descriptors) in a broker > is completely broken. Right now, any outgoing message of a broker > will account FDs on the broker, however, there is no way for the > broker to track outgoing FDs. As such, it cannot attribute them on > the original sender of the FD, opening up for DoS attacks. > > o LSMs and audit cannot hook into the broker, nor get any additional > routing information. Thus, audit cannot log proper information, and > LSMs need to hook into a user-space process, relying on them to > implement the wanted security model. > > o The kernel itself can never operate on the bus, nor provide services > seamlessly to user-space (e.g., like netlink does), unless it is > implemented in the kernel. > > o If a broker is involved, no communication can be ordered against > side-channels. A kernel implementation, on the other hand, provides > strong ordering against any other event happening on the system. > > The implemention of bus1.ko with its <5k LOC is relatively small, but > still takes a considerable amount of time to review and understand. We > would like to use the kernel-summit as an opportunity to present bus1, > and answer questions on its design, implementation, and use of other > kernel subsystems. We encourage everyone to look into the sources, but > we still believe that a personal discussion up-front would save everyone > a lot of time and energy. Furthermore, it would also allow us to > collectively solve remaining issues. > > Everyone interested in IPC is invited to the discussion. In particular, > we would welcome everyone who participated in the Binder and kdbus > discussions, is involed in shmem+memcg (or other bus1-related > subsystems), possibly including: > > o Andy Lutomirski > o Greg Kroah-Hartman > o Steven Rostedt > o Eric W. Biederman > o Jiri Kosina > o Borislav Petkov > o Michal Hocko (memcg) > o Johannes Weiner (memcg) > o Hugh Dickins (shmem) > o Tom Gundersen (bus1) > o David Herrmann (bus1) I'd like to join in discussing the user interface. The current version seems (compared to kdbus) simple enough that we could consider using syscalls instead of a miscdev. Arnd