From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53A98EB64DA for ; Tue, 4 Jul 2023 15:18:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6BB78280085; Tue, 4 Jul 2023 11:18:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 66C10280076; Tue, 4 Jul 2023 11:18:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 53349280085; Tue, 4 Jul 2023 11:18:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 41D0E280076 for ; Tue, 4 Jul 2023 11:18:53 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 04196C02A6 for ; Tue, 4 Jul 2023 15:18:52 +0000 (UTC) X-FDA: 80974286946.28.85D64AE Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf28.hostedemail.com (Postfix) with ESMTP id 976C1C000D for ; Tue, 4 Jul 2023 15:18:49 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=none; spf=pass (imf28.hostedemail.com: domain of Petr.Tesarik.ext@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=Petr.Tesarik.ext@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688483930; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LCoaw8BpdUQMX74Ck8h9GSNMSZdebUERrHapZwtHw0Q=; b=WGQR4YXwDrQ9hUdGf+bjBv55SKuVeI6KUGR8Bjq9FmJwMJIhDGtlr0JIizO46RoWATBtWx s8Gn7CzhWWI3ZhC0nUcABo8V46/m8TpRUneEUUUcGasKLl+QaDVbVPh8PKxM+5hg15Fy6N gVgTPkN2udhPRPLNTuUztJWi2y2ww8Y= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=none; spf=pass (imf28.hostedemail.com: domain of Petr.Tesarik.ext@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=Petr.Tesarik.ext@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688483930; a=rsa-sha256; cv=none; b=VzNLQhNnk4p/jpNNFGTkf8NQ0lIghD0MHAprKsq57Ogn2AQH5TTIPMFHeyUnxGKVRus5t6 VbS11vtJoqSxYaRJ7HJfRnkJXtaRY+mVPfDhfNdNTyA6+0HclwAYL0tQT+bVQ78veJZKVZ F4GHKeQqiu2AXRyHXaCMG9lSMLmzdwo= Received: from frapeml500002.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4QwRFJ3CTLz67cSL; Tue, 4 Jul 2023 23:15:48 +0800 (CST) Received: from [10.45.151.231] (10.45.151.231) by frapeml500002.china.huawei.com (7.182.85.205) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Tue, 4 Jul 2023 17:18:44 +0200 Message-ID: <17702e7f-479a-22b8-70d9-56e418c8120b@huawei.com> Date: Tue, 4 Jul 2023 17:18:43 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.11.2 Subject: Re: [QUESTION] Full user space process isolation? To: Roberto Sassu , Jann Horn CC: Oleg Nesterov , Paul Moore , James Morris , "Serge E. Hallyn" , Stephen Smalley , Eric Paris , Andrew Morton , Mimi Zohar , Kees Cook , Casey Schaufler , David Howells , LuisChamberlain , Eric Biederman , Christoph Hellwig , Petr Mladek , Peter Zijlstra , Thomas Gleixner , Tejun Heo , , , , , , References: Content-Language: en-US From: Petr Tesarik In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.45.151.231] X-ClientProxiedBy: frapeml100004.china.huawei.com (7.182.85.167) To frapeml500002.china.huawei.com (7.182.85.205) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 976C1C000D X-Rspam-User: X-Stat-Signature: icje8i7rtonykiam599s54y5h5i7aesi X-Rspamd-Server: rspam01 X-HE-Tag: 1688483929-596710 X-HE-Meta: U2FsdGVkX19KQ7Y5EEs+L+/Y0MbjT6QWPBgac9Z9CTcsBNH4yEXMJoz/VwrhZwg/Hwb3DerHR9kTtlHV44BwVnFSm04nZLDt3GRrYpbUYDbCPVK6IdjLSReKg/c6s4EEDs5TyYi8b+ebTXPMJb8OA/nYDentoRtKfN6FfI86U0zvTLBTNsao7ZcPTFZWeTMjHPNucEq13XLRRcC9yJcL2AxwYvB/PKDwfDlMpj6DjPqM5eMv7p99EwPV/7f3nw6BhID6JcOG4dSACzj+eOCYisJJpkfCYhfiD6zReg3u4TbCHbjgmTEkN3HYPEnWqc/DywZ7oQkXrShsWH6MC+OWXwpBYVWzvEdnkOMHue3w7qETuVmahr3DE/C3dJ2zbhyb6wjDs7yx7XZrI1xnbZh5caUyqqlKyJTF042KGjkZkhFS5G2iF/SpNIgDRrmAgOZL16DR6qrqlL4zTysGNhLhkQ7j8XL9f6r+2em4TI6TkNUpEmThr670n3NB6Zv9nV4uhEQ1TKb/dj3//UQSsBvlYnTG7uTa52va0VeSWCcDc7NXx2Ekw4lh5rR1HoxwqLk21rjGOB6l7NXgKFxWshjbjTx/t0kU6ttRE/X0za+FH03Ik4UsHgEOWTprBzbO5dHUqUDmNUX8O1o20FSUneHSSWHeuuU6iXKZibdZ/Z+vWGIGfPHdmQb4BKVC4nCOSLPbxXaTbMOrk0MJZiszwGgc7dz+jmvpqco2THcEQllPQoHWFWcYXYcEcAcMLga4jAeCQgnRxHhHJefPbwDnVsq/E8nQ0ywfW4i+WFevLDszIw7sS9ABa36PFXco6la4+Vd6D5TXArCCuLP8qBm6XSXKa0NuLA6HTUXhMpx9SKHqmgoRwGrfZcfe/hZrWIhBBcwCu20zSIcXybf2gVhShIFRlrzb7C2VmHVwzwsF4Z5kIBlzSjjfvj0SqHZY2AfEEAfsVwwTZ0+nDmQC6gckXdm zMxxrC6Z UKKospWAeG5xpQeVVwBXjovVr3AepKP+cSQsYaUNbRDchSq2pQJVCSS19Qlrf1GPEJlJRuTcemDkmr/ON25xGyRz1N8C7oP2+8CLFtbKwukflCSi6xuXWCbuVTMd2TW/TlJQa0lB/naAmsyroTqD3mgtBAuvctZnJJLDdP6CknKjSzLPtwDwzjbmR3HP+VaUqLKunUBlm7Tx2RHIvH8Nm2Mfqu5JPASMjVnhC/NzmYaxqo/Gm5G2xTZim21RXIRk2Lo8sH8TXygoC95o/6ojp1GyrRxFocaPzJFBXzbosedRhttRnyN6br4fo4R5GaqWi6iI0njFYrITclSEvNEIMex9SMws02nPyXaz8mDX5WlGVAek= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/3/2023 5:28 PM, Roberto Sassu wrote: > On Mon, 2023-07-03 at 17:06 +0200, Jann Horn wrote: >> On Thu, Jun 22, 2023 at 4:45 PM Roberto Sassu >> wrote: >>> I wanted to execute some kernel workloads in a fully isolated user >>> space process, started from a binary statically linked with klibc, >>> connected to the kernel only through a pipe. >> >> FWIW, the kernel has some infrastructure for this already, see >> CONFIG_USERMODE_DRIVER and kernel/usermode_driver.c, with a usage >> example in net/bpfilter/. > > Thanks, I actually took that code to make a generic UMD management > library, that can be used by all use cases: > > https://lore.kernel.org/linux-kernel/20230317145240.363908-1-roberto.sassu@huaweicloud.com/ > >>> I also wanted that, for the root user, tampering with that process is >>> as hard as if the same code runs in kernel space. >> >> I believe that actually making it that hard would probably mean that >> you'd have to ensure that the process doesn't use swap (in other >> words, it would have to run with all memory locked), because root can >> choose where swapped pages are stored. Other than that, if you mark it >> as a kthread so that no ptrace access is allowed, you can probably get >> pretty close. But if you do anything like that, please leave some way >> (like a kernel build config option or such) to enable debugging for >> these processes. > > I didn't think about the swapping part... thanks! > > Ok to enable debugging with a config option. > >> But I'm not convinced that it makes sense to try to draw a security >> boundary between fully-privileged root (with the ability to mount >> things and configure swap and so on) and the kernel - my understanding >> is that some kernel subsystems don't treat root-to-kernel privilege >> escalation issues as security bugs that have to be fixed. > > Yes, that is unfortunately true, and in that case the trustworthy UMD > would not make things worse. On the other hand, on systems where that > separation is defined, the advantage would be to run more exploitable > code in user space, leaving the kernel safe. > > I'm thinking about all the cases where the code had to be included in > the kernel to run at the same privilege level, but would not use any of > the kernel facilities (e.g. parsers). Thanks for reminding me of kexec-tools. The complete image for booting a new kernel was originally prepared in user space. With kernel lockdown, all this code had to move into the kernel, adding a new syscall and lots of complexity to build purgatory code, etc. Yet, this new implementation in the kernel does not offer all features of kexec-tools, so both code bases continue to exist and are happily diverging... > If the boundary is extended to user space, some of these components > could be moved away from the kernel, and the functionality would be the > same without decreasing the security. All right, AFAICS your idea is limited to relatively simple cases for now. I mean, allowing kexec-tools to run in user space is not easily possible when UID 0 is not trusted, because kexec needs to open various files and make various other syscalls, which would require a complex LSM policy. It looks technically possible to write one, but then the big question is if it would be simpler to review and maintain than adding more kexec-tools features to the kernel. Anyway, I can sense a general desire to run less code in the most privileged system environment. Robert's proposal is one of few that go in this direction. What are the alternatives? Petr T