From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15F7FC43334 for ; Tue, 19 Jul 2022 22:46:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 578B26B0072; Tue, 19 Jul 2022 18:46:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 529066B0073; Tue, 19 Jul 2022 18:46:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3F05A6B0074; Tue, 19 Jul 2022 18:46:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 2CB886B0072 for ; Tue, 19 Jul 2022 18:46:15 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E63E41C5C24 for ; Tue, 19 Jul 2022 22:46:14 +0000 (UTC) X-FDA: 79705334268.13.2396949 Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45]) by imf06.hostedemail.com (Postfix) with ESMTP id 778C6180056 for ; Tue, 19 Jul 2022 22:46:14 +0000 (UTC) Received: by mail-io1-f45.google.com with SMTP id r70so11900002iod.10 for ; Tue, 19 Jul 2022 15:46:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=; b=MjL9o3vEOb0cC/AOC29+FJHaYEDdp3YF4uPnUKdH2MSDVyDrF7ODr3Ke0Wcy4ilYzf BzCFwxsZ/XHiEHTOzXcQJv3vCytelIwqMbSVQy/FRLtlMuURP6ov3NWCIw8RZPLahcDd C48SeV+unpKZ2XPeGrdmEDCsa6LF6ZASoymNm+TqHMtWQYybbebrFT/9bV5YYeMDLZ4x oaPJRoZ/aZAKBfaZ9E6cpvtaiAhFYSDi9LC7qcvIPN12naTuSVOg+/o2fA2e6EiO7Fa0 xHvsR1NLSDdqKv1q7gdh+mZLQNyp7KE54xFUiOdTzB8i+cvo8dMLbT1d0lXkvAObz4G2 IAjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=; b=g3corWrbfANyVNNSlMNFQf08pJNT9icRQ2WlvmPLA3A/z9FGeRzJ5BtsMi0ir4QzfW 6gkMeYyI7d21jLYtzObb69dZj82uu0WuVgfprGFVLSdlx3Sn1WFIvS60R0CWxYyAusXv UB72uDkF+uAb8cwM2k1vXYk4N5rhrPXwOS/9LdfWod7PV9zZx2pgHhTKnDBYinD0hAqu aD/LrHSeu0T7F6NkgDDuxMne4/oiLfE1aRrqA9ekiMX7rPVWp8nVwtKUznQ0K4+WFNLM DHn/EeNiuXs6bFaVq+Kfo0AgJvjC1d09OjUYaBeucIdG1zzI/MAK8X8hh2YU6lWS+Z+y pCRA== X-Gm-Message-State: AJIora+HgECaBU3u3uvAF8mT2G/lbhEGBx6djpZ2VG/+H//UPQR5mSMF IRTKq8PI3InjKhJP4zvZB94niMcHqmSKmwuDybR72Q== X-Google-Smtp-Source: AGRyM1scKsvkgjvOBcySMD8ANT00O0TRABCY2NJ0UUBywdxmH7xhFhuqq83FJ/2EITJFvKlSUEurf8W9R+8SpfjeE5I= X-Received: by 2002:a05:6602:1644:b0:678:8ba4:8df6 with SMTP id y4-20020a056602164400b006788ba48df6mr16243773iow.138.1658270773545; Tue, 19 Jul 2022 15:46:13 -0700 (PDT) MIME-Version: 1.0 References: <20220719195628.3415852-1-axelrasmussen@google.com> <20220719195628.3415852-3-axelrasmussen@google.com> In-Reply-To: From: Axel Rasmussen Date: Tue, 19 Jul 2022 15:45:37 -0700 Message-ID: Subject: Re: [PATCH v4 2/5] userfaultfd: add /dev/userfaultfd for fine grained access control To: Nadav Amit Cc: Alexander Viro , Andrew Morton , Dave Hansen , "Dmitry V . Levin" , Gleb Fotengauer-Malinovskiy , Hugh Dickins , Jan Kara , Jonathan Corbet , Mel Gorman , Mike Kravetz , Mike Rapoport , Peter Xu , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , zhangyi , "linux-doc@vger.kernel.org" , linux-fsdevel , LKML , Linux MM , "linux-kselftest@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658270774; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=; b=sOvyOf3WtSaj9UcN3iVsYW0lHpcrDU4QA73Hs5PbCvUvZ1cxrEwn3ORCVr7EnuaYO7hd61 PzeQCmB32lxKndULwQA41Y1aEspasegLZETCbLOviu5pFtdXQJ4T4eTFdQ8VwtA3YVNlMA ZMKonMyBI4baeqJ9p0Fgz3V/oJyo378= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658270774; a=rsa-sha256; cv=none; b=cE6iFcsfAV1XZKn+Y4r2AENaIrteCBYocpzjNdGA0daGVizrD7fqxHLUKHtF1gTxF0+Rcj jLUbcUTkZf81ecvtHlfPkUHnhJtzXDyuAYY5JoIHDmY6ST5UzOkSHa69Uj2t2mNQlFE9H1 6Wcwl0gjk4fC5tNI0vYnYMUmv4/G4HI= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=MjL9o3vE; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.166.45 as permitted sender) smtp.mailfrom=axelrasmussen@google.com X-Rspamd-Queue-Id: 778C6180056 Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=MjL9o3vE; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.166.45 as permitted sender) smtp.mailfrom=axelrasmussen@google.com X-Rspamd-Server: rspam12 X-Rspam-User: X-Stat-Signature: wgaohs6drq4p5o1rpncgkuykesqkz8a5 X-HE-Tag: 1658270774-813617 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 19, 2022 at 3:32 PM Nadav Amit wrote: > > On Jul 19, 2022, at 12:56 PM, Axel Rasmussen w= rote: > > > Historically, it has been shown that intercepting kernel faults with > > userfaultfd (thereby forcing the kernel to wait for an arbitrary amount > > of time) can be exploited, or at least can make some kinds of exploits > > easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we > > changed things so, in order for kernel faults to be handled by > > userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl > > must be configured so that any unprivileged user can do it. > > > > In a typical implementation of a hypervisor with live migration (take > > QEMU/KVM as one such example), we do indeed need to be able to handle > > kernel faults. But, both options above are less than ideal: > > > > - Toggling the sysctl increases attack surface by allowing any > > unprivileged user to do it. > > > > - Granting the live migration process CAP_SYS_PTRACE gives it this > > ability, but *also* the ability to "observe and control the > > execution of another process [...], and examine and change [its] > > memory and registers" (from ptrace(2)). This isn't something we need > > or want to be able to do, so granting this permission violates the > > "principle of least privilege". > > > > This is all a long winded way to say: we want a more fine-grained way t= o > > grant access to userfaultfd, without granting other additional > > permissions at the same time. > > > > To achieve this, add a /dev/userfaultfd misc device. This device > > provides an alternative to the userfaultfd(2) syscall for the creation > > of new userfaultfds. The idea is, any userfaultfds created this way wil= l > > be able to handle kernel faults, without the caller having any special > > capabilities. Access to this mechanism is instead restricted using e.g. > > standard filesystem permissions. > > Are there any other =E2=80=9Cdevices" that when opened by different proce= sses > provide such isolated interfaces in each process? I.e., devices that if y= ou > read from them in different processes you get completely unrelated data? > (putting aside namespaces). > > It all sounds so wrong to me, that I am going to try again to pushback > (sorry). No need to be sorry. :) > > From a semantic point of view - userfaultfd is process specific. It is > therefore similar to /proc/[pid]/mem (or /proc/[pid]/pagemap and so on). > > So why can=E2=80=99t we put it there? I saw that you argued against it in= your > cover-letter, and I think that your argument is you would need > CAP_SYS_PTRACE if you want to access userfaultfd of other processes. But > this is EXACTLY the way opening /proc/[pid]/mem is performed - see > proc_mem_open(). > > So instead of having some strange device that behaves differently in the > context of each process, you can just have /proc/[pid]/userfaultfd and th= en > use mm_access() to check if you have permissions to access userfaultfd (j= ust > like proc_mem_open() does). This would be more intuitive for users as it = is > similar to other /proc/[pid]/X, and would cover both local and remote > use-cases. Ah, so actually I find this argument much more compelling. I don't find it persuasive that we should put it in /proc for the purpose of supporting cross-process memory manipulation, because I think the syscall works better for that, and in that case we don't mind depending on CAP_SYS_PTRACE. But, what you've argued here I do find persuasive. :) You are right, I can't think of any other example of a device node in /dev that works like this, where it is completely independent on a per-process basis. The closest I could come up with was /dev/zero or /dev/null or similar. You won't affect any other process by touching these, but I don't think these are good examples. I'll send a v5 which does this. I do worry that cross-process support is probably complex to get right, so I might leave that out and only allow a process to open its own device for now. >