From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 15F7FC43334
	for <linux-mm@archiver.kernel.org>; Tue, 19 Jul 2022 22:46:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 578B26B0072; Tue, 19 Jul 2022 18:46:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 529066B0073; Tue, 19 Jul 2022 18:46:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3F05A6B0074; Tue, 19 Jul 2022 18:46:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 2CB886B0072
	for <linux-mm@kvack.org>; Tue, 19 Jul 2022 18:46:15 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id E63E41C5C24
	for <linux-mm@kvack.org>; Tue, 19 Jul 2022 22:46:14 +0000 (UTC)
X-FDA: 79705334268.13.2396949
Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45])
	by imf06.hostedemail.com (Postfix) with ESMTP id 778C6180056
	for <linux-mm@kvack.org>; Tue, 19 Jul 2022 22:46:14 +0000 (UTC)
Received: by mail-io1-f45.google.com with SMTP id r70so11900002iod.10
        for <linux-mm@kvack.org>; Tue, 19 Jul 2022 15:46:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=;
        b=MjL9o3vEOb0cC/AOC29+FJHaYEDdp3YF4uPnUKdH2MSDVyDrF7ODr3Ke0Wcy4ilYzf
         BzCFwxsZ/XHiEHTOzXcQJv3vCytelIwqMbSVQy/FRLtlMuURP6ov3NWCIw8RZPLahcDd
         C48SeV+unpKZ2XPeGrdmEDCsa6LF6ZASoymNm+TqHMtWQYybbebrFT/9bV5YYeMDLZ4x
         oaPJRoZ/aZAKBfaZ9E6cpvtaiAhFYSDi9LC7qcvIPN12naTuSVOg+/o2fA2e6EiO7Fa0
         xHvsR1NLSDdqKv1q7gdh+mZLQNyp7KE54xFUiOdTzB8i+cvo8dMLbT1d0lXkvAObz4G2
         IAjg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=;
        b=g3corWrbfANyVNNSlMNFQf08pJNT9icRQ2WlvmPLA3A/z9FGeRzJ5BtsMi0ir4QzfW
         6gkMeYyI7d21jLYtzObb69dZj82uu0WuVgfprGFVLSdlx3Sn1WFIvS60R0CWxYyAusXv
         UB72uDkF+uAb8cwM2k1vXYk4N5rhrPXwOS/9LdfWod7PV9zZx2pgHhTKnDBYinD0hAqu
         aD/LrHSeu0T7F6NkgDDuxMne4/oiLfE1aRrqA9ekiMX7rPVWp8nVwtKUznQ0K4+WFNLM
         DHn/EeNiuXs6bFaVq+Kfo0AgJvjC1d09OjUYaBeucIdG1zzI/MAK8X8hh2YU6lWS+Z+y
         pCRA==
X-Gm-Message-State: AJIora+HgECaBU3u3uvAF8mT2G/lbhEGBx6djpZ2VG/+H//UPQR5mSMF
	IRTKq8PI3InjKhJP4zvZB94niMcHqmSKmwuDybR72Q==
X-Google-Smtp-Source: AGRyM1scKsvkgjvOBcySMD8ANT00O0TRABCY2NJ0UUBywdxmH7xhFhuqq83FJ/2EITJFvKlSUEurf8W9R+8SpfjeE5I=
X-Received: by 2002:a05:6602:1644:b0:678:8ba4:8df6 with SMTP id
 y4-20020a056602164400b006788ba48df6mr16243773iow.138.1658270773545; Tue, 19
 Jul 2022 15:46:13 -0700 (PDT)
MIME-Version: 1.0
References: <20220719195628.3415852-1-axelrasmussen@google.com>
 <20220719195628.3415852-3-axelrasmussen@google.com> <D43534E1-7982-45EE-8B16-2C4687F49E77@vmware.com>
In-Reply-To: <D43534E1-7982-45EE-8B16-2C4687F49E77@vmware.com>
From: Axel Rasmussen <axelrasmussen@google.com>
Date: Tue, 19 Jul 2022 15:45:37 -0700
Message-ID: <CAJHvVcigVqAibm0JODkiR=Pcd3E14xp0NB6acw2q2enwnrnLSA@mail.gmail.com>
Subject: Re: [PATCH v4 2/5] userfaultfd: add /dev/userfaultfd for fine grained
 access control
To: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Andrew Morton <akpm@linux-foundation.org>, 
	Dave Hansen <dave.hansen@linux.intel.com>, "Dmitry V . Levin" <ldv@altlinux.org>, 
	Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>, Hugh Dickins <hughd@google.com>, Jan Kara <jack@suse.cz>, 
	Jonathan Corbet <corbet@lwn.net>, Mel Gorman <mgorman@techsingularity.net>, 
	Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@kernel.org>, Peter Xu <peterx@redhat.com>, 
	Shuah Khan <shuah@kernel.org>, Suren Baghdasaryan <surenb@google.com>, Vlastimil Babka <vbabka@suse.cz>, 
	zhangyi <yi.zhang@huawei.com>, 
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, 
	LKML <linux-kernel@vger.kernel.org>, Linux MM <linux-mm@kvack.org>, 
	"linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1658270774;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=;
	b=sOvyOf3WtSaj9UcN3iVsYW0lHpcrDU4QA73Hs5PbCvUvZ1cxrEwn3ORCVr7EnuaYO7hd61
	PzeQCmB32lxKndULwQA41Y1aEspasegLZETCbLOviu5pFtdXQJ4T4eTFdQ8VwtA3YVNlMA
	ZMKonMyBI4baeqJ9p0Fgz3V/oJyo378=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658270774; a=rsa-sha256;
	cv=none;
	b=cE6iFcsfAV1XZKn+Y4r2AENaIrteCBYocpzjNdGA0daGVizrD7fqxHLUKHtF1gTxF0+Rcj
	jLUbcUTkZf81ecvtHlfPkUHnhJtzXDyuAYY5JoIHDmY6ST5UzOkSHa69Uj2t2mNQlFE9H1
	6Wcwl0gjk4fC5tNI0vYnYMUmv4/G4HI=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=MjL9o3vE;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.166.45 as permitted sender) smtp.mailfrom=axelrasmussen@google.com
X-Rspamd-Queue-Id: 778C6180056
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=MjL9o3vE;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.166.45 as permitted sender) smtp.mailfrom=axelrasmussen@google.com
X-Rspamd-Server: rspam12
X-Rspam-User: 
X-Stat-Signature: wgaohs6drq4p5o1rpncgkuykesqkz8a5
X-HE-Tag: 1658270774-813617
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jul 19, 2022 at 3:32 PM Nadav Amit <namit@vmware.com> wrote:
>
> On Jul 19, 2022, at 12:56 PM, Axel Rasmussen <axelrasmussen@google.com> w=
rote:
>
> > Historically, it has been shown that intercepting kernel faults with
> > userfaultfd (thereby forcing the kernel to wait for an arbitrary amount
> > of time) can be exploited, or at least can make some kinds of exploits
> > easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
> > changed things so, in order for kernel faults to be handled by
> > userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl
> > must be configured so that any unprivileged user can do it.
> >
> > In a typical implementation of a hypervisor with live migration (take
> > QEMU/KVM as one such example), we do indeed need to be able to handle
> > kernel faults. But, both options above are less than ideal:
> >
> > - Toggling the sysctl increases attack surface by allowing any
> >  unprivileged user to do it.
> >
> > - Granting the live migration process CAP_SYS_PTRACE gives it this
> >  ability, but *also* the ability to "observe and control the
> >  execution of another process [...], and examine and change [its]
> >  memory and registers" (from ptrace(2)). This isn't something we need
> >  or want to be able to do, so granting this permission violates the
> >  "principle of least privilege".
> >
> > This is all a long winded way to say: we want a more fine-grained way t=
o
> > grant access to userfaultfd, without granting other additional
> > permissions at the same time.
> >
> > To achieve this, add a /dev/userfaultfd misc device. This device
> > provides an alternative to the userfaultfd(2) syscall for the creation
> > of new userfaultfds. The idea is, any userfaultfds created this way wil=
l
> > be able to handle kernel faults, without the caller having any special
> > capabilities. Access to this mechanism is instead restricted using e.g.
> > standard filesystem permissions.
>
> Are there any other =E2=80=9Cdevices" that when opened by different proce=
sses
> provide such isolated interfaces in each process? I.e., devices that if y=
ou
> read from them in different processes you get completely unrelated data?
> (putting aside namespaces).
>
> It all sounds so wrong to me, that I am going to try again to pushback
> (sorry).

No need to be sorry. :)

>
> From a semantic point of view - userfaultfd is process specific. It is
> therefore similar to /proc/[pid]/mem (or /proc/[pid]/pagemap and so on).
>
> So why can=E2=80=99t we put it there? I saw that you argued against it in=
 your
> cover-letter, and I think that your argument is you would need
> CAP_SYS_PTRACE if you want to access userfaultfd of other processes. But
> this is EXACTLY the way opening /proc/[pid]/mem is performed - see
> proc_mem_open().
>
> So instead of having some strange device that behaves differently in the
> context of each process, you can just have /proc/[pid]/userfaultfd and th=
en
> use mm_access() to check if you have permissions to access userfaultfd (j=
ust
> like proc_mem_open() does). This would be more intuitive for users as it =
is
> similar to other /proc/[pid]/X, and would cover both local and remote
> use-cases.

Ah, so actually I find this argument much more compelling.

I don't find it persuasive that we should put it in /proc for the
purpose of supporting cross-process memory manipulation, because I
think the syscall works better for that, and in that case we don't
mind depending on CAP_SYS_PTRACE.

But, what you've argued here I do find persuasive. :) You are right, I
can't think of any other example of a device node in /dev that works
like this, where it is completely independent on a per-process basis.
The closest I could come up with was /dev/zero or /dev/null or
similar. You won't affect any other process by touching these, but I
don't think these are good examples.

I'll send a v5 which does this. I do worry that cross-process support
is probably complex to get right, so I might leave that out and only
allow a process to open its own device for now.

>