From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89955C43334 for ; Tue, 14 Jun 2022 19:36:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7C106B0071; Tue, 14 Jun 2022 15:36:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B2B4B6B0072; Tue, 14 Jun 2022 15:36:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9CC756B0073; Tue, 14 Jun 2022 15:36:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8DF356B0071 for ; Tue, 14 Jun 2022 15:36:43 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 6436460FB8 for ; Tue, 14 Jun 2022 19:36:43 +0000 (UTC) X-FDA: 79577848686.01.60D6CCD Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf08.hostedemail.com (Postfix) with ESMTP id B56DD16009F for ; Tue, 14 Jun 2022 19:36:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1655235402; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=+ox4dj78XDeONpCY/I2byHRWGGL1bbVgG7Ngg4aNtSc=; b=MhXR2TJn9W/IsuKlWAqZqEcPk6kqIXLEnNE8MvSZfRJCuucJSiDC1MjwUZWX3KZTnexn54 ycEvCjURA5PHZINGFCRDFkYfhkXg7U/ggHM4PJSbpzsqh6AeX+3IQP/OjyQjd+0GtXE7qb xm7HfOpxkkq1yB5fZs6hLaat0yRhPFE= Received: from mail-il1-f197.google.com (mail-il1-f197.google.com [209.85.166.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-407-OZC4nlU8M_uCmQ5PBNnH9g-1; Tue, 14 Jun 2022 15:36:41 -0400 X-MC-Unique: OZC4nlU8M_uCmQ5PBNnH9g-1 Received: by mail-il1-f197.google.com with SMTP id i18-20020a926d12000000b002d1b13b896cso7160627ilc.7 for ; Tue, 14 Jun 2022 12:36:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=+ox4dj78XDeONpCY/I2byHRWGGL1bbVgG7Ngg4aNtSc=; b=f9R0I5h26/eLKK+Y8/3Cd96LOxZk+L89sPE7rIlCs9td9GhJD6I3jCmjj/IVjVeQuI u9+a6OSdkwYsiXdD5sTZxxNBhaNQ52zyNdCbdjPGehAtXekyg0THlhOpudKwewVfyIqb VyOcDCPVp7lR/3Y4Z+c5R+0xPw69aGobMNASB49iPBO02FDYoy31gAXh/3Z5gylhguQu 2eOkEP9YJ3q0rO40j/YSJq69ssb4+awf5edJdQ5lzyDNkUQ+qpVxS21usFFN93orq4hr H0eEI7GGG0f9NtjfV+xMGKGQANFGjYgyqxa3QZUV1h8eUps1SPjHiSKFmozDJpJzr4Gv sImw== X-Gm-Message-State: AOAM5315Z7rTplAvHmxpKcoK+J6Q28COYIlgMkQ0FyLKOBJ4jcrSWEVw 7yJtdz8q1ZFKiAfpZnxGCdga9fVhQi9uj1OD5RTnZb7zrf0VrajHxCTpOuGnQoeU0cUVDZaAMdp adghQZvdmbUs= X-Received: by 2002:a05:6638:2722:b0:331:f878:e38b with SMTP id m34-20020a056638272200b00331f878e38bmr3844035jav.272.1655235400380; Tue, 14 Jun 2022 12:36:40 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy46UsOGrOa9qzN3/0xyOD3Pu8920BAJfUTIRIZ+b23+ZLnb0uaumApmJV1QyQxC2J6j9msQQ== X-Received: by 2002:a05:6638:2722:b0:331:f878:e38b with SMTP id m34-20020a056638272200b00331f878e38bmr3844002jav.272.1655235400064; Tue, 14 Jun 2022 12:36:40 -0700 (PDT) Received: from xz-m1.local (cpec09435e3e0ee-cmc09435e3e0ec.cpe.net.cable.rogers.com. [99.241.198.116]) by smtp.gmail.com with ESMTPSA id r2-20020a6bd902000000b00669c107e289sm5783020ioc.29.2022.06.14.12.36.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Jun 2022 12:36:39 -0700 (PDT) Date: Tue, 14 Jun 2022 15:36:36 -0400 From: Peter Xu To: Axel Rasmussen Cc: Alexander Viro , Andrew Morton , Charan Teja Reddy , Dave Hansen , "Dmitry V . Levin" , Gleb Fotengauer-Malinovskiy , Hugh Dickins , Jan Kara , Jonathan Corbet , Mel Gorman , Mike Kravetz , Mike Rapoport , Nadav Amit , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , zhangyi , linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Subject: Re: [PATCH v3 4/6] userfaultfd: update documentation to describe /dev/userfaultfd Message-ID: References: <20220601210951.3916598-1-axelrasmussen@google.com> <20220601210951.3916598-5-axelrasmussen@google.com> MIME-Version: 1.0 In-Reply-To: <20220601210951.3916598-5-axelrasmussen@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655235402; a=rsa-sha256; cv=none; b=s7/RKmbxI8DtuB6UmBcvXRBQdh/X8Fwih/KrMp32lzjCc65KByThD1HvrFSj8VTrY9x7Se 2DoSPvqG9qNPosIyRL2fDD5vqSJDTDz9WydK/pgdY37xt7VfgMctNCEA4Tw2FZ5AqldZsb yGRWPVoEWJTbJraPGjfQE8fPrwM+uwM= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MhXR2TJn; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf08.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655235402; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+ox4dj78XDeONpCY/I2byHRWGGL1bbVgG7Ngg4aNtSc=; b=0gnA2CNI4s0LTU1Yr4KDQhyLuglQgsMsiyMj44jYUpvGOEFoEtYynO58eG7tB0GxmadDD7 QuLhVwlfWB+yqBBMUn1Av7MR5h93m3JJkO8u9J7EgKpcH3cEypY1bi7CQc7nI2x9re+ZgH 8dzI3ziq4/6C3owNoTow0RpQCM2Wl1c= X-Rspamd-Queue-Id: B56DD16009F X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MhXR2TJn; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf08.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com X-Rspamd-Server: rspam06 X-Stat-Signature: w6rccxwgrkytxxgdqei1bhsi4posetw3 X-HE-Tag: 1655235402-445555 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 01, 2022 at 02:09:49PM -0700, Axel Rasmussen wrote: > Explain the different ways to create a new userfaultfd, and how access > control works for each way. > > Signed-off-by: Axel Rasmussen > --- > Documentation/admin-guide/mm/userfaultfd.rst | 40 ++++++++++++++++++-- > Documentation/admin-guide/sysctl/vm.rst | 3 ++ > 2 files changed, 40 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst > index 6528036093e1..9bae1acd431f 100644 > --- a/Documentation/admin-guide/mm/userfaultfd.rst > +++ b/Documentation/admin-guide/mm/userfaultfd.rst > @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick. > Design > ====== > > -Userfaults are delivered and resolved through the ``userfaultfd`` syscall. > +Userspace creates a new userfaultfd, initializes it, and registers one or more > +regions of virtual memory with it. Then, any page faults which occur within the > +region(s) result in a message being delivered to the userfaultfd, notifying > +userspace of the fault. > > The ``userfaultfd`` (aside from registering and unregistering virtual > memory ranges) provides two primary functionalities: > @@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory > management of mremap/mprotect is that the userfaults in all their > operations never involve heavyweight structures like vmas (in fact the > ``userfaultfd`` runtime load never takes the mmap_lock for writing). > - > Vmas are not suitable for page- (or hugepage) granular fault tracking > when dealing with virtual address spaces that could span > Terabytes. Too many vmas would be needed for that. > > -The ``userfaultfd`` once opened by invoking the syscall, can also be > +The ``userfaultfd``, once created, can also be > passed using unix domain sockets to a manager process, so the same > manager process could handle the userfaults of a multitude of > different processes without them being aware about what is going on > @@ -50,6 +52,38 @@ is a corner case that would currently return ``-EBUSY``). > API > === > > +Creating a userfaultfd > +---------------------- > + > +There are two ways to create a new userfaultfd, each of which provide ways to > +restrict access to this functionality (since historically userfaultfds which > +handle kernel page faults have been a useful tool for exploiting the kernel). > + > +The first way, supported by older kernels, is the userfaultfd(2) syscall. How about "supported since userfaultfd was introduced"? Otherwise the reader can get a feeling that the syscall won't work on new kernels but it will. > +Access to this is controlled in several ways: > + > +- By default, the userfaultfd will be able to handle kernel page faults. This s/kernel/both user and kernel/? > + can be disabled by passing in UFFD_USER_MODE_ONLY. > + > +- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have > + CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY. > + > +- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to > + use this syscall, even if UFFD_USER_MODE_ONLY is *not* set. The separation of above three paragraphs do not feel very clear to me to understand these flags.. Entry 1) was trying to define UFFD_USER_MODE_ONLY, but entry 2) was also referring to it in another context. How about using two paragraphs to explain these two flags one by one? My try.. The user can always creates an userfaultfd that only traps userspace page faults only. To achieve it, one can create the userfaultfd object using the syscall userfaultfd() with flag UFFD_USER_MODE_ONLY passed in. If the user would like to also trap kernel page faults for the address space, then either the process needs to have CAP_SYS_PTRACE capability, or the system must have vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd is set to 0. > + > +The second way, added to the kernel more recently, is by opening and issuing a > +USERFAULTFD_IOC_NEW ioctl to /dev/userfaultfd. This method yields equivalent > +userfaultfds to the userfaultfd(2) syscall; its benefit is in how access to > +creating userfaultfds is controlled. Since the benefit is immediately mentioned next, how about dropping "its benefit is in how ... is controlled" and just connect these two paragraphs? Again, please take it with a grain of salt on my English-related comments (it means all comment above :). Thanks, > + > +Access to /dev/userfaultfd is controlled via normal filesystem permissions > +(user/group/mode for example), which gives fine grained access to userfaultfd > +specifically, without also granting other unrelated privileges at the same time > +(as e.g. granting CAP_SYS_PTRACE would do). > + > +Initializing up a userfaultfd > +----------------------------- > + > When first opened the ``userfaultfd`` must be enabled invoking the > ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or > a later API version) which will specify the ``read/POLLIN`` protocol > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index d7374a1e8ac9..e3a952d1fd35 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -927,6 +927,9 @@ calls without any restrictions. > > The default value is 0. > > +An alternative to this sysctl / the userfaultfd(2) syscall is to create > +userfaultfds via /dev/userfaultfd. See > +Documentation/admin-guide/mm/userfaultfd.rst. > > user_reserve_kbytes > =================== > -- > 2.36.1.255.ge46751e96f-goog > -- Peter Xu