From: David Hildenbrand <david@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>,
linux-kernel@vger.kernel.org, patches@lists.linux.dev,
tglx@linutronix.de, linux-crypto@vger.kernel.org,
linux-api@vger.kernel.org, x86@kernel.org,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>,
Carlos O'Donell <carlos@redhat.com>,
Florian Weimer <fweimer@redhat.com>,
Arnd Bergmann <arnd@arndb.de>, Jann Horn <jannh@google.com>,
Christian Brauner <brauner@kernel.org>,
David Hildenbrand <dhildenb@redhat.com>,
linux-mm@kvack.org
Subject: Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings
Date: Mon, 8 Jul 2024 10:11:24 +0200 [thread overview]
Message-ID: <7439da2e-4a60-4643-9804-17e99ce6e312@redhat.com> (raw)
In-Reply-To: <CAHk-=wi=XvCZ9r897LjEb4ZarLzLtKN1p+Fyig+F2fmQDF8GSA@mail.gmail.com>
On 08.07.24 02:08, Linus Torvalds wrote:
> On Sun, 7 Jul 2024 at 14:01, David Hildenbrand <david@redhat.com> wrote:
>>
>> At least MAP_DROPPABLE doesn't quite make sense with hugetlb, but at least
>> the other ones do have semantics with hugetlb?
>
> Hmm.
>
> How about we just say that VM_DROPPABLE really is something separate
> from MAP_PRIVATE or MAP_SHARED..
So it would essentially currently imply MAP_ANON|MAP_PRIVATE, without
COW (not shared with a child process).
Then, we should ignore any fd+offset that is passed (or bail out); I
assume that's what your proposal below does automatically without diving
into the code.
>
> And then we make the rule be that VM_DROPPABLE is never dumped and
> always dropped on fork, just to make things simpler.
The semantics are much more intuitive. No need for separate mmap flags.
>
> It not only avoids a flag, but it actually makes sense: the pages
> aren't stable for dumping anyway, and not copying them on fork() not
> only avoids some overhead, but makes it much more reliable and
> testable.
>
> IOW, how about taking this approach:
>
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -17,5 +17,6 @@
> #define MAP_SHARED 0x01 /* Share changes */
> #define MAP_PRIVATE 0x02 /* Changes are private */
> #define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */
> +#define MAP_DROPPABLE 0x08 /* 4 is not in MAP_TYPE on parisc? */
>
> /*
>
> with do_mmap() doing:
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1369,6 +1369,23 @@ unsigned long do_mmap(struct file *file,
> pgoff = 0;
> vm_flags |= VM_SHARED | VM_MAYSHARE;
> break;
> + case MAP_DROPPABLE:
> + /*
> + * A locked or stack area makes no sense to
> + * be droppable.
> + *
> + * Also, since droppable pages can just go
> + * away at any time, it makes no sense to
> + * copy them on fork or dump them.
> + */
> + if (flags & MAP_LOCKED)
> + return -EINVAL;
Likely we'll have to adjust mlock() as well. Also, I think we should
just bail out with hugetlb as well.
> + if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
> + return -EINVAL;
> +
> + vm_flags |= VM_DROPPABLE;
> + vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;
Further, maybe we want to disallow madvise() clearing these flags here,
just to be consistent.
> + fallthrough;
> case MAP_PRIVATE:
> /*
> * Set pgoff according to addr for anon_vma.
>
> which looks rather simple.
>
> The only oddity is that parisc thing - every other archiecture has the
> MAP_TYPE bits being 0xf, but parisc uses 0x2b (also four bits, but
> instead of the low four bits it's 00101011 - strange).
I assume, changing that would have the risk of breaking stupid user
space, right? (that sets a bit without any semantics)
>
> So using 8 as a MAP_TYPE bit for MAP_DROPPABLE works everywhere, and
> if we eventually want to do a "signaling" MAP_DROPPABLE we could use
> 9.
Sounds good enough.
>
> This has the added advantage that if somebody does this on an old
> kernel,. they *will* get an error. Because unlike the 'flag' bits in
> general, the MAP_TYPE bit space has always been tested.
>
> Hmm?
As a side note, I'll raise that I am not a particular fan of the
"droppable" terminology, at least with the "read 0s" approach.
From a user perspective, the memory might suddenly lose its state and
read as 0s just like volatile memory when it loses power. "dropping
pages" sounds more like an implementation detail.
Something like MAP_VOLATILE might be more intuitive (similar to the
proposed MADV_VOLATILE).
But naming is hard, just mentioning to share my thought :)
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-07-08 8:11 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240707002658.1917440-1-Jason@zx2c4.com>
2024-07-07 0:26 ` Jason A. Donenfeld
2024-07-07 7:42 ` David Hildenbrand
2024-07-07 18:19 ` Linus Torvalds
2024-07-07 18:52 ` David Hildenbrand
2024-07-07 19:22 ` Linus Torvalds
2024-07-07 21:01 ` David Hildenbrand
2024-07-08 0:08 ` Linus Torvalds
2024-07-08 8:11 ` David Hildenbrand [this message]
2024-07-08 8:23 ` David Hildenbrand
2024-07-08 13:57 ` Jason A. Donenfeld
2024-07-08 20:05 ` David Hildenbrand
2024-07-08 13:55 ` Jason A. Donenfeld
2024-07-08 14:40 ` Jason A. Donenfeld
2024-07-08 20:21 ` David Hildenbrand
2024-07-08 20:26 ` David Hildenbrand
2024-07-09 2:17 ` Jason A. Donenfeld
2024-07-10 3:05 ` David Hildenbrand
2024-07-10 3:34 ` Jason A. Donenfeld
2024-07-10 3:53 ` David Hildenbrand
2024-07-08 20:06 ` David Hildenbrand
2024-07-08 13:50 ` Jason A. Donenfeld
2024-07-08 1:59 ` Jason A. Donenfeld
2024-07-08 1:46 ` Jason A. Donenfeld
2024-07-08 20:24 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7439da2e-4a60-4643-9804-17e99ce6e312@redhat.com \
--to=david@redhat.com \
--cc=Jason@zx2c4.com \
--cc=adhemerval.zanella@linaro.org \
--cc=arnd@arndb.de \
--cc=brauner@kernel.org \
--cc=carlos@redhat.com \
--cc=dhildenb@redhat.com \
--cc=fweimer@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=jannh@google.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=patches@lists.linux.dev \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox