From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@redhat.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Jann Horn <jannh@google.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
Date: Wed, 14 May 2025 11:14:46 +0100 [thread overview]
Message-ID: <cdfa1a38-2740-4d87-a16e-c1e73e276453@lucifer.local> (raw)
In-Reply-To: <e3olqweuiyxlkewat26n4vyltpn5ordg6ckfndejwjleoemrtw@e2mo6nocaepy>
On Wed, May 14, 2025 at 11:01:49AM +0100, Pedro Falcato wrote:
> On Wed, May 14, 2025 at 10:12:49AM +0100, Lorenzo Stoakes wrote:
> > On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote:
> > > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> > > > Provide a means by which drivers can specify which fields of those
> > > > permitted to be changed should be altered to prior to mmap()'ing a
> > > > range (which may either result from a merge or from mapping an entirely new
> > > > VMA).
> > > >
> > > > Doing so is substantially safer than the existing .mmap() calback which
> > > > provides unrestricted access to the part-constructed VMA and permits
> > > > drivers and file systems to do 'creative' things which makes it hard to
> > > > reason about the state of the VMA after the function returns.
> > > >
> > > > The existing .mmap() callback's freedom has caused a great deal of issues,
> > > > especially in error handling, as unwinding the mmap() state has proven to
> > > > be non-trivial and caused significant issues in the past, for instance
> > > > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > > > error path behaviour").
> > > >
> > > > It also necessitates a second attempt at merge once the .mmap() callback
> > > > has completed, which has caused issues in the past, is awkward, adds
> > > > overhead and is difficult to reason about.
> > > >
> > > > The .mmap_prepare() callback eliminates this requirement, as we can update
> > > > fields prior to even attempting the first merge. It is safer, as we heavily
> > > > restrict what can actually be modified, and being invoked very early in the
> > > > mmap() process, error handling can be performed safely with very little
> > > > unwinding of state required.
> > > >
> > > > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > > > exclusive, so we permit only one to be invoked at a time.
> > > >
> > > > Update vma userland test stubs to account for changes.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >
> > > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> > >
> > > Neat idea, thanks. This should also help out with the insane proliferation of
> > > vm_flags_set in ->mmap() callbacks all over. Hopefully.
> [snip]
> > >
> > > 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
> > > So things like remap_pfn_range can still be used by drivers' mmap()
> > > implementation.
> >
> > Thanks for raising the remap_pfn_range() case! Yes this is definitely a
> > thing.
> >
> > However this proposed callback would totally undermine the purpose of the
> > change - the idea is to never give a vma because if we do so we lose all of
> > the advantages here and may as well just leave the mmap in place for
> > this...
> >
>
> Yes, good point.
>
> > However I do think we'll need a new callback at some point (previously
> > discussed in thread).
> >
> > We could perhaps provide the option to _explicitly_ remap for instance. I
> > would want it to be heavily locked down as to what can happen and to happen
> > as early as possible.
> >
>
> I think we can simply combine various ideas here. Like:
>
> struct vm_area_desc_private {
> struct vm_area_desc desc;
> struct vm_area_struct *vma;
> };
>
> Then, for this "mmap_finish" callback and associated infra:
>
> int (*mmap_finish)(struct vm_area_desc *desc);
>
> int mmap_remap_pfn_range(struct vm_area_desc *desc, /*...*/)
> {
> struct vm_area_desc_private *private = container_of(desc, struct vm_area_desc_private, desc);
> return remap_pfn_range(private->vma, /*...*/);
> }
>
> int random_driver_mmap_finish(struct vm_area_desc *desc)
> {
> return mmap_remap_pfn_range(desc, desc->start, some_pfn, some_size,
> desc->page_prot);
> }
>
>
Yeah that's really nice actually! I like that. Will note down for when we
come to this.
Obviously we need to know whent the finish implies a remap, but we can have
a field for this potentially set in the vm_area_desc to indicate this.
> I think something of the sort would be quite less prone to abuse, and we could
> take the time to then even polish up the interface (e.g maybe it would be nicer
> if mmap_remap_pfn_range took a vma offset, and not a start address).
Right yeah, David definitely wanted some improvements in this area so this
aligns with other intentions/work.
>
> Anyway, just brainstorming. This idea came to mind, I think it's quite interesting.
Nice ideas :) yeah, good chance to undo some sins here.
>
> > This is something we can iterate on, as trying to find the ideal scheme
> > immediately will just lead to inaction, the big advantage with the approach
> > here is we can be iterative.
> >
> > We provide this, use it in a scenario which allows us to eliminate merge
> > retry, and can take it from there :)
>
> Yep, totally.
>
> >
> > So indeed, watch this space basically... I will be highly proactive on this
> > stuff moving forward.
> >
> > >
> > > 1) is particularly important so our VFS and driver friends know this is supposed
> > > to be The Way Forward.
> >
> > I think probably the answer is for me to fully update the document to be
> > bang up to date, right? But that would obviously work best as a follow up
> > patch :)
> >
>
> You love your big projects :p I had the impression the docs were more or less up to date?
> The VFS people do update it somewhat diligently. And for mm we only have ->mmap, ->get_unmapped_area,
> and now ->mmap_prepare. And the descriptions are ATM quite useless, just
> "called by the mmap(2) system call".
Much like leg day, sometimes painful things are worthwhile things :)
As for docs I saw there were some missing uring entries, and it explicitly
says 'as of kernel 4.18'.
So I feel that it really needs to be properly updated and it ends up being
a little out of scope to the series.
Since this series only uses this callback in one, mm-specific, place, I
think it makes sense to attack this on a follow up, alongside updating file
system callbacks etc.
But it's definitely on the todo! :)
Cheers, Lorenzo
>
> --
> Pedro
next prev parent reply other threads:[~2025-05-14 10:15 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-09 12:13 [PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
2025-05-12 9:24 ` Christian Brauner
2025-05-12 11:31 ` Lorenzo Stoakes
2025-05-13 7:29 ` Christian Brauner
2025-05-12 13:37 ` Vlastimil Babka
2025-05-13 9:01 ` David Hildenbrand
2025-05-13 9:32 ` Lorenzo Stoakes
2025-05-13 13:25 ` David Hildenbrand
2025-05-13 15:31 ` Suren Baghdasaryan
2025-05-13 13:22 ` Liam R. Howlett
2025-05-14 9:04 ` Pedro Falcato
2025-05-14 9:12 ` Lorenzo Stoakes
2025-05-14 10:01 ` Pedro Falcato
2025-05-14 10:14 ` Lorenzo Stoakes [this message]
2025-05-09 12:13 ` [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes
2025-05-12 13:41 ` Vlastimil Babka
2025-05-13 13:23 ` Liam R. Howlett
2025-05-13 15:32 ` Suren Baghdasaryan
2025-05-09 12:13 ` [PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes
2025-05-12 13:44 ` Vlastimil Babka
2025-05-13 13:25 ` Liam R. Howlett
2025-05-13 15:38 ` Suren Baghdasaryan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cdfa1a38-2740-4d87-a16e-c1e73e276453@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@redhat.com \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=pfalcato@suse.de \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox