Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>,
	David Hildenbrand <david@kernel.org>,
	Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Subject: Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping
Date: Fri, 21 Nov 2025 07:52:31 +0000	[thread overview]
Message-ID: <070dcd0f-ef77-4924-baf1-6c380bca192d@lucifer.local> (raw)
In-Reply-To: <IA0PR11MB71851184AA5977BBAFBD55C7F8D5A@IA0PR11MB7185.namprd11.prod.outlook.com>

On Fri, Nov 21, 2025 at 06:51:31AM +0000, Kasireddy, Vivek wrote:
> Hi Lorenzo,
>
> > Subject: Re: [PATCH] mm/mremap: allow VMAs with
> > VM_DONTEXPAND|VM_PFNMAP when creating new mapping
> >
> > Hi Vivek, thanks for the patch.
> >
> > In general though, let's please not make a fundamental change to mremap()
> > behaviour in late -rc6. Late in cycle/during merge window we're really only
> > interested in existing series, series that are less involved than this.
> I am sorry about the timing of this patch. My intended goal was to just start
> a discussion to address this issue we were seeing in Qemu.

No problem, just underlining here :) No reason you'd be aware of this. We are
trying slowly to manage workload. Slowly :)

>
> >
> > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote:
> > > When mremap is used to create a new mapping, we should not return
> > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set
> > because
> > > the old VMA would neither be expanded nor shrunk in this case. This
> >
> > I guess you're trying to be succinct here and 'clone' each input VMA using
> > the 0 source size input.
> Yes, cloning would be the right term to describe what we are doing here.

Right.

>
> >
> > However this can't work.
> >
> > This operation is not equivalent to an mmap(). It may seem to be for
> > ordinary mappings but in practice it isn't:
> >
> > (syscall)
> > -> do_mremap()
> > -> mremap_at()
> > -> expand_vma()
> > -> move_vma()
> > -> copy_vma_and_data()
> > -> copy_vma()
> >
> > Essentially copying the properties of the VMA to the new region.
> >
> > But this doesn't work for PFN map.
> >
> > At _no point_ are you invoking the original f_op->mmap or
> > f_op->mmap_prepare handler.
> If all the properties of the original VMA are going to be copied over to
> the new VMA, wouldn't that mean whatever state, page tables were
> setup via f_op->mmap or f_op->mmap_prepare handlers for the original
> VMA, would also be copied over to the new VMA right?

The problem is for these 'special' VMAs the VMA doesn't fully describe the
state, some of the state ends up in the page tables in a way that cannot be
reconstructed on fault.

This is really why we have flags like VM_DONTEXPAND.

>
> >
> > And these handles for PFN maps set up page tables, because PFN maps
> > literally do not exist as VMAs which have properties independent of their
> > page tables like this.
> >
> > Equally VM_DONTEXPAND implies the same kind of semantics.
> >
> > > is particularly useful when trying to create a new VMA using other
> > > existing VMAs that have these flags set, such as the ones associated
> > > with VFIO devices.
> > >
> > > Specifically, there are use-cases where a VMM such as Qemu would
> > > want to map a non-contiguous buffer associated with a VFIO device
> > > in the following way:
> > >
> > >     void *start, *cur;
> > >     int i;
> > >
> > >     start = mmap(NULL, size, PROT_NONE, MAP_SHARED, -1, 0);
> > >     if (start == MAP_FAILED) {
> > >         return start;
> > >     }
> >
> > Err, MAP_SHARED and -1 fd? Doesn't this always fail? It does locally with
> > -EBADFD?
> Sorry, I was trying different things and did not copy the version that had
> the MAP_ANONYMOUS flag to the commit message. In other words, I had this
> in my test code, which works:
>     start = mmap(NULL, size, PROT_NONE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>     if (start == MAP_FAILED) {
>         return start;
>     }

Right yeah :)

>
> >
> > >
> > >     cur = start;
> > >     for (i = 0; i < iov_cnt; i++) {
> > >          if (mremap(iov[i].iov_base, 0, iov[i].iov_len,
> > >              MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) {
> >
> > Trying to cleverly clone a VMA I guess.
> >
> > >              goto err;
> > >          }
> > >          cur += iov[i].iov_len;
> > >     }
> > >     return start;
> > >
> > > The above code currently works when mapping buffers backed by
> > > shmem (memfd) but fails with -EFAULT when mapping VFIO backed
> >
> > Right, which is expected, as described above, you _are_ expanding here.
> If the original VMA would remain unaltered and all we are doing is
> cloning it to a new address, would that still be considered expansion
> of the original VMA?

Yes.

You're saying 'make this 0 size VMA size N'. N > 0, so it's an expansion. It's
the same thing as if you were expanding N to N + pgsize - you are trying to map
a range without having invoked .mmap[_prepare], but the .mmap[_prepare] callback
establishes page table state (and performs checks) that are not encdoed in the
VMA metadata and are critical to the mapping being valid.

>
> >
> > > buffers because the VMAs associated with iov[i].iov_base addresses
> > > have VM_DONTEXPAND and VM_PFNMAP flags set. Therefore, fix this
> > > issue by not returning -EFAULT when a new mapping is being created.
> >
> > We're not fixing an issue, we're (unfortunately, incorrectly) introducing
> > an entirely new behaviour of permitting the remapping of page tables
> > belonging to VMAs posssesing VM_DONTEXPAND and VM_PFNMAP flags.
> I mentioned fixing because with this patch applied mremap no longer fails
> with VFIO backed buffers and my use-case works without any further issues.
> However, I do understand that just because it works doesn't mean anything
> and that there may be other pitfalls that I did not run into, but I am wondering
> what those are?

It absolutely will break, see the .mmap callback for vfio, you can go past the
end of the physical range...

Your change makes expansion of non-expandable VMAs valid in all cases where you
aren't explicitly setting a fixed address. This will break in all cases except
where you get lucky.

This is obviously a non-viable solution.

>
> >
> > >
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Jann Horn <jannh@google.com>
> > > Cc: Pedro Falcato <pfalcato@suse.de>
> > > Cc: David Hildenbrand <david@kernel.org>
> > > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > > ---
> > >  mm/mremap.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index fdb0485ede74..d3868d941f72 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -1736,7 +1736,8 @@ static int check_prep_vma(struct
> > vma_remap_struct *vrm)
> > >  	if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
> > >  		return -EINVAL;
> > >
> > > -	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
> > > +	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP) &&
> > > +	    !vrm_implies_new_addr(vrm))
> > >  		return -EFAULT;
> >
> > This is just incorrect. I mean firstly, as explained above, we simply do
> > not support this kind of operation and cannot correctly.
> >
> > But also here you're allowing !(MREMAP_FIXED | MREMAP_DONTUNMAP)
> > VMAs to
> > expand VM_DONTEXPAND, VM_PFNMAP VMAs which is really really not
> > right.
> >
> > The missing context here is:
> >
> > 	if (new_len == old_len)
> > 		return 0;
> >
> > 	... all code below involves an expansion of a VMA ...
> >
> > 	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
> > 		return -EFAULT;
> >
> > And making _all_ expands that aren't _definitely_ specifying a new address
> > work correclty. But you are in fact 100% specifying a new address because
> > you're setting input length of 0...
> >
> > So this allows _anybody_ to try to incorrectly expand VM_PFNMAP,
> > VM_DONTEXPAND VMAs. So is wildly incorrect.
> Is the problem here with VM_DONTEXPAND or VM_PFNMAP or both?
> In other words, assuming the original VMA did not have VM_DONTEXPAND,
> would this behavior (what I am trying to do with mremap) be acceptable?
>
> >
> > >
> > >  	if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta))
> > > --
> > > 2.50.1
> > >
> > >
> >
> > In any case, I don't want to try to support this behaviour, I don't think
> > there's really a sensible way to be sure we can _copy_ page tables
> > correctly, and even though the source size = 0 is a cute trick you _ARE_
> > expanding, and VM_DONTEXPAND is very clear - do not do this.
> So, the thing that is still not clear to me is whether cloning a VMA would be
> considered a form of expansion (of the original VMA) forbidden by
> VM_DONTEXPAND?

I feel I've covered this above :)

>
> >
> > Additionally, please if trying to make such fundamental changes going
> > forward _always_ provide test code, it's absolutely a requirement.
> Ok, will do. Just to confirm, your suggestion is to add a new mm selftest
> to exercise the targeted codepath right?

No please don't, your patch isn't upstreamable.

I hate to say it, but I guess it's not come across clearly :P

NAK to your patch or anything like it sorry :(

>
> Thanks,
> Vivek
>
> >
> > Thanks, Lorenzo
>

Another way round here might be to try to use userland to figure out the file
that a mapping belongs to via e.g. /proc/$pid/map_files, determine the
attributes from /proc/$pid/maps and then generate a new mmap() call with the
same properties.

Cheers, Lorenzo

next prev parent reply	other threads:[~2025-11-21  7:52 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20  5:35 Vivek Kasireddy
2025-11-20  9:04 ` Lorenzo Stoakes
2025-11-20  9:16   ` David Hildenbrand (Red Hat)
2025-11-20  9:35     ` Lorenzo Stoakes
2025-11-20  9:49       ` David Hildenbrand (Red Hat)
2025-11-20  9:58         ` Lorenzo Stoakes
2025-11-21  3:05           ` Akihiko Odaki
2025-11-21  8:03             ` Lorenzo Stoakes
2025-11-21  8:48               ` Akihiko Odaki
2025-11-21  9:10                 ` Lorenzo Stoakes
2025-11-21 10:16                   ` Akihiko Odaki
2025-11-21 10:52                     ` Lorenzo Stoakes
2025-11-21  7:26           ` David Hildenbrand (Red Hat)
2025-11-21  6:51   ` Kasireddy, Vivek
2025-11-21  7:52     ` Lorenzo Stoakes [this message]
2025-11-21  8:13       ` David Hildenbrand (Red Hat)
2025-11-21 15:03         ` Liam R. Howlett
2025-11-22  6:56           ` Kasireddy, Vivek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=070dcd0f-ef77-4924-baf1-6c380bca192d@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=odaki@rsg.ci.i.u-tokyo.ac.jp \
    --cc=pfalcato@suse.de \
    --cc=vbabka@suse.cz \
    --cc=vivek.kasireddy@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox