linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>,
	David Hildenbrand <david@kernel.org>,
	Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Subject: RE: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping
Date: Fri, 21 Nov 2025 06:51:31 +0000	[thread overview]
Message-ID: <IA0PR11MB71851184AA5977BBAFBD55C7F8D5A@IA0PR11MB7185.namprd11.prod.outlook.com> (raw)
In-Reply-To: <976e9916-c949-4fa0-b92e-87f6841b5cbe@lucifer.local>

Hi Lorenzo,

> Subject: Re: [PATCH] mm/mremap: allow VMAs with
> VM_DONTEXPAND|VM_PFNMAP when creating new mapping
> 
> Hi Vivek, thanks for the patch.
> 
> In general though, let's please not make a fundamental change to mremap()
> behaviour in late -rc6. Late in cycle/during merge window we're really only
> interested in existing series, series that are less involved than this.
I am sorry about the timing of this patch. My intended goal was to just start
a discussion to address this issue we were seeing in Qemu.

> 
> On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote:
> > When mremap is used to create a new mapping, we should not return
> > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set
> because
> > the old VMA would neither be expanded nor shrunk in this case. This
> 
> I guess you're trying to be succinct here and 'clone' each input VMA using
> the 0 source size input.
Yes, cloning would be the right term to describe what we are doing here.

> 
> However this can't work.
> 
> This operation is not equivalent to an mmap(). It may seem to be for
> ordinary mappings but in practice it isn't:
> 
> (syscall)
> -> do_mremap()
> -> mremap_at()
> -> expand_vma()
> -> move_vma()
> -> copy_vma_and_data()
> -> copy_vma()
> 
> Essentially copying the properties of the VMA to the new region.
> 
> But this doesn't work for PFN map.
> 
> At _no point_ are you invoking the original f_op->mmap or
> f_op->mmap_prepare handler.
If all the properties of the original VMA are going to be copied over to
the new VMA, wouldn't that mean whatever state, page tables were
setup via f_op->mmap or f_op->mmap_prepare handlers for the original
VMA, would also be copied over to the new VMA right?

> 
> And these handles for PFN maps set up page tables, because PFN maps
> literally do not exist as VMAs which have properties independent of their
> page tables like this.
> 
> Equally VM_DONTEXPAND implies the same kind of semantics.
> 
> > is particularly useful when trying to create a new VMA using other
> > existing VMAs that have these flags set, such as the ones associated
> > with VFIO devices.
> >
> > Specifically, there are use-cases where a VMM such as Qemu would
> > want to map a non-contiguous buffer associated with a VFIO device
> > in the following way:
> >
> >     void *start, *cur;
> >     int i;
> >
> >     start = mmap(NULL, size, PROT_NONE, MAP_SHARED, -1, 0);
> >     if (start == MAP_FAILED) {
> >         return start;
> >     }
> 
> Err, MAP_SHARED and -1 fd? Doesn't this always fail? It does locally with
> -EBADFD?
Sorry, I was trying different things and did not copy the version that had
the MAP_ANONYMOUS flag to the commit message. In other words, I had this
in my test code, which works:
    start = mmap(NULL, size, PROT_NONE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (start == MAP_FAILED) {
        return start;
    }

> 
> >
> >     cur = start;
> >     for (i = 0; i < iov_cnt; i++) {
> >          if (mremap(iov[i].iov_base, 0, iov[i].iov_len,
> >              MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) {
> 
> Trying to cleverly clone a VMA I guess.
> 
> >              goto err;
> >          }
> >          cur += iov[i].iov_len;
> >     }
> >     return start;
> >
> > The above code currently works when mapping buffers backed by
> > shmem (memfd) but fails with -EFAULT when mapping VFIO backed
> 
> Right, which is expected, as described above, you _are_ expanding here.
If the original VMA would remain unaltered and all we are doing is
cloning it to a new address, would that still be considered expansion
of the original VMA?

> 
> > buffers because the VMAs associated with iov[i].iov_base addresses
> > have VM_DONTEXPAND and VM_PFNMAP flags set. Therefore, fix this
> > issue by not returning -EFAULT when a new mapping is being created.
> 
> We're not fixing an issue, we're (unfortunately, incorrectly) introducing
> an entirely new behaviour of permitting the remapping of page tables
> belonging to VMAs posssesing VM_DONTEXPAND and VM_PFNMAP flags.
I mentioned fixing because with this patch applied mremap no longer fails
with VFIO backed buffers and my use-case works without any further issues.
However, I do understand that just because it works doesn't mean anything
and that there may be other pitfalls that I did not run into, but I am wondering
what those are?

> 
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > ---
> >  mm/mremap.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index fdb0485ede74..d3868d941f72 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1736,7 +1736,8 @@ static int check_prep_vma(struct
> vma_remap_struct *vrm)
> >  	if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
> >  		return -EINVAL;
> >
> > -	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
> > +	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP) &&
> > +	    !vrm_implies_new_addr(vrm))
> >  		return -EFAULT;
> 
> This is just incorrect. I mean firstly, as explained above, we simply do
> not support this kind of operation and cannot correctly.
> 
> But also here you're allowing !(MREMAP_FIXED | MREMAP_DONTUNMAP)
> VMAs to
> expand VM_DONTEXPAND, VM_PFNMAP VMAs which is really really not
> right.
> 
> The missing context here is:
> 
> 	if (new_len == old_len)
> 		return 0;
> 
> 	... all code below involves an expansion of a VMA ...
> 
> 	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
> 		return -EFAULT;
> 
> And making _all_ expands that aren't _definitely_ specifying a new address
> work correclty. But you are in fact 100% specifying a new address because
> you're setting input length of 0...
> 
> So this allows _anybody_ to try to incorrectly expand VM_PFNMAP,
> VM_DONTEXPAND VMAs. So is wildly incorrect.
Is the problem here with VM_DONTEXPAND or VM_PFNMAP or both?
In other words, assuming the original VMA did not have VM_DONTEXPAND,
would this behavior (what I am trying to do with mremap) be acceptable?

> 
> >
> >  	if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta))
> > --
> > 2.50.1
> >
> >
> 
> In any case, I don't want to try to support this behaviour, I don't think
> there's really a sensible way to be sure we can _copy_ page tables
> correctly, and even though the source size = 0 is a cute trick you _ARE_
> expanding, and VM_DONTEXPAND is very clear - do not do this.
So, the thing that is still not clear to me is whether cloning a VMA would be
considered a form of expansion (of the original VMA) forbidden by
VM_DONTEXPAND?

> 
> Additionally, please if trying to make such fundamental changes going
> forward _always_ provide test code, it's absolutely a requirement.
Ok, will do. Just to confirm, your suggestion is to add a new mm selftest
to exercise the targeted codepath right?

Thanks,
Vivek

> 
> Thanks, Lorenzo


  parent reply	other threads:[~2025-11-21  6:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20  5:35 Vivek Kasireddy
2025-11-20  9:04 ` Lorenzo Stoakes
2025-11-20  9:16   ` David Hildenbrand (Red Hat)
2025-11-20  9:35     ` Lorenzo Stoakes
2025-11-20  9:49       ` David Hildenbrand (Red Hat)
2025-11-20  9:58         ` Lorenzo Stoakes
2025-11-21  3:05           ` Akihiko Odaki
2025-11-21  8:03             ` Lorenzo Stoakes
2025-11-21  8:48               ` Akihiko Odaki
2025-11-21  9:10                 ` Lorenzo Stoakes
2025-11-21 10:16                   ` Akihiko Odaki
2025-11-21 10:52                     ` Lorenzo Stoakes
2025-11-21  7:26           ` David Hildenbrand (Red Hat)
2025-11-21  6:51   ` Kasireddy, Vivek [this message]
2025-11-21  7:52     ` Lorenzo Stoakes
2025-11-21  8:13       ` David Hildenbrand (Red Hat)
2025-11-21 15:03         ` Liam R. Howlett
2025-11-22  6:56           ` Kasireddy, Vivek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=IA0PR11MB71851184AA5977BBAFBD55C7F8D5A@IA0PR11MB7185.namprd11.prod.outlook.com \
    --to=vivek.kasireddy@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=odaki@rsg.ci.i.u-tokyo.ac.jp \
    --cc=pfalcato@suse.de \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox