linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>,
	Vivek Kasireddy <vivek.kasireddy@intel.com>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>
Subject: Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping
Date: Fri, 21 Nov 2025 09:10:27 +0000	[thread overview]
Message-ID: <a752bfd5-92ad-4283-959d-37f39bbe722e@lucifer.local> (raw)
In-Reply-To: <c6dc4179-b2c8-4991-8bbe-662dd32a31f5@rsg.ci.i.u-tokyo.ac.jp>

On Fri, Nov 21, 2025 at 05:48:33PM +0900, Akihiko Odaki wrote:
>
>
> On 2025/11/21 17:03, Lorenzo Stoakes wrote:
> > On Fri, Nov 21, 2025 at 12:05:56PM +0900, Akihiko Odaki wrote:
> > > Hi,
> > >
> > > I'm another QEMU developer who have been discussing the problem motivating
> > > the mremap() usage.
> > >
> >
> > Hmm for some reason this mail hasn't appeared at lore how strange.
> >
> > > On 2025/11/20 18:58, Lorenzo Stoakes wrote:
> > > > On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > On 11/20/25 10:35, Lorenzo Stoakes wrote:
> > > > > > On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > > On 11/20/25 10:04, Lorenzo Stoakes wrote:
> > > > > > > > Hi Vivek, thanks for the patch.
> > > > > > > >
> > > > > > > > In general though, let's please not make a fundamental change to mremap()
> > > > > > > > behaviour in late -rc6. Late in cycle/during merge window we're really only
> > > > > > > > interested in existing series, series that are less involved than this.
> > > > > > > >
> > > > > > > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote:
> > > > > > > > > When mremap is used to create a new mapping, we should not return
> > > > > > > > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because
> > > > > > > > > the old VMA would neither be expanded nor shrunk in this case. This
> > > > > > > >
> > > > > > > > I guess you're trying to be succinct here and 'clone' each input VMA using
> > > > > > > > the 0 source size input.
> > > > > > > >
> > > > > > > > However this can't work.
> > > > > > > >
> > > > > > > > This operation is not equivalent to an mmap(). It may seem to be for
> > > > > > > > ordinary mappings but in practice it isn't:
> > > > > > > >
> > > > > > > > (syscall)
> > > > > > > > -> do_mremap()
> > > > > > > > -> mremap_at()
> > > > > > > > -> expand_vma()
> > > > > > > > -> move_vma()
> > > > > > > > -> copy_vma_and_data()
> > > > > > > > -> copy_vma()
> > > > > > > >
> > > > > > > > Essentially copying the properties of the VMA to the new region.
> > > > > > > >
> > > > > > > > But this doesn't work for PFN map.
> > > > > > > >
> > > > > > > > At _no point_ are you invoking the original f_op->mmap or
> > > > > > > > f_op->mmap_prepare handler.
> > > > > > > >
> > > > > > > > And these handles for PFN maps set up page tables, because PFN maps
> > > > > > > > literally do not exist as VMAs which have properties independent of their
> > > > > > > > page tables like this.
> > > > > > >
> > > > > > > vfio-pci is a bit different, though, as it uses
> > > > > > > vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to
> > > > > > > insert PFNs, not at mmap time using remap_pfn_range() and friends.
> > > > > > >
> > > > > > > (see vfio_pci_mmap_page_fault() )
> > > > > >
> > > > > > It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a
> > > > > > bug there:
> > > > >
> > > > > Yeah, I am all confused about expansion. The example code looks like all it
> > > > > wants to do is move a VM_PFNMAP mapping.
> > > > >
> > > > >            if (mremap(iov[i].iov_base, 0, iov[i].iov_len,
> > > > >                MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) {
> > > > >                goto err;
> > > > >            }
> > > > >
> > > > > I guess the expansion is because of iov[i].iov_len is bigger than the
> > > > > original VMA?
> > > > >
> > > > > Is that maybe a bug in QEMU or why are we even expanding here?
> > > >
> > > > We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy
> > > > of this VMA at a new address'.
> > > >
> > > > There's never any moving, as input size is 0 :)
> > > >
> > > > It's a cute corner case way of using mremap().
> > > >
> > > > We're basically asking for a _copy_. But you can't get a copy of a
> > > > VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap)
> > > > to get something sensible and you are bypassing that on expansion, even if it's
> > > > a 'clone' style expansion.
> > >
> > > Apparently fork() copies VM_PFNMAP without invoking mmap_prepare or legacy
> > > mmap unless VM_DONTCOPY is set, so I wonder if mremap() can use the same
> > > logic.
> >
> > It's because it's literally copying page tables in the exact same range exactly
> > as they are to the exact same virtual address.
> >
> > You're asking for a _brand new mapping_ of effectively _any size whatsoever_ at
> > a _new virtual address_ while _retaining the original mapping_.
> >
> > Also note that you're copying the VMA exactly as-is with _all internal private
> > metadata_ duplicated, but now in another process.
> >
> > It's entirely different.
> >
> > For better or for worse (*ahem*) we've given huge flexibility to drivers to do
> > what they want with this stuff. Which means -literally anything- might be stored
> > in page tables, whcih means there might be alignment requirements for the
> > mapping, which means that page tables may be established in .mmap,
> > .mmap_prepare, which means that internal state might be tied to the VMA that is
> > only correctly set up in .mmap[_prepare], etc. etc.
> >
> > So yes - if we exactly duplicate this with everything virtual, metadata being
> > _exactly the same_ in a _brand new process_ - with the driver _knowing_ that a
> > fork might happen (and setting VM_DONTCOPY in cases where it doesn't want it) -
> > then we're good.
> >
> > But that's something very different from 'allow arbitrary copies of the VMA'.
> >
> > In terms of mremap() this is very simply an expansion and we won't be supporting
> > this kind of operation there sorry.
> >
> > I may go work on an idea to allow this behaviour via a new approach, but it
> > won't be in mremap().
> >
> > Note that I replied to Vivek with some ideas as to how to do this in userland
> > (thanks to David for suggesting btw forgot to say ;) so you _should_ be able to
> > get what you need here without needing mremap() to do something different.
>
> I understand that the logic to copy page table cannot be borrowed from
> fork(), but I thought that copy_vma_and_data() could be extended to support
> this scenario.
>
> If I understand it correctly, it does almost what we want; copying a VMA and
> page table with a new size. It also calls vma->vm_ops->mremap to let drivers
> know the new VMA. However it doesn't copy the page table if old_len == 0 and
> clears the old page table entries, which prevents using the function to copy
> VM_PFNMAP.

It doesn't almost do what we want at all. All the drivers known VM_PFNMAP and
VM_DONTEXPAND will _not_ be mremap()'d so unless you have a time machine I don't
know about we can't in any way take the existence of this callback to be
meaningful here :)

>
> So my idea is simple: change copy_vma_and_data() to copy the page table
> without clearing the old page table entries if !old_len && (vma->vm_flags &
> VM_PFNMAP).

No, absolutely not.

I already went over the reasons, but to highlight:

- There may be alignment requirements that are no longer fulfilled.

- There may be metadata associated with the VMA that no longer exists in the
  copied VMA.

- There may be some requirement that only one mapping exists at a time of the
  given range.

And who knows what else.

we give drivers a great deal of freedom to do what they want with these
callbacks. We've built in the assumption that:

- VM_PFNMAP means .mmap[_prepare] will _always_ be called for any new mapping.
- VM_DONTEXPAND means that we will _never_ mremap() in a way that _copies_ the
  VMA.

Now these semantics are non-obvious and may be inconvenient, but that ship has
sailed, and trying to do something different now is broken.

I don't particularly fancy auditing every single driver for this behaviour
(inevitably missing some) either. I am already having to do this for .mmap in my
.mmap_prepare work and that was... already an 'interesting' addition to my
workload :)

Also to be clear, as perhaps I've not been quite firm enough - I will NAK any
patch that tries to bolt on more 'special behaviour' to mremap().

It already has enough of that, if we had that time machine I mentioned I would
never have allowed this ridiculous 'cute' mremap(ptr, 0, new_size, ...)
behaviour.

Note that we explicitly disallow it for anon mappings, so there's already
non-obvious caveats on top of caveats on top of caveats.

There will be absolutely no more of this :)

>
> Of course we still need to respect VM_DONTEXPAND so it should be also
> checked that the new VMA is a subset of the old one.

Yeah no, sorry.

>
> Can this work?

Nope, but I + David already suggested a way forward that should work - just
mmap() something new utilising the existing fd.

You could even explicit try to do this only when the mremap()-clone behaviour
fails.

I leave exploring the details of this to you guys ;)

>
> Regards,
> Akihiko Odaki

Like I said, I may look into adding some _new_ kernel functionality that gives
you what you want. I will cc you and Vivek if/when I put something forward.

Cheers, Lorenzo


  reply	other threads:[~2025-11-21  9:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20  5:35 Vivek Kasireddy
2025-11-20  9:04 ` Lorenzo Stoakes
2025-11-20  9:16   ` David Hildenbrand (Red Hat)
2025-11-20  9:35     ` Lorenzo Stoakes
2025-11-20  9:49       ` David Hildenbrand (Red Hat)
2025-11-20  9:58         ` Lorenzo Stoakes
2025-11-21  3:05           ` Akihiko Odaki
2025-11-21  8:03             ` Lorenzo Stoakes
2025-11-21  8:48               ` Akihiko Odaki
2025-11-21  9:10                 ` Lorenzo Stoakes [this message]
2025-11-21 10:16                   ` Akihiko Odaki
2025-11-21 10:52                     ` Lorenzo Stoakes
2025-11-21  7:26           ` David Hildenbrand (Red Hat)
2025-11-21  6:51   ` Kasireddy, Vivek
2025-11-21  7:52     ` Lorenzo Stoakes
2025-11-21  8:13       ` David Hildenbrand (Red Hat)
2025-11-21 15:03         ` Liam R. Howlett
2025-11-22  6:56           ` Kasireddy, Vivek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a752bfd5-92ad-4283-959d-37f39bbe722e@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=odaki@rsg.ci.i.u-tokyo.ac.jp \
    --cc=pfalcato@suse.de \
    --cc=vbabka@suse.cz \
    --cc=vivek.kasireddy@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox