* [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping
@ 2025-11-20 5:35 Vivek Kasireddy
2025-11-20 9:04 ` Lorenzo Stoakes
0 siblings, 1 reply; 18+ messages in thread
From: Vivek Kasireddy @ 2025-11-20 5:35 UTC (permalink / raw)
To: linux-mm
Cc: Vivek Kasireddy, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, David Hildenbrand,
Akihiko Odaki
When mremap is used to create a new mapping, we should not return
-EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because
the old VMA would neither be expanded nor shrunk in this case. This
is particularly useful when trying to create a new VMA using other
existing VMAs that have these flags set, such as the ones associated
with VFIO devices.
Specifically, there are use-cases where a VMM such as Qemu would
want to map a non-contiguous buffer associated with a VFIO device
in the following way:
void *start, *cur;
int i;
start = mmap(NULL, size, PROT_NONE, MAP_SHARED, -1, 0);
if (start == MAP_FAILED) {
return start;
}
cur = start;
for (i = 0; i < iov_cnt; i++) {
if (mremap(iov[i].iov_base, 0, iov[i].iov_len,
MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) {
goto err;
}
cur += iov[i].iov_len;
}
return start;
The above code currently works when mapping buffers backed by
shmem (memfd) but fails with -EFAULT when mapping VFIO backed
buffers because the VMAs associated with iov[i].iov_base addresses
have VM_DONTEXPAND and VM_PFNMAP flags set. Therefore, fix this
issue by not returning -EFAULT when a new mapping is being created.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
mm/mremap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mremap.c b/mm/mremap.c
index fdb0485ede74..d3868d941f72 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1736,7 +1736,8 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
return -EINVAL;
- if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
+ if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP) &&
+ !vrm_implies_new_addr(vrm))
return -EFAULT;
if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta))
--
2.50.1
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 5:35 [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping Vivek Kasireddy @ 2025-11-20 9:04 ` Lorenzo Stoakes 2025-11-20 9:16 ` David Hildenbrand (Red Hat) 2025-11-21 6:51 ` Kasireddy, Vivek 0 siblings, 2 replies; 18+ messages in thread From: Lorenzo Stoakes @ 2025-11-20 9:04 UTC (permalink / raw) To: Vivek Kasireddy Cc: linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, David Hildenbrand, Akihiko Odaki Hi Vivek, thanks for the patch. In general though, let's please not make a fundamental change to mremap() behaviour in late -rc6. Late in cycle/during merge window we're really only interested in existing series, series that are less involved than this. On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: > When mremap is used to create a new mapping, we should not return > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because > the old VMA would neither be expanded nor shrunk in this case. This I guess you're trying to be succinct here and 'clone' each input VMA using the 0 source size input. However this can't work. This operation is not equivalent to an mmap(). It may seem to be for ordinary mappings but in practice it isn't: (syscall) -> do_mremap() -> mremap_at() -> expand_vma() -> move_vma() -> copy_vma_and_data() -> copy_vma() Essentially copying the properties of the VMA to the new region. But this doesn't work for PFN map. At _no point_ are you invoking the original f_op->mmap or f_op->mmap_prepare handler. And these handles for PFN maps set up page tables, because PFN maps literally do not exist as VMAs which have properties independent of their page tables like this. Equally VM_DONTEXPAND implies the same kind of semantics. > is particularly useful when trying to create a new VMA using other > existing VMAs that have these flags set, such as the ones associated > with VFIO devices. > > Specifically, there are use-cases where a VMM such as Qemu would > want to map a non-contiguous buffer associated with a VFIO device > in the following way: > > void *start, *cur; > int i; > > start = mmap(NULL, size, PROT_NONE, MAP_SHARED, -1, 0); > if (start == MAP_FAILED) { > return start; > } Err, MAP_SHARED and -1 fd? Doesn't this always fail? It does locally with -EBADFD? > > cur = start; > for (i = 0; i < iov_cnt; i++) { > if (mremap(iov[i].iov_base, 0, iov[i].iov_len, > MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { Trying to cleverly clone a VMA I guess. > goto err; > } > cur += iov[i].iov_len; > } > return start; > > The above code currently works when mapping buffers backed by > shmem (memfd) but fails with -EFAULT when mapping VFIO backed Right, which is expected, as described above, you _are_ expanding here. > buffers because the VMAs associated with iov[i].iov_base addresses > have VM_DONTEXPAND and VM_PFNMAP flags set. Therefore, fix this > issue by not returning -EFAULT when a new mapping is being created. We're not fixing an issue, we're (unfortunately, incorrectly) introducing an entirely new behaviour of permitting the remapping of page tables belonging to VMAs posssesing VM_DONTEXPAND and VM_PFNMAP flags. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jann Horn <jannh@google.com> > Cc: Pedro Falcato <pfalcato@suse.de> > Cc: David Hildenbrand <david@kernel.org> > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > --- > mm/mremap.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/mremap.c b/mm/mremap.c > index fdb0485ede74..d3868d941f72 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -1736,7 +1736,8 @@ static int check_prep_vma(struct vma_remap_struct *vrm) > if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) > return -EINVAL; > > - if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) > + if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP) && > + !vrm_implies_new_addr(vrm)) > return -EFAULT; This is just incorrect. I mean firstly, as explained above, we simply do not support this kind of operation and cannot correctly. But also here you're allowing !(MREMAP_FIXED | MREMAP_DONTUNMAP) VMAs to expand VM_DONTEXPAND, VM_PFNMAP VMAs which is really really not right. The missing context here is: if (new_len == old_len) return 0; ... all code below involves an expansion of a VMA ... if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) return -EFAULT; And making _all_ expands that aren't _definitely_ specifying a new address work correclty. But you are in fact 100% specifying a new address because you're setting input length of 0... So this allows _anybody_ to try to incorrectly expand VM_PFNMAP, VM_DONTEXPAND VMAs. So is wildly incorrect. > > if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta)) > -- > 2.50.1 > > In any case, I don't want to try to support this behaviour, I don't think there's really a sensible way to be sure we can _copy_ page tables correctly, and even though the source size = 0 is a cute trick you _ARE_ expanding, and VM_DONTEXPAND is very clear - do not do this. Additionally, please if trying to make such fundamental changes going forward _always_ provide test code, it's absolutely a requirement. Thanks, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 9:04 ` Lorenzo Stoakes @ 2025-11-20 9:16 ` David Hildenbrand (Red Hat) 2025-11-20 9:35 ` Lorenzo Stoakes 2025-11-21 6:51 ` Kasireddy, Vivek 1 sibling, 1 reply; 18+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-20 9:16 UTC (permalink / raw) To: Lorenzo Stoakes, Vivek Kasireddy Cc: linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki On 11/20/25 10:04, Lorenzo Stoakes wrote: > Hi Vivek, thanks for the patch. > > In general though, let's please not make a fundamental change to mremap() > behaviour in late -rc6. Late in cycle/during merge window we're really only > interested in existing series, series that are less involved than this. > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: >> When mremap is used to create a new mapping, we should not return >> -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because >> the old VMA would neither be expanded nor shrunk in this case. This > > I guess you're trying to be succinct here and 'clone' each input VMA using > the 0 source size input. > > However this can't work. > > This operation is not equivalent to an mmap(). It may seem to be for > ordinary mappings but in practice it isn't: > > (syscall) > -> do_mremap() > -> mremap_at() > -> expand_vma() > -> move_vma() > -> copy_vma_and_data() > -> copy_vma() > > Essentially copying the properties of the VMA to the new region. > > But this doesn't work for PFN map. > > At _no point_ are you invoking the original f_op->mmap or > f_op->mmap_prepare handler. > > And these handles for PFN maps set up page tables, because PFN maps > literally do not exist as VMAs which have properties independent of their > page tables like this. vfio-pci is a bit different, though, as it uses vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to insert PFNs, not at mmap time using remap_pfn_range() and friends. (see vfio_pci_mmap_page_fault() ) Now, I have to idea if that is the main use case we want to target here and how we could handle it, just wanted to point it out :) -- Cheers David ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 9:16 ` David Hildenbrand (Red Hat) @ 2025-11-20 9:35 ` Lorenzo Stoakes 2025-11-20 9:49 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 18+ messages in thread From: Lorenzo Stoakes @ 2025-11-20 9:35 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: > On 11/20/25 10:04, Lorenzo Stoakes wrote: > > Hi Vivek, thanks for the patch. > > > > In general though, let's please not make a fundamental change to mremap() > > behaviour in late -rc6. Late in cycle/during merge window we're really only > > interested in existing series, series that are less involved than this. > > > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: > > > When mremap is used to create a new mapping, we should not return > > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because > > > the old VMA would neither be expanded nor shrunk in this case. This > > > > I guess you're trying to be succinct here and 'clone' each input VMA using > > the 0 source size input. > > > > However this can't work. > > > > This operation is not equivalent to an mmap(). It may seem to be for > > ordinary mappings but in practice it isn't: > > > > (syscall) > > -> do_mremap() > > -> mremap_at() > > -> expand_vma() > > -> move_vma() > > -> copy_vma_and_data() > > -> copy_vma() > > > > Essentially copying the properties of the VMA to the new region. > > > > But this doesn't work for PFN map. > > > > At _no point_ are you invoking the original f_op->mmap or > > f_op->mmap_prepare handler. > > > > And these handles for PFN maps set up page tables, because PFN maps > > literally do not exist as VMAs which have properties independent of their > > page tables like this. > > vfio-pci is a bit different, though, as it uses > vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to > insert PFNs, not at mmap time using remap_pfn_range() and friends. > > (see vfio_pci_mmap_page_fault() ) It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a bug there: vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); Drop the VM_DONTEXPAND then? But on the other hand... I see a _whole bunch_ of logic in vfio_pci_core_mmap() that just wouldn't be executed on expansion (and the requested case here is 100% an expand due to not invoking mmap callbacks...) So we'd still be in a broken state here surely if we allowed it? > > Now, I have to idea if that is the main use case we want to target here and > how we could handle it, just wanted to point it out :) Right, the objections all remain, we absolutely cannot take this patch or anything like it. I guess VM_DONTEXPAND is really 'we do stuff in the (soon to be) mmap_prepare callback that is dependent on region size'. Most notably in this case: phys_len = PAGE_ALIGN(pci_resource_len(pdev, index)); ... if (req_start + req_len > phys_len) return -EINVAL; > > -- > Cheers > > David > I wouldn't be entirely opposed to a _new system call_ explicitly for cloning an existing memory region that explicitly invokes an mmap with equivalent parameters for VM_DONTEXPAND/VM_PFNMAP cases/does a copy for ordinary cases. But I'm not interested in doing that for mremap(ptr, 0, new_size, ...). An alternative would be to have a new VMA callback for expansion where the driver could explicitly do these checks :) but not sure if worth it. Cheers, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 9:35 ` Lorenzo Stoakes @ 2025-11-20 9:49 ` David Hildenbrand (Red Hat) 2025-11-20 9:58 ` Lorenzo Stoakes 0 siblings, 1 reply; 18+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-20 9:49 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki On 11/20/25 10:35, Lorenzo Stoakes wrote: > On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: >> On 11/20/25 10:04, Lorenzo Stoakes wrote: >>> Hi Vivek, thanks for the patch. >>> >>> In general though, let's please not make a fundamental change to mremap() >>> behaviour in late -rc6. Late in cycle/during merge window we're really only >>> interested in existing series, series that are less involved than this. >>> >>> On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: >>>> When mremap is used to create a new mapping, we should not return >>>> -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because >>>> the old VMA would neither be expanded nor shrunk in this case. This >>> >>> I guess you're trying to be succinct here and 'clone' each input VMA using >>> the 0 source size input. >>> >>> However this can't work. >>> >>> This operation is not equivalent to an mmap(). It may seem to be for >>> ordinary mappings but in practice it isn't: >>> >>> (syscall) >>> -> do_mremap() >>> -> mremap_at() >>> -> expand_vma() >>> -> move_vma() >>> -> copy_vma_and_data() >>> -> copy_vma() >>> >>> Essentially copying the properties of the VMA to the new region. >>> >>> But this doesn't work for PFN map. >>> >>> At _no point_ are you invoking the original f_op->mmap or >>> f_op->mmap_prepare handler. >>> >>> And these handles for PFN maps set up page tables, because PFN maps >>> literally do not exist as VMAs which have properties independent of their >>> page tables like this. >> >> vfio-pci is a bit different, though, as it uses >> vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to >> insert PFNs, not at mmap time using remap_pfn_range() and friends. >> >> (see vfio_pci_mmap_page_fault() ) > > It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a > bug there: Yeah, I am all confused about expansion. The example code looks like all it wants to do is move a VM_PFNMAP mapping. if (mremap(iov[i].iov_base, 0, iov[i].iov_len, MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { goto err; } I guess the expansion is because of iov[i].iov_len is bigger than the original VMA? Is that maybe a bug in QEMU or why are we even expanding here? -- Cheers David ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 9:49 ` David Hildenbrand (Red Hat) @ 2025-11-20 9:58 ` Lorenzo Stoakes 2025-11-21 3:05 ` Akihiko Odaki 2025-11-21 7:26 ` David Hildenbrand (Red Hat) 0 siblings, 2 replies; 18+ messages in thread From: Lorenzo Stoakes @ 2025-11-20 9:58 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote: > On 11/20/25 10:35, Lorenzo Stoakes wrote: > > On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: > > > On 11/20/25 10:04, Lorenzo Stoakes wrote: > > > > Hi Vivek, thanks for the patch. > > > > > > > > In general though, let's please not make a fundamental change to mremap() > > > > behaviour in late -rc6. Late in cycle/during merge window we're really only > > > > interested in existing series, series that are less involved than this. > > > > > > > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: > > > > > When mremap is used to create a new mapping, we should not return > > > > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because > > > > > the old VMA would neither be expanded nor shrunk in this case. This > > > > > > > > I guess you're trying to be succinct here and 'clone' each input VMA using > > > > the 0 source size input. > > > > > > > > However this can't work. > > > > > > > > This operation is not equivalent to an mmap(). It may seem to be for > > > > ordinary mappings but in practice it isn't: > > > > > > > > (syscall) > > > > -> do_mremap() > > > > -> mremap_at() > > > > -> expand_vma() > > > > -> move_vma() > > > > -> copy_vma_and_data() > > > > -> copy_vma() > > > > > > > > Essentially copying the properties of the VMA to the new region. > > > > > > > > But this doesn't work for PFN map. > > > > > > > > At _no point_ are you invoking the original f_op->mmap or > > > > f_op->mmap_prepare handler. > > > > > > > > And these handles for PFN maps set up page tables, because PFN maps > > > > literally do not exist as VMAs which have properties independent of their > > > > page tables like this. > > > > > > vfio-pci is a bit different, though, as it uses > > > vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to > > > insert PFNs, not at mmap time using remap_pfn_range() and friends. > > > > > > (see vfio_pci_mmap_page_fault() ) > > > > It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a > > bug there: > > Yeah, I am all confused about expansion. The example code looks like all it > wants to do is move a VM_PFNMAP mapping. > > if (mremap(iov[i].iov_base, 0, iov[i].iov_len, > MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { > goto err; > } > > I guess the expansion is because of iov[i].iov_len is bigger than the > original VMA? > > Is that maybe a bug in QEMU or why are we even expanding here? We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy of this VMA at a new address'. There's never any moving, as input size is 0 :) It's a cute corner case way of using mremap(). We're basically asking for a _copy_. But you can't get a copy of a VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap) to get something sensible and you are bypassing that on expansion, even if it's a 'clone' style expansion. > > -- > Cheers > > David > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 9:58 ` Lorenzo Stoakes @ 2025-11-21 3:05 ` Akihiko Odaki 2025-11-21 8:03 ` Lorenzo Stoakes 2025-11-21 7:26 ` David Hildenbrand (Red Hat) 1 sibling, 1 reply; 18+ messages in thread From: Akihiko Odaki @ 2025-11-21 3:05 UTC (permalink / raw) To: Lorenzo Stoakes, David Hildenbrand (Red Hat) Cc: Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato Hi, I'm another QEMU developer who have been discussing the problem motivating the mremap() usage. On 2025/11/20 18:58, Lorenzo Stoakes wrote: > On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote: >> On 11/20/25 10:35, Lorenzo Stoakes wrote: >>> On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: >>>> On 11/20/25 10:04, Lorenzo Stoakes wrote: >>>>> Hi Vivek, thanks for the patch. >>>>> >>>>> In general though, let's please not make a fundamental change to mremap() >>>>> behaviour in late -rc6. Late in cycle/during merge window we're really only >>>>> interested in existing series, series that are less involved than this. >>>>> >>>>> On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: >>>>>> When mremap is used to create a new mapping, we should not return >>>>>> -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because >>>>>> the old VMA would neither be expanded nor shrunk in this case. This >>>>> >>>>> I guess you're trying to be succinct here and 'clone' each input VMA using >>>>> the 0 source size input. >>>>> >>>>> However this can't work. >>>>> >>>>> This operation is not equivalent to an mmap(). It may seem to be for >>>>> ordinary mappings but in practice it isn't: >>>>> >>>>> (syscall) >>>>> -> do_mremap() >>>>> -> mremap_at() >>>>> -> expand_vma() >>>>> -> move_vma() >>>>> -> copy_vma_and_data() >>>>> -> copy_vma() >>>>> >>>>> Essentially copying the properties of the VMA to the new region. >>>>> >>>>> But this doesn't work for PFN map. >>>>> >>>>> At _no point_ are you invoking the original f_op->mmap or >>>>> f_op->mmap_prepare handler. >>>>> >>>>> And these handles for PFN maps set up page tables, because PFN maps >>>>> literally do not exist as VMAs which have properties independent of their >>>>> page tables like this. >>>> >>>> vfio-pci is a bit different, though, as it uses >>>> vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to >>>> insert PFNs, not at mmap time using remap_pfn_range() and friends. >>>> >>>> (see vfio_pci_mmap_page_fault() ) >>> >>> It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a >>> bug there: >> >> Yeah, I am all confused about expansion. The example code looks like all it >> wants to do is move a VM_PFNMAP mapping. >> >> if (mremap(iov[i].iov_base, 0, iov[i].iov_len, >> MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { >> goto err; >> } >> >> I guess the expansion is because of iov[i].iov_len is bigger than the >> original VMA? >> >> Is that maybe a bug in QEMU or why are we even expanding here? > > We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy > of this VMA at a new address'. > > There's never any moving, as input size is 0 :) > > It's a cute corner case way of using mremap(). > > We're basically asking for a _copy_. But you can't get a copy of a > VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap) > to get something sensible and you are bypassing that on expansion, even if it's > a 'clone' style expansion. Apparently fork() copies VM_PFNMAP without invoking mmap_prepare or legacy mmap unless VM_DONTCOPY is set, so I wonder if mremap() can use the same logic. Regards, Akihiko Odaki ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 3:05 ` Akihiko Odaki @ 2025-11-21 8:03 ` Lorenzo Stoakes 2025-11-21 8:48 ` Akihiko Odaki 0 siblings, 1 reply; 18+ messages in thread From: Lorenzo Stoakes @ 2025-11-21 8:03 UTC (permalink / raw) To: Akihiko Odaki Cc: David Hildenbrand (Red Hat), Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato On Fri, Nov 21, 2025 at 12:05:56PM +0900, Akihiko Odaki wrote: > Hi, > > I'm another QEMU developer who have been discussing the problem motivating > the mremap() usage. > Hmm for some reason this mail hasn't appeared at lore how strange. > On 2025/11/20 18:58, Lorenzo Stoakes wrote: > > On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote: > > > On 11/20/25 10:35, Lorenzo Stoakes wrote: > > > > On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: > > > > > On 11/20/25 10:04, Lorenzo Stoakes wrote: > > > > > > Hi Vivek, thanks for the patch. > > > > > > > > > > > > In general though, let's please not make a fundamental change to mremap() > > > > > > behaviour in late -rc6. Late in cycle/during merge window we're really only > > > > > > interested in existing series, series that are less involved than this. > > > > > > > > > > > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: > > > > > > > When mremap is used to create a new mapping, we should not return > > > > > > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because > > > > > > > the old VMA would neither be expanded nor shrunk in this case. This > > > > > > > > > > > > I guess you're trying to be succinct here and 'clone' each input VMA using > > > > > > the 0 source size input. > > > > > > > > > > > > However this can't work. > > > > > > > > > > > > This operation is not equivalent to an mmap(). It may seem to be for > > > > > > ordinary mappings but in practice it isn't: > > > > > > > > > > > > (syscall) > > > > > > -> do_mremap() > > > > > > -> mremap_at() > > > > > > -> expand_vma() > > > > > > -> move_vma() > > > > > > -> copy_vma_and_data() > > > > > > -> copy_vma() > > > > > > > > > > > > Essentially copying the properties of the VMA to the new region. > > > > > > > > > > > > But this doesn't work for PFN map. > > > > > > > > > > > > At _no point_ are you invoking the original f_op->mmap or > > > > > > f_op->mmap_prepare handler. > > > > > > > > > > > > And these handles for PFN maps set up page tables, because PFN maps > > > > > > literally do not exist as VMAs which have properties independent of their > > > > > > page tables like this. > > > > > > > > > > vfio-pci is a bit different, though, as it uses > > > > > vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to > > > > > insert PFNs, not at mmap time using remap_pfn_range() and friends. > > > > > > > > > > (see vfio_pci_mmap_page_fault() ) > > > > > > > > It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a > > > > bug there: > > > > > > Yeah, I am all confused about expansion. The example code looks like all it > > > wants to do is move a VM_PFNMAP mapping. > > > > > > if (mremap(iov[i].iov_base, 0, iov[i].iov_len, > > > MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { > > > goto err; > > > } > > > > > > I guess the expansion is because of iov[i].iov_len is bigger than the > > > original VMA? > > > > > > Is that maybe a bug in QEMU or why are we even expanding here? > > > > We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy > > of this VMA at a new address'. > > > > There's never any moving, as input size is 0 :) > > > > It's a cute corner case way of using mremap(). > > > > We're basically asking for a _copy_. But you can't get a copy of a > > VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap) > > to get something sensible and you are bypassing that on expansion, even if it's > > a 'clone' style expansion. > > Apparently fork() copies VM_PFNMAP without invoking mmap_prepare or legacy > mmap unless VM_DONTCOPY is set, so I wonder if mremap() can use the same > logic. It's because it's literally copying page tables in the exact same range exactly as they are to the exact same virtual address. You're asking for a _brand new mapping_ of effectively _any size whatsoever_ at a _new virtual address_ while _retaining the original mapping_. Also note that you're copying the VMA exactly as-is with _all internal private metadata_ duplicated, but now in another process. It's entirely different. For better or for worse (*ahem*) we've given huge flexibility to drivers to do what they want with this stuff. Which means -literally anything- might be stored in page tables, whcih means there might be alignment requirements for the mapping, which means that page tables may be established in .mmap, .mmap_prepare, which means that internal state might be tied to the VMA that is only correctly set up in .mmap[_prepare], etc. etc. So yes - if we exactly duplicate this with everything virtual, metadata being _exactly the same_ in a _brand new process_ - with the driver _knowing_ that a fork might happen (and setting VM_DONTCOPY in cases where it doesn't want it) - then we're good. But that's something very different from 'allow arbitrary copies of the VMA'. In terms of mremap() this is very simply an expansion and we won't be supporting this kind of operation there sorry. I may go work on an idea to allow this behaviour via a new approach, but it won't be in mremap(). Note that I replied to Vivek with some ideas as to how to do this in userland (thanks to David for suggesting btw forgot to say ;) so you _should_ be able to get what you need here without needing mremap() to do something different. > > Regards, > Akihiko Odaki Cheers, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 8:03 ` Lorenzo Stoakes @ 2025-11-21 8:48 ` Akihiko Odaki 2025-11-21 9:10 ` Lorenzo Stoakes 0 siblings, 1 reply; 18+ messages in thread From: Akihiko Odaki @ 2025-11-21 8:48 UTC (permalink / raw) To: Lorenzo Stoakes Cc: David Hildenbrand (Red Hat), Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato On 2025/11/21 17:03, Lorenzo Stoakes wrote: > On Fri, Nov 21, 2025 at 12:05:56PM +0900, Akihiko Odaki wrote: >> Hi, >> >> I'm another QEMU developer who have been discussing the problem motivating >> the mremap() usage. >> > > Hmm for some reason this mail hasn't appeared at lore how strange. > >> On 2025/11/20 18:58, Lorenzo Stoakes wrote: >>> On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote: >>>> On 11/20/25 10:35, Lorenzo Stoakes wrote: >>>>> On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: >>>>>> On 11/20/25 10:04, Lorenzo Stoakes wrote: >>>>>>> Hi Vivek, thanks for the patch. >>>>>>> >>>>>>> In general though, let's please not make a fundamental change to mremap() >>>>>>> behaviour in late -rc6. Late in cycle/during merge window we're really only >>>>>>> interested in existing series, series that are less involved than this. >>>>>>> >>>>>>> On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: >>>>>>>> When mremap is used to create a new mapping, we should not return >>>>>>>> -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because >>>>>>>> the old VMA would neither be expanded nor shrunk in this case. This >>>>>>> >>>>>>> I guess you're trying to be succinct here and 'clone' each input VMA using >>>>>>> the 0 source size input. >>>>>>> >>>>>>> However this can't work. >>>>>>> >>>>>>> This operation is not equivalent to an mmap(). It may seem to be for >>>>>>> ordinary mappings but in practice it isn't: >>>>>>> >>>>>>> (syscall) >>>>>>> -> do_mremap() >>>>>>> -> mremap_at() >>>>>>> -> expand_vma() >>>>>>> -> move_vma() >>>>>>> -> copy_vma_and_data() >>>>>>> -> copy_vma() >>>>>>> >>>>>>> Essentially copying the properties of the VMA to the new region. >>>>>>> >>>>>>> But this doesn't work for PFN map. >>>>>>> >>>>>>> At _no point_ are you invoking the original f_op->mmap or >>>>>>> f_op->mmap_prepare handler. >>>>>>> >>>>>>> And these handles for PFN maps set up page tables, because PFN maps >>>>>>> literally do not exist as VMAs which have properties independent of their >>>>>>> page tables like this. >>>>>> >>>>>> vfio-pci is a bit different, though, as it uses >>>>>> vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to >>>>>> insert PFNs, not at mmap time using remap_pfn_range() and friends. >>>>>> >>>>>> (see vfio_pci_mmap_page_fault() ) >>>>> >>>>> It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a >>>>> bug there: >>>> >>>> Yeah, I am all confused about expansion. The example code looks like all it >>>> wants to do is move a VM_PFNMAP mapping. >>>> >>>> if (mremap(iov[i].iov_base, 0, iov[i].iov_len, >>>> MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { >>>> goto err; >>>> } >>>> >>>> I guess the expansion is because of iov[i].iov_len is bigger than the >>>> original VMA? >>>> >>>> Is that maybe a bug in QEMU or why are we even expanding here? >>> >>> We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy >>> of this VMA at a new address'. >>> >>> There's never any moving, as input size is 0 :) >>> >>> It's a cute corner case way of using mremap(). >>> >>> We're basically asking for a _copy_. But you can't get a copy of a >>> VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap) >>> to get something sensible and you are bypassing that on expansion, even if it's >>> a 'clone' style expansion. >> >> Apparently fork() copies VM_PFNMAP without invoking mmap_prepare or legacy >> mmap unless VM_DONTCOPY is set, so I wonder if mremap() can use the same >> logic. > > It's because it's literally copying page tables in the exact same range exactly > as they are to the exact same virtual address. > > You're asking for a _brand new mapping_ of effectively _any size whatsoever_ at > a _new virtual address_ while _retaining the original mapping_. > > Also note that you're copying the VMA exactly as-is with _all internal private > metadata_ duplicated, but now in another process. > > It's entirely different. > > For better or for worse (*ahem*) we've given huge flexibility to drivers to do > what they want with this stuff. Which means -literally anything- might be stored > in page tables, whcih means there might be alignment requirements for the > mapping, which means that page tables may be established in .mmap, > .mmap_prepare, which means that internal state might be tied to the VMA that is > only correctly set up in .mmap[_prepare], etc. etc. > > So yes - if we exactly duplicate this with everything virtual, metadata being > _exactly the same_ in a _brand new process_ - with the driver _knowing_ that a > fork might happen (and setting VM_DONTCOPY in cases where it doesn't want it) - > then we're good. > > But that's something very different from 'allow arbitrary copies of the VMA'. > > In terms of mremap() this is very simply an expansion and we won't be supporting > this kind of operation there sorry. > > I may go work on an idea to allow this behaviour via a new approach, but it > won't be in mremap(). > > Note that I replied to Vivek with some ideas as to how to do this in userland > (thanks to David for suggesting btw forgot to say ;) so you _should_ be able to > get what you need here without needing mremap() to do something different. I understand that the logic to copy page table cannot be borrowed from fork(), but I thought that copy_vma_and_data() could be extended to support this scenario. If I understand it correctly, it does almost what we want; copying a VMA and page table with a new size. It also calls vma->vm_ops->mremap to let drivers know the new VMA. However it doesn't copy the page table if old_len == 0 and clears the old page table entries, which prevents using the function to copy VM_PFNMAP. So my idea is simple: change copy_vma_and_data() to copy the page table without clearing the old page table entries if !old_len && (vma->vm_flags & VM_PFNMAP). Of course we still need to respect VM_DONTEXPAND so it should be also checked that the new VMA is a subset of the old one. Can this work? Regards, Akihiko Odaki ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 8:48 ` Akihiko Odaki @ 2025-11-21 9:10 ` Lorenzo Stoakes 2025-11-21 10:16 ` Akihiko Odaki 0 siblings, 1 reply; 18+ messages in thread From: Lorenzo Stoakes @ 2025-11-21 9:10 UTC (permalink / raw) To: Akihiko Odaki Cc: David Hildenbrand (Red Hat), Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato On Fri, Nov 21, 2025 at 05:48:33PM +0900, Akihiko Odaki wrote: > > > On 2025/11/21 17:03, Lorenzo Stoakes wrote: > > On Fri, Nov 21, 2025 at 12:05:56PM +0900, Akihiko Odaki wrote: > > > Hi, > > > > > > I'm another QEMU developer who have been discussing the problem motivating > > > the mremap() usage. > > > > > > > Hmm for some reason this mail hasn't appeared at lore how strange. > > > > > On 2025/11/20 18:58, Lorenzo Stoakes wrote: > > > > On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote: > > > > > On 11/20/25 10:35, Lorenzo Stoakes wrote: > > > > > > On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: > > > > > > > On 11/20/25 10:04, Lorenzo Stoakes wrote: > > > > > > > > Hi Vivek, thanks for the patch. > > > > > > > > > > > > > > > > In general though, let's please not make a fundamental change to mremap() > > > > > > > > behaviour in late -rc6. Late in cycle/during merge window we're really only > > > > > > > > interested in existing series, series that are less involved than this. > > > > > > > > > > > > > > > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: > > > > > > > > > When mremap is used to create a new mapping, we should not return > > > > > > > > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because > > > > > > > > > the old VMA would neither be expanded nor shrunk in this case. This > > > > > > > > > > > > > > > > I guess you're trying to be succinct here and 'clone' each input VMA using > > > > > > > > the 0 source size input. > > > > > > > > > > > > > > > > However this can't work. > > > > > > > > > > > > > > > > This operation is not equivalent to an mmap(). It may seem to be for > > > > > > > > ordinary mappings but in practice it isn't: > > > > > > > > > > > > > > > > (syscall) > > > > > > > > -> do_mremap() > > > > > > > > -> mremap_at() > > > > > > > > -> expand_vma() > > > > > > > > -> move_vma() > > > > > > > > -> copy_vma_and_data() > > > > > > > > -> copy_vma() > > > > > > > > > > > > > > > > Essentially copying the properties of the VMA to the new region. > > > > > > > > > > > > > > > > But this doesn't work for PFN map. > > > > > > > > > > > > > > > > At _no point_ are you invoking the original f_op->mmap or > > > > > > > > f_op->mmap_prepare handler. > > > > > > > > > > > > > > > > And these handles for PFN maps set up page tables, because PFN maps > > > > > > > > literally do not exist as VMAs which have properties independent of their > > > > > > > > page tables like this. > > > > > > > > > > > > > > vfio-pci is a bit different, though, as it uses > > > > > > > vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to > > > > > > > insert PFNs, not at mmap time using remap_pfn_range() and friends. > > > > > > > > > > > > > > (see vfio_pci_mmap_page_fault() ) > > > > > > > > > > > > It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a > > > > > > bug there: > > > > > > > > > > Yeah, I am all confused about expansion. The example code looks like all it > > > > > wants to do is move a VM_PFNMAP mapping. > > > > > > > > > > if (mremap(iov[i].iov_base, 0, iov[i].iov_len, > > > > > MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { > > > > > goto err; > > > > > } > > > > > > > > > > I guess the expansion is because of iov[i].iov_len is bigger than the > > > > > original VMA? > > > > > > > > > > Is that maybe a bug in QEMU or why are we even expanding here? > > > > > > > > We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy > > > > of this VMA at a new address'. > > > > > > > > There's never any moving, as input size is 0 :) > > > > > > > > It's a cute corner case way of using mremap(). > > > > > > > > We're basically asking for a _copy_. But you can't get a copy of a > > > > VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap) > > > > to get something sensible and you are bypassing that on expansion, even if it's > > > > a 'clone' style expansion. > > > > > > Apparently fork() copies VM_PFNMAP without invoking mmap_prepare or legacy > > > mmap unless VM_DONTCOPY is set, so I wonder if mremap() can use the same > > > logic. > > > > It's because it's literally copying page tables in the exact same range exactly > > as they are to the exact same virtual address. > > > > You're asking for a _brand new mapping_ of effectively _any size whatsoever_ at > > a _new virtual address_ while _retaining the original mapping_. > > > > Also note that you're copying the VMA exactly as-is with _all internal private > > metadata_ duplicated, but now in another process. > > > > It's entirely different. > > > > For better or for worse (*ahem*) we've given huge flexibility to drivers to do > > what they want with this stuff. Which means -literally anything- might be stored > > in page tables, whcih means there might be alignment requirements for the > > mapping, which means that page tables may be established in .mmap, > > .mmap_prepare, which means that internal state might be tied to the VMA that is > > only correctly set up in .mmap[_prepare], etc. etc. > > > > So yes - if we exactly duplicate this with everything virtual, metadata being > > _exactly the same_ in a _brand new process_ - with the driver _knowing_ that a > > fork might happen (and setting VM_DONTCOPY in cases where it doesn't want it) - > > then we're good. > > > > But that's something very different from 'allow arbitrary copies of the VMA'. > > > > In terms of mremap() this is very simply an expansion and we won't be supporting > > this kind of operation there sorry. > > > > I may go work on an idea to allow this behaviour via a new approach, but it > > won't be in mremap(). > > > > Note that I replied to Vivek with some ideas as to how to do this in userland > > (thanks to David for suggesting btw forgot to say ;) so you _should_ be able to > > get what you need here without needing mremap() to do something different. > > I understand that the logic to copy page table cannot be borrowed from > fork(), but I thought that copy_vma_and_data() could be extended to support > this scenario. > > If I understand it correctly, it does almost what we want; copying a VMA and > page table with a new size. It also calls vma->vm_ops->mremap to let drivers > know the new VMA. However it doesn't copy the page table if old_len == 0 and > clears the old page table entries, which prevents using the function to copy > VM_PFNMAP. It doesn't almost do what we want at all. All the drivers known VM_PFNMAP and VM_DONTEXPAND will _not_ be mremap()'d so unless you have a time machine I don't know about we can't in any way take the existence of this callback to be meaningful here :) > > So my idea is simple: change copy_vma_and_data() to copy the page table > without clearing the old page table entries if !old_len && (vma->vm_flags & > VM_PFNMAP). No, absolutely not. I already went over the reasons, but to highlight: - There may be alignment requirements that are no longer fulfilled. - There may be metadata associated with the VMA that no longer exists in the copied VMA. - There may be some requirement that only one mapping exists at a time of the given range. And who knows what else. we give drivers a great deal of freedom to do what they want with these callbacks. We've built in the assumption that: - VM_PFNMAP means .mmap[_prepare] will _always_ be called for any new mapping. - VM_DONTEXPAND means that we will _never_ mremap() in a way that _copies_ the VMA. Now these semantics are non-obvious and may be inconvenient, but that ship has sailed, and trying to do something different now is broken. I don't particularly fancy auditing every single driver for this behaviour (inevitably missing some) either. I am already having to do this for .mmap in my .mmap_prepare work and that was... already an 'interesting' addition to my workload :) Also to be clear, as perhaps I've not been quite firm enough - I will NAK any patch that tries to bolt on more 'special behaviour' to mremap(). It already has enough of that, if we had that time machine I mentioned I would never have allowed this ridiculous 'cute' mremap(ptr, 0, new_size, ...) behaviour. Note that we explicitly disallow it for anon mappings, so there's already non-obvious caveats on top of caveats on top of caveats. There will be absolutely no more of this :) > > Of course we still need to respect VM_DONTEXPAND so it should be also > checked that the new VMA is a subset of the old one. Yeah no, sorry. > > Can this work? Nope, but I + David already suggested a way forward that should work - just mmap() something new utilising the existing fd. You could even explicit try to do this only when the mremap()-clone behaviour fails. I leave exploring the details of this to you guys ;) > > Regards, > Akihiko Odaki Like I said, I may look into adding some _new_ kernel functionality that gives you what you want. I will cc you and Vivek if/when I put something forward. Cheers, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 9:10 ` Lorenzo Stoakes @ 2025-11-21 10:16 ` Akihiko Odaki 2025-11-21 10:52 ` Lorenzo Stoakes 0 siblings, 1 reply; 18+ messages in thread From: Akihiko Odaki @ 2025-11-21 10:16 UTC (permalink / raw) To: Lorenzo Stoakes Cc: David Hildenbrand (Red Hat), Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato On 2025/11/21 18:10, Lorenzo Stoakes wrote: > On Fri, Nov 21, 2025 at 05:48:33PM +0900, Akihiko Odaki wrote: >> >> >> On 2025/11/21 17:03, Lorenzo Stoakes wrote: >>> On Fri, Nov 21, 2025 at 12:05:56PM +0900, Akihiko Odaki wrote: >>>> Hi, >>>> >>>> I'm another QEMU developer who have been discussing the problem motivating >>>> the mremap() usage. >>>> >>> >>> Hmm for some reason this mail hasn't appeared at lore how strange. >>> >>>> On 2025/11/20 18:58, Lorenzo Stoakes wrote: >>>>> On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote: >>>>>> On 11/20/25 10:35, Lorenzo Stoakes wrote: >>>>>>> On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: >>>>>>>> On 11/20/25 10:04, Lorenzo Stoakes wrote: >>>>>>>>> Hi Vivek, thanks for the patch. >>>>>>>>> >>>>>>>>> In general though, let's please not make a fundamental change to mremap() >>>>>>>>> behaviour in late -rc6. Late in cycle/during merge window we're really only >>>>>>>>> interested in existing series, series that are less involved than this. >>>>>>>>> >>>>>>>>> On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: >>>>>>>>>> When mremap is used to create a new mapping, we should not return >>>>>>>>>> -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because >>>>>>>>>> the old VMA would neither be expanded nor shrunk in this case. This >>>>>>>>> >>>>>>>>> I guess you're trying to be succinct here and 'clone' each input VMA using >>>>>>>>> the 0 source size input. >>>>>>>>> >>>>>>>>> However this can't work. >>>>>>>>> >>>>>>>>> This operation is not equivalent to an mmap(). It may seem to be for >>>>>>>>> ordinary mappings but in practice it isn't: >>>>>>>>> >>>>>>>>> (syscall) >>>>>>>>> -> do_mremap() >>>>>>>>> -> mremap_at() >>>>>>>>> -> expand_vma() >>>>>>>>> -> move_vma() >>>>>>>>> -> copy_vma_and_data() >>>>>>>>> -> copy_vma() >>>>>>>>> >>>>>>>>> Essentially copying the properties of the VMA to the new region. >>>>>>>>> >>>>>>>>> But this doesn't work for PFN map. >>>>>>>>> >>>>>>>>> At _no point_ are you invoking the original f_op->mmap or >>>>>>>>> f_op->mmap_prepare handler. >>>>>>>>> >>>>>>>>> And these handles for PFN maps set up page tables, because PFN maps >>>>>>>>> literally do not exist as VMAs which have properties independent of their >>>>>>>>> page tables like this. >>>>>>>> >>>>>>>> vfio-pci is a bit different, though, as it uses >>>>>>>> vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to >>>>>>>> insert PFNs, not at mmap time using remap_pfn_range() and friends. >>>>>>>> >>>>>>>> (see vfio_pci_mmap_page_fault() ) >>>>>>> >>>>>>> It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a >>>>>>> bug there: >>>>>> >>>>>> Yeah, I am all confused about expansion. The example code looks like all it >>>>>> wants to do is move a VM_PFNMAP mapping. >>>>>> >>>>>> if (mremap(iov[i].iov_base, 0, iov[i].iov_len, >>>>>> MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { >>>>>> goto err; >>>>>> } >>>>>> >>>>>> I guess the expansion is because of iov[i].iov_len is bigger than the >>>>>> original VMA? >>>>>> >>>>>> Is that maybe a bug in QEMU or why are we even expanding here? >>>>> >>>>> We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy >>>>> of this VMA at a new address'. >>>>> >>>>> There's never any moving, as input size is 0 :) >>>>> >>>>> It's a cute corner case way of using mremap(). >>>>> >>>>> We're basically asking for a _copy_. But you can't get a copy of a >>>>> VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap) >>>>> to get something sensible and you are bypassing that on expansion, even if it's >>>>> a 'clone' style expansion. >>>> >>>> Apparently fork() copies VM_PFNMAP without invoking mmap_prepare or legacy >>>> mmap unless VM_DONTCOPY is set, so I wonder if mremap() can use the same >>>> logic. >>> >>> It's because it's literally copying page tables in the exact same range exactly >>> as they are to the exact same virtual address. >>> >>> You're asking for a _brand new mapping_ of effectively _any size whatsoever_ at >>> a _new virtual address_ while _retaining the original mapping_. >>> >>> Also note that you're copying the VMA exactly as-is with _all internal private >>> metadata_ duplicated, but now in another process. >>> >>> It's entirely different. >>> >>> For better or for worse (*ahem*) we've given huge flexibility to drivers to do >>> what they want with this stuff. Which means -literally anything- might be stored >>> in page tables, whcih means there might be alignment requirements for the >>> mapping, which means that page tables may be established in .mmap, >>> .mmap_prepare, which means that internal state might be tied to the VMA that is >>> only correctly set up in .mmap[_prepare], etc. etc. >>> >>> So yes - if we exactly duplicate this with everything virtual, metadata being >>> _exactly the same_ in a _brand new process_ - with the driver _knowing_ that a >>> fork might happen (and setting VM_DONTCOPY in cases where it doesn't want it) - >>> then we're good. >>> >>> But that's something very different from 'allow arbitrary copies of the VMA'. >>> >>> In terms of mremap() this is very simply an expansion and we won't be supporting >>> this kind of operation there sorry. >>> >>> I may go work on an idea to allow this behaviour via a new approach, but it >>> won't be in mremap(). >>> >>> Note that I replied to Vivek with some ideas as to how to do this in userland >>> (thanks to David for suggesting btw forgot to say ;) so you _should_ be able to >>> get what you need here without needing mremap() to do something different. >> >> I understand that the logic to copy page table cannot be borrowed from >> fork(), but I thought that copy_vma_and_data() could be extended to support >> this scenario. >> >> If I understand it correctly, it does almost what we want; copying a VMA and >> page table with a new size. It also calls vma->vm_ops->mremap to let drivers >> know the new VMA. However it doesn't copy the page table if old_len == 0 and >> clears the old page table entries, which prevents using the function to copy >> VM_PFNMAP. > > It doesn't almost do what we want at all. All the drivers known VM_PFNMAP and > VM_DONTEXPAND will _not_ be mremap()'d so unless you have a time machine I don't > know about we can't in any way take the existence of this callback to be > meaningful here :) Correct me if I'm wrong, but looking at check_prep_vma(), VM_PFNMAP is checked only if MREMAP_DONTUNMAP is set or expansion is requested. So it is already possible to move and shrink VM_PFNMAP, and if you need to e.g., assert alignment requirements or synchronize metadata. That said... > >> >> So my idea is simple: change copy_vma_and_data() to copy the page table >> without clearing the old page table entries if !old_len && (vma->vm_flags & >> VM_PFNMAP). > > No, absolutely not. > > I already went over the reasons, but to highlight: > > - There may be alignment requirements that are no longer fulfilled. > > - There may be metadata associated with the VMA that no longer exists in the > copied VMA. > > - There may be some requirement that only one mapping exists at a time of the > given range. Obviously what I suggested goes against "the only one mapping exists at a time" so, taking that into account, I agree that it will not work. > > And who knows what else. > > we give drivers a great deal of freedom to do what they want with these > callbacks. We've built in the assumption that: > > - VM_PFNMAP means .mmap[_prepare] will _always_ be called for any new mapping. > - VM_DONTEXPAND means that we will _never_ mremap() in a way that _copies_ the > VMA. > > Now these semantics are non-obvious and may be inconvenient, but that ship has > sailed, and trying to do something different now is broken. > > I don't particularly fancy auditing every single driver for this behaviour > (inevitably missing some) either. I am already having to do this for .mmap in my > .mmap_prepare work and that was... already an 'interesting' addition to my > workload :) > > Also to be clear, as perhaps I've not been quite firm enough - I will NAK any > patch that tries to bolt on more 'special behaviour' to mremap(). > > It already has enough of that, if we had that time machine I mentioned I would > never have allowed this ridiculous 'cute' mremap(ptr, 0, new_size, ...) > behaviour. > > Note that we explicitly disallow it for anon mappings, so there's already > non-obvious caveats on top of caveats on top of caveats. > > There will be absolutely no more of this :) > >> >> Of course we still need to respect VM_DONTEXPAND so it should be also >> checked that the new VMA is a subset of the old one. > > Yeah no, sorry. > >> >> Can this work? > > Nope, but I + David already suggested a way forward that should work - just > mmap() something new utilising the existing fd. > > You could even explicit try to do this only when the mremap()-clone behaviour > fails. > > I leave exploring the details of this to you guys ;) > >> >> Regards, >> Akihiko Odaki > > Like I said, I may look into adding some _new_ kernel functionality that gives > you what you want. I will cc you and Vivek if/when I put something forward. Thank you. Regards, Akihiko Odaki ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 10:16 ` Akihiko Odaki @ 2025-11-21 10:52 ` Lorenzo Stoakes 0 siblings, 0 replies; 18+ messages in thread From: Lorenzo Stoakes @ 2025-11-21 10:52 UTC (permalink / raw) To: Akihiko Odaki Cc: David Hildenbrand (Red Hat), Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato On Fri, Nov 21, 2025 at 07:16:39PM +0900, Akihiko Odaki wrote: > On 2025/11/21 18:10, Lorenzo Stoakes wrote: > > It doesn't almost do what we want at all. All the drivers known VM_PFNMAP and > > VM_DONTEXPAND will _not_ be mremap()'d so unless you have a time machine I don't > > know about we can't in any way take the existence of this callback to be > > meaningful here :) > > Correct me if I'm wrong, but looking at check_prep_vma(), VM_PFNMAP is > checked only if MREMAP_DONTUNMAP is set or expansion is requested. Actually it is the case that a straight move is permitted you're right. So virtual alignment not a concern then. I should make the logic clearer on this. Metadata state + any assumption about multiple mappings remains problematic. > > So it is already possible to move and shrink VM_PFNMAP, and if you need to > e.g., assert alignment requirements or synchronize metadata. Well no re: metadata, because you're _moving_ an existing mapping so the metadata is still valid. And generally anything regarding span will still be functional if you map less of the span (keep in mind you can munmap() parts of a mapping at any time you like, so this is impliled). The problem remains with a copy/expansion and having correct page table state. And again, drivers are all implemented on the assumption that things behave as they do right now. Also as I said I"m not interested in us making mremap() _even more of a complicated mess_ by supporting further weird edge case behaviours. Most people are not aware of this cloning behaviour so adding in additional very specific behaviour here is only going to lead to confusion. > > That said... > > > > > > > > > So my idea is simple: change copy_vma_and_data() to copy the page table > > > without clearing the old page table entries if !old_len && (vma->vm_flags & > > > VM_PFNMAP). > > > > No, absolutely not. > > > > I already went over the reasons, but to highlight: > > > > - There may be alignment requirements that are no longer fulfilled. We can drop this, I was mistaken as it turns out we do in fact allow moves in this case (patch will be incoming to make this clearer in the mremap code :) > > > > - There may be metadata associated with the VMA that no longer exists in the > > copied VMA. > > > > - There may be some requirement that only one mapping exists at a time of the > > given range. > > Obviously what I suggested goes against "the only one mapping exists at a > time" so, taking that into account, I agree that it will not work. Well also the metadata remains a problem. A typical usage is a reference count... > > > > Like I said, I may look into adding some _new_ kernel functionality that gives > > you what you want. I will cc you and Vivek if/when I put something forward. > > Thank you. No problem :) in the meantime you can explore using the /proc/%pid/... to do something similar in userspace, I pointed out some ideas in the other thread. > > Regards, > Akihiko Odaki Cheers, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 9:58 ` Lorenzo Stoakes 2025-11-21 3:05 ` Akihiko Odaki @ 2025-11-21 7:26 ` David Hildenbrand (Red Hat) 1 sibling, 0 replies; 18+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-21 7:26 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Vivek Kasireddy, linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki On 11/20/25 10:58, Lorenzo Stoakes wrote: > On Thu, Nov 20, 2025 at 10:49:59AM +0100, David Hildenbrand (Red Hat) wrote: >> On 11/20/25 10:35, Lorenzo Stoakes wrote: >>> On Thu, Nov 20, 2025 at 10:16:26AM +0100, David Hildenbrand (Red Hat) wrote: >>>> On 11/20/25 10:04, Lorenzo Stoakes wrote: >>>>> Hi Vivek, thanks for the patch. >>>>> >>>>> In general though, let's please not make a fundamental change to mremap() >>>>> behaviour in late -rc6. Late in cycle/during merge window we're really only >>>>> interested in existing series, series that are less involved than this. >>>>> >>>>> On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: >>>>>> When mremap is used to create a new mapping, we should not return >>>>>> -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set because >>>>>> the old VMA would neither be expanded nor shrunk in this case. This >>>>> >>>>> I guess you're trying to be succinct here and 'clone' each input VMA using >>>>> the 0 source size input. >>>>> >>>>> However this can't work. >>>>> >>>>> This operation is not equivalent to an mmap(). It may seem to be for >>>>> ordinary mappings but in practice it isn't: >>>>> >>>>> (syscall) >>>>> -> do_mremap() >>>>> -> mremap_at() >>>>> -> expand_vma() >>>>> -> move_vma() >>>>> -> copy_vma_and_data() >>>>> -> copy_vma() >>>>> >>>>> Essentially copying the properties of the VMA to the new region. >>>>> >>>>> But this doesn't work for PFN map. >>>>> >>>>> At _no point_ are you invoking the original f_op->mmap or >>>>> f_op->mmap_prepare handler. >>>>> >>>>> And these handles for PFN maps set up page tables, because PFN maps >>>>> literally do not exist as VMAs which have properties independent of their >>>>> page tables like this. >>>> >>>> vfio-pci is a bit different, though, as it uses >>>> vmf_insert_pfn()/vmf_insert_pfn_pmd()/vmf_insert_pfn_pud() at fault time to >>>> insert PFNs, not at mmap time using remap_pfn_range() and friends. >>>> >>>> (see vfio_pci_mmap_page_fault() ) >>> >>> It sets VM_DONTEXPAND but is fine with being expanded? :) That sounds like a >>> bug there: >> >> Yeah, I am all confused about expansion. The example code looks like all it >> wants to do is move a VM_PFNMAP mapping. >> >> if (mremap(iov[i].iov_base, 0, iov[i].iov_len, >> MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { >> goto err; >> } >> >> I guess the expansion is because of iov[i].iov_len is bigger than the >> original VMA? >> >> Is that maybe a bug in QEMU or why are we even expanding here? > > We're going from size 0 to iov[i].iov_len, which is saying 'please make a copy > of this VMA at a new address'. > > There's never any moving, as input size is 0 :) Ah, so it is indeed cloning. The cloning as part of a "remap" operation is really confusing. > > It's a cute corner case way of using mremap(). > > We're basically asking for a _copy_. But you can't get a copy of a > VM_DONTEXPAND/VM_PFNMAP because you need to invoke mmap_prepare (or legacy mmap) > to get something sensible and you are bypassing that on expansion, even if it's > a 'clone' style expansion. Yes, agreed. As Akihiko writes, what they want to achieve resemble a bit what fork() does. But there, the flow is rather different. -- Cheers David ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-20 9:04 ` Lorenzo Stoakes 2025-11-20 9:16 ` David Hildenbrand (Red Hat) @ 2025-11-21 6:51 ` Kasireddy, Vivek 2025-11-21 7:52 ` Lorenzo Stoakes 1 sibling, 1 reply; 18+ messages in thread From: Kasireddy, Vivek @ 2025-11-21 6:51 UTC (permalink / raw) To: Lorenzo Stoakes Cc: linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, David Hildenbrand, Akihiko Odaki Hi Lorenzo, > Subject: Re: [PATCH] mm/mremap: allow VMAs with > VM_DONTEXPAND|VM_PFNMAP when creating new mapping > > Hi Vivek, thanks for the patch. > > In general though, let's please not make a fundamental change to mremap() > behaviour in late -rc6. Late in cycle/during merge window we're really only > interested in existing series, series that are less involved than this. I am sorry about the timing of this patch. My intended goal was to just start a discussion to address this issue we were seeing in Qemu. > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: > > When mremap is used to create a new mapping, we should not return > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set > because > > the old VMA would neither be expanded nor shrunk in this case. This > > I guess you're trying to be succinct here and 'clone' each input VMA using > the 0 source size input. Yes, cloning would be the right term to describe what we are doing here. > > However this can't work. > > This operation is not equivalent to an mmap(). It may seem to be for > ordinary mappings but in practice it isn't: > > (syscall) > -> do_mremap() > -> mremap_at() > -> expand_vma() > -> move_vma() > -> copy_vma_and_data() > -> copy_vma() > > Essentially copying the properties of the VMA to the new region. > > But this doesn't work for PFN map. > > At _no point_ are you invoking the original f_op->mmap or > f_op->mmap_prepare handler. If all the properties of the original VMA are going to be copied over to the new VMA, wouldn't that mean whatever state, page tables were setup via f_op->mmap or f_op->mmap_prepare handlers for the original VMA, would also be copied over to the new VMA right? > > And these handles for PFN maps set up page tables, because PFN maps > literally do not exist as VMAs which have properties independent of their > page tables like this. > > Equally VM_DONTEXPAND implies the same kind of semantics. > > > is particularly useful when trying to create a new VMA using other > > existing VMAs that have these flags set, such as the ones associated > > with VFIO devices. > > > > Specifically, there are use-cases where a VMM such as Qemu would > > want to map a non-contiguous buffer associated with a VFIO device > > in the following way: > > > > void *start, *cur; > > int i; > > > > start = mmap(NULL, size, PROT_NONE, MAP_SHARED, -1, 0); > > if (start == MAP_FAILED) { > > return start; > > } > > Err, MAP_SHARED and -1 fd? Doesn't this always fail? It does locally with > -EBADFD? Sorry, I was trying different things and did not copy the version that had the MAP_ANONYMOUS flag to the commit message. In other words, I had this in my test code, which works: start = mmap(NULL, size, PROT_NONE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (start == MAP_FAILED) { return start; } > > > > > cur = start; > > for (i = 0; i < iov_cnt; i++) { > > if (mremap(iov[i].iov_base, 0, iov[i].iov_len, > > MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { > > Trying to cleverly clone a VMA I guess. > > > goto err; > > } > > cur += iov[i].iov_len; > > } > > return start; > > > > The above code currently works when mapping buffers backed by > > shmem (memfd) but fails with -EFAULT when mapping VFIO backed > > Right, which is expected, as described above, you _are_ expanding here. If the original VMA would remain unaltered and all we are doing is cloning it to a new address, would that still be considered expansion of the original VMA? > > > buffers because the VMAs associated with iov[i].iov_base addresses > > have VM_DONTEXPAND and VM_PFNMAP flags set. Therefore, fix this > > issue by not returning -EFAULT when a new mapping is being created. > > We're not fixing an issue, we're (unfortunately, incorrectly) introducing > an entirely new behaviour of permitting the remapping of page tables > belonging to VMAs posssesing VM_DONTEXPAND and VM_PFNMAP flags. I mentioned fixing because with this patch applied mremap no longer fails with VFIO backed buffers and my use-case works without any further issues. However, I do understand that just because it works doesn't mean anything and that there may be other pitfalls that I did not run into, but I am wondering what those are? > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Jann Horn <jannh@google.com> > > Cc: Pedro Falcato <pfalcato@suse.de> > > Cc: David Hildenbrand <david@kernel.org> > > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > > --- > > mm/mremap.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > index fdb0485ede74..d3868d941f72 100644 > > --- a/mm/mremap.c > > +++ b/mm/mremap.c > > @@ -1736,7 +1736,8 @@ static int check_prep_vma(struct > vma_remap_struct *vrm) > > if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) > > return -EINVAL; > > > > - if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) > > + if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP) && > > + !vrm_implies_new_addr(vrm)) > > return -EFAULT; > > This is just incorrect. I mean firstly, as explained above, we simply do > not support this kind of operation and cannot correctly. > > But also here you're allowing !(MREMAP_FIXED | MREMAP_DONTUNMAP) > VMAs to > expand VM_DONTEXPAND, VM_PFNMAP VMAs which is really really not > right. > > The missing context here is: > > if (new_len == old_len) > return 0; > > ... all code below involves an expansion of a VMA ... > > if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) > return -EFAULT; > > And making _all_ expands that aren't _definitely_ specifying a new address > work correclty. But you are in fact 100% specifying a new address because > you're setting input length of 0... > > So this allows _anybody_ to try to incorrectly expand VM_PFNMAP, > VM_DONTEXPAND VMAs. So is wildly incorrect. Is the problem here with VM_DONTEXPAND or VM_PFNMAP or both? In other words, assuming the original VMA did not have VM_DONTEXPAND, would this behavior (what I am trying to do with mremap) be acceptable? > > > > > if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta)) > > -- > > 2.50.1 > > > > > > In any case, I don't want to try to support this behaviour, I don't think > there's really a sensible way to be sure we can _copy_ page tables > correctly, and even though the source size = 0 is a cute trick you _ARE_ > expanding, and VM_DONTEXPAND is very clear - do not do this. So, the thing that is still not clear to me is whether cloning a VMA would be considered a form of expansion (of the original VMA) forbidden by VM_DONTEXPAND? > > Additionally, please if trying to make such fundamental changes going > forward _always_ provide test code, it's absolutely a requirement. Ok, will do. Just to confirm, your suggestion is to add a new mm selftest to exercise the targeted codepath right? Thanks, Vivek > > Thanks, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 6:51 ` Kasireddy, Vivek @ 2025-11-21 7:52 ` Lorenzo Stoakes 2025-11-21 8:13 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 18+ messages in thread From: Lorenzo Stoakes @ 2025-11-21 7:52 UTC (permalink / raw) To: Kasireddy, Vivek Cc: linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, David Hildenbrand, Akihiko Odaki On Fri, Nov 21, 2025 at 06:51:31AM +0000, Kasireddy, Vivek wrote: > Hi Lorenzo, > > > Subject: Re: [PATCH] mm/mremap: allow VMAs with > > VM_DONTEXPAND|VM_PFNMAP when creating new mapping > > > > Hi Vivek, thanks for the patch. > > > > In general though, let's please not make a fundamental change to mremap() > > behaviour in late -rc6. Late in cycle/during merge window we're really only > > interested in existing series, series that are less involved than this. > I am sorry about the timing of this patch. My intended goal was to just start > a discussion to address this issue we were seeing in Qemu. No problem, just underlining here :) No reason you'd be aware of this. We are trying slowly to manage workload. Slowly :) > > > > > On Wed, Nov 19, 2025 at 09:35:46PM -0800, Vivek Kasireddy wrote: > > > When mremap is used to create a new mapping, we should not return > > > -EFAULT for VMAs with VM_DONTEXPAND or VM_PFNMAP flags set > > because > > > the old VMA would neither be expanded nor shrunk in this case. This > > > > I guess you're trying to be succinct here and 'clone' each input VMA using > > the 0 source size input. > Yes, cloning would be the right term to describe what we are doing here. Right. > > > > > However this can't work. > > > > This operation is not equivalent to an mmap(). It may seem to be for > > ordinary mappings but in practice it isn't: > > > > (syscall) > > -> do_mremap() > > -> mremap_at() > > -> expand_vma() > > -> move_vma() > > -> copy_vma_and_data() > > -> copy_vma() > > > > Essentially copying the properties of the VMA to the new region. > > > > But this doesn't work for PFN map. > > > > At _no point_ are you invoking the original f_op->mmap or > > f_op->mmap_prepare handler. > If all the properties of the original VMA are going to be copied over to > the new VMA, wouldn't that mean whatever state, page tables were > setup via f_op->mmap or f_op->mmap_prepare handlers for the original > VMA, would also be copied over to the new VMA right? The problem is for these 'special' VMAs the VMA doesn't fully describe the state, some of the state ends up in the page tables in a way that cannot be reconstructed on fault. This is really why we have flags like VM_DONTEXPAND. > > > > > And these handles for PFN maps set up page tables, because PFN maps > > literally do not exist as VMAs which have properties independent of their > > page tables like this. > > > > Equally VM_DONTEXPAND implies the same kind of semantics. > > > > > is particularly useful when trying to create a new VMA using other > > > existing VMAs that have these flags set, such as the ones associated > > > with VFIO devices. > > > > > > Specifically, there are use-cases where a VMM such as Qemu would > > > want to map a non-contiguous buffer associated with a VFIO device > > > in the following way: > > > > > > void *start, *cur; > > > int i; > > > > > > start = mmap(NULL, size, PROT_NONE, MAP_SHARED, -1, 0); > > > if (start == MAP_FAILED) { > > > return start; > > > } > > > > Err, MAP_SHARED and -1 fd? Doesn't this always fail? It does locally with > > -EBADFD? > Sorry, I was trying different things and did not copy the version that had > the MAP_ANONYMOUS flag to the commit message. In other words, I had this > in my test code, which works: > start = mmap(NULL, size, PROT_NONE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); > if (start == MAP_FAILED) { > return start; > } Right yeah :) > > > > > > > > > cur = start; > > > for (i = 0; i < iov_cnt; i++) { > > > if (mremap(iov[i].iov_base, 0, iov[i].iov_len, > > > MREMAP_FIXED | MREMAP_MAYMOVE, cur) == MAP_FAILED) { > > > > Trying to cleverly clone a VMA I guess. > > > > > goto err; > > > } > > > cur += iov[i].iov_len; > > > } > > > return start; > > > > > > The above code currently works when mapping buffers backed by > > > shmem (memfd) but fails with -EFAULT when mapping VFIO backed > > > > Right, which is expected, as described above, you _are_ expanding here. > If the original VMA would remain unaltered and all we are doing is > cloning it to a new address, would that still be considered expansion > of the original VMA? Yes. You're saying 'make this 0 size VMA size N'. N > 0, so it's an expansion. It's the same thing as if you were expanding N to N + pgsize - you are trying to map a range without having invoked .mmap[_prepare], but the .mmap[_prepare] callback establishes page table state (and performs checks) that are not encdoed in the VMA metadata and are critical to the mapping being valid. > > > > > > buffers because the VMAs associated with iov[i].iov_base addresses > > > have VM_DONTEXPAND and VM_PFNMAP flags set. Therefore, fix this > > > issue by not returning -EFAULT when a new mapping is being created. > > > > We're not fixing an issue, we're (unfortunately, incorrectly) introducing > > an entirely new behaviour of permitting the remapping of page tables > > belonging to VMAs posssesing VM_DONTEXPAND and VM_PFNMAP flags. > I mentioned fixing because with this patch applied mremap no longer fails > with VFIO backed buffers and my use-case works without any further issues. > However, I do understand that just because it works doesn't mean anything > and that there may be other pitfalls that I did not run into, but I am wondering > what those are? It absolutely will break, see the .mmap callback for vfio, you can go past the end of the physical range... Your change makes expansion of non-expandable VMAs valid in all cases where you aren't explicitly setting a fixed address. This will break in all cases except where you get lucky. This is obviously a non-viable solution. > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Cc: Jann Horn <jannh@google.com> > > > Cc: Pedro Falcato <pfalcato@suse.de> > > > Cc: David Hildenbrand <david@kernel.org> > > > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > > > --- > > > mm/mremap.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > index fdb0485ede74..d3868d941f72 100644 > > > --- a/mm/mremap.c > > > +++ b/mm/mremap.c > > > @@ -1736,7 +1736,8 @@ static int check_prep_vma(struct > > vma_remap_struct *vrm) > > > if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) > > > return -EINVAL; > > > > > > - if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) > > > + if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP) && > > > + !vrm_implies_new_addr(vrm)) > > > return -EFAULT; > > > > This is just incorrect. I mean firstly, as explained above, we simply do > > not support this kind of operation and cannot correctly. > > > > But also here you're allowing !(MREMAP_FIXED | MREMAP_DONTUNMAP) > > VMAs to > > expand VM_DONTEXPAND, VM_PFNMAP VMAs which is really really not > > right. > > > > The missing context here is: > > > > if (new_len == old_len) > > return 0; > > > > ... all code below involves an expansion of a VMA ... > > > > if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) > > return -EFAULT; > > > > And making _all_ expands that aren't _definitely_ specifying a new address > > work correclty. But you are in fact 100% specifying a new address because > > you're setting input length of 0... > > > > So this allows _anybody_ to try to incorrectly expand VM_PFNMAP, > > VM_DONTEXPAND VMAs. So is wildly incorrect. > Is the problem here with VM_DONTEXPAND or VM_PFNMAP or both? > In other words, assuming the original VMA did not have VM_DONTEXPAND, > would this behavior (what I am trying to do with mremap) be acceptable? > > > > > > > > > if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta)) > > > -- > > > 2.50.1 > > > > > > > > > > In any case, I don't want to try to support this behaviour, I don't think > > there's really a sensible way to be sure we can _copy_ page tables > > correctly, and even though the source size = 0 is a cute trick you _ARE_ > > expanding, and VM_DONTEXPAND is very clear - do not do this. > So, the thing that is still not clear to me is whether cloning a VMA would be > considered a form of expansion (of the original VMA) forbidden by > VM_DONTEXPAND? I feel I've covered this above :) > > > > > Additionally, please if trying to make such fundamental changes going > > forward _always_ provide test code, it's absolutely a requirement. > Ok, will do. Just to confirm, your suggestion is to add a new mm selftest > to exercise the targeted codepath right? No please don't, your patch isn't upstreamable. I hate to say it, but I guess it's not come across clearly :P NAK to your patch or anything like it sorry :( > > Thanks, > Vivek > > > > > Thanks, Lorenzo > Another way round here might be to try to use userland to figure out the file that a mapping belongs to via e.g. /proc/$pid/map_files, determine the attributes from /proc/$pid/maps and then generate a new mmap() call with the same properties. Cheers, Lorenzo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 7:52 ` Lorenzo Stoakes @ 2025-11-21 8:13 ` David Hildenbrand (Red Hat) 2025-11-21 15:03 ` Liam R. Howlett 0 siblings, 1 reply; 18+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-21 8:13 UTC (permalink / raw) To: Lorenzo Stoakes, Kasireddy, Vivek Cc: linux-mm, Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki > > Another way round here might be to try to use userland to figure out the file > that a mapping belongs to via e.g. /proc/$pid/map_files, determine the > attributes from /proc/$pid/maps and then generate a new mmap() call with the > same properties. Right, if we could find some way to just obtain the file/fd we could just avoid mremap altogether. I recall that CRIU used something similar to obtain the fd of a MAP_SHARED|MAP_ANON mapping (anon shmem) VMA, but my memory is a bit vague on that one. -- Cheers David ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 8:13 ` David Hildenbrand (Red Hat) @ 2025-11-21 15:03 ` Liam R. Howlett 2025-11-22 6:56 ` Kasireddy, Vivek 0 siblings, 1 reply; 18+ messages in thread From: Liam R. Howlett @ 2025-11-21 15:03 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: Lorenzo Stoakes, Kasireddy, Vivek, linux-mm, Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki * David Hildenbrand (Red Hat) <david@kernel.org> [251121 03:13]: > > > > Another way round here might be to try to use userland to figure out the file > > that a mapping belongs to via e.g. /proc/$pid/map_files, determine the > > attributes from /proc/$pid/maps and then generate a new mmap() call with the > > same properties. > > Right, if we could find some way to just obtain the file/fd we could just > avoid mremap altogether. > > I recall that CRIU used something similar to obtain the fd of a > MAP_SHARED|MAP_ANON mapping (anon shmem) VMA, but my memory is a bit vague > on that one. There is an ioctl query method to get information on VMAs since 6.11. Check Documentation/filesystems/proc.rst for this: Starting with 6.11 kernel, /proc/PID/maps provides an alternative ioctl()-based API that gives ability to flexibly and efficiently query and filter individual VMAs. This interface is binary and is meant for more efficient and easy programmatic use. `struct procmap_query`, defined in linux/fs.h UAPI header, serves as an input/output argument to the `PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for details on query semantics, supported flags, data returned, and general API usage information. ... maybe that should be linux/fs.h and not linus/fs.h. The actual file seems to be include/uapi/linux/fs.h (grepping for procmap_query). Maybe you could make use of that ioctl to get what you want? If not, maybe look at adding what you need there would be more of the change you want to make? Thanks, Liam ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping 2025-11-21 15:03 ` Liam R. Howlett @ 2025-11-22 6:56 ` Kasireddy, Vivek 0 siblings, 0 replies; 18+ messages in thread From: Kasireddy, Vivek @ 2025-11-22 6:56 UTC (permalink / raw) To: Liam R. Howlett, David Hildenbrand (Red Hat), lorenzo.stoakes Cc: linux-mm, Andrew Morton, Vlastimil Babka, Jann Horn, Pedro Falcato, Akihiko Odaki > Subject: Re: [PATCH] mm/mremap: allow VMAs with > VM_DONTEXPAND|VM_PFNMAP when creating new mapping > > * David Hildenbrand (Red Hat) <david@kernel.org> [251121 03:13]: > > > > > > Another way round here might be to try to use userland to figure out > the file > > > that a mapping belongs to via e.g. /proc/$pid/map_files, determine the > > > attributes from /proc/$pid/maps and then generate a new mmap() call > with the > > > same properties. > > > > Right, if we could find some way to just obtain the file/fd we could just > > avoid mremap altogether. > > > > I recall that CRIU used something similar to obtain the fd of a > > MAP_SHARED|MAP_ANON mapping (anon shmem) VMA, but my memory > is a bit vague > > on that one. > > There is an ioctl query method to get information on VMAs since 6.11. > > Check Documentation/filesystems/proc.rst for this: > > Starting with 6.11 kernel, /proc/PID/maps provides an alternative > ioctl()-based API that gives ability to flexibly and efficiently query and > filter individual VMAs. This interface is binary and is meant for more > efficient and easy programmatic use. `struct procmap_query`, defined in > linux/fs.h UAPI header, serves as an input/output argument to the > `PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI > header for > details on query semantics, supported flags, data returned, and general API > usage information. > > > ... maybe that should be linux/fs.h and not linus/fs.h. The actual file > seems to be include/uapi/linux/fs.h (grepping for procmap_query). > > Maybe you could make use of that ioctl to get what you want? > > If not, maybe look at adding what you need there would be more of the > change you want to make? Thank you Liam, Lorenzo, David for your suggestions and comments. We'll try other ways to address our use-case including exploring the idea of obtaining the underlying fd via VFIO APIs or using /proc/PID/maps. Thanks, Vivek > > Thanks, > Liam ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-11-22 6:56 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-11-20 5:35 [PATCH] mm/mremap: allow VMAs with VM_DONTEXPAND|VM_PFNMAP when creating new mapping Vivek Kasireddy 2025-11-20 9:04 ` Lorenzo Stoakes 2025-11-20 9:16 ` David Hildenbrand (Red Hat) 2025-11-20 9:35 ` Lorenzo Stoakes 2025-11-20 9:49 ` David Hildenbrand (Red Hat) 2025-11-20 9:58 ` Lorenzo Stoakes 2025-11-21 3:05 ` Akihiko Odaki 2025-11-21 8:03 ` Lorenzo Stoakes 2025-11-21 8:48 ` Akihiko Odaki 2025-11-21 9:10 ` Lorenzo Stoakes 2025-11-21 10:16 ` Akihiko Odaki 2025-11-21 10:52 ` Lorenzo Stoakes 2025-11-21 7:26 ` David Hildenbrand (Red Hat) 2025-11-21 6:51 ` Kasireddy, Vivek 2025-11-21 7:52 ` Lorenzo Stoakes 2025-11-21 8:13 ` David Hildenbrand (Red Hat) 2025-11-21 15:03 ` Liam R. Howlett 2025-11-22 6:56 ` Kasireddy, Vivek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox