Re: [PATCH 09/10] mm/mremap: permit mremap() move of multiple VMAs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Xu <peterx@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>, Rik van Riel <riel@surriel.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH 09/10] mm/mremap: permit mremap() move of multiple VMAs
Date: Thu, 10 Jul 2025 11:41:51 +0100	[thread overview]
Message-ID: <a8886e8d-b335-45fa-abaa-559c81247e89@lucifer.local> (raw)
In-Reply-To: <vn7djxc5cnadmqxsxtd7frgx346fnsvfvlzyh253dxnwn3pe7f@kqci2ing4coj>

On Wed, Jul 09, 2025 at 02:13:41PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250707 01:28]:
> > Historically we've made it a uAPI requirement that mremap() may only
> > operate on a single VMA at a time.
> >
> > For instances where VMAs need to be resized, this makes sense, as it
> > becomes very difficult to determine what a user actually wants should they
> > indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> > Adjust sizes individually? Some other strategy?).
> >
> > However, in instances where a user is moving VMAs, it is restrictive to
> > disallow this.
> >
> > This is especially the case when anonymous mapping remap may or may not be
> > mergeable depending on whether VMAs have or have not been faulted due to
> > anon_vma assignment and folio index alignment with vma->vm_pgoff.
> >
> > Often this can result in surprising impact where a moved region is faulted,
> > then moved back and a user fails to observe a merge from otherwise
> > compatible, adjacent VMAs.
> >
> > This change allows such cases to work without the user having to be
> > cognizant of whether a prior mremap() move or other VMA operations has
> > resulted in VMA fragmentation.
> >
> > Having refactored mremap code to aggregate per-VMA and parameter checks, we
> > are now in a position to permit this kind of move.
> >
> > We do so by detecting if this is a move-only operation up-front, and then
> > utilising a separate code path via remap_move() rather than the ordinary
> > single-VMA path.
> >
> > There are two tasks that occur outside of the mmap write lock - userfaultfd
> > notification and population of unmapped regions of expanded VMAs should the
> > VMA be mlock()'d.
> >
> > The latter doesn't apply, as this is logic for a move only and thus no
> > expansion can take place. In the former case, we explicitly disallow
> > multi-VMA operations on uffd-armed VMAs.
> >
> > The mmap lock is never dropped in the move-only case, this only occurs on a
> > VMA shrink.
> >
> > We take care to handle cases where a VMA merge has occurred, by resetting
> > the VMA iterator in such instances.
> >
> > We needn't worry about self-merges, as in those cases we would, by
> > definition, not be spanning multiple VMAs. The overlapping range test is
> > performed on the whole range so specifically disallows this.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/mremap.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 99 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 28e776cddc08..2e6005e1d22c 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -69,6 +69,8 @@ struct vma_remap_struct {
> >  	enum mremap_type remap_type;	/* expand, shrink, etc. */
> >  	bool mmap_locked;		/* Is mm currently write-locked? */
> >  	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
> > +	bool multi_vma;			/* Is >1 VMA being moved? */
> > +	bool vma_reset;			/* Was the VMA merged/unmap occur? */
>
> The name doesn't read well in code.  vmi_reset or reset_iter might be
> better, but I don't really mind it like this.

Yeah it is a bit odd I agree.

>
> >  };
> >
> >  static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > @@ -1111,6 +1113,7 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> >
> >  	err = do_vmi_munmap(&vmi, mm, addr, len, vrm->uf_unmap, /* unlock= */false);
> >  	vrm->vma = NULL; /* Invalidated. */
> > +	vrm->vma_reset = true;
>
> I believe the munmap() operation leaves the vmi in the correct position
> to reuse, so this is cautious that costs an extra walk of the tree.  I
> don't think it's critical to performance, but if it is we can look here.
> It would have to be passed through which might be a pain.

Yeah I think this means we _always_ reset the VMI as you mention below, unless
MREMAP_DONT_UNMAP | MREMAP_FIXED is used.

It's right to invalidate the vrm->vma here, as this is the source VMA so is now
a dangling pointer.

I think the problem I was worried about here was a partial unmap causing a
split, and keep in mind we might be moving things backwards also.

But I don't think the _iterator_ should be invalidated by this actually right?
We'd still be in the correct position.

So yeah, I'll drop this.

>
> >  	if (err) {
> >  		/* OOM: unable to split vma, just get accounts right */
> >  		vm_acct_memory(len >> PAGE_SHIFT);
> > @@ -1181,6 +1184,7 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> >
> >  	new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff,
> >  			   &pmc.need_rmap_locks);
> > +	vrm->vma_reset = vma != vrm->vma;
> >  	if (!new_vma) {
> >  		vrm_uncharge(vrm);
> >  		*new_vma_ptr = NULL;
> > @@ -1325,6 +1329,7 @@ static unsigned long shrink_vma(struct vma_remap_struct *vrm,
> >  	res = do_vmi_munmap(&vmi, mm, unmap_start, unmap_bytes,
> >  			    vrm->uf_unmap, drop_lock);
> >  	vrm->vma = NULL; /* Invalidated. */
> > +	vrm->vma_reset = true;
>
> Ditto here, lock depending..

We won't ever drop the lock in a move path to be clear. Only on shrink, which is
disallowed for multi VMA move (as is expand).

So probably this is overcautious and I'll drop it.

>
> >  	if (res)
> >  		return res;
> >
> > @@ -1362,6 +1367,7 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm)
> >  		err = do_munmap(mm, vrm->new_addr, vrm->new_len,
> >  				vrm->uf_unmap_early);
> >  		vrm->vma = NULL; /* Invalidated. */
> > +		vrm->vma_reset = true;
>
> Pretty sure this one is needed, regardless of passing through (and
> updating this call).

Yes this one for sure.

>
> >  		if (err)
> >  			return err;
> >
> > @@ -1581,6 +1587,18 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm)
> >  	return false;
> >  }
> >
> > +/* Does this remap ONLY move mappings? */
> > +static bool vrm_move_only(struct vma_remap_struct *vrm)
> > +{
> > +	if (!vrm_implies_new_addr(vrm))
> > +		return false;
> > +
> > +	if (vrm->old_len != vrm->new_len)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
> >  {
> >  	struct mm_struct *mm = current->mm;
> > @@ -1644,10 +1662,29 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
> >  			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
> >  		return -EINVAL;
> >
> > -	/* We can't remap across vm area boundaries */
> > +	/*
> > +	 * We can't remap across the end of VMAs, as another VMA may be
> > +	 * adjacent:
> > +	 *
> > +	 *       addr   vma->vm_end
> > +	 *  |-----.----------|
> > +	 *  |     .          |
> > +	 *  |-----.----------|
> > +	 *        .<--------->xxx>
> > +	 *            old_len
> > +	 *
> > +	 * We also require that vma->vm_start <= addr < vma->vm_end.
> > +	 */
> >  	if (old_len > vma->vm_end - addr)
> >  		return -EFAULT;
> >
> > +	/*
> > +	 * We can't support moving multiple uffd VMAs as notify requires mmap
> > +	 * lock to be dropped.
> > +	 */
> > +	if (vrm->multi_vma && userfaultfd_armed(vma))
> > +		return -EINVAL;
> > +
> >  	if (new_len <= old_len)
> >  		return 0;
> >
> > @@ -1744,6 +1781,57 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
> >  	return 0;
> >  }
> >
> > +static unsigned long remap_move(struct vma_remap_struct *vrm)
> > +{
> > +	struct vm_area_struct *vma;
> > +	unsigned long start = vrm->addr;
> > +	unsigned long end = vrm->addr + vrm->old_len;
> > +	unsigned long new_addr = vrm->new_addr;
> > +	unsigned long prev_addr = start;
> > +	VMA_ITERATOR(vmi, current->mm, start);
> > +
> > +	/*
> > +	 * When moving VMAs we allow for batched moves across multiple VMAs,
> > +	 * with all VMAs in the input range [addr, addr + old_len) being moved
> > +	 * (and split as necessary).
> > +	 */
> > +	for_each_vma_range(vmi, vma, end) {
> > +		unsigned long addr = max(vma->vm_start, start);
> > +		unsigned long len = min(end, vma->vm_end) - addr;
> > +		unsigned long offset = addr - start;
> > +		unsigned long res;
> > +
> > +		/* Merged with self, move on. */
> > +		if (vrm->multi_vma && prev_addr == addr)
> > +			continue;
> > +
> > +		vrm->vma = vma;
> > +		vrm->addr = addr;
> > +		vrm->new_addr = new_addr + offset;
> > +		vrm->old_len = vrm->new_len = len;
> > +
> > +		res = check_prep_vma(vrm);
> > +		if (!res)
> > +			res = mremap_to(vrm);
> > +		if (IS_ERR_VALUE(res))
> > +			return res;
> > +
> > +		/* mmap lock is only dropped on shrink. */
> > +		VM_WARN_ON_ONCE(!vrm->mmap_locked);
> > +		/* This is a move, no expand should occur. */
> > +		VM_WARN_ON_ONCE(vrm->populate_expand);
> > +
> > +		if (vrm->vma_reset) {
> > +			vma_iter_reset(&vmi);
> > +			vrm->vma_reset = false;
> > +		}
>
> What code path results in vma_reset == false here?

Yeah that's a good point, only MREMAP_DONT_UNMAP | MREMAP_FIXED will fail to hit
it, so let's drop for unmaps.

I will test this is all good too.

>
> > +		vrm->multi_vma = true;
> > +		prev_addr = addr;
> > +	}
> > +
> > +	return new_addr;
> > +}
>
> The iterator use looks good.

Thanks!

>
> > +
> >  static unsigned long do_mremap(struct vma_remap_struct *vrm)
> >  {
> >  	struct mm_struct *mm = current->mm;
> > @@ -1761,13 +1849,17 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
> >  		return -EINTR;
> >  	vrm->mmap_locked = true;
> >
> > -	vrm->vma = vma_lookup(current->mm, vrm->addr);
> > -	res = check_prep_vma(vrm);
> > -	if (res)
> > -		goto out;
> > +	if (vrm_move_only(vrm)) {
> > +		res = remap_move(vrm);
> > +	} else {
> > +		vrm->vma = vma_lookup(current->mm, vrm->addr);
> > +		res = check_prep_vma(vrm);
> > +		if (res)
> > +			goto out;
> >
> > -	/* Actually execute mremap. */
> > -	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> > +		/* Actually execute mremap. */
> > +		res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> > +	}
> >
> >  out:
> >  	failed = IS_ERR_VALUE(res);
> > --
> > 2.50.0
> >

next prev parent reply	other threads:[~2025-07-10 10:42 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-07  5:27 [PATCH 00/10] " Lorenzo Stoakes
2025-07-07  5:27 ` [PATCH 01/10] mm/mremap: perform some simple cleanups Lorenzo Stoakes
2025-07-10 11:09   ` Vlastimil Babka
2025-07-07  5:27 ` [PATCH 02/10] mm/mremap: refactor initial parameter sanity checks Lorenzo Stoakes
2025-07-10 11:38   ` Vlastimil Babka
2025-07-07  5:27 ` [PATCH 03/10] mm/mremap: put VMA check and prep logic into helper function Lorenzo Stoakes
2025-07-10 13:10   ` Vlastimil Babka
2025-07-07  5:27 ` [PATCH 04/10] mm/mremap: cleanup post-processing stage of mremap Lorenzo Stoakes
2025-07-10 13:49   ` Vlastimil Babka
2025-07-10 15:28     ` Lorenzo Stoakes
2025-07-07  5:27 ` [PATCH 05/10] mm/mremap: use an explicit uffd failure path for mremap Lorenzo Stoakes
2025-07-07  7:56   ` kernel test robot
2025-07-07 10:13     ` Lorenzo Stoakes
2025-07-07 10:20   ` Lorenzo Stoakes
2025-07-10 14:24   ` Vlastimil Babka
2025-07-07  5:27 ` [PATCH 06/10] mm/mremap: check remap conditions earlier Lorenzo Stoakes
2025-07-10 14:36   ` Vlastimil Babka
2025-07-07  5:27 ` [PATCH 07/10] mm/mremap: move remap_is_valid() into check_prep_vma() Lorenzo Stoakes
2025-07-10 14:44   ` Vlastimil Babka
2025-07-07  5:27 ` [PATCH 08/10] mm/mremap: clean up mlock populate behaviour Lorenzo Stoakes
2025-07-10 14:47   ` Vlastimil Babka
2025-07-07  5:27 ` [PATCH 09/10] mm/mremap: permit mremap() move of multiple VMAs Lorenzo Stoakes
2025-07-09 18:13   ` Liam R. Howlett
2025-07-10 10:41     ` Lorenzo Stoakes [this message]
2025-07-11  8:17   ` Mark Brown
2025-07-11  8:22     ` Mark Brown
2025-07-11  8:31       ` Lorenzo Stoakes
2025-07-07  5:27 ` [PATCH 10/10] tools/testing/selftests: extend mremap_test to test multi-VMA mremap Lorenzo Stoakes
2025-07-07  6:12 ` [PATCH 00/10] mm/mremap: permit mremap() move of multiple VMAs Hugh Dickins
2025-07-07 10:31   ` Lorenzo Stoakes
2025-07-07 10:34 ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a8886e8d-b335-45fa-abaa-559c81247e89@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox