Re: [PATCH v2 6/7] mm/mremap: refactor move_page_tables(), abstracting state

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Jann Horn <jannh@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Harry Yoo <harry.yoo@oracle.com>,
	Yosry Ahmed <yosry.ahmed@linux.dev>
Subject: Re: [PATCH v2 6/7] mm/mremap: refactor move_page_tables(), abstracting state
Date: Mon, 10 Mar 2025 18:05:45 +0100	[thread overview]
Message-ID: <40d1c30b-f437-4cf5-a37f-2e3084f03086@suse.cz> (raw)
In-Reply-To: <89f2446f70acd41172dadbb3404db9d95415c78c.1741256580.git.lorenzo.stoakes@oracle.com>

On 3/6/25 11:34, Lorenzo Stoakes wrote:
> A lot of state is threaded throughout the page table moving logic within
> the mremap code, including boolean values which control behaviour
> specifically in regard to whether rmap locks need be held over the
> operation and whether the VMA belongs to a temporary stack being moved by
> move_arg_pages() (and consequently, relocate_vma_down()).
> 
> As we already transmit state throughout this operation, it is neater and
> more readable to maintain a small state object.  We do so in the form of
> pagetable_move_control.
> 
> In addition, this allows us to update parameters within the state as we
> manipulate things, for instance with regard to the page table realignment
> logic.
> 
> In future I want to add additional functionality to the page table logic,
> so this is an additional motivation for making it easier to do so.
> 
> This patch changes move_page_tables() to accept a pointer to a
> pagetable_move_control struct, and performs changes at this level only.
> Further page table logic will be updated in a subsequent patch.
> 
> We additionally also take the opportunity to add significant comments
> describing the address realignment logic to make it abundantly clear what
> is going on in this code.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some nits below.

> ---
>  mm/internal.h |  46 ++++++++++++--
>  mm/mmap.c     |   5 +-
>  mm/mremap.c   | 172 ++++++++++++++++++++++++++++++++++++--------------
>  3 files changed, 171 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 7a4f81a6edd6..a4608c85a3ba 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -24,6 +24,47 @@
>  
>  struct folio_batch;
>  
> +/*
> + * Maintains state across a page table move. The operation assumes both source
> + * and destination VMAs already exist and are specified by the user.
> + *
> + * Partial moves are permitted, but:
> + *      from_vma->vm_start <= from_addr < from_vma->vm_end - len
> + *      to_vma->vm_start <= to_addr < to_vma->vm_end - len

Should this rather be expressed using the actual field names?

> + *
> + * Must be maintained.
> + *
> + * mmap lock must be held in write and VMA write locks must be held on any VMA
> + * that is visible.
> + *
> + * Use the PAGETABLE_MOVE() macro to initialise this struct.
> + *
> + * NOTE: The page table move is affected by reading from [old_addr, old_end),
> + * and old_addr may be updated for better page table alignment, so len_in
> + * represents the length of the range being copied as specified by the user.
> + */
> +struct pagetable_move_control {
> +	struct vm_area_struct *old; /* Source VMA. */
> +	struct vm_area_struct *new; /* Destination VMA. */
> +	unsigned long old_addr; /* Address from which the move begins. */
> +	unsigned long old_end; /* Exclusive address at which old range ends. */
> +	unsigned long new_addr; /* Address to move page tables to. */
> +	unsigned long len_in; /* Bytes to remap specified by user. */
> +
> +	bool need_rmap_locks; /* Do rmap locks need to be taken? */
> +	bool for_stack; /* Is this an early temp stack being moved? */
> +};
> +
> +#define PAGETABLE_MOVE(name, old_, new_, old_addr_, new_addr_, len_)	\
> +	struct pagetable_move_control name = {				\
> +		.old = old_,						\
> +		.new = new_,						\
> +		.old_addr = old_addr_,					\
> +		.old_end = (old_addr_) + (len_),			\
> +		.new_addr = new_addr_,					\
> +		.len_in = len_,						\
> +	}
> +
>  /*
>   * The set of flags that only affect watermark checking and reclaim
>   * behaviour. This is used by the MM to obey the caller constraints
> @@ -1537,10 +1578,7 @@ extern struct list_lru shadow_nodes;
>  } while (0)
>  
>  /* mremap.c */
> -unsigned long move_page_tables(struct vm_area_struct *vma,
> -	unsigned long old_addr, struct vm_area_struct *new_vma,
> -	unsigned long new_addr, unsigned long len,
> -	bool need_rmap_locks, bool for_stack);
> +unsigned long move_page_tables(struct pagetable_move_control *pmc);
>  
>  #ifdef CONFIG_UNACCEPTED_MEMORY
>  void accept_page(struct page *page);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 15d6cd7cc845..efcc4ca7500d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1694,6 +1694,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
>  	VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff);
>  	struct vm_area_struct *next;
>  	struct mmu_gather tlb;
> +	PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
>  
>  	BUG_ON(new_start > new_end);
>  
> @@ -1716,8 +1717,8 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
>  	 * move the page tables downwards, on failure we rely on
>  	 * process cleanup to remove whatever mess we made.
>  	 */
> -	if (length != move_page_tables(vma, old_start,
> -				       vma, new_start, length, false, true))
> +	pmc.for_stack = true;
> +	if (length != move_page_tables(&pmc))
>  		return -ENOMEM;
>  
>  	tlb_gather_mmu(&tlb, mm);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index df550780a450..a4b0124528fa 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -580,8 +580,9 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
>   * the VMA that is created to span the source and destination of the move,
>   * so we make an exception for it.
>   */
> -static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
> -			    unsigned long mask, bool for_stack)
> +static bool can_align_down(struct pagetable_move_control *pmc,
> +			   struct vm_area_struct *vma, unsigned long addr_to_align,
> +			   unsigned long mask)
>  {
>  	unsigned long addr_masked = addr_to_align & mask;
>  
> @@ -590,11 +591,11 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
>  	 * of the corresponding VMA, we can't align down or we will destroy part
>  	 * of the current mapping.
>  	 */
> -	if (!for_stack && vma->vm_start != addr_to_align)
> +	if (!pmc->for_stack && vma->vm_start != addr_to_align)
>  		return false;
>  
>  	/* In the stack case we explicitly permit in-VMA alignment. */
> -	if (for_stack && addr_masked >= vma->vm_start)
> +	if (pmc->for_stack && addr_masked >= vma->vm_start)
>  		return true;
>  
>  	/*
> @@ -604,54 +605,131 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
>  	return find_vma_intersection(vma->vm_mm, addr_masked, vma->vm_start) == NULL;
>  }
>  
> -/* Opportunistically realign to specified boundary for faster copy. */
> -static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
> -			     unsigned long *new_addr, struct vm_area_struct *new_vma,
> -			     unsigned long mask, bool for_stack)
> +/*
> + * Determine if are in fact able to realign for efficiency to a higher page
> + * table boundary.
> + */
> +static bool can_realign_addr(struct pagetable_move_control *pmc,
> +			     unsigned long pagetable_mask)
>  {
> +	unsigned long align_mask = ~pagetable_mask;
> +	unsigned long old_align = pmc->old_addr & align_mask;
> +	unsigned long new_align = pmc->new_addr & align_mask;
> +	unsigned long pagetable_size = align_mask + 1;
> +	unsigned long old_align_next = pagetable_size - old_align;
> +
> +	/*
> +	 * We don't want to have to go hunting for VMAs from the end of the old
> +	 * VMA to the next page table boundary, also we want to make sure the
> +	 * operation is wortwhile.
> +	 *
> +	 * So ensure that we only perform this realignment if the end of the
> +	 * range being copied from is at least page table aligned:

AFAIU we need to at least reach (or cross) one page table boundary? Might be
just me, but I'd understand that sentences that it has to be aligned with a
page table boundary, and "at least" just means it can be naturaly aligned
with a higher power-of-two value than the pagetable_size (but doesn't matter).

> +	 *
> +	 * boundary                        boundary
> +	 *    .<- old_align ->                .
> +	 *    .              |----------------.-----------|
> +	 *    .              |          vma   .           |
> +	 *    .              |----------------.-----------|
> +	 *    .              <----------------.----------->
> +	 *    .                          len_in
> +	 *    <------------------------------->
> +	 *    .         pagetable_size        .
> +	 *    .              <---------------->
> +	 *    .                old_align_next .
> +	 */
> +	if (pmc->len_in < old_align_next)
> +		return false;
> +
>  	/* Skip if the addresses are already aligned. */
> -	if ((*old_addr & ~mask) == 0)
> -		return;
> +	if (old_align == 0)
> +		return false;
>  
>  	/* Only realign if the new and old addresses are mutually aligned. */
> -	if ((*old_addr & ~mask) != (*new_addr & ~mask))
> -		return;
> +	if (old_align != new_align)
> +		return false;
>  
>  	/* Ensure realignment doesn't cause overlap with existing mappings. */
> -	if (!can_align_down(old_vma, *old_addr, mask, for_stack) ||
> -	    !can_align_down(new_vma, *new_addr, mask, for_stack))
> +	if (!can_align_down(pmc, pmc->old, pmc->old_addr, pagetable_mask) ||
> +	    !can_align_down(pmc, pmc->new, pmc->new_addr, pagetable_mask))
> +		return false;
> +
> +	return true;
> +}
> +
> +/*
> + * Opportunistically realign to specified boundary for faster copy.
> + *
> + * Consider an mremap() of a VMA with page table boundaries as below, and no
> + * preceding VMAs from the lower page table boundary to the start of the VMA,
> + * with the end of the range being at least page table aligned:

Same.

> + *
> + *   boundary                        boundary
> + *      .              |----------------.-----------|
> + *      .              |          vma   .           |
> + *      .              |----------------.-----------|
> + *      .         pmc->old_addr         .      pmc->old_end
> + *      .              <---------------------------->
> + *      .                  move these page tables
> + *
> + * If we proceed with moving page tables in this scenario, we will have a lot of
> + * work to do traversing old page tables and establishing new ones in the
> + * destination across multiple lower level page tables.
> + *
> + * The idea here is simply to align pmc->old_addr, pmc->new_addr down to the
> + * page table boundary, so we can simply copy a single page table entry for the
> + * aligned portion of the VMA instead:
> + *
> + *   boundary                        boundary
> + *      .              |----------------.-----------|
> + *      .              |          vma   .           |
> + *      .              |----------------.-----------|
> + * pmc->old_addr                        .      pmc->old_end
> + *      <------------------------------------------->
> + *      .           move these page tables
> + */
> +static void try_realign_addr(struct pagetable_move_control *pmc,
> +			     unsigned long pagetable_mask)
> +{
> +
> +	if (!can_realign_addr(pmc, pagetable_mask))
>  		return;
>  
> -	*old_addr = *old_addr & mask;
> -	*new_addr = *new_addr & mask;
> +	/*
> +	 * Simply align to page table boundaries. Note that we do NOT update the
> +	 * pmc->old_end value, and since the move_page_tables() operation spans

As per this comment about not updating old_end...

> +	 * from [old_addr, old_end) (offsetting new_addr as it is performed),
> +	 * this simply changes the start of the copy, not the end.
> +	 */
> +	pmc->old_addr &= pagetable_mask;
> +	pmc->new_addr &= pagetable_mask;

... and we really don't touch it ...

>  }
>  
> -unsigned long move_page_tables(struct vm_area_struct *vma,
> -		unsigned long old_addr, struct vm_area_struct *new_vma,
> -		unsigned long new_addr, unsigned long len,
> -		bool need_rmap_locks, bool for_stack)
> +unsigned long move_page_tables(struct pagetable_move_control *pmc)
>  {
>  	unsigned long extent, old_end;
>  	struct mmu_notifier_range range;
>  	pmd_t *old_pmd, *new_pmd;
>  	pud_t *old_pud, *new_pud;
> +	unsigned long old_addr, new_addr;
> +	struct vm_area_struct *vma = pmc->old;
>  
> -	if (!len)
> +	if (!pmc->len_in)
>  		return 0;
>  
> -	old_end = old_addr + len;
> -
>  	if (is_vm_hugetlb_page(vma))
> -		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> -						new_addr, len);
> +		return move_hugetlb_page_tables(pmc->old, pmc->new, pmc->old_addr,
> +						pmc->new_addr, pmc->len_in);
>  
>  	/*
>  	 * If possible, realign addresses to PMD boundary for faster copy.
>  	 * Only realign if the mremap copying hits a PMD boundary.
>  	 */
> -	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
> -		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK,
> -				 for_stack);
> +	try_realign_addr(pmc, PMD_MASK);
> +	/* These may have been changed. */
> +	old_addr = pmc->old_addr;
> +	new_addr = pmc->new_addr;
> +	old_end = pmc->old_end;

... this line seems unnecessary.

>  
>  	flush_cache_range(vma, old_addr, old_end);
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,

next prev parent reply	other threads:[~2025-03-10 17:05 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-06 10:33 [PATCH v2 0/7] refactor mremap and fix bug Lorenzo Stoakes
2025-03-06 10:33 ` [PATCH v2 1/7] mm/mremap: correctly handle partial mremap() of VMA starting at 0 Lorenzo Stoakes
2025-03-06 10:33 ` [PATCH v2 2/7] mm/mremap: refactor mremap() system call implementation Lorenzo Stoakes
2025-03-06 13:56   ` Vlastimil Babka
2025-03-06 10:33 ` [PATCH v2 3/7] mm/mremap: introduce and use vma_remap_struct threaded state Lorenzo Stoakes
2025-03-06 15:26   ` Vlastimil Babka
2025-03-10 10:19   ` Lorenzo Stoakes
2025-03-10 18:02     ` Lorenzo Stoakes
2025-03-06 10:34 ` [PATCH v2 4/7] mm/mremap: initial refactor of move_vma() Lorenzo Stoakes
2025-03-10 14:11   ` Vlastimil Babka
2025-03-10 14:28     ` Lorenzo Stoakes
2025-03-06 10:34 ` [PATCH v2 5/7] mm/mremap: complete " Lorenzo Stoakes
2025-03-10 15:11   ` Vlastimil Babka
2025-03-10 15:38     ` Lorenzo Stoakes
2025-03-06 10:34 ` [PATCH v2 6/7] mm/mremap: refactor move_page_tables(), abstracting state Lorenzo Stoakes
2025-03-10 17:05   ` Vlastimil Babka [this message]
2025-03-10 18:09     ` Lorenzo Stoakes
2025-03-06 10:34 ` [PATCH v2 7/7] mm/mremap: thread state through move page table operation Lorenzo Stoakes
2025-03-10 17:52   ` Vlastimil Babka
2025-03-10 18:13     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40d1c30b-f437-4cf5-a37f-2e3084f03086@suse.cz \
    --to=vbabka@suse.cz \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox