Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Dev Jain <dev.jain@arm.com>, akpm@linux-foundation.org
Cc: Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com,
	vbabka@suse.cz, jannh@google.com, pfalcato@suse.de,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, ryan.roberts@arm.com, mingo@kernel.org,
	libang.li@antgroup.com, maobibo@loongson.cn,
	zhengqi.arch@bytedance.com, baohua@kernel.org,
	anshuman.khandual@arm.com, willy@infradead.org,
	ioworker0@gmail.com, yang@os.amperecomputing.com,
	baolin.wang@linux.alibaba.com, ziy@nvidia.com, hughd@google.com
Subject: Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
Date: Mon, 27 Oct 2025 22:40:49 +0100	[thread overview]
Message-ID: <726dcb51-82a7-49a7-a8e5-49bc3eb05dcf@redhat.com> (raw)
In-Reply-To: <20250610035043.75448-3-dev.jain@arm.com>

On 10.06.25 05:50, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.
> 
> For split folios, there will be no pte batching; nr_ptes will be 1. For
> pagetable splitting, the ptes will still point to the same large folio;
> for arm64, this results in the optimization described above, and for other
> arches (including the general case), a minor improvement is expected due to
> a reduction in the number of function calls.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>   1 file changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 180b12225368..18b215521ada 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>   	return pte;
>   }
>   
> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	struct folio *folio;
> +
> +	if (max_nr == 1)
> +		return 1;
> +
> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (!folio || !folio_test_large(folio))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +			       NULL, NULL);
> +}

Dev, I think there is another bug hiding in here. That function ignores
the writable bit, which is not what you need here, in particular for
anonymous folios in some cases.

Later set_ptes() could end up marking ptes writable that were not writable
before, which is bad (at least for anonymous folios, maybe also for pagecache
folios).

I think you really must respect the writable bit through something like
FPB_RESPECT_WRITE.

I patched out the "pte_batch_hint(ptep, pte) == 1" check we have upstream
to make it reproduce on x86_64, but the following reproducer should likely
reproduce on aarch64 without further kernel modifications.


# ./mremap
BUG: Memory modified


#define _GNU_SOURCE
#include <stdint.h>
#include <string.h>
#include <stdbool.h>
#include <x86intrin.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <errno.h>
#include <signal.h>
#include <fcntl.h>
#include <sys/wait.h>

static size_t pagesize;
static size_t thpsize;
static int pagemap_fd;

static uint64_t pagemap_get_entry(int fd, char *start)
{
         const unsigned long pfn = (unsigned long)start / getpagesize();
         uint64_t entry;
         int ret;

         ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry));
         if (ret != sizeof(entry)) {
                 perror("reading pagemap failed");
                 exit(-1);
         }
         return entry;
}

static bool pagemap_is_populated(int fd, char *start)
{
         return pagemap_get_entry(fd, start) & ((1ULL << 62) | (1ULL << 63));
}

unsigned long pagemap_get_pfn(int fd, char *start)
{
         uint64_t entry = pagemap_get_entry(fd, start);

         /* If present (63th bit), PFN is at bit 0 -- 54. */
         if (entry & (1ULL << 63))
                 return entry & 0x007fffffffffffffull;
         return -1ul;
}

int main(void)
{
         char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
         size_t size, mmap_size;
         int ret;

         pagesize = getpagesize();
         thpsize = 2 * 1024 * 1024ul;
         pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
         if (pagemap_fd < 0) {
                 perror("opening pagemap failed");
                 return -1;
         }

         /* For alignment purposes, we need twice the thp size. */
         mmap_size = 2 * thpsize;
         mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
         if (mmap_mem == MAP_FAILED) {
                 perror("mmap() failed");
                 return -1;
         }

         /* We need a THP-aligned memory area. */
         mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));

         ret = madvise(mem, thpsize, MADV_HUGEPAGE);
         if (ret) {
                 perror("MADV_HUGEPAGE failed");
                 return -1;
         }

         /*
          * Try to populate a THP. Touch the first sub-page and test if we get
          * another sub-page populated automatically.
          */
         mem[0] = 0;
         if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
                 perror("Did not get a THP populated");
                 return -1;
         }

         /* Share only the first page of the THP. */
         if (madvise(mem, pagesize, MADV_DONTFORK)) {
                 perror("MADV_DONTFORK failed");
                 return -1;
         }

         ret = fork();
         if (ret < 0) {
                 perror("fork() failed");
                 return -1;
         } else if (!ret) {
                 while (true) {
                         char c = *((volatile char *)(mem + pagesize));

                         if (c) {
                                 fprintf(stderr, "BUG: Memory modified\n");
                                 exit(-2);
                         }
                 }
         }

         /* Merge VMAs again. */
         if (madvise(mem, pagesize, MADV_DOFORK)) {
                 perror("MADV_DONTFORK failed");
                 return -1;
         }

         /* Mremap multiple pages. */
         mremap_mem = mmap(NULL, 2 * pagesize, PROT_NONE,
                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
         if (mem == MAP_FAILED) {
                 perror("mmap() failed");
                 return -1;
         }
         tmp = mremap(mem, 2 * pagesize, 2 * pagesize, MREMAP_MAYMOVE | MREMAP_FIXED,
                      mremap_mem);
         if (tmp != mremap_mem) {
                 perror("mremap() failed");
                 return -1;
         }

         /* Write into both pages. The child should never see these updates. */
         memset(mremap_mem, 1, 2 * pagesize);

         pause();
}

-- 
Cheers

David / dhildenb

next prev parent reply	other threads:[~2025-10-27 21:41 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-10  3:50 [PATCH v4 0/2] Optimize mremap() for large folios Dev Jain
2025-06-10  3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
2025-06-11 13:23   ` David Hildenbrand
2025-06-11 13:25     ` Dev Jain
2025-06-11 13:29       ` Lorenzo Stoakes
2025-06-11 13:31         ` David Hildenbrand
2025-06-12 12:05   ` Pedro Falcato
2025-06-10  3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
2025-06-10  7:03   ` Barry Song
2025-06-10  7:44     ` Dev Jain
2025-06-10  8:11       ` Barry Song
2025-06-16 21:27         ` Ryan Roberts
2025-06-10  8:37   ` Barry Song
2025-06-10 13:18   ` Lorenzo Stoakes
2025-06-11 14:00   ` David Hildenbrand
2025-06-13  4:24     ` Dev Jain
2025-06-17  8:02       ` David Hildenbrand
2025-06-13 12:32     ` Lorenzo Stoakes
2025-06-16 16:13       ` David Hildenbrand
2025-06-12 12:13   ` Pedro Falcato
2025-10-27 21:40   ` David Hildenbrand [this message]
2025-10-28  5:32     ` Dev Jain
2025-10-28  7:14       ` David Hildenbrand
2025-06-10 12:11 ` [PATCH v4 0/2] Optimize mremap() for large folios Lorenzo Stoakes
2025-06-10 12:33   ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=726dcb51-82a7-49a7-a8e5-49bc3eb05dcf@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jannh@google.com \
    --cc=libang.li@antgroup.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=maobibo@loongson.cn \
    --cc=mingo@kernel.org \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox