Re: [PATCH] mm: map maximum pages possible in finish_fault

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Dev Jain <dev.jain@arm.com>, akpm@linux-foundation.org
Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, ryan.roberts@arm.com,
	anshuman.khandual@arm.com, kirill@shutemov.name,
	willy@infradead.org
Subject: Re: [PATCH] mm: map maximum pages possible in finish_fault
Date: Wed, 11 Feb 2026 12:33:47 +0100	[thread overview]
Message-ID: <3684c55a-6581-4731-b94a-19526f455a1e@kernel.org> (raw)
In-Reply-To: <20260206135648.38164-1-dev.jain@arm.com>

On 2/6/26 14:56, Dev Jain wrote:
> Suppose a VMA of size 64K, maps a file/shmem-file at index 0, and that the
> pagecache at index 0 contains an order-9 folio. If do_fault_around is able
> to find this folio, filemap_map_pages ultimately maps the first 64K half of
> the folio into the pagetable, thus reducing the number of future page
> faults. If fault-around fails to satisfy the fault, or if it is a write
> fault, then we use vma->vm_ops->fault to find/create the folio, followed by
> finish_fault to map the folio into the pagetable. On encountering a similar
> large folio crossing the VMA boundary, finish_fault currently falls back
> to mapping only a single page.
> 
> Align finish_fault with filemap_map_pages, and map as many pages as
> possible, without crossing VMA/PMD/file boundaries.
> 
> Commit 19773df031bc ("mm/fault: try to map the entire file folio in
> finish_fault()") argues that doing a per-page fault only prevents RSS
> accounting, and not RSS inflation. Combining with the improvements below,
> it makes sense to map as maximum pages as possible.
> 
> We test the patch with the following userspace program. A shmem VMA of
> 2M is created, and faulted in, with sysfs setting
> hugepages-2048k/shmem_enabled = always, so that the pagecache is populated
> with a 2M folio. Then, a 64K VMA is created, and we fault on each page.
> Then, we do MADV_DONTNEED to zap the pagetable, so that we can fault again
> in the next iteration. We measure the accumulated time taken during
> faulting the VMA.
> 
> On arm64,
> 
> without patch:
> Total time taken by inner loop: 4701721766 ns
> 
> with patch:
> Total time taken by inner loop: 516043507 ns
> 
> giving a 9x improvement.
> 
> To remove arm64 contpte interference (although contpte will worsen the
> execution time due to not accessing mapped memory, but incurring the
> overhead of painting the ptes with cont bit), we can change the program to
> map a 32K VMA, and do the fault 8 times. For this case:
> 
> without patch:
> Total time taken by inner loop: 2081356415 ns
> 
> with patch:
> Total time taken by inner loop: 408755218 ns
> 
> leading to an improvement as well.
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <assert.h>
> #include <time.h>
> #include <errno.h>
> #include <string.h>
> #include <sys/syscall.h>
> #include <linux/memfd.h>
> #include <fcntl.h>
> 
> #define PMDSIZE     (1UL << 21)
> #define PAGESIZE    (1UL << 12)
> #define MTHPSIZE    (1UL << 16)
> 
> #define ITERATIONS  1000000
> 
> static int xmemfd_create(const char *name, unsigned int flags)
> {
> #ifdef SYS_memfd_create
>      return syscall(SYS_memfd_create, name, flags);
> #else
>      (void)name; (void)flags;
>      errno = ENOSYS;
>      return -1;
> #endif
> }
> 
> static void die(const char *msg)
> {
>      fprintf(stderr, "%s: %s (errno=%d)\n", msg, strerror(errno), errno);
>      exit(1);
> }
> 
> int main(void)
> {
>      /* Create a shmem-backed "file" (anonymous, RAM/swap-backed) */
>      int fd = xmemfd_create("willitscale-shmem", MFD_CLOEXEC);
>      if (fd < 0)
>          die("memfd_create");
> 
>      if (ftruncate(fd, PMDSIZE) != 0)
>          die("ftruncate");
> 
>      char *c = mmap(NULL, PMDSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>      if (c == MAP_FAILED)
>          die("mmap PMDSIZE");
> 
>      assert(!((unsigned long)c & (PMDSIZE - 1)));
> 
>      /* allocate PMD shmem folio */
>      c[0] = 0;
> 
>      char *ptr = mmap(NULL, MTHPSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>      if (ptr == MAP_FAILED)
>          die("mmap MTHPSIZE");
> 
>      assert(!((unsigned long)ptr & (MTHPSIZE - 1)));
> 
>      long total_time_ns = 0;
> 
>      for (int i = 0; i < ITERATIONS; ++i) {
>          struct timespec start, end;
>          if (clock_gettime(CLOCK_MONOTONIC, &start) != 0)
>              die("clock_gettime start");
> 
>          for (int j = 0; j < 8; ++j) {
>              ptr[j * PAGESIZE] = 3;
>          }
> 
>          if (clock_gettime(CLOCK_MONOTONIC, &end) != 0)
>              die("clock_gettime end");
> 
>          long elapsed_ns =
>              (end.tv_sec - start.tv_sec) * 1000000000L +
>              (end.tv_nsec - start.tv_nsec);
> 
>          total_time_ns += elapsed_ns;
> 
>          assert(madvise(ptr, MTHPSIZE, MADV_DONTNEED) == 0);
>      }
> 
>      printf("Total time taken by inner loop: %ld ns\n", total_time_ns);
> 
>      munmap(ptr, MTHPSIZE);
>      munmap(c, PMDSIZE);
>      close(fd);
>      return 0;
> }
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Based on mm-unstable (6873c4e2723d). mm-selftests pass.
> 
>   mm/memory.c | 72 ++++++++++++++++++++++++++++-------------------------
>   1 file changed, 38 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 79ba525671c7..b3d951573076 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5563,11 +5563,14 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>   		      !(vma->vm_flags & VM_SHARED);
>   	int type, nr_pages;
> -	unsigned long addr;
> -	bool needs_fallback = false;
> +	unsigned long start_addr;
> +	bool single_page_fallback = false;
> +	bool try_pmd_mapping = true;
> +	pgoff_t file_end;
> +	struct address_space *mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>   
>   fallback:
> -	addr = vmf->address;
> +	start_addr = vmf->address;
>   
>   	/* Did we COW the page? */
>   	if (is_cow)
> @@ -5586,25 +5589,22 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   			return ret;
>   	}
>   
> -	if (!needs_fallback && vma->vm_file) {
> -		struct address_space *mapping = vma->vm_file->f_mapping;
> -		pgoff_t file_end;
> -
> +	if (!single_page_fallback && mapping) {
>   		file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
>   
>   		/*
> -		 * Do not allow to map with PTEs beyond i_size and with PMD
> -		 * across i_size to preserve SIGBUS semantics.
> +		 * Do not allow to map with PMD across i_size to preserve
> +		 * SIGBUS semantics.
>   		 *
>   		 * Make an exception for shmem/tmpfs that for long time
>   		 * intentionally mapped with PMDs across i_size.
>   		 */
> -		needs_fallback = !shmem_mapping(mapping) &&
> -			file_end < folio_next_index(folio);
> +		try_pmd_mapping = shmem_mapping(mapping) ||
> +			file_end >= folio_next_index(folio);
>   	}
>   
>   	if (pmd_none(*vmf->pmd)) {
> -		if (!needs_fallback && folio_test_pmd_mappable(folio)) {
> +		if (try_pmd_mapping && folio_test_pmd_mappable(folio)) {
>   			ret = do_set_pmd(vmf, folio, page);
>   			if (ret != VM_FAULT_FALLBACK)
>   				return ret;
> @@ -5619,49 +5619,53 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   	nr_pages = folio_nr_pages(folio);
>   
>   	/* Using per-page fault to maintain the uffd semantics */
> -	if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
> +	if (unlikely(userfaultfd_armed(vma)) || unlikely(single_page_fallback)) {
>   		nr_pages = 1;
>   	} else if (nr_pages > 1) {
> -		pgoff_t idx = folio_page_idx(folio, page);
> -		/* The page offset of vmf->address within the VMA. */
> -		pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
> -		/* The index of the entry in the pagetable for fault page. */
> -		pgoff_t pte_off = pte_index(vmf->address);
> +
> +		/* Ensure mapping stays within VMA and PMD boundaries */
> +		unsigned long pmd_boundary_start = ALIGN_DOWN(vmf->address, PMD_SIZE);
> +		unsigned long pmd_boundary_end = pmd_boundary_start + PMD_SIZE;
> +		unsigned long va_of_folio_start = vmf->address - ((vmf->pgoff - folio->index) * PAGE_SIZE);
> +		unsigned long va_of_folio_end = va_of_folio_start + nr_pages * PAGE_SIZE;
> +		unsigned long end_addr;
> +
> +		start_addr = max3(vma->vm_start, pmd_boundary_start, va_of_folio_start);
> +		end_addr = min3(vma->vm_end, pmd_boundary_end, va_of_folio_end);
>   
>   		/*
> -		 * Fallback to per-page fault in case the folio size in page
> -		 * cache beyond the VMA limits and PMD pagetable limits.
> +		 * Do not allow to map with PTEs across i_size to preserve
> +		 * SIGBUS semantics.
> +		 *
> +		 * Make an exception for shmem/tmpfs that for long time
> +		 * intentionally mapped with PMDs across i_size.
>   		 */
> -		if (unlikely(vma_off < idx ||
> -			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
> -			    pte_off < idx ||
> -			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
> -			nr_pages = 1;
> -		} else {
> -			/* Now we can set mappings for the whole large folio. */
> -			addr = vmf->address - idx * PAGE_SIZE;
> -			page = &folio->page;
> -		}
> +		if (mapping && !shmem_mapping(mapping))
> +			end_addr = min(end_addr, va_of_folio_start + (file_end - folio->index) * PAGE_SIZE);
> +
> +		nr_pages = (end_addr - start_addr) >> PAGE_SHIFT;
> +		page = folio_page(folio, (start_addr - va_of_folio_start) >> PAGE_SHIFT);
>   	}
>   
>   	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -				       addr, &vmf->ptl);
> +				       start_addr, &vmf->ptl);
>   	if (!vmf->pte)
>   		return VM_FAULT_NOPAGE;
>   
>   	/* Re-check under ptl */
>   	if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> -		update_mmu_tlb(vma, addr, vmf->pte);
> +		update_mmu_tlb(vma, start_addr, vmf->pte);
>   		ret = VM_FAULT_NOPAGE;
>   		goto unlock;
>   	} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> -		needs_fallback = true;
> +		single_page_fallback = true;
> +		try_pmd_mapping = false;
>   		pte_unmap_unlock(vmf->pte, vmf->ptl);
>   		goto fallback;
>   	}
>   
>   	folio_ref_add(folio, nr_pages - 1);
> -	set_pte_range(vmf, folio, page, nr_pages, addr);
> +	set_pte_range(vmf, folio, page, nr_pages, start_addr);
>   	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>   	add_mm_counter(vma->vm_mm, type, nr_pages);
>   	ret = 0;


I can see something like that to be reasonable, but I think we really 
have to cleanup this function first before we perform further 
complicated changes.

I wonder if the "goto fallback" thing could be avoided by reworking 
things. A bit nasty.

The whole "does the whole folio fit into this page table+vma" check 
should probably get factored out into a helper that could later tell us
which page+nr_pages range could get mapped into the page table instead 
of just telling us "1 page vs. the full thing".

There, I would also expect a lockless check for a suiteable 
pte_none-range, similar to how we have it in other code.


So I would expect the function flow to be something like:

1) Try mapping with PMD if possible.

2) If not, detect biggest folio range that can be mapped (taking VMA,
    PMD-range and actual page table (pte_none) into account) and try to
    map that. If there is already something at the faulting pte,
    VM_FAULT_NOPAGE.

3) We could retry 2) or just give up and let the caller retry 
(VM_FAULT_NOPAGE).

-- 
Cheers,

David

     prev parent reply	other threads:[~2026-02-11 11:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-06 13:56 Dev Jain
2026-02-06 15:22 ` Matthew Wilcox
2026-02-10 13:28   ` Dev Jain
2026-02-10 13:39     ` Lorenzo Stoakes
2026-02-10 13:54       ` Dev Jain
2026-02-10 16:27     ` Matthew Wilcox
2026-02-06 19:27 ` [syzbot ci] " syzbot ci
2026-02-07 18:08 ` [PATCH] " Usama Arif
2026-02-10 13:29   ` Dev Jain
2026-02-11 11:33 ` David Hildenbrand (Arm) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3684c55a-6581-4731-b94a-19526f455a1e@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=dev.jain@arm.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox