Re: [PATCH v3 2/4] mm: use tiered folio allocation for VM_EXEC readahead

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Usama Arif <usama.arif@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	david@kernel.org, willy@infradead.org, ryan.roberts@arm.com,
	linux-mm@kvack.org, r@hev.cc, jack@suse.cz, ajd@linux.ibm.com,
	apopple@nvidia.com, baohua@kernel.org,
	baolin.wang@linux.alibaba.com, brauner@kernel.org,
	catalin.marinas@arm.com, dev.jain@arm.com, kees@kernel.org,
	kevin.brodsky@arm.com, lance.yang@linux.dev,
	Liam.Howlett@oracle.com, linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Lorenzo Stoakes <ljs@kernel.org>,
	mhocko@suse.com, npache@redhat.com, pasha.tatashin@soleen.com,
	rmclure@linux.ibm.com, rppt@kernel.org, surenb@google.com,
	vbabka@kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
	wilts.infradead.org@quack3.kvack.org, ziy@nvidia.com,
	hannes@cmpxchg.org, kas@kernel.org, shakeel.butt@linux.dev,
	leitao@debian.org, kernel-team@meta.com
Subject: Re: [PATCH v3 2/4] mm: use tiered folio allocation for VM_EXEC readahead
Date: Mon, 13 Apr 2026 13:03:06 +0200	[thread overview]
Message-ID: <aji7zs42th272khtxesk6dfcrgf7ddr5r5n62wgzeqooyexgxf@5ns3i47f5nlg> (raw)
In-Reply-To: <20260402181326.3107102-3-usama.arif@linux.dev>

On Thu 02-04-26 11:08:23, Usama Arif wrote:
> When executable pages are faulted via do_sync_mmap_readahead(), request
> a folio order that enables the best hardware TLB coalescing available:
> 
> - If the VMA is large enough to contain a full PMD, request
>   HPAGE_PMD_ORDER so the folio can be PMD-mapped. This benefits
>   architectures where PMD_SIZE is reasonable (e.g. 2M on x86-64
>   and arm64 with 4K pages). VM_EXEC VMAs are very unlikely to be
>   large enough for 512M pages on ARM to take into affect.

I'm not sure relying on PMD_SIZE will be too much for a VMA is a great
strategy. With 16k PAGE_SIZE the PMD would be 32MB large which would fit in
the .text size but already looks a bit too much? Mapping with PMD sized
folios brings some benefits but at the same time it costs because now parts
of VMA that would be never paged in are pulled into memory and also LRU
tracking now happens with this very large granularity making it fairly
inefficient (big folios have much higher chances of getting accessed
similarly often making LRU order mostly random). We are already getting
reports of people with small machines (phones etc.) where the memory
overhead of large folios (in the page cache) is simply too much. So I'd
have a bigger peace of mind if we capped folio size at 2MB for now until we
come with a more sophisticated heuristic of picking sensible folio order
given the machine size. Now I'm not really an MM person so my feeling here
may be just wrong but I wanted to voice this concern from what I can see...

								Honza


> - Otherwise, fall back to exec_folio_order(), which returns the
>   minimum order for hardware PTE coalescing for arm64:
>   - arm64 4K:  order 4 (64K) for contpte (16 PTEs → 1 iTLB entry)
>   - arm64 16K: order 2 (64K) for HPA (4 pages → 1 TLB entry)
>   - arm64 64K: order 5 (2M) for contpte (32 PTEs → 1 iTLB entry)
>   - generic:   order 0 (no coalescing)
> 
> Update the arm64 exec_folio_order() to return ilog2(SZ_2M >>
> PAGE_SHIFT) on 64K page configurations, where the previous SZ_64K
> value collapsed to order 0 (a single page) and provided no coalescing
> benefit.
> 
> Use ~__GFP_RECLAIM so the allocation is opportunistic: if a large
> folio is readily available, use it, otherwise fall back to smaller
> folios without stalling on reclaim or compaction. The existing fallback
> in page_cache_ra_order() handles this naturally.
> 
> The readahead window is already clamped to the VMA boundaries, so
> ra->size naturally caps the folio order via ilog2(ra->size) in
> page_cache_ra_order().
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  arch/arm64/include/asm/pgtable.h | 16 +++++++++----
>  mm/filemap.c                     | 40 +++++++++++++++++++++++---------
>  mm/internal.h                    |  3 ++-
>  mm/readahead.c                   |  7 +++---
>  4 files changed, 45 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 52bafe79c10a..9ce9f73a6f35 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1591,12 +1591,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>  #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
>  
>  /*
> - * Request exec memory is read into pagecache in at least 64K folios. This size
> - * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
> - * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
> - * pages are in use.
> + * Request exec memory is read into pagecache in folios large enough for
> + * hardware TLB coalescing. On 4K and 16K page configs this is 64K, which
> + * enables contpte mapping (16 × 4K) and HPA coalescing (4 × 16K). On
> + * 64K page configs, contpte requires 2M (32 × 64K).
>   */
> -#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
> +#define exec_folio_order exec_folio_order
> +static inline unsigned int exec_folio_order(void)
> +{
> +	if (PAGE_SIZE == SZ_64K)
> +		return ilog2(SZ_2M >> PAGE_SHIFT);
> +	return ilog2(SZ_64K >> PAGE_SHIFT);
> +}
>  
>  static inline bool pud_sect_supported(void)
>  {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a4ea869b2ca1..7ffea986b3b4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3311,6 +3311,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
>  	struct file *fpin = NULL;
>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
> +	gfp_t gfp = readahead_gfp_mask(mapping);
>  	bool force_thp_readahead = false;
>  	unsigned short mmap_miss;
>  
> @@ -3363,28 +3364,45 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  			ra->size *= 2;
>  		ra->async_size = HPAGE_PMD_NR;
>  		ra->order = HPAGE_PMD_ORDER;
> -		page_cache_ra_order(&ractl, ra);
> +		page_cache_ra_order(&ractl, ra, gfp);
>  		return fpin;
>  	}
>  
>  	if (vm_flags & VM_EXEC) {
>  		/*
> -		 * Allow arch to request a preferred minimum folio order for
> -		 * executable memory. This can often be beneficial to
> -		 * performance if (e.g.) arm64 can contpte-map the folio.
> -		 * Executable memory rarely benefits from readahead, due to its
> -		 * random access nature, so set async_size to 0.
> +		 * Request large folios for executable memory to enable
> +		 * hardware PTE coalescing and PMD mappings:
>  		 *
> -		 * Limit to the boundaries of the VMA to avoid reading in any
> -		 * pad that might exist between sections, which would be a waste
> -		 * of memory.
> +		 *  - If the VMA is large enough for a PMD, request
> +		 *    HPAGE_PMD_ORDER so the folio can be PMD-mapped.
> +		 *  - Otherwise, use exec_folio_order() which returns
> +		 *    the minimum order for hardware TLB coalescing
> +		 *    (e.g. arm64 contpte/HPA).
> +		 *
> +		 * Use ~__GFP_RECLAIM so large folio allocation is
> +		 * opportunistic — if memory isn't readily available,
> +		 * fall back to smaller folios rather than stalling on
> +		 * reclaim or compaction.
> +		 *
> +		 * Executable memory rarely benefits from speculative
> +		 * readahead due to its random access nature, so set
> +		 * async_size to 0.
> +		 *
> +		 * Limit to the boundaries of the VMA to avoid reading
> +		 * in any pad that might exist between sections, which
> +		 * would be a waste of memory.
>  		 */
> +		gfp &= ~__GFP_RECLAIM;
>  		struct vm_area_struct *vma = vmf->vma;
>  		unsigned long start = vma->vm_pgoff;
>  		unsigned long end = start + vma_pages(vma);
>  		unsigned long ra_end;
>  
> -		ra->order = exec_folio_order();
> +		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> +		    vma_pages(vma) >= HPAGE_PMD_NR)
> +			ra->order = HPAGE_PMD_ORDER;
> +		else
> +			ra->order = exec_folio_order();
>  		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
>  		ra->start = max(ra->start, start);
>  		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
> @@ -3403,7 +3421,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  
>  	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  	ractl._index = ra->start;
> -	page_cache_ra_order(&ractl, ra);
> +	page_cache_ra_order(&ractl, ra, gfp);
>  	return fpin;
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 475bd281a10d..e624cb619057 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -545,7 +545,8 @@ int zap_vma_for_reaping(struct vm_area_struct *vma);
>  int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
>  			   gfp_t gfp);
>  
> -void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
> +void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
> +			 gfp_t gfp);
>  void force_page_cache_ra(struct readahead_control *, unsigned long nr);
>  static inline void force_page_cache_readahead(struct address_space *mapping,
>  		struct file *file, pgoff_t index, unsigned long nr_to_read)
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 7b05082c89ea..b3dc08cf180c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -465,7 +465,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
>  }
>  
>  void page_cache_ra_order(struct readahead_control *ractl,
> -		struct file_ra_state *ra)
> +		struct file_ra_state *ra, gfp_t gfp)
>  {
>  	struct address_space *mapping = ractl->mapping;
>  	pgoff_t start = readahead_index(ractl);
> @@ -475,7 +475,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  	pgoff_t mark = index + ra->size - ra->async_size;
>  	unsigned int nofs;
>  	int err = 0;
> -	gfp_t gfp = readahead_gfp_mask(mapping);
>  	unsigned int new_order = ra->order;
>  
>  	trace_page_cache_ra_order(mapping->host, start, ra);
> @@ -626,7 +625,7 @@ void page_cache_sync_ra(struct readahead_control *ractl,
>  readit:
>  	ra->order = 0;
>  	ractl->_index = ra->start;
> -	page_cache_ra_order(ractl, ra);
> +	page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
>  }
>  EXPORT_SYMBOL_GPL(page_cache_sync_ra);
>  
> @@ -697,7 +696,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  		ra->size -= end - aligned_end;
>  	ra->async_size = ra->size;
>  	ractl->_index = ra->start;
> -	page_cache_ra_order(ractl, ra);
> +	page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
>  }
>  EXPORT_SYMBOL_GPL(page_cache_async_ra);
>  
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2026-04-13 11:03 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
2026-04-13 11:03   ` Jan Kara [this message]
2026-04-13 11:48     ` Usama Arif
2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
2026-04-10 11:03 ` [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-10 11:55   ` Lorenzo Stoakes
2026-04-10 11:57     ` Lorenzo Stoakes
2026-04-10 12:19       ` Usama Arif
2026-04-10 12:24         ` Lorenzo Stoakes
2026-04-10 13:29           ` Vlastimil Babka (SUSE)
2026-04-10 13:50             ` Lorenzo Stoakes
2026-04-10 14:02           ` David Hildenbrand (Arm)
2026-04-10 12:05     ` Usama Arif
2026-04-10 12:13       ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aji7zs42th272khtxesk6dfcrgf7ddr5r5n62wgzeqooyexgxf@5ns3i47f5nlg \
    --to=jack@suse.cz \
    --cc=Liam.Howlett@oracle.com \
    --cc=ajd@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kees@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kevin.brodsky@arm.com \
    --cc=lance.yang@linux.dev \
    --cc=leitao@debian.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=r@hev.cc \
    --cc=rmclure@linux.ibm.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=wilts.infradead.org@quack3.kvack.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox