Re: [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Klara Modin <klarasmodin@gmail.com>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	 Matthew Wilcox <willy@infradead.org>,
	Hugh Dickins <hughd@google.com>, Chris Li <chrisl@kernel.org>,
	 David Hildenbrand <david@redhat.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	 "Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	 Baoquan He <bhe@redhat.com>, Barry Song <baohua@kernel.org>,
	 Kalesh Singh <kaleshsingh@google.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	 Tim Chen <tim.c.chen@linux.intel.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check
Date: Thu, 15 May 2025 11:31:10 +0200	[thread overview]
Message-ID: <iqcrqzvoqt7dcmboh75u5g5cwgzaevlota3izpsfmaeotdnnzw@f7udsv2gghwa> (raw)
In-Reply-To: <20250514201729.48420-12-ryncsn@gmail.com>

Hi,

On 2025-05-15 04:17:11 +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Move all mTHP swapin check into can_swapin_thp and use it for both pre
> IO check and post IO check. This way the code is more consolidated and
> make later commit easier to maintain.

From what I can see, can_swapin_thp is gated behind
CONFIG_TRANSPARENT_HUGEPAGE and this fails to build when it's not
enabled.

> 
> Also clean up the comments while at it. The current comment of
> non_swapcache_batch is not correct: swap in bypassing swap cache won't
> reach the swap device as long as the entry is cached, because it still
> sets the SWAP_HAS_CACHE flag. If the folio is already in swap cache, raced
> swap in will either fail due to -EEXIST with swapcache_prepare, or see the
> cached folio.
> 
> The real reason this non_swapcache_batch is needed is that if a smaller
> folio is in the swap cache but not mapped, mTHP swapin will be blocked
> forever as it won't see the folio due to index offset, nor it can set the
> SWAP_HAS_CACHE bit, so it has to fallback to order 0 swap in.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c | 90 ++++++++++++++++++++++++-----------------------------
>  1 file changed, 41 insertions(+), 49 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f2897d9059f2..1b6e192de6ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4319,12 +4319,6 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  	pgoff_t offset = swp_offset(entry);
>  	int i;
>  
> -	/*
> -	 * While allocating a large folio and doing swap_read_folio, which is
> -	 * the case the being faulted pte doesn't have swapcache. We need to
> -	 * ensure all PTEs have no cache as well, otherwise, we might go to
> -	 * swap devices while the content is in swapcache.
> -	 */
>  	for (i = 0; i < max_nr; i++) {
>  		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
>  			return i;

> @@ -4334,34 +4328,30 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  }
>  
>  /*
> - * Check if the PTEs within a range are contiguous swap entries
> - * and have consistent swapcache, zeromap.
> + * Check if the page table is still suitable for large folio swap in.
> + * @vmf: The fault triggering the swap-in.
> + * @ptep: Pointer to the PTE that should be the head of the swap in folio.
> + * @addr: The address corresponding to the PTE.
> + * @nr_pages: Number of pages of the folio that suppose to be swapped in.
>   */
> -static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +			   unsigned long addr, unsigned int nr_pages)
>  {
> -	unsigned long addr;
> -	swp_entry_t entry;
> -	int idx;
> -	pte_t pte;
> -
> -	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> -	idx = (vmf->address - addr) / PAGE_SIZE;
> -	pte = ptep_get(ptep);
> +	pte_t pte = ptep_get(ptep);
> +	unsigned long addr_end = addr + (PAGE_SIZE * nr_pages);
> +	unsigned long pte_offset = (vmf->address - addr) / PAGE_SIZE;
>  
> -	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(addr, PAGE_SIZE) ||
> +			addr > vmf->address || addr_end <= vmf->address);
> +	if (unlikely(addr < max(addr & PMD_MASK, vmf->vma->vm_start) ||
> +		     addr_end > pmd_addr_end(addr, vmf->vma->vm_end)))
>  		return false;
> -	entry = pte_to_swp_entry(pte);
> -	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
> -		return false;
> -
>  	/*
> -	 * swap_read_folio() can't handle the case a large folio is hybridly
> -	 * from different backends. And they are likely corner cases. Similar
> -	 * things might be added once zswap support large folios.
> +	 * All swap entries must from the same swap device, in same
> +	 * cgroup, with same exclusiveness, only differs in offset.
>  	 */
> -	if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
> -		return false;
> -	if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
> +	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -pte_offset)) ||
> +	    swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
>  		return false;
>  
>  	return true;
> @@ -4441,13 +4431,24 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	 * completely swap entries with contiguous swap offsets.
>  	 */
>  	order = highest_order(orders);
> -	while (orders) {
> -		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> -		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> -			break;
> -		order = next_order(&orders, order);
> +	for (; orders; order = next_order(&orders, order)) {
> +		unsigned long nr_pages = 1 << order;
> +		swp_entry_t swap_entry = { .val = ALIGN_DOWN(entry.val, nr_pages) };
> +		addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +		if (!can_swapin_thp(vmf, pte + pte_index(addr), addr, nr_pages))
> +			continue;
> +		/*
> +		 * If there is already a smaller folio in cache, it will
> +		 * conflict with the larger folio in the swap cache layer
> +		 * and block the swap in.
> +		 */
> +		if (unlikely(non_swapcache_batch(swap_entry, nr_pages) != nr_pages))
> +			continue;
> +		/* Zero map doesn't work with large folio yet. */
> +		if (unlikely(swap_zeromap_batch(swap_entry, nr_pages, NULL) != nr_pages))
> +			continue;
> +		break;
>  	}
> -
>  	pte_unmap_unlock(pte, ptl);
>  
>  	/* Try allocating the highest of the remaining orders. */
> @@ -4731,27 +4732,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	page_idx = 0;
>  	address = vmf->address;
>  	ptep = vmf->pte;
> +
>  	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> -		int nr = folio_nr_pages(folio);
> +		unsigned long nr = folio_nr_pages(folio);
>  		unsigned long idx = folio_page_idx(folio, page);
> -		unsigned long folio_start = address - idx * PAGE_SIZE;
> -		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> -		pte_t *folio_ptep;
> -		pte_t folio_pte;
> +		unsigned long folio_address = address - idx * PAGE_SIZE;
> +		pte_t *folio_ptep = vmf->pte - idx;
>  
> -		if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> -			goto check_folio;
> -		if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
> -			goto check_folio;
> -
> -		folio_ptep = vmf->pte - idx;
> -		folio_pte = ptep_get(folio_ptep);
> -		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> -		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)

> +		if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr))
>  			goto check_folio;

At this point we're outside CONFIG_TRANSPARENT_HUGEPAGE.

>  
>  		page_idx = idx;
> -		address = folio_start;
> +		address = folio_address;
>  		ptep = folio_ptep;
>  		nr_pages = nr;
>  		entry = folio->swap;
> -- 
> 2.49.0
> 

Regards,
Klara Modin

next prev parent reply	other threads:[~2025-05-15  9:31 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
2025-05-14 20:17 ` [PATCH 02/28] mm, swap: consolidate the helper for mincore Kairui Song
2025-05-14 20:17 ` [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-05-14 20:17 ` [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup Kairui Song
2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
2025-05-19  4:38   ` Barry Song
2025-05-20  3:31     ` Kairui Song
2025-05-20  4:41       ` Barry Song
2025-05-20 19:09         ` Kairui Song
2025-05-20 22:33           ` Barry Song
2025-05-21  2:45             ` Kairui Song
2025-05-21  3:24               ` Barry Song
2025-05-23  2:29               ` Barry Song
2025-05-23 20:01                 ` Kairui Song
2025-05-27  7:58                   ` Barry Song
2025-05-27 15:11                     ` Kairui Song
2025-05-30  8:49                       ` Kairui Song
2025-05-30 19:24                         ` Kairui Song
2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
2025-05-19  6:26   ` Barry Song
2025-05-20  3:50     ` Kairui Song
2025-05-14 20:17 ` [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-05-14 20:17 ` [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API Kairui Song
2025-05-14 20:17 ` [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc Kairui Song
2025-05-14 20:17 ` [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead Kairui Song
2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
2025-05-15  9:31   ` Klara Modin [this message]
2025-05-15  9:39     ` Kairui Song
2025-05-19  7:08   ` Barry Song
2025-05-19 11:09     ` Kairui Song
2025-05-19 11:57       ` Barry Song
2025-05-14 20:17 ` [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-05-14 20:17 ` [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper Kairui Song
2025-05-14 20:17 ` [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-05-14 20:17 ` [PATCH 17/28] mm, swap: sanitize swap entry management workflow Kairui Song
2025-05-14 20:17 ` [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache Kairui Song
2025-05-14 20:17 ` [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing Kairui Song
2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
2025-06-19 10:38   ` Baoquan He
2025-06-19 10:50     ` Kairui Song
2025-06-20  8:04       ` Baoquan He
2025-05-14 20:17 ` [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-05-14 20:17 ` [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-05-14 20:17 ` [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
2025-05-15  9:40   ` Klara Modin
2025-05-16  2:35     ` Kairui Song
2025-05-14 20:17 ` [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow Kairui Song
2025-05-14 20:17 ` [PATCH 26/28] mm, swap: minor clean up for swapon Kairui Song
2025-05-14 20:17 ` [PATCH 27/28] mm, swap: use swap table to track swap count Kairui Song
2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-05-21 18:36   ` Nhat Pham
2025-05-22  4:13     ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=iqcrqzvoqt7dcmboh75u5g5cwgzaevlota3izpsfmaeotdnnzw@f7udsv2gghwa \
    --to=klarasmodin@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kaleshsingh@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox