From: Kairui Song <ryncsn@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>, Chris Li <chrisl@kernel.org>,
Nhat Pham <nphamcs@gmail.com>,
Yosry Ahmed <yosry.ahmed@linux.dev>,
David Hildenbrand <david@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Youngjun Park <youngjun.park@lge.com>,
Hugh Dickins <hughd@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Ying Huang <ying.huang@linux.alibaba.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
Date: Fri, 21 Nov 2025 10:41:33 +0800 [thread overview]
Message-ID: <CAMgjq7BYNwVSX+=KDxmdFpUZ4gnTDjA7LU_Fi=J6SFUV8pAcUg@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4zSqQax-Ct-zz3uMZ_dD7vnvTjWZ_BsnXBnMVpVNNDYUA@mail.gmail.com>
On Fri, Nov 21, 2025 at 8:55 AM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Kairui,
>
> >
> > + /*
> > + * If a large folio already belongs to anon mapping, then we
> > + * can just go on and map it partially.
>
> this is right.
>
Hi Barry,
Thanks for the review.
> > + * If not, with the large swapin check above failing, the page table
> > + * have changed, so sub pages might got charged to the wrong cgroup,
> > + * or even should be shmem. So we have to free it and fallback.
> > + * Nothing should have touched it, both anon and shmem checks if a
> > + * large folio is fully appliable before use.
>
> I'm curious about one case:
>
> - Process 1: nr_pages are in swap.
> - Process 2: "nr_pages - m" pages are in swap (with m slots already
> unmapped).
>
> Sequence:
>
> 1. Process 1 swap-ins the page, allocates it, and adds it to the
> swapcache — but the rmap hasn’t been added yet.
Yes, whoever wants to use the folio will have to lock it first.
> 2. Process 2 swap-ins the same folio and finds it in the swapcache, but
> it’s not associated with anon_mapping yet.
If P2 found it in the swap cache, it will try to acquire the folio lock.
> What will process 2 do in this situation? Does it go to out_nomap? If so,
> what happens on the second swapin attempt? Will it keep retrying
> indefinitely until Process 1 completes the rmap installation?
P2 will wait on the folio lock, which I think is the right thing to
do. After P1 finishes the rmap installation, P2 wakes up and tries to
map the folio, no busy loop or repeated fault.
If P1 somehow failed to install the rmap (eg. a concurrent partial
unmap made part of the page table invalidated), it will remove the
folio from swap cache then unlock it (the code right below). P2 also
wakes up by then, and sees the invalid folio will then fallback to
order 0.
>
> > + *
> > + * This will be removed once we unify folio allocation in the swap cache
> > + * layer, where allocation of a folio stabilizes the swap entries.
> > + */
> > + if (!folio_test_anon(folio) && folio_test_large(folio) &&
> > + nr_pages != folio_nr_pages(folio)) {
> > + if (!WARN_ON_ONCE(folio_test_dirty(folio)))
> > + swap_cache_del_folio(folio);
> > + goto out_nomap;
> > + }
> > +
> > /*
> > * Check under PT lock (to protect against concurrent fork() sharing
> > * the swap entry concurrently) for certainly exclusive pages.
> > */
> > if (!folio_test_ksm(folio)) {
> > + /*
> > + * The can_swapin_thp check above ensures all PTE have
> > + * same exclusivenss, only check one PTE is fine.
>
> typos? exclusiveness ? Checking just one PTE is fine ?
Nice catch, indeed a typo, will fix it.
>
> > + */
> > exclusive = pte_swp_exclusive(vmf->orig_pte);
> > + if (exclusive)
> > + check_swap_exclusive(folio, entry, nr_pages);
> > if (folio != swapcache) {
> > /*
> > * We have a fresh page that is not exposed to the
> > @@ -4985,18 +4962,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > vmf->orig_pte = pte_advance_pfn(pte, page_idx);
> >
> > /* ksm created a completely new copy */
> > - if (unlikely(folio != swapcache && swapcache)) {
> > + if (unlikely(folio != swapcache)) {
> > folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > folio_add_lru_vma(folio, vma);
> > } else if (!folio_test_anon(folio)) {
> > /*
> > - * We currently only expect small !anon folios which are either
> > - * fully exclusive or fully shared, or new allocated large
> > - * folios which are fully exclusive. If we ever get large
> > - * folios within swapcache here, we have to be careful.
> > + * We currently only expect !anon folios that are fully
> > + * mappable. See the comment after can_swapin_thp above.
> > */
> > - VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
> > - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > + VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
> > + VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
>
> We have this guard to ensure that a large folio is always added to the
> rmap in one shot, since we only support partial rmap addition for folios
> that have already been mapped before.
>
> It now seems you rely on repeated page faults to ensure the partially
> mapped process runs after the fully mapped one, which doesn’t look ideal
> to me as it may cause priority inversion.
We are mostly relying on folio lock now, I think there is no priority
inversion issue from this part.
There might be priority inversion in the swap cache layer (so we have
a schedule_timeout_uninterruptible workaround in
__swap_cache_prepare_and_add now), that issue is resolved in the
follow up commit "mm, swap: use swap cache as the swap in synchronize
layer". Sorry I can't make it all happen in one place because the
SYNC_IO swapin path has to be removed first for that to happen. That
workaround is not a critical issue besides being ugly so I think
that's fine for intermediate steps.
next prev parent reply other threads:[~2025-11-21 2:42 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-16 18:11 [PATCH v2 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
2025-11-16 18:11 ` [PATCH v2 01/19] mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
2025-11-17 4:36 ` Nhat Pham
2025-11-17 8:11 ` Barry Song
2025-11-22 8:39 ` Chris Li
2025-11-16 18:11 ` [PATCH v2 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
2025-11-17 8:27 ` Barry Song
2025-11-17 10:04 ` Kairui Song
2025-11-22 8:50 ` Chris Li
2025-11-16 18:11 ` [PATCH v2 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
2025-11-21 0:55 ` Barry Song
2025-11-21 2:41 ` Kairui Song [this message]
2025-11-21 4:56 ` Barry Song
2025-11-21 5:38 ` Kairui Song
2025-11-16 18:11 ` [PATCH v2 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
2025-11-16 18:11 ` [PATCH v2 05/19] mm, swap: simplify the code and reduce indention Kairui Song
2025-11-16 18:11 ` [PATCH v2 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
2025-11-16 18:11 ` [PATCH v2 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-11-16 18:11 ` [PATCH v2 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-11-16 18:11 ` [PATCH v2 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
2025-11-16 18:11 ` [PATCH v2 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
2025-11-20 6:47 ` YoungJun Park
2025-11-20 15:32 ` Kairui Song
2025-11-21 6:58 ` YoungJun Park
2025-11-20 7:11 ` YoungJun Park
2025-11-20 7:13 ` YoungJun Park
2025-11-20 15:24 ` Kairui Song
2025-11-16 18:11 ` [PATCH v2 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
2025-11-16 18:11 ` [PATCH v2 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-11-16 18:11 ` [PATCH v2 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
2025-11-16 18:11 ` [PATCH v2 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
2025-11-17 11:23 ` kernel test robot
2025-11-17 13:05 ` Kairui Song
2025-11-16 18:11 ` [PATCH v2 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-11-16 18:11 ` [PATCH v2 16/19] mm, swap: check swap table directly for checking cache Kairui Song
2025-11-16 18:11 ` [PATCH v2 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
2025-11-16 18:11 ` [PATCH v2 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-11-17 11:01 ` kernel test robot
2025-11-17 11:22 ` kernel test robot
2025-11-17 13:30 ` Kairui Song
2025-11-16 18:12 ` [PATCH v2 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-11-17 3:21 ` [PATCH v2 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMgjq7BYNwVSX+=KDxmdFpUZ4gnTDjA7LU_Fi=J6SFUV8pAcUg@mail.gmail.com' \
--to=ryncsn@gmail.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=nphamcs@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox