From: Barry Song <21cnbao@gmail.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>, Chris Li <chrisl@kernel.org>,
Nhat Pham <nphamcs@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Yosry Ahmed <yosry.ahmed@linux.dev>,
David Hildenbrand <david@redhat.com>,
Youngjun Park <youngjun.park@lge.com>,
Hugh Dickins <hughd@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
Date: Wed, 5 Nov 2025 03:52:25 +0800 [thread overview]
Message-ID: <CAGsJ_4z47L4u50xdv9opC4QN6Jtdx6oR7xVx6WCGCtUa_CqMQQ@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7CV14FdHvgGtXwWQPqXKwy4W-5vfR=-hUYkBemeMP=Srw@mail.gmail.com>
On Tue, Nov 4, 2025 at 6:51 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Nov 4, 2025 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > To prevent repeated faults of parallel swapin of the same PTE, remove
> > > the folio from the swap cache after the folio is mapped. So any user
> > > faulting from the swap PTE should see the folio in the swap cache and
> > > wait on it.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > > mm/memory.c | 21 +++++++++++----------
> > > 1 file changed, 11 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6c5cd86c4a66..589d6fc3d424 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > > struct folio *folio,
> > > struct vm_area_struct *vma,
> > > + unsigned int extra_refs,
> > > unsigned int fault_flags)
> > > {
> > > if (!folio_test_swapcache(folio))
> > > @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > > * reference only in case it's likely that we'll be the exclusive user.
> > > */
> > > return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > > - folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > > + folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
> > > }
> > >
> > > static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> > > @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > */
> > > arch_swap_restore(folio_swap(entry, folio), folio);
> > >
> > > - /*
> > > - * Remove the swap entry and conditionally try to free up the swapcache.
> > > - * We're already holding a reference on the page but haven't mapped it
> > > - * yet.
> > > - */
> > > - swap_free_nr(entry, nr_pages);
> > > - if (should_try_to_free_swap(si, folio, vma, vmf->flags))
> > > - folio_free_swap(folio);
> > > -
> > > add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > > add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > > pte = mk_pte(page, vma->vm_page_prot);
> > > @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > arch_do_swap_page_nr(vma->vm_mm, vma, address,
> > > pte, pte, nr_pages);
> > >
> > > + /*
> > > + * Remove the swap entry and conditionally try to free up the
> > > + * swapcache. Do it after mapping so any raced page fault will
> > > + * see the folio in swap cache and wait for us.
> >
> > This seems like the right optimization—it reduces the race window where we might
> > allocate a folio, perform the read, and then attempt to map it, only
> > to find after
> > taking the PTL that the PTE has already changed.
> >
> > Although I am not entirely sure that “any raced page fault will see the folio in
> > swapcache,” it seems there could still be cases where a fault occurs after
> > folio_free_swap(), and thus can’t see the swapcache entry.
> >
> > T1:
> > swap in PF, allocate and add swapcache, map PTE, delete swapcache
> >
> > T2:
> > swap in PF before PTE is changed;
> > ...........................................................;
> > check swapcache after T1 deletes swapcache -> no swapcache found.
>
> Right, that's true. But we will at most only have one repeated fault,
> and the time window is much smaller. T2 will PTE != orig_pte and then
> return just fine.
>
> So this patch is only reducing the race time window for a potentially
> better performance, and this race is basically harmless anyway. I
> think it's good enough.
Right. What I really disagree with is "Do it after mapping so any
raced page fault
will see the folio in swap cache and wait for". It sounds like it
guarantees no race
at all, so I’d rather we change it to something like "reduced race window".
Thanks
Barry
next prev parent reply other threads:[~2025-11-04 19:52 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
2025-10-30 22:53 ` Yosry Ahmed
2025-11-03 8:28 ` Barry Song
2025-11-03 9:02 ` Kairui Song
2025-11-03 9:10 ` Barry Song
2025-11-03 16:50 ` Yosry Ahmed
2025-10-29 15:58 ` [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
2025-11-04 3:47 ` Barry Song
2025-11-04 10:44 ` Kairui Song
2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
2025-11-04 4:19 ` Barry Song
2025-11-04 8:26 ` Barry Song
2025-11-04 10:55 ` Kairui Song
2025-10-29 15:58 ` [PATCH 05/19] mm, swap: simplify the code and reduce indention Kairui Song
2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
2025-11-04 9:14 ` Barry Song
2025-11-04 10:50 ` Kairui Song
2025-11-04 19:52 ` Barry Song [this message]
2025-10-29 15:58 ` [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-10-29 15:58 ` [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-10-29 15:58 ` [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
2025-10-31 5:25 ` YoungJun Park
2025-10-31 7:11 ` Kairui Song
2025-10-29 15:58 ` [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-10-29 19:25 ` kernel test robot
2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
2025-11-07 3:07 ` Barry Song
2025-11-09 14:18 ` Kairui Song
2025-11-10 7:21 ` Barry Song
2025-11-16 16:01 ` Kairui Song
2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
2025-10-29 19:25 ` kernel test robot
2025-10-30 5:25 ` Kairui Song
2025-10-29 19:25 ` kernel test robot
2025-11-01 4:51 ` YoungJun Park
2025-11-01 8:59 ` Kairui Song
2025-11-01 9:08 ` YoungJun Park
2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-10-29 16:52 ` Kairui Song
2025-10-31 5:56 ` YoungJun Park
2025-10-31 7:02 ` Kairui Song
2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
2025-11-06 21:02 ` Barry Song
2025-11-07 3:13 ` Kairui Song
2025-10-29 15:58 ` [PATCH 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
2025-10-29 15:58 ` [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-10-29 15:58 ` [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
2025-10-31 6:58 ` Kairui Song
2025-11-05 7:39 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGsJ_4z47L4u50xdv9opC4QN6Jtdx6oR7xVx6WCGCtUa_CqMQQ@mail.gmail.com \
--to=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=nphamcs@gmail.com \
--cc=ryncsn@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox