From: YoungJun Park <youngjun.park@lge.com>
To: kasong@tencent.com
Cc: linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
Barry Song <baohua@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Alexandre Ghiti <alex@ghiti.fr>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Hugh Dickins <hughd@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Chuanhua Han <hanchuanhua@oppo.com>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: Re: [PATCH RFC 0/2] mm, swap: fix swapin race that causes inaccurate memcg accounting
Date: Mon, 13 Apr 2026 16:33:25 +0900 [thread overview]
Message-ID: <adycRemx6QmSOX8n@yjaykim-PowerEdge-T330> (raw)
In-Reply-To: <20260407-swap-memcg-fix-v1-0-a473ce2e5bb8@tencent.com>
On Tue, Apr 07, 2026 at 10:55:41PM +0800, Kairui Song via B4 Relay wrote:
> While doing code inspection, I noticed there is a long-existing issue
> THP swapin may got charged into the wrong memcg since commit
> 242d12c981745 ("mm: support large folios swap-in for sync io devices").
> And a recent fix made it a bit worse.
>
> The error seem not serious. The worst that could happen is slightly
> inaccurate memcg accounting as the charge will go to an unexpected but
> somehow still relevant memcg. The chance is seems extremely low.
> This issue will be fixed (and found during the rebase of) swap table P4
> but may worth a separate fix. Sending as RFC first in case I'm missing
> anything, or I'm overlooking the result, or overthinking about it.
>
> And recent commit 9acbe135588e ("mm/swap: fix swap cache memcg
> accounting") extended this issue for ordinary swap too (see patch 1 in
> this series). The chance is still extremely low and doesn't seem to have
> a significant negative result.
>
> The problem occurs when swapin tries to allocate and charge a swapin
> folio without holding any lock or pinning the swap slot first. It's
> possible that the page table or mapping may change. Another thread may
> swap in and free these memory, the swap slots are also freed. Then if
> another mem cgroup faulted these memory again, thing get messy.
>
> Usually, this is still fine since the user of the charged folio -
> swapin, anon or shmem, will double check if the page table or mapping
> is still the same and abort if not. But the PTE or mapping entry could
> got swapped out again using the same swap entry. Now the page table or
> mapping does look the same. But the swapout is done after the resource
> is owned by another cgroup (e.g. by MADV & realloc), then, back to the
> initial caller the start the swapin and charged the folio, it can keep
> using the old charged folio, which means we chaged into a wrong cgroup.
>
> The problem is similar to what we fixed with commit 13ddaf26be324
> ("mm/swap: fix race when skipping swapcache"). There is no data
> corruption since IO is guarded by swap cache or the old HAS_CACHE
> bit in commit 242d12c981745 ("mm: support large folios swap-in for sync
> io devices").
>
> The chance should be extremely low, it requires multiple cgroups to hit
> a set of rare time windows in a row together, so far I haven't found a
> good way to reproduce it, but in theory it is possible, and at least
> looks risky:
>
> CPU0 (memcg0 runnig) | CPU1 (also memcg0 running)
> |
> do_swap_page() of entry X |
> <direct swapin path> |
> <alloc folio A, charge into memcg0> |
> ... interrupted ... |
> | do_swap_page() of same entry X
> | <finish swapin>
> | set_pte_at() - a folio installed
> | <frees the folio with MADV_FREE>
> | <migrate to another *memcg1*>
> | <fault and install another folio>
> | the folio belong to *memcg 1*
> | <swapout the using same entry X>
> ... continue ... | now entry X belongs to *memcg 1*
> pte_same() <- Check pass, PTE seems |
> unchanged, but now |
> belong to memcg1. |
> set_pte_at() <- folio A installed, |
> memcg0 is charged. |
>
> The folio got charged to memcg0, but it really should be charged to
> memcg1 as the PTE / folio is owned by memcg1 before the last swapout.
> Fortunately there is no leak, swap accounting will still be uncharging
> memcg1. memcg0 is not completely irrelevant as it's true that it is now
> memcg1 faulting this folio. Shmem may have similar issue.
>
> Patch 1 fixes this issue for order 0 / non-SYNCHRONOUS_IO swapin, Patch
> 2 fixes this issue for SYNCHRONOUS_IO swapin.
>
> If we consider this problem trivial, I suggest we fix it for order 0
> swapin first since that's a more common and recent issue since a recent
> commit.
>
> SYNCHRONOUS_IO fix seems also good, but it changes the current fallback
> logic. Instead of fallback to next order it will fallback to order 0
> directly. That should be fine though. This issue can be fixed / cleaned
> up in a better way with swap table P4 as demostrated previously by
> allocating the folio in swap cache directly with proper fallback and a
> more compat loop for error handling:
>
> https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-4-104795d19815@tencent.com/
Hello Kairui,
Nice catch!
I have reviewed the proposed patches, and LGTM :D
(For 1/2, flattening the if-statement depth slightly could help readability.
However, since this is planned to be refactored as part of the P4 swap table work,
I think it is fine as is.)
I mostly agree with your rationale.
> memcg0 is not completely irrelevant as it's true that it is now
> memcg1 faulting this folio. Shmem may have similar issue.
That said, I would like to leave one small comment.
My understanding is that if we account based on the folio that was
allocated while running in memcg0 (on CPU 0), then having
set_pte_at() install it with memcg0 already charged may still be
considered acceptable from a acceptable coarse-grained synchronization perspective.
(cuz folio is alloced at the time of "memcg 1 epoch")
Let's think of the situation below
CPU 0 (memcg0) CPU 1
--------------------------- -----------------------------
charge folio to memcg0
allocate / prepare folio
task migrates to memcg1
...
set_pte_at() installs PTE
(folio is already charged to memcg0)
In this flow, the charge follows the allocation context (memcg0),
even though the actual PTE installation happens after migration
to memcg1.
I understand that we cannot strictly guarantee correctness without
fully synchronized migration, so this region inherently has some
ambiguity. In that sense, the patch is addressing a corner of that
problem space.
But, I largely agree with your argument (the rationale is sound,
and the change is not intrusive).
I would have no further concerns if the following hold:
- There is a tangible benefit to modifying this patch.
- There is no meaningful behavioral difference between charging
earlier (current behavior) and charging later (proposed change),
(e.g especially when memcg limits are hit.)
If those assumptions are correct, I am fully on board.
Best Regards,
Youngjun Park
prev parent reply other threads:[~2026-04-13 7:33 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 14:55 Kairui Song via B4 Relay
2026-04-07 14:55 ` [PATCH RFC 1/2] mm, swap: fix potential race of charging into the wrong memcg Kairui Song via B4 Relay
2026-04-07 14:55 ` [PATCH RFC 2/2] mm, swap: fix race of charging into the wrong memcg for THP Kairui Song via B4 Relay
2026-04-13 7:33 ` YoungJun Park [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adycRemx6QmSOX8n@yjaykim-PowerEdge-T330 \
--to=youngjun.park@lge.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=hanchuanhua@oppo.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox