linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record
@ 2026-01-10  6:46 Deepanshu Kartikey
  2026-01-10 23:29 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Deepanshu Kartikey @ 2026-01-10  6:46 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm
  Cc: cgroups, linux-mm, Deepanshu Kartikey, syzbot+d97580a8cceb9b03c13e

When using MADV_PAGEOUT, pages can remain in swapcache with their swap
entries assigned. If MADV_PAGEOUT is called again on these pages, they
reuse the same swap entries, causing memcg1_swapout() to call
swap_cgroup_record() with an already-recorded entry.

The existing code assumes swap entries are always being recorded for the
first time (oldid == 0), triggering VM_BUG_ON when it encounters an
already-recorded entry:

  ------------[ cut here ]------------
  kernel BUG at mm/swap_cgroup.c:78!
  Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 0 UID: 0 PID: 6176 Comm: syz.0.30 Not tainted
  RIP: 0010:swap_cgroup_record+0x19c/0x1c0 mm/swap_cgroup.c:78
  Call Trace:
   memcg1_swapout+0x2fa/0x830 mm/memcontrol-v1.c:623
   __remove_mapping+0xac5/0xe30 mm/vmscan.c:773
   shrink_folio_list+0x2786/0x4f40 mm/vmscan.c:1528
   reclaim_folio_list+0xeb/0x4e0 mm/vmscan.c:2208
   reclaim_pages+0x454/0x520 mm/vmscan.c:2245
   madvise_cold_or_pageout_pte_range+0x19a0/0x1ce0 mm/madvise.c:563
   ...
   do_madvise+0x1bc/0x270 mm/madvise.c:2030
   __do_sys_madvise mm/madvise.c:2039

This bug occurs because pages in swapcache can be targeted by
MADV_PAGEOUT multiple times without being swapped in between. Each time,
the same swap entry is reused, but swap_cgroup_record() expects to only
record new, unused entries.

Fix this by checking if the swap entry already has the correct cgroup ID
recorded before attempting to record it. Use the existing
lookup_swap_cgroup_id() to read the current cgroup ID, and return early
from memcg1_swapout() if the entry is already correctly recorded. Only
call swap_cgroup_record() when the entry needs to be set or updated.

This approach avoids unnecessary atomic operations, reference count
manipulations, and statistics updates when the entry is already correct.

Link: https://syzkaller.appspot.com/bug?extid=d97580a8cceb9b03c13e
Reported-by: syzbot+d97580a8cceb9b03c13e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d97580a8cceb9b03c13e
Tested-by: syzbot+d97580a8cceb9b03c13e@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
---
 mm/memcontrol-v1.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 56d27baf93ab..982cfe5af225 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -614,6 +614,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
 	unsigned int nr_entries;
+	unsigned short oldid;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
@@ -630,6 +631,16 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	if (!memcg)
 		return;
 
+	/*
+	 * Check if this swap entry is already recorded. This can happen
+	 * when MADV_PAGEOUT is called multiple times on pages that remain
+	 * in swapcache, reusing the same swap entries.
+	 */
+	oldid = lookup_swap_cgroup_id(entry);
+	if (oldid == mem_cgroup_id(memcg))
+		return;
+	VM_WARN_ON_ONCE(oldid != 0);
+
 	/*
 	 * In case the memcg owning these pages has been offlined and doesn't
 	 * have an ID allocated to it anymore, charge the closest online
-- 
2.43.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record
  2026-01-10  6:46 [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record Deepanshu Kartikey
@ 2026-01-10 23:29 ` Andrew Morton
  2026-01-12 13:57 ` Michal Hocko
  2026-01-12 15:27 ` Johannes Weiner
  2 siblings, 0 replies; 5+ messages in thread
From: Andrew Morton @ 2026-01-10 23:29 UTC (permalink / raw)
  To: Deepanshu Kartikey
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, syzbot+d97580a8cceb9b03c13e

On Sat, 10 Jan 2026 12:16:13 +0530 Deepanshu Kartikey <kartikey406@gmail.com> wrote:

> When using MADV_PAGEOUT, pages can remain in swapcache with their swap
> entries assigned. If MADV_PAGEOUT is called again on these pages, they
> reuse the same swap entries, causing memcg1_swapout() to call
> swap_cgroup_record() with an already-recorded entry.
> 
> The existing code assumes swap entries are always being recorded for the
> first time (oldid == 0), triggering VM_BUG_ON when it encounters an
> already-recorded entry:
> 
>   ------------[ cut here ]------------
>   kernel BUG at mm/swap_cgroup.c:78!
>   Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
>   CPU: 0 UID: 0 PID: 6176 Comm: syz.0.30 Not tainted
>   RIP: 0010:swap_cgroup_record+0x19c/0x1c0 mm/swap_cgroup.c:78
>   Call Trace:
>    memcg1_swapout+0x2fa/0x830 mm/memcontrol-v1.c:623
>    __remove_mapping+0xac5/0xe30 mm/vmscan.c:773
>    shrink_folio_list+0x2786/0x4f40 mm/vmscan.c:1528
>    reclaim_folio_list+0xeb/0x4e0 mm/vmscan.c:2208
>    reclaim_pages+0x454/0x520 mm/vmscan.c:2245
>    madvise_cold_or_pageout_pte_range+0x19a0/0x1ce0 mm/madvise.c:563
>    ...
>    do_madvise+0x1bc/0x270 mm/madvise.c:2030
>    __do_sys_madvise mm/madvise.c:2039
> 
> This bug occurs because pages in swapcache can be targeted by
> MADV_PAGEOUT multiple times without being swapped in between. Each time,
> the same swap entry is reused, but swap_cgroup_record() expects to only
> record new, unused entries.
> 
> Fix this by checking if the swap entry already has the correct cgroup ID
> recorded before attempting to record it. Use the existing
> lookup_swap_cgroup_id() to read the current cgroup ID, and return early
> from memcg1_swapout() if the entry is already correctly recorded. Only
> call swap_cgroup_record() when the entry needs to be set or updated.
> 
> This approach avoids unnecessary atomic operations, reference count
> manipulations, and statistics updates when the entry is already correct.

Thanks.  This looks like a fairly old bug and it annoyingly predates a
lot of memcg code movement.

What do people think?  Should we backport this into -stable kernels? 
If so, can some intrepid soul please help figure out what it Fixes:?

Deepanshu, if we do decide to put a cc:stable on this then some -stable
maintainers will complain that the patch alters things in a file which
doesn't exist and they'll hope that you can help.  Which means
backporting the fix into kernels which predate 89ce924f0bd44.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record
  2026-01-10  6:46 [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record Deepanshu Kartikey
  2026-01-10 23:29 ` Andrew Morton
@ 2026-01-12 13:57 ` Michal Hocko
  2026-01-12 15:27 ` Johannes Weiner
  2 siblings, 0 replies; 5+ messages in thread
From: Michal Hocko @ 2026-01-12 13:57 UTC (permalink / raw)
  To: Deepanshu Kartikey
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, cgroups,
	linux-mm, syzbot+d97580a8cceb9b03c13e

On Sat 10-01-26 12:16:13, Deepanshu Kartikey wrote:
> When using MADV_PAGEOUT, pages can remain in swapcache with their swap
> entries assigned. If MADV_PAGEOUT is called again on these pages, they
> reuse the same swap entries, causing memcg1_swapout() to call
> swap_cgroup_record() with an already-recorded entry.
> 
> The existing code assumes swap entries are always being recorded for the
> first time (oldid == 0), triggering VM_BUG_ON when it encounters an
> already-recorded entry:
> 
>   ------------[ cut here ]------------
>   kernel BUG at mm/swap_cgroup.c:78!
>   Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
>   CPU: 0 UID: 0 PID: 6176 Comm: syz.0.30 Not tainted
>   RIP: 0010:swap_cgroup_record+0x19c/0x1c0 mm/swap_cgroup.c:78
>   Call Trace:
>    memcg1_swapout+0x2fa/0x830 mm/memcontrol-v1.c:623
>    __remove_mapping+0xac5/0xe30 mm/vmscan.c:773
>    shrink_folio_list+0x2786/0x4f40 mm/vmscan.c:1528
>    reclaim_folio_list+0xeb/0x4e0 mm/vmscan.c:2208
>    reclaim_pages+0x454/0x520 mm/vmscan.c:2245
>    madvise_cold_or_pageout_pte_range+0x19a0/0x1ce0 mm/madvise.c:563
>    ...
>    do_madvise+0x1bc/0x270 mm/madvise.c:2030
>    __do_sys_madvise mm/madvise.c:2039
> 
> This bug occurs because pages in swapcache can be targeted by
> MADV_PAGEOUT multiple times without being swapped in between. Each time,
> the same swap entry is reused, but swap_cgroup_record() expects to only
> record new, unused entries.

Shouldn't madvise path avoid paging out swap cache pages instead? IIRC
this is what the normal reclaim path does.

> Fix this by checking if the swap entry already has the correct cgroup ID
> recorded before attempting to record it. Use the existing
> lookup_swap_cgroup_id() to read the current cgroup ID, and return early
> from memcg1_swapout() if the entry is already correctly recorded. Only
> call swap_cgroup_record() when the entry needs to be set or updated.
> 
> This approach avoids unnecessary atomic operations, reference count
> manipulations, and statistics updates when the entry is already correct.
> 
> Link: https://syzkaller.appspot.com/bug?extid=d97580a8cceb9b03c13e
> Reported-by: syzbot+d97580a8cceb9b03c13e@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=d97580a8cceb9b03c13e
> Tested-by: syzbot+d97580a8cceb9b03c13e@syzkaller.appspotmail.com
> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>

I would use
Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record
  2026-01-10  6:46 [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record Deepanshu Kartikey
  2026-01-10 23:29 ` Andrew Morton
  2026-01-12 13:57 ` Michal Hocko
@ 2026-01-12 15:27 ` Johannes Weiner
  2026-01-12 16:16   ` Kairui Song
  2 siblings, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2026-01-12 15:27 UTC (permalink / raw)
  To: Deepanshu Kartikey
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, cgroups,
	linux-mm, syzbot+d97580a8cceb9b03c13e, Kairui Song

On Sat, Jan 10, 2026 at 12:16:13PM +0530, Deepanshu Kartikey wrote:
> When using MADV_PAGEOUT, pages can remain in swapcache with their swap
> entries assigned. If MADV_PAGEOUT is called again on these pages,

This doesn't add up to me - maybe I'm missing something.

memcg1_swapout() is called at the very end of reclaim, from
__remove_mapping(), which *removes the folio from swapcache*. At this
point the folio is exclusive to *that* thread - there are no more
present ptes that another madvise could even be acting on.

How could we reach here twice for the same swap entry?

It seems more likely that we're missing a swapin notification, fail to
clear the swap entry from the cgroup records, and then trip up when
the entry is recycled to a totally different page down the line. No?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record
  2026-01-12 15:27 ` Johannes Weiner
@ 2026-01-12 16:16   ` Kairui Song
  0 siblings, 0 replies; 5+ messages in thread
From: Kairui Song @ 2026-01-12 16:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Deepanshu Kartikey, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, cgroups, linux-mm,
	syzbot+d97580a8cceb9b03c13e

On Mon, Jan 12, 2026 at 11:27 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sat, Jan 10, 2026 at 12:16:13PM +0530, Deepanshu Kartikey wrote:
> > When using MADV_PAGEOUT, pages can remain in swapcache with their swap
> > entries assigned. If MADV_PAGEOUT is called again on these pages,
>
> This doesn't add up to me - maybe I'm missing something.
>
> memcg1_swapout() is called at the very end of reclaim, from
> __remove_mapping(), which *removes the folio from swapcache*. At this
> point the folio is exclusive to *that* thread - there are no more
> present ptes that another madvise could even be acting on.
>
> How could we reach here twice for the same swap entry?
>
> It seems more likely that we're missing a swapin notification, fail to
> clear the swap entry from the cgroup records, and then trip up when
> the entry is recycled to a totally different page down the line. No?

Thank you so much for Ccing me!

Deepanshu's patch is helpful but that's not the root cause. I did see
the problem locally sometime ago, but I completely forgot about this
one :(

I think the following fix should be good?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 453d654727c1..e8b5b8f514ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -758,8 +758,8 @@ static int __remove_mapping(struct address_space
*mapping, struct folio *folio,

                if (reclaimed && !mapping_exiting(mapping))
                        shadow = workingset_eviction(folio, target_memcg);
-               __swap_cache_del_folio(ci, folio, swap, shadow);
                memcg1_swapout(folio, swap);
+               __swap_cache_del_folio(ci, folio, swap, shadow);
                swap_cluster_unlock_irq(ci);
        } else {
                void (*free_folio)(struct folio *);

---

It's caused by https://lore.kernel.org/linux-mm/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com/

Before that patch, if the folio's swap entries are already freed and
have swap count of zero, memcg1_swapout records a stalled value but
the put_swap_folio below will clear the cgroup info as it frees the
folio's swap slots. After that commit, put_swap_folio is merged into
__swap_cache_del_folio so the stalled value will stay.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-01-12 16:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-10  6:46 [PATCH] mm/swap_cgroup: fix kernel BUG in swap_cgroup_record Deepanshu Kartikey
2026-01-10 23:29 ` Andrew Morton
2026-01-12 13:57 ` Michal Hocko
2026-01-12 15:27 ` Johannes Weiner
2026-01-12 16:16   ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox