* Re: [PATCH] mm/swap: set active flag after adding the folio to activate fbatch
2025-04-11 8:28 [PATCH] mm/swap: set active flag after adding the folio to activate fbatch Jinjiang Tu
@ 2025-04-14 11:47 ` Lorenzo Stoakes
2025-04-14 21:59 ` Andrew Morton
2025-04-15 10:26 ` David Hildenbrand
0 siblings, 2 replies; 4+ messages in thread
From: Lorenzo Stoakes @ 2025-04-14 11:47 UTC (permalink / raw)
To: Jinjiang Tu; +Cc: akpm, david, yangge1116, linux-mm, wangkefeng.wang
Hi,
Andrew - could we drop this from mm-new until this gets fixed? Thanks!
This patch is currently breaking mm-new, which now does not boot with
CONFIG_DEBUG_VM.
It seems like you're underflowing the folio batch?
Stack trace below:
[ 0.773399] ------------[ cut here ]------------
[ 0.773400] mem_cgroup_update_lru_size(ffff88817e801440, 1, -1): lru_size -1
This is line 1321 in mm/memcontrol.c (actually had to comment out the
VM_BUG_ON() as it otherwise mangles output, fun! But if not removed obviously
the kernel just halts):
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
int zid, int nr_pages)
{
struct mem_cgroup_per_node *mz;
unsigned long *lru_size;
long size;
if (mem_cgroup_disabled())
return;
mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
lru_size = &mz->lru_zone_size[zid][lru];
if (nr_pages < 0)
*lru_size += nr_pages;
size = *lru_size;
if (WARN_ONCE(size < 0,
"%s(%p, %d, %d): lru_size %ld\n",
__func__, lruvec, lru, nr_pages, size)) {
VM_BUG_ON(1); <------------------------------------- here
*lru_size = 0;
}
if (nr_pages > 0)
*lru_size += nr_pages;
}
So, nr_pages = -1, size = -1.
So I guess the problem is that mz->lru_zone_size[zid][lru] value is now negative
on delete? So we're underflowing the folio batch?
This is called from the always-inlined update_lru_size() which simply passes
nr_pages in...
[ 0.773415] WARNING: CPU: 2 PID: 92 at mm/memcontrol.c:1318 mem_cgroup_update_lru_size+0xa1/0xb0
[ 0.773423] Modules linked in:
[ 0.773425] CPU: 2 UID: 0 PID: 92 Comm: 9 Not tainted 6.15.0-rc1+ #8 PREEMPT(voluntary)
[ 0.773426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 0.773427] RIP: 0010:mem_cgroup_update_lru_size+0xa1/0xb0
[ 0.773429] Code: 00 00 00 00 eb ae c6 05 6f 39 77 01 01 90 89 f1 48 89 fa 41 89 d8 48 c7 c6 80 a7 22 82 48 c7 c7 6a 66 7c 82 e8 c0 23 dc ff 90 <0f> 0b 90 90 eb c9 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
[ 0.773430] RSP: 0018:ffffc900002aba80 EFLAGS: 00010096
[ 0.773431] RAX: 0000000000000040 RBX: ffffffffffffffff RCX: 0000000000000000
[ 0.773432] RDX: 0000000000000002 RSI: ffffc900002ab938 RDI: 00000000ffffdfff
[ 0.773432] RBP: ffff88817e801498 R08: 00000000ffffdfff R09: ffffffff82b18528
[ 0.773433] R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000002
[ 0.773433] R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000001
[ 0.773435] FS: 00007f5e47c3e880(0000) GS:ffff8882e085a000(0000) knlGS:0000000000000000
[ 0.773437] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.773438] CR2: 00007f5e475fb2f0 CR3: 000000017e802000 CR4: 0000000000750ef0
[ 0.773438] PKRU: 55555554
[ 0.773439] Call Trace:
[ 0.773442] <TASK>
[ 0.773443] lru_activate+0x121/0x320
[ 0.773447] ? __pfx_lru_activate+0x10/0x10
This indicates that the move_fn() below is in fact lru_activate().
With inlining I can't precisely say what ultimately calls
mm-cgroup_update_lru_size(), but given the negative sign being the problem, it
seems almost certain to be lruvec_del_folio():
static __always_inline
void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
{
...
update_lru_size(lruvec, lru, folio_zonenum(folio),
-folio_nr_pages(folio)); <-- negative.
}
This is called in lru_activate():
static void lru_activate(struct lruvec *lruvec, struct folio *folio)
{
long nr_pages = folio_nr_pages(folio);
...
lruvec_del_folio(lruvec, folio);
lruvec_add_folio(lruvec, folio);
...
}
[ 0.773448] folio_batch_move_lru+0x56/0xe0
This is:
static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
{
...
for (i = 0; i < folio_batch_count(fbatch); i++) {
struct folio *folio = fbatch->folios[i];
folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
move_fn(lruvec, folio); <----------------------------------- here
folio_set_lru(folio);
}
...
}
[ 0.773449] __folio_batch_release+0x2c/0x50
[ 0.773450] shmem_undo_range+0x297/0x650
[ 0.773454] shmem_evict_inode+0xed/0x250
[ 0.773455] ? atime_needs_update+0x9b/0x110
[ 0.773456] evict+0xf2/0x260
[ 0.773457] __dentry_kill+0x6c/0x180
[ 0.773458] dput+0xe6/0x1b0
[ 0.773460] __fput+0x12d/0x2a0
[ 0.773461] __x64_sys_close+0x38/0x80
[ 0.773463] do_syscall_64+0x9e/0x1a0
[ 0.773465] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 0.773466] RIP: 0033:0x7f5e4731b83b
[ 0.773467] Code: ff ff c3 0f 1f 40 00 48 8b 15 d1 a4 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 a1 a4 0d 00 f7 d8
[ 0.773468] RSP: 002b:00007ffdf7d82bc8 EFLAGS: 00000297 ORIG_RAX: 0000000000000003
[ 0.773469] RAX: ffffffffffffffda RBX: 00005601862b32a0 RCX: 00007f5e4731b83b
[ 0.773469] RDX: 00007f5e473f4ea0 RSI: 00005601862b3470 RDI: 000000000000002d
[ 0.773470] RBP: 00007ffdf7d82bf0 R08: 00005601862b3010 R09: 0000000000000007
[ 0.773470] R10: 00005601862b3480 R11: 0000000000000297 R12: 0000000000000000
[ 0.773470] R13: 00007f5e473f4ff0 R14: 00007ffdf7d82f60 R15: 00007ffdf7d82d60
[ 0.773471] </TASK>
[ 0.773471] ---[ end trace 0000000000000000 ]---
On Fri, Apr 11, 2025 at 04:28:57PM +0800, Jinjiang Tu wrote:
> We notiched a 12.3% performance regression for LibMicro pwrite testcase
> due to commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
> adding to LRU batch").
>
> The testcase is executed as follows, and the file is tmpfs file.
> pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>
> this testcase writes 1KB (only one page) to the tmpfs and repeats
> this step for many times. The Flame graph shows the performance regression
> comes from folio_mark_accessed() and workingset_activation().
>
> folio_mark_accessed() is called for the same page for many times.
> Before the commit, each call will add the page to activate fbatch. When
> the fbatch is full, the fbatch is drained and the page is promoted to
> active list. And then, folio_mark_accessed() does nothing in later calls.
>
> But after the commit, the folio clear lru flags after it is added to
> activate fbatch. After then, folio_mark_accessed will never call
> folio_activate() again due to the page is without lru flag, and the fbatch
> will not be full and the folio will not be marked active, later
> folio_mark_accessed() calls will always call workingset_activation(),
> leading to performance regression.
>
> Besides, repeated workingset_age_nonresident() call before the folio is
> drained from activate fbatch leads to unreasonable lruvec->nonresident_age.
>
> To fix it, set active flag after the folio is cleared lru flag when adding
> the folio to activate fbatch.
>
> Fixes: 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
> mm/swap.c | 26 ++++++++++++++++++++++----
> 1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 77b2d5997873..f0de837988b4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
> folios_put(fbatch);
> }
>
> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
> +
> static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
> struct folio *folio, move_fn_t move_fn,
> bool on_lru, bool disable_irq)
> @@ -184,6 +186,14 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
> if (on_lru && !folio_test_clear_lru(folio))
> return;
>
> + if (move_fn == lru_activate) {
> + if (folio_test_unevictable(folio)) {
> + folio_set_lru(folio);
> + return;
> + }
> + folio_set_active(folio);
> + }
> +
> folio_get(folio);
>
> if (disable_irq)
> @@ -299,12 +309,15 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio)
> {
> long nr_pages = folio_nr_pages(folio);
>
> - if (folio_test_active(folio) || folio_test_unevictable(folio))
> - return;
> -
> + /*
> + * We check unevictable flag isn't set and set active flag
> + * after we clear lru flag. Unevictable and active flag
> + * couldn't be modified before we set lru flag again.
> + */
> + VM_WARN_ON_ONCE(!folio_test_active(folio));
> + VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>
> lruvec_del_folio(lruvec, folio);
> - folio_set_active(folio);
> lruvec_add_folio(lruvec, folio);
> trace_mm_lru_activate(folio);
>
> @@ -341,6 +354,11 @@ void folio_activate(struct folio *folio)
> if (!folio_test_clear_lru(folio))
> return;
>
> + if (folio_test_unevictable(folio) || folio_test_active(folio)) {
> + folio_set_lru(folio);
> + return;
> + }
> + folio_set_active(folio);
> lruvec = folio_lruvec_lock_irq(folio);
> lru_activate(lruvec, folio);
> unlock_page_lruvec_irq(lruvec);
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread