Re: [PATCH] mm/swap: set active flag after adding the folio to activate fbatch

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Jinjiang Tu <tujinjiang@huawei.com>
Cc: akpm@linux-foundation.org, david@redhat.com, yangge1116@126.com,
	linux-mm@kvack.org, wangkefeng.wang@huawei.com
Subject: Re: [PATCH] mm/swap: set active flag after adding the folio to activate fbatch
Date: Mon, 14 Apr 2025 12:47:04 +0100	[thread overview]
Message-ID: <aab71ca5-3f87-4e29-af66-be6eaf52c9a1@lucifer.local> (raw)
In-Reply-To: <20250411082857.2426539-1-tujinjiang@huawei.com>

Hi,

Andrew - could we drop this from mm-new until this gets fixed? Thanks!

This patch is currently breaking mm-new, which now does not boot with
CONFIG_DEBUG_VM.

It seems like you're underflowing the folio batch?

Stack trace below:

[    0.773399] ------------[ cut here ]------------
[    0.773400] mem_cgroup_update_lru_size(ffff88817e801440, 1, -1): lru_size -1

This is line 1321 in mm/memcontrol.c (actually had to comment out the
VM_BUG_ON() as it otherwise mangles output, fun! But if not removed obviously
the kernel just halts):

void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
				int zid, int nr_pages)
{
	struct mem_cgroup_per_node *mz;
	unsigned long *lru_size;
	long size;

	if (mem_cgroup_disabled())
		return;

	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
	lru_size = &mz->lru_zone_size[zid][lru];

	if (nr_pages < 0)
		*lru_size += nr_pages;

	size = *lru_size;
	if (WARN_ONCE(size < 0,
		"%s(%p, %d, %d): lru_size %ld\n",
		__func__, lruvec, lru, nr_pages, size)) {
		VM_BUG_ON(1); <------------------------------------- here
		*lru_size = 0;
	}

	if (nr_pages > 0)
		*lru_size += nr_pages;
}

So, nr_pages = -1, size = -1.

So I guess the problem is that mz->lru_zone_size[zid][lru] value is now negative
on delete? So we're underflowing the folio batch?

This is called from the always-inlined update_lru_size() which simply passes
nr_pages in...

[    0.773415] WARNING: CPU: 2 PID: 92 at mm/memcontrol.c:1318 mem_cgroup_update_lru_size+0xa1/0xb0
[    0.773423] Modules linked in:
[    0.773425] CPU: 2 UID: 0 PID: 92 Comm: 9 Not tainted 6.15.0-rc1+ #8 PREEMPT(voluntary)
[    0.773426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    0.773427] RIP: 0010:mem_cgroup_update_lru_size+0xa1/0xb0
[    0.773429] Code: 00 00 00 00 eb ae c6 05 6f 39 77 01 01 90 89 f1 48 89 fa 41 89 d8 48 c7 c6 80 a7 22 82 48 c7 c7 6a 66 7c 82 e8 c0 23 dc ff 90 <0f> 0b 90 90 eb c9 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
[    0.773430] RSP: 0018:ffffc900002aba80 EFLAGS: 00010096
[    0.773431] RAX: 0000000000000040 RBX: ffffffffffffffff RCX: 0000000000000000
[    0.773432] RDX: 0000000000000002 RSI: ffffc900002ab938 RDI: 00000000ffffdfff
[    0.773432] RBP: ffff88817e801498 R08: 00000000ffffdfff R09: ffffffff82b18528
[    0.773433] R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000002
[    0.773433] R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000001
[    0.773435] FS:  00007f5e47c3e880(0000) GS:ffff8882e085a000(0000) knlGS:0000000000000000
[    0.773437] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.773438] CR2: 00007f5e475fb2f0 CR3: 000000017e802000 CR4: 0000000000750ef0
[    0.773438] PKRU: 55555554
[    0.773439] Call Trace:
[    0.773442]  <TASK>
[    0.773443]  lru_activate+0x121/0x320
[    0.773447]  ? __pfx_lru_activate+0x10/0x10

This indicates that the move_fn() below is in fact lru_activate().

With inlining I can't precisely say what ultimately calls
mm-cgroup_update_lru_size(), but given the negative sign being the problem, it
seems almost certain to be lruvec_del_folio():

static __always_inline
void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
{
	...

	update_lru_size(lruvec, lru, folio_zonenum(folio),
			-folio_nr_pages(folio)); <-- negative.
}


This is called in lru_activate():

static void lru_activate(struct lruvec *lruvec, struct folio *folio)
{
	long nr_pages = folio_nr_pages(folio);

	...

	lruvec_del_folio(lruvec, folio);
	lruvec_add_folio(lruvec, folio);

	...
}

[    0.773448]  folio_batch_move_lru+0x56/0xe0

This is:

static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
{
	...

	for (i = 0; i < folio_batch_count(fbatch); i++) {
		struct folio *folio = fbatch->folios[i];

		folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
		move_fn(lruvec, folio); <----------------------------------- here

		folio_set_lru(folio);
	}

	...
}

[    0.773449]  __folio_batch_release+0x2c/0x50
[    0.773450]  shmem_undo_range+0x297/0x650
[    0.773454]  shmem_evict_inode+0xed/0x250
[    0.773455]  ? atime_needs_update+0x9b/0x110
[    0.773456]  evict+0xf2/0x260
[    0.773457]  __dentry_kill+0x6c/0x180
[    0.773458]  dput+0xe6/0x1b0
[    0.773460]  __fput+0x12d/0x2a0
[    0.773461]  __x64_sys_close+0x38/0x80
[    0.773463]  do_syscall_64+0x9e/0x1a0
[    0.773465]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    0.773466] RIP: 0033:0x7f5e4731b83b
[    0.773467] Code: ff ff c3 0f 1f 40 00 48 8b 15 d1 a4 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 a1 a4 0d 00 f7 d8
[    0.773468] RSP: 002b:00007ffdf7d82bc8 EFLAGS: 00000297 ORIG_RAX: 0000000000000003
[    0.773469] RAX: ffffffffffffffda RBX: 00005601862b32a0 RCX: 00007f5e4731b83b
[    0.773469] RDX: 00007f5e473f4ea0 RSI: 00005601862b3470 RDI: 000000000000002d
[    0.773470] RBP: 00007ffdf7d82bf0 R08: 00005601862b3010 R09: 0000000000000007
[    0.773470] R10: 00005601862b3480 R11: 0000000000000297 R12: 0000000000000000
[    0.773470] R13: 00007f5e473f4ff0 R14: 00007ffdf7d82f60 R15: 00007ffdf7d82d60
[    0.773471]  </TASK>
[    0.773471] ---[ end trace 0000000000000000 ]---


On Fri, Apr 11, 2025 at 04:28:57PM +0800, Jinjiang Tu wrote:
> We notiched a 12.3% performance regression for LibMicro pwrite testcase
> due to commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
> adding to LRU batch").
>
> The testcase is executed as follows, and the file is tmpfs file.
>   pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>
> this testcase writes 1KB (only one page) to the tmpfs and repeats
> this step for many times. The Flame graph shows the performance regression
> comes from folio_mark_accessed() and workingset_activation().
>
> folio_mark_accessed() is called for the same page for many times.
> Before the commit, each call will add the page to activate fbatch. When
> the fbatch is full, the fbatch is drained and the page is promoted to
> active list. And then, folio_mark_accessed() does nothing in later calls.
>
> But after the commit, the folio clear lru flags after it is added to
> activate fbatch. After then, folio_mark_accessed will never call
> folio_activate() again due to the page is without lru flag, and the fbatch
> will not be full and the folio will not be marked active, later
> folio_mark_accessed() calls will always call workingset_activation(),
> leading to performance regression.
>
> Besides, repeated workingset_age_nonresident() call before the folio is
> drained from activate fbatch leads to unreasonable lruvec->nonresident_age.
>
> To fix it, set active flag after the folio is cleared lru flag when adding
> the folio to activate fbatch.
>
> Fixes: 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
>  mm/swap.c | 26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 77b2d5997873..f0de837988b4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>  	folios_put(fbatch);
>  }
>
> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
> +
>  static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
>  		struct folio *folio, move_fn_t move_fn,
>  		bool on_lru, bool disable_irq)
> @@ -184,6 +186,14 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
>  	if (on_lru && !folio_test_clear_lru(folio))
>  		return;
>
> +	if (move_fn == lru_activate) {
> +		if (folio_test_unevictable(folio)) {
> +			folio_set_lru(folio);
> +			return;
> +		}
> +		folio_set_active(folio);
> +	}
> +
>  	folio_get(folio);
>
>  	if (disable_irq)
> @@ -299,12 +309,15 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio)
>  {
>  	long nr_pages = folio_nr_pages(folio);
>
> -	if (folio_test_active(folio) || folio_test_unevictable(folio))
> -		return;
> -
> +	/*
> +	 * We check unevictable flag isn't set and set active flag
> +	 * after we clear lru flag. Unevictable and active flag
> +	 * couldn't be modified before we set lru flag again.
> +	 */
> +	VM_WARN_ON_ONCE(!folio_test_active(folio));
> +	VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>
>  	lruvec_del_folio(lruvec, folio);
> -	folio_set_active(folio);
>  	lruvec_add_folio(lruvec, folio);
>  	trace_mm_lru_activate(folio);
>
> @@ -341,6 +354,11 @@ void folio_activate(struct folio *folio)
>  	if (!folio_test_clear_lru(folio))
>  		return;
>
> +	if (folio_test_unevictable(folio) || folio_test_active(folio)) {
> +		folio_set_lru(folio);
> +		return;
> +	}
> +	folio_set_active(folio);
>  	lruvec = folio_lruvec_lock_irq(folio);
>  	lru_activate(lruvec, folio);
>  	unlock_page_lruvec_irq(lruvec);
> --
> 2.43.0
>
>

next prev parent reply	other threads:[~2025-04-14 11:47 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-11  8:28 Jinjiang Tu
2025-04-14 11:47 ` Lorenzo Stoakes [this message]
2025-04-14 21:59   ` Andrew Morton
2025-04-15 10:26   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aab71ca5-3f87-4e29-af66-be6eaf52c9a1@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=tujinjiang@huawei.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=yangge1116@126.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox