[REGRESSION] Null pointer dereference while shrinking zswap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION] Null pointer dereference while shrinking zswap
@ 2024-04-16 12:19 Christian Heusel
  2024-04-16 19:18 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Christian Heusel @ 2024-04-16 12:19 UTC (permalink / raw)
  To: Seth Jennings, Dan Streetman, Vitaly Wool, Andrew Morton,
	linux-mm, linux-kernel
  Cc: David Runge, Richard W.M. Jones, Mark W, regressions

[-- Attachment #1: Type: text/plain, Size: 3596 bytes --]

Hello everyone,

while rebuilding a few packages in Arch Linux we have recently come
across a regression in the linux kernel which was made visible by a test
failure in libguestfs[0], where the booted kernel showed a Call Trace
like the following one:

[  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
[  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
[  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
[  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
[  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
[  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
[  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
[  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
[  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
[  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
[  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
[  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
[  218.743494] PKRU: 55555554^M
[  218.743593] Call Trace:^M
[  218.743733]  <TASK>^M
[  218.743847]  ? __die+0x23/0x70^M
[  218.743957]  ? page_fault_oops+0x171/0x4e0^M
[  218.744056]  ? free_unref_page+0xf6/0x180^M
[  218.744458]  ? exc_page_fault+0x7f/0x180^M
[  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
[  218.744684]  ? memcg_page_state+0x9/0x30^M
[  218.744779]  zswap_shrinker_count+0x9d/0x110^M
[  218.744896]  do_shrink_slab+0x3a/0x360^M
[  218.744990]  shrink_slab+0xc7/0x3c0^M
[  218.745609]  drop_slab+0x85/0x140^M
[  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
[  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
[  218.745912]  vfs_write+0x23d/0x400^M
[  218.745998]  ksys_write+0x6f/0xf0^M
[  218.746080]  do_syscall_64+0x64/0xe0^M
[  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
[  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M

The regression is present in the mainline kernel and also was
independently reported to the redhat bugtracker[1].

I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
and have landed on the following results (removed unrelated test commit)
as remainders since some of the commits were not buildable for me:
- 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
- a65b0e7607cc ("zswap: make shrinking memcg-aware")
- b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")

I have decided on good/bad commits with the relevant libguestfs tests,
but I think the reproducer in the redhat bugzilla is simpler (although I
only became aware of it during the bisection and therefore didn't test
it myself):

  LIBGUESTFS_MEMSIZE=4096 LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 make -C /build/libguestfs/src/libguestfs-1.52.0/tests -k check TESTS=c-api/tests

I hope I have included everything needed to debug this further, if there
is more to add I'm happy to provide more details!

Cheers,
Christian

[0]: https://github.com/libguestfs/libguestfs/issues/139
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2275252
[2]: https://gist.github.com/christian-heusel/d5095c36b72ae90871e27dfed32ddc46

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-16 12:19 [REGRESSION] Null pointer dereference while shrinking zswap Christian Heusel
@ 2024-04-16 19:18 ` Andrew Morton
  2024-04-16 19:57   ` Christian Heusel
  2024-04-16 22:14 ` Nhat Pham
  2024-04-17  0:33 ` Nhat Pham
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2024-04-16 19:18 UTC (permalink / raw)
  To: Christian Heusel
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, linux-mm,
	linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions, Johannes Weiner, Yosry Ahmed, Nhat Pham,
	Chengming Zhou

On Tue, 16 Apr 2024 14:19:24 +0200 Christian Heusel <christian@heusel.eu> wrote:

> Hello everyone,
> 
> while rebuilding a few packages in Arch Linux we have recently come
> across a regression in the linux kernel which was made visible by a test
> failure in libguestfs[0], where the booted kernel showed a Call Trace
> like the following one:
> 
> [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M

(cc more zswap people)

Fairly old kernel.  We might have fixed it since then, but it would be
good to know what fixed this, for backporting reasons.

> [  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> [  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> [  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> [  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> [  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> [  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> [  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> [  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> [  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> [  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> [  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> [  218.743494] PKRU: 55555554^M
> [  218.743593] Call Trace:^M
> [  218.743733]  <TASK>^M
> [  218.743847]  ? __die+0x23/0x70^M
> [  218.743957]  ? page_fault_oops+0x171/0x4e0^M
> [  218.744056]  ? free_unref_page+0xf6/0x180^M
> [  218.744458]  ? exc_page_fault+0x7f/0x180^M
> [  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
> [  218.744684]  ? memcg_page_state+0x9/0x30^M
> [  218.744779]  zswap_shrinker_count+0x9d/0x110^M
> [  218.744896]  do_shrink_slab+0x3a/0x360^M
> [  218.744990]  shrink_slab+0xc7/0x3c0^M
> [  218.745609]  drop_slab+0x85/0x140^M
> [  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
> [  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
> [  218.745912]  vfs_write+0x23d/0x400^M
> [  218.745998]  ksys_write+0x6f/0xf0^M
> [  218.746080]  do_syscall_64+0x64/0xe0^M
> [  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
> [  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
> 
> The regression is present in the mainline kernel and also was
> independently reported to the redhat bugtracker[1].
> 
> I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> and have landed on the following results (removed unrelated test commit)
> as remainders since some of the commits were not buildable for me:
> - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> 
> I have decided on good/bad commits with the relevant libguestfs tests,
> but I think the reproducer in the redhat bugzilla is simpler (although I
> only became aware of it during the bisection and therefore didn't test
> it myself):
> 
>   LIBGUESTFS_MEMSIZE=4096 LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 make -C /build/libguestfs/src/libguestfs-1.52.0/tests -k check TESTS=c-api/tests
> 
> I hope I have included everything needed to debug this further, if there
> is more to add I'm happy to provide more details!
> 
> Cheers,
> Christian
> 
> [0]: https://github.com/libguestfs/libguestfs/issues/139
> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2275252
> [2]: https://gist.github.com/christian-heusel/d5095c36b72ae90871e27dfed32ddc46


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-16 19:18 ` Andrew Morton
@ 2024-04-16 19:57   ` Christian Heusel
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Heusel @ 2024-04-16 19:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, linux-mm,
	linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions, Johannes Weiner, Yosry Ahmed, Nhat Pham,
	Chengming Zhou

[-- Attachment #1: Type: text/plain, Size: 783 bytes --]

On 24/04/16 12:18PM, Andrew Morton wrote:
> On Tue, 16 Apr 2024 14:19:24 +0200 Christian Heusel <christian@heusel.eu> wrote:
> > [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
>
> (cc more zswap people)
>
> Fairly old kernel.  We might have fixed it since then, but it would be
> good to know what fixed this, for backporting reasons.

Hey Andrew,

I guess you must have missed the end of my previous mail, I have
verified that the bug is still present with with the latest mainline
release and bisected the issue down to a few commits. The trace I was
posting above was just a random one from during the bisection.

Sorry if my formatting was a bit confusing .. :)

Cheers,
Christian

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-16 12:19 [REGRESSION] Null pointer dereference while shrinking zswap Christian Heusel
  2024-04-16 19:18 ` Andrew Morton
@ 2024-04-16 22:14 ` Nhat Pham
  2024-04-16 22:29   ` Christian Heusel
  2024-04-16 23:29   ` Nhat Pham
  2024-04-17  0:33 ` Nhat Pham
  2 siblings, 2 replies; 17+ messages in thread
From: Nhat Pham @ 2024-04-16 22:14 UTC (permalink / raw)
  To: Christian Heusel
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, Andrew Morton,
	linux-mm, linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions, Johannes Weiner, Yosry Ahmed, Chengming Zhou

On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <christian@heusel.eu> wrote:
>
> Hello everyone,

Thanks for the report, Christian! Looking at it now.

>
> while rebuilding a few packages in Arch Linux we have recently come
> across a regression in the linux kernel which was made visible by a test
> failure in libguestfs[0], where the booted kernel showed a Call Trace
> like the following one:
>
> [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M

Is this one of the kernel versions that was broken? That looks a bit
odd, as zswap shrinker landed on 6.8...

> [  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> [  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> [  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> [  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> [  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> [  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> [  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> [  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> [  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> [  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> [  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> [  218.743494] PKRU: 55555554^M
> [  218.743593] Call Trace:^M
> [  218.743733]  <TASK>^M
> [  218.743847]  ? __die+0x23/0x70^M
> [  218.743957]  ? page_fault_oops+0x171/0x4e0^M
> [  218.744056]  ? free_unref_page+0xf6/0x180^M
> [  218.744458]  ? exc_page_fault+0x7f/0x180^M
> [  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
> [  218.744684]  ? memcg_page_state+0x9/0x30^M
> [  218.744779]  zswap_shrinker_count+0x9d/0x110^M
> [  218.744896]  do_shrink_slab+0x3a/0x360^M
> [  218.744990]  shrink_slab+0xc7/0x3c0^M
> [  218.745609]  drop_slab+0x85/0x140^M
> [  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
> [  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
> [  218.745912]  vfs_write+0x23d/0x400^M
> [  218.745998]  ksys_write+0x6f/0xf0^M
> [  218.746080]  do_syscall_64+0x64/0xe0^M
> [  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
> [  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>
> The regression is present in the mainline kernel and also was
> independently reported to the redhat bugtracker[1].
>
> I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> and have landed on the following results (removed unrelated test commit)
> as remainders since some of the commits were not buildable for me:
> - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-16 22:14 ` Nhat Pham
@ 2024-04-16 22:29   ` Christian Heusel
  2024-04-16 23:29   ` Nhat Pham
  1 sibling, 0 replies; 17+ messages in thread
From: Christian Heusel @ 2024-04-16 22:29 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, Andrew Morton,
	linux-mm, linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions, Johannes Weiner, Yosry Ahmed, Chengming Zhou

[-- Attachment #1: Type: text/plain, Size: 977 bytes --]

On 24/04/16 03:14PM, Nhat Pham wrote:
> Is this one of the kernel versions that was broken? That looks a bit
> odd, as zswap shrinker landed on 6.8...

Yes that is the build at commit b5ba474f3f51 ("zswap: shrink zswap pool
based on memory pressure") which is one of the kernel versions that is
broken, although the version string seems a bit confusing. This is due
to the build script invoking "make -s kernelrelease" which gives the
following:

  $ git checkout b5ba474f3f51
  [...]
  $ make -s kernelrelease
  6.7.0-rc4-00158-gb5ba474f3f51

The commit is contained in v6.8+ only as you say, so its just confusing
in the version string:

  $ git tag --contains b5ba474f3f51 | grep -v next
  v6.8
  v6.8-rc1
  v6.8-rc2
  v6.8-rc3
  v6.8-rc4
  v6.8-rc5
  v6.8-rc6
  v6.8-rc7
  v6.9-rc1
  v6.9-rc2
  v6.9-rc3
  v6.9-rc4

Please see my initial report for the detailed bisection results and the
link there to the bisection log.

Cheers and thanks for looking into this!
Christian

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-16 22:14 ` Nhat Pham
  2024-04-16 22:29   ` Christian Heusel
@ 2024-04-16 23:29   ` Nhat Pham
  2024-04-17  0:22     ` Nhat Pham
  1 sibling, 1 reply; 17+ messages in thread
From: Nhat Pham @ 2024-04-16 23:29 UTC (permalink / raw)
  To: Christian Heusel
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, Andrew Morton,
	linux-mm, linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions, Johannes Weiner, Yosry Ahmed, Chengming Zhou

On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <christian@heusel.eu> wrote:
> >
> > Hello everyone,
>
> Thanks for the report, Christian! Looking at it now.
>
> >
> > while rebuilding a few packages in Arch Linux we have recently come
> > across a regression in the linux kernel which was made visible by a test
> > failure in libguestfs[0], where the booted kernel showed a Call Trace
> > like the following one:
> >
> > [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
>
> Is this one of the kernel versions that was broken? That looks a bit
> odd, as zswap shrinker landed on 6.8...

Ah ignore this - I understand the versioning now...

>
> > [  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> > [  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> > [  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> > [  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> > [  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> > [  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> > [  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> > [  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> > [  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> > [  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> > [  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> > [  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> > [  218.743494] PKRU: 55555554^M
> > [  218.743593] Call Trace:^M
> > [  218.743733]  <TASK>^M
> > [  218.743847]  ? __die+0x23/0x70^M
> > [  218.743957]  ? page_fault_oops+0x171/0x4e0^M
> > [  218.744056]  ? free_unref_page+0xf6/0x180^M
> > [  218.744458]  ? exc_page_fault+0x7f/0x180^M
> > [  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
> > [  218.744684]  ? memcg_page_state+0x9/0x30^M
> > [  218.744779]  zswap_shrinker_count+0x9d/0x110^M
> > [  218.744896]  do_shrink_slab+0x3a/0x360^M
> > [  218.744990]  shrink_slab+0xc7/0x3c0^M
> > [  218.745609]  drop_slab+0x85/0x140^M
> > [  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
> > [  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
> > [  218.745912]  vfs_write+0x23d/0x400^M
> > [  218.745998]  ksys_write+0x6f/0xf0^M
> > [  218.746080]  do_syscall_64+0x64/0xe0^M
> > [  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
> > [  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
> >
> > The regression is present in the mainline kernel and also was
> > independently reported to the redhat bugtracker[1].
> >
> > I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> > and have landed on the following results (removed unrelated test commit)
> > as remainders since some of the commits were not buildable for me:
> > - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> > - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> > - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")

Hmmm, this smells like a memcg reference counting bug to me. The memcg
was successfully accessed just a couple of LoCs above, but failed
after the stats flushing.

My suspicion right now is the new memcg refcnt dance to select an
memcg for the global, capacity-based, shrinker, introduced in this
patch series. That dance looks solid to me tho - not sure where it
goes wrong. And, if it's a reference counting issue, that should pop
up in different places (i.e other mysterious memcg traces)...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-16 23:29   ` Nhat Pham
@ 2024-04-17  0:22     ` Nhat Pham
  2024-04-17  3:44       ` Chengming Zhou
  0 siblings, 1 reply; 17+ messages in thread
From: Nhat Pham @ 2024-04-17  0:22 UTC (permalink / raw)
  To: Christian Heusel
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, Andrew Morton,
	linux-mm, linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions, Johannes Weiner, Yosry Ahmed, Chengming Zhou

On Tue, Apr 16, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <christian@heusel.eu> wrote:
> > >
> > > Hello everyone,
> >
> > Thanks for the report, Christian! Looking at it now.
> >
> > >
> > > while rebuilding a few packages in Arch Linux we have recently come
> > > across a regression in the linux kernel which was made visible by a test
> > > failure in libguestfs[0], where the booted kernel showed a Call Trace
> > > like the following one:
> > >
> > > [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
> >
> > Is this one of the kernel versions that was broken? That looks a bit
> > odd, as zswap shrinker landed on 6.8...
>
> Ah ignore this - I understand the versioning now...
>
> >
> > > [  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> > > [  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> > > [  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> > > [  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> > > [  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> > > [  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> > > [  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> > > [  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> > > [  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> > > [  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> > > [  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> > > [  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> > > [  218.743494] PKRU: 55555554^M
> > > [  218.743593] Call Trace:^M
> > > [  218.743733]  <TASK>^M
> > > [  218.743847]  ? __die+0x23/0x70^M
> > > [  218.743957]  ? page_fault_oops+0x171/0x4e0^M
> > > [  218.744056]  ? free_unref_page+0xf6/0x180^M
> > > [  218.744458]  ? exc_page_fault+0x7f/0x180^M
> > > [  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
> > > [  218.744684]  ? memcg_page_state+0x9/0x30^M
> > > [  218.744779]  zswap_shrinker_count+0x9d/0x110^M
> > > [  218.744896]  do_shrink_slab+0x3a/0x360^M
> > > [  218.744990]  shrink_slab+0xc7/0x3c0^M
> > > [  218.745609]  drop_slab+0x85/0x140^M
> > > [  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
> > > [  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
> > > [  218.745912]  vfs_write+0x23d/0x400^M
> > > [  218.745998]  ksys_write+0x6f/0xf0^M
> > > [  218.746080]  do_syscall_64+0x64/0xe0^M
> > > [  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
> > > [  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
> > >

Actually, inspecting the code a bit more - can memcg be null here?

Specifically, if mem_cgroup_disabled() is true, can we see null memcg
here? Looks like in this case, mem_cgroup_iter() can return null, and
the first iteration of drop_slab_node() can have null memcg if it's
returned by mem_cgroup_iter().

shrink_slab() will still proceed and call do_shrink_slab() if the
memcg is null - provided that mem_cgroup_disabled() holds:

if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
      return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

Inside zswap_shrink_count(), all the memcg accessors in this area seem
to always check for null memcg (mem_cgroup_lruvec,
mem_cgroup_zswap_writeback_enabled), *except* memcg_page_state, which
is the one line that fail.

If this is all to it, we can simply add a null check or
mem_cgroup_disabled() check here, and use pool stats instead?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-17  0:22     ` Nhat Pham
@ 2024-04-17  3:44       ` Chengming Zhou
  2024-04-17 14:33         ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: Chengming Zhou @ 2024-04-17  3:44 UTC (permalink / raw)
  To: Nhat Pham, Christian Heusel
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, Andrew Morton,
	linux-mm, linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions, Johannes Weiner, Yosry Ahmed

On 2024/4/17 08:22, Nhat Pham wrote:
> On Tue, Apr 16, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>
>> On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>>
>>> On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <christian@heusel.eu> wrote:
>>>>
>>>> Hello everyone,
>>>
>>> Thanks for the report, Christian! Looking at it now.
>>>
>>>>
>>>> while rebuilding a few packages in Arch Linux we have recently come
>>>> across a regression in the linux kernel which was made visible by a test
>>>> failure in libguestfs[0], where the booted kernel showed a Call Trace
>>>> like the following one:
>>>>
>>>> [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
>>>
>>> Is this one of the kernel versions that was broken? That looks a bit
>>> odd, as zswap shrinker landed on 6.8...
>>
>> Ah ignore this - I understand the versioning now...
>>
>>>
>>>> [  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
>>>> [  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
>>>> [  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
>>>> [  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
>>>> [  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
>>>> [  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
>>>> [  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
>>>> [  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
>>>> [  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
>>>> [  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
>>>> [  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>>>> [  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
>>>> [  218.743494] PKRU: 55555554^M
>>>> [  218.743593] Call Trace:^M
>>>> [  218.743733]  <TASK>^M
>>>> [  218.743847]  ? __die+0x23/0x70^M
>>>> [  218.743957]  ? page_fault_oops+0x171/0x4e0^M
>>>> [  218.744056]  ? free_unref_page+0xf6/0x180^M
>>>> [  218.744458]  ? exc_page_fault+0x7f/0x180^M
>>>> [  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
>>>> [  218.744684]  ? memcg_page_state+0x9/0x30^M
>>>> [  218.744779]  zswap_shrinker_count+0x9d/0x110^M
>>>> [  218.744896]  do_shrink_slab+0x3a/0x360^M
>>>> [  218.744990]  shrink_slab+0xc7/0x3c0^M
>>>> [  218.745609]  drop_slab+0x85/0x140^M
>>>> [  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
>>>> [  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
>>>> [  218.745912]  vfs_write+0x23d/0x400^M
>>>> [  218.745998]  ksys_write+0x6f/0xf0^M
>>>> [  218.746080]  do_syscall_64+0x64/0xe0^M
>>>> [  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
>>>> [  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>>>>
> 
> Actually, inspecting the code a bit more - can memcg be null here?
> 
> Specifically, if mem_cgroup_disabled() is true, can we see null memcg
> here? Looks like in this case, mem_cgroup_iter() can return null, and
> the first iteration of drop_slab_node() can have null memcg if it's
> returned by mem_cgroup_iter().
> 
> shrink_slab() will still proceed and call do_shrink_slab() if the
> memcg is null - provided that mem_cgroup_disabled() holds:
> 
> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>       return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> 

Ah, I think your analysis is very right, here memcg is NULL but memcg_page_state
won't check.

> Inside zswap_shrink_count(), all the memcg accessors in this area seem
> to always check for null memcg (mem_cgroup_lruvec,
> mem_cgroup_zswap_writeback_enabled), *except* memcg_page_state, which
> is the one line that fail.
> 
> If this is all to it, we can simply add a null check or
> mem_cgroup_disabled() check here, and use pool stats instead?

Both look ok to me. The VM could only set sc->memcg to NULL when memcg
disabled, right?

Thanks.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-17  3:44       ` Chengming Zhou
@ 2024-04-17 14:33         ` Johannes Weiner
  2024-04-17 15:08           ` Richard W.M. Jones
  2024-04-17 17:18           ` Christian Heusel
  0 siblings, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2024-04-17 14:33 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Christian Heusel, Seth Jennings, Dan Streetman,
	Vitaly Wool, Andrew Morton, linux-mm, linux-kernel, David Runge,
	Richard W.M. Jones, Mark W, regressions, Yosry Ahmed

On Wed, Apr 17, 2024 at 11:44:45AM +0800, Chengming Zhou wrote:
> On 2024/4/17 08:22, Nhat Pham wrote:
> > On Tue, Apr 16, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >>
> >> On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >>>
> >>> On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <christian@heusel.eu> wrote:
> >>>>
> >>>> Hello everyone,
> >>>
> >>> Thanks for the report, Christian! Looking at it now.
> >>>
> >>>>
> >>>> while rebuilding a few packages in Arch Linux we have recently come
> >>>> across a regression in the linux kernel which was made visible by a test
> >>>> failure in libguestfs[0], where the booted kernel showed a Call Trace
> >>>> like the following one:
> >>>>
> >>>> [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
> >>>
> >>> Is this one of the kernel versions that was broken? That looks a bit
> >>> odd, as zswap shrinker landed on 6.8...
> >>
> >> Ah ignore this - I understand the versioning now...
> >>
> >>>
> >>>> [  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> >>>> [  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> >>>> [  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> >>>> [  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> >>>> [  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> >>>> [  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> >>>> [  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> >>>> [  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> >>>> [  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> >>>> [  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> >>>> [  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> >>>> [  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> >>>> [  218.743494] PKRU: 55555554^M
> >>>> [  218.743593] Call Trace:^M
> >>>> [  218.743733]  <TASK>^M
> >>>> [  218.743847]  ? __die+0x23/0x70^M
> >>>> [  218.743957]  ? page_fault_oops+0x171/0x4e0^M
> >>>> [  218.744056]  ? free_unref_page+0xf6/0x180^M
> >>>> [  218.744458]  ? exc_page_fault+0x7f/0x180^M
> >>>> [  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
> >>>> [  218.744684]  ? memcg_page_state+0x9/0x30^M
> >>>> [  218.744779]  zswap_shrinker_count+0x9d/0x110^M
> >>>> [  218.744896]  do_shrink_slab+0x3a/0x360^M
> >>>> [  218.744990]  shrink_slab+0xc7/0x3c0^M
> >>>> [  218.745609]  drop_slab+0x85/0x140^M
> >>>> [  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
> >>>> [  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
> >>>> [  218.745912]  vfs_write+0x23d/0x400^M
> >>>> [  218.745998]  ksys_write+0x6f/0xf0^M
> >>>> [  218.746080]  do_syscall_64+0x64/0xe0^M
> >>>> [  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
> >>>> [  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
> >>>>
> > 
> > Actually, inspecting the code a bit more - can memcg be null here?
> > 
> > Specifically, if mem_cgroup_disabled() is true, can we see null memcg
> > here? Looks like in this case, mem_cgroup_iter() can return null, and
> > the first iteration of drop_slab_node() can have null memcg if it's
> > returned by mem_cgroup_iter().
> > 
> > shrink_slab() will still proceed and call do_shrink_slab() if the
> > memcg is null - provided that mem_cgroup_disabled() holds:
> > 
> > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> >       return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > 
> 
> Ah, I think your analysis is very right, here memcg is NULL but memcg_page_state
> won't check.

+1.

I could reproduce the NULL crash locally with cgroup_disable=memory,
the shrinker enabled, and echo 3 >/proc/sys/vm/drop_caches.

> > Inside zswap_shrink_count(), all the memcg accessors in this area seem
> > to always check for null memcg (mem_cgroup_lruvec,
> > mem_cgroup_zswap_writeback_enabled), *except* memcg_page_state, which
> > is the one line that fail.
> > 
> > If this is all to it, we can simply add a null check or
> > mem_cgroup_disabled() check here, and use pool stats instead?
> 
> Both look ok to me. The VM could only set sc->memcg to NULL when memcg
> disabled, right?

Christian, can you please test the below patch on top of current
upstream?

diff --git a/mm/zswap.c b/mm/zswap.c
index caed028945b0..6f8850c44b61 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	if (!gfp_has_io_fs(sc->gfp_mask))
 		return 0;
 
-#ifdef CONFIG_MEMCG_KMEM
-	mem_cgroup_flush_stats(memcg);
-	nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
-	nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
-#else
-	/* use pool stats instead of memcg stats */
-	nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
-	nr_stored = atomic_read(&zswap_nr_stored);
-#endif
+	/*
+	 * For memcg, use the cgroup-wide ZSWAP stats since we don't
+	 * have them per-node and thus per-lruvec. Careful if memcg is
+	 * runtime-disabled: we can get sc->memcg == NULL, which is ok
+	 * for the lruvec, but not for memcg_page_state().
+	 *
+	 * Without memcg, use the zswap pool-wide metrics.
+	 */
+	if (!mem_cgroup_disabled()) {
+		mem_cgroup_flush_stats(memcg);
+		nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
+		nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
+	} else {
+		nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
+		nr_stored = atomic_read(&zswap_nr_stored);
+	}
 
 	if (!nr_stored)
 		return 0;


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-17 14:33         ` Johannes Weiner
@ 2024-04-17 15:08           ` Richard W.M. Jones
  2024-04-17 17:18           ` Christian Heusel
  1 sibling, 0 replies; 17+ messages in thread
From: Richard W.M. Jones @ 2024-04-17 15:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, Nhat Pham, Christian Heusel, Seth Jennings,
	Dan Streetman, Vitaly Wool, Andrew Morton, linux-mm,
	linux-kernel, David Runge, Mark W, regressions, Yosry Ahmed

On Wed, Apr 17, 2024 at 10:33:24AM -0400, Johannes Weiner wrote:
> I could reproduce the NULL crash locally with cgroup_disable=memory,
> the shrinker enabled, and echo 3 >/proc/sys/vm/drop_caches.

Can confirm that libguestfs uses cgroup_disable=memory for its
appliance (as it saves some RAM and we don't need cgroups).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-17 14:33         ` Johannes Weiner
  2024-04-17 15:08           ` Richard W.M. Jones
@ 2024-04-17 17:18           ` Christian Heusel
  2024-04-18 12:40             ` Johannes Weiner
  1 sibling, 1 reply; 17+ messages in thread
From: Christian Heusel @ 2024-04-17 17:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, Nhat Pham, Seth Jennings, Dan Streetman,
	Vitaly Wool, Andrew Morton, linux-mm, linux-kernel, David Runge,
	Richard W.M. Jones, Mark W, regressions, Yosry Ahmed

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

On 24/04/17 10:33AM, Johannes Weiner wrote:
>
> Christian, can you please test the below patch on top of current
> upstream?
>

Hey Johannes,

I have applied your patch on top of 6.9-rc4 and it did solve the crash for
me, thanks for hacking together a fix so quickly! 🤗

Tested-By: Christian Heusel <christian@heusel.eu>

Cheers,
Christian

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-17 17:18           ` Christian Heusel
@ 2024-04-18 12:40             ` Johannes Weiner
  2024-04-18 14:25               ` Linux regression tracking (Thorsten Leemhuis)
  2024-04-18 20:09               ` Yosry Ahmed
  0 siblings, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2024-04-18 12:40 UTC (permalink / raw)
  To: Christian Heusel
  Cc: Chengming Zhou, Nhat Pham, Seth Jennings, Dan Streetman,
	Vitaly Wool, Andrew Morton, linux-mm, linux-kernel, David Runge,
	Richard W.M. Jones, Mark W, regressions, Yosry Ahmed

On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> On 24/04/17 10:33AM, Johannes Weiner wrote:
> >
> > Christian, can you please test the below patch on top of current
> > upstream?
> >
> 
> Hey Johannes,
> 
> I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> me, thanks for hacking together a fix so quickly! 🤗
> 
> Tested-By: Christian Heusel <christian@heusel.eu>

Thanks for confirming it, and sorry about the breakage.

Andrew, can you please use the updated changelog below?

---

From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 18 Apr 2024 08:26:28 -0400
Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory

Christian reports a NULL deref in zswap that he bisected down to the
zswap shrinker. The issue also cropped up in the bug trackers of
libguestfs [1] and the Red Hat bugzilla [2].

The problem is that when memcg is disabled with the boot time flag,
the zswap shrinker might get called with sc->memcg == NULL. This is
okay in many places, like the lruvec operations. But it crashes in
memcg_page_state() - which is only used due to the non-node accounting
of cgroup's the zswap memory to begin with.

Nhat spotted that the memcg can be NULL in the memcg-disabled case,
and I was then able to reproduce the crash locally as well.

[1] https://github.com/libguestfs/libguestfs/issues/139
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252

Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Cc: stable@vger.kernel.org	[v6.8]
Link: https://lkml.kernel.org/r/20240417143324.GA1055428@cmpxchg.org
Reported-by: Christian Heusel <christian@heusel.eu>
Debugged-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Tested-By: Christian Heusel <christian@heusel.eu>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/zswap.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index caed028945b0..6f8850c44b61 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	if (!gfp_has_io_fs(sc->gfp_mask))
 		return 0;
 
-#ifdef CONFIG_MEMCG_KMEM
-	mem_cgroup_flush_stats(memcg);
-	nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
-	nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
-#else
-	/* use pool stats instead of memcg stats */
-	nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
-	nr_stored = atomic_read(&zswap_nr_stored);
-#endif
+	/*
+	 * For memcg, use the cgroup-wide ZSWAP stats since we don't
+	 * have them per-node and thus per-lruvec. Careful if memcg is
+	 * runtime-disabled: we can get sc->memcg == NULL, which is ok
+	 * for the lruvec, but not for memcg_page_state().
+	 *
+	 * Without memcg, use the zswap pool-wide metrics.
+	 */
+	if (!mem_cgroup_disabled()) {
+		mem_cgroup_flush_stats(memcg);
+		nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
+		nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
+	} else {
+		nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
+		nr_stored = atomic_read(&zswap_nr_stored);
+	}
 
 	if (!nr_stored)
 		return 0;
-- 
2.44.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-18 12:40             ` Johannes Weiner
@ 2024-04-18 14:25               ` Linux regression tracking (Thorsten Leemhuis)
  2024-04-18 20:09               ` Yosry Ahmed
  1 sibling, 0 replies; 17+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-04-18 14:25 UTC (permalink / raw)
  To: Johannes Weiner, Christian Heusel
  Cc: Chengming Zhou, Nhat Pham, Seth Jennings, Dan Streetman,
	Vitaly Wool, Andrew Morton, linux-mm, linux-kernel, David Runge,
	Richard W.M. Jones, Mark W, regressions, Yosry Ahmed

On 18.04.24 14:40, Johannes Weiner wrote:
> On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
>> On 24/04/17 10:33AM, Johannes Weiner wrote:
> Christian reports a NULL deref in zswap that he bisected down to the
> zswap shrinker. The issue also cropped up in the bug trackers of
> libguestfs [1] and the Red Hat bugzilla [2].
> 
> The problem is that when memcg is disabled with the boot time flag,
> the zswap shrinker might get called with sc->memcg == NULL. This is
> okay in many places, like the lruvec operations. But it crashes in
> memcg_page_state() - which is only used due to the non-node accounting
> of cgroup's the zswap memory to begin with.
> 
> Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> and I was then able to reproduce the crash locally as well.

Thx for the fix. Nitpicking:

> [1] https://github.com/libguestfs/libguestfs/issues/139
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252

FWIW, those should ideally look like this:

Link: https://github.com/libguestfs/libguestfs/issues/139 [1]
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2275252 [2]

> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Cc: stable@vger.kernel.org	[v6.8]
> Link: https://lkml.kernel.org/r/20240417143324.GA1055428@cmpxchg.org
> Reported-by: Christian Heusel <christian@heusel.eu>

And here checkpatch.pl should have complained that the above line should
ideally be followed by a Link or Closes tag to the report, e.g.:

Closes:
https://lore.kernel.org/all/3iccc6vjl5gminut3lvpl4va2lbnsgku5ei2d7ylftoofy3n2v@gcfdvtsq6dx2/

Which in this case would be nice, as I'm tracking this regression, hence
regzbot will then track the patch and consider the regression resolved
once the fix lands in mainline.

Ciao, Thorsten


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-18 12:40             ` Johannes Weiner
  2024-04-18 14:25               ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-04-18 20:09               ` Yosry Ahmed
  2024-04-19 14:22                 ` Johannes Weiner
  1 sibling, 1 reply; 17+ messages in thread
From: Yosry Ahmed @ 2024-04-18 20:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christian Heusel, Chengming Zhou, Nhat Pham, Seth Jennings,
	Dan Streetman, Vitaly Wool, Andrew Morton, linux-mm,
	linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions

On Thu, Apr 18, 2024 at 5:40 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> > On 24/04/17 10:33AM, Johannes Weiner wrote:
> > >
> > > Christian, can you please test the below patch on top of current
> > > upstream?
> > >
> >
> > Hey Johannes,
> >
> > I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> > me, thanks for hacking together a fix so quickly! 🤗
> >
> > Tested-By: Christian Heusel <christian@heusel.eu>
>
> Thanks for confirming it, and sorry about the breakage.
>
> Andrew, can you please use the updated changelog below?
>
> ---
>
> From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 18 Apr 2024 08:26:28 -0400
> Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
>
> Christian reports a NULL deref in zswap that he bisected down to the
> zswap shrinker. The issue also cropped up in the bug trackers of
> libguestfs [1] and the Red Hat bugzilla [2].
>
> The problem is that when memcg is disabled with the boot time flag,
> the zswap shrinker might get called with sc->memcg == NULL. This is
> okay in many places, like the lruvec operations. But it crashes in
> memcg_page_state() - which is only used due to the non-node accounting
> of cgroup's the zswap memory to begin with.
>
> Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> and I was then able to reproduce the crash locally as well.
>
> [1] https://github.com/libguestfs/libguestfs/issues/139
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
>
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Cc: stable@vger.kernel.org      [v6.8]
> Link: https://lkml.kernel.org/r/20240417143324.GA1055428@cmpxchg.org
> Reported-by: Christian Heusel <christian@heusel.eu>
> Debugged-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Tested-By: Christian Heusel <christian@heusel.eu>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for fixing this. A couple of comments/questions below, but anyway LGTM:

Acked-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  mm/zswap.c | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index caed028945b0..6f8850c44b61 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>         if (!gfp_has_io_fs(sc->gfp_mask))
>                 return 0;
>
> -#ifdef CONFIG_MEMCG_KMEM
> -       mem_cgroup_flush_stats(memcg);
> -       nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> -       nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> -#else
> -       /* use pool stats instead of memcg stats */
> -       nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> -       nr_stored = atomic_read(&zswap_nr_stored);
> -#endif
> +       /*
> +        * For memcg, use the cgroup-wide ZSWAP stats since we don't
> +        * have them per-node and thus per-lruvec. Careful if memcg is
> +        * runtime-disabled: we can get sc->memcg == NULL, which is ok
> +        * for the lruvec, but not for memcg_page_state().
> +        *
> +        * Without memcg, use the zswap pool-wide metrics.
> +        */
> +       if (!mem_cgroup_disabled()) {

With the current shrinker code, it seems like we cannot get sc->memcg
== NULL unless mem_cgroup_disabled() is true indeed. However, maybe
it's better to check if sc->memcg is not NULL directly instead?

This would be more resilient in case the shrinker code changes and
passing sc->memcg == NULL becomes possible in other cases (e.g. during
global shrinking). It seems like other shrinkers do this, for example
see count_shadow_nodes() and deferred_split_count().

I am also wondering if we should also check !mem_cgroup_is_root()
here? We can avoid the expensive global flush and use the global stats
directly in this case. I could also send a follow up patch for this if
that's preferred.

> +               mem_cgroup_flush_stats(memcg);
> +               nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> +               nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> +       } else {
> +               nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> +               nr_stored = atomic_read(&zswap_nr_stored);
> +       }
>
>         if (!nr_stored)
>                 return 0;
> --
> 2.44.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-18 20:09               ` Yosry Ahmed
@ 2024-04-19 14:22                 ` Johannes Weiner
  2024-04-19 19:10                   ` Yosry Ahmed
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2024-04-19 14:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Christian Heusel, Chengming Zhou, Nhat Pham, Seth Jennings,
	Dan Streetman, Vitaly Wool, Andrew Morton, linux-mm,
	linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions

On Thu, Apr 18, 2024 at 01:09:22PM -0700, Yosry Ahmed wrote:
> On Thu, Apr 18, 2024 at 5:40 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> > > On 24/04/17 10:33AM, Johannes Weiner wrote:
> > > >
> > > > Christian, can you please test the below patch on top of current
> > > > upstream?
> > > >
> > >
> > > Hey Johannes,
> > >
> > > I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> > > me, thanks for hacking together a fix so quickly! 🤗
> > >
> > > Tested-By: Christian Heusel <christian@heusel.eu>
> >
> > Thanks for confirming it, and sorry about the breakage.
> >
> > Andrew, can you please use the updated changelog below?
> >
> > ---
> >
> > From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Thu, 18 Apr 2024 08:26:28 -0400
> > Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
> >
> > Christian reports a NULL deref in zswap that he bisected down to the
> > zswap shrinker. The issue also cropped up in the bug trackers of
> > libguestfs [1] and the Red Hat bugzilla [2].
> >
> > The problem is that when memcg is disabled with the boot time flag,
> > the zswap shrinker might get called with sc->memcg == NULL. This is
> > okay in many places, like the lruvec operations. But it crashes in
> > memcg_page_state() - which is only used due to the non-node accounting
> > of cgroup's the zswap memory to begin with.
> >
> > Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> > and I was then able to reproduce the crash locally as well.
> >
> > [1] https://github.com/libguestfs/libguestfs/issues/139
> > [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
> >
> > Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> > Cc: stable@vger.kernel.org      [v6.8]
> > Link: https://lkml.kernel.org/r/20240417143324.GA1055428@cmpxchg.org
> > Reported-by: Christian Heusel <christian@heusel.eu>
> > Debugged-by: Nhat Pham <nphamcs@gmail.com>
> > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > Tested-By: Christian Heusel <christian@heusel.eu>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Thanks for fixing this. A couple of comments/questions below, but anyway LGTM:
> 
> Acked-by: Yosry Ahmed <yosryahmed@google.com>
> 
> > ---
> >  mm/zswap.c | 25 ++++++++++++++++---------
> >  1 file changed, 16 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index caed028945b0..6f8850c44b61 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> >         if (!gfp_has_io_fs(sc->gfp_mask))
> >                 return 0;
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> > -       mem_cgroup_flush_stats(memcg);
> > -       nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > -       nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > -#else
> > -       /* use pool stats instead of memcg stats */
> > -       nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> > -       nr_stored = atomic_read(&zswap_nr_stored);
> > -#endif
> > +       /*
> > +        * For memcg, use the cgroup-wide ZSWAP stats since we don't
> > +        * have them per-node and thus per-lruvec. Careful if memcg is
> > +        * runtime-disabled: we can get sc->memcg == NULL, which is ok
> > +        * for the lruvec, but not for memcg_page_state().
> > +        *
> > +        * Without memcg, use the zswap pool-wide metrics.
> > +        */
> > +       if (!mem_cgroup_disabled()) {
> 
> With the current shrinker code, it seems like we cannot get sc->memcg
> == NULL unless mem_cgroup_disabled() is true indeed. However, maybe
> it's better to check if sc->memcg is not NULL directly instead?
> 
> This would be more resilient in case the shrinker code changes and
> passing sc->memcg == NULL becomes possible in other cases (e.g. during
> global shrinking). It seems like other shrinkers do this, for example
> see count_shadow_nodes() and deferred_split_count().

Eh, I'm not sure it's better or worse, so it's a bit hard to care. We
shouldn't get NULL here when memcg is enabled, and if somebody
introduces that bug it's better to catch it early than run into subtle
priority inversions when the kernel is deployed to millions of hosts.

> I am also wondering if we should also check !mem_cgroup_is_root()
> here? We can avoid the expensive global flush and use the global stats
> directly in this case. I could also send a follow up patch for this if
> that's preferred.

I'd rather not proliferate more memcg internals in this code. If this
is a concern, optimizing it in the flush and stat functions would make
more sense. Reclaim already flushes the subtree before getting here,
so odds are good this is a no-op in most cases.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-19 14:22                 ` Johannes Weiner
@ 2024-04-19 19:10                   ` Yosry Ahmed
  0 siblings, 0 replies; 17+ messages in thread
From: Yosry Ahmed @ 2024-04-19 19:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christian Heusel, Chengming Zhou, Nhat Pham, Seth Jennings,
	Dan Streetman, Vitaly Wool, Andrew Morton, linux-mm,
	linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions

On Fri, Apr 19, 2024 at 7:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Apr 18, 2024 at 01:09:22PM -0700, Yosry Ahmed wrote:
> > On Thu, Apr 18, 2024 at 5:40 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> > > > On 24/04/17 10:33AM, Johannes Weiner wrote:
> > > > >
> > > > > Christian, can you please test the below patch on top of current
> > > > > upstream?
> > > > >
> > > >
> > > > Hey Johannes,
> > > >
> > > > I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> > > > me, thanks for hacking together a fix so quickly! 🤗
> > > >
> > > > Tested-By: Christian Heusel <christian@heusel.eu>
> > >
> > > Thanks for confirming it, and sorry about the breakage.
> > >
> > > Andrew, can you please use the updated changelog below?
> > >
> > > ---
> > >
> > > From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Thu, 18 Apr 2024 08:26:28 -0400
> > > Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
> > >
> > > Christian reports a NULL deref in zswap that he bisected down to the
> > > zswap shrinker. The issue also cropped up in the bug trackers of
> > > libguestfs [1] and the Red Hat bugzilla [2].
> > >
> > > The problem is that when memcg is disabled with the boot time flag,
> > > the zswap shrinker might get called with sc->memcg == NULL. This is
> > > okay in many places, like the lruvec operations. But it crashes in
> > > memcg_page_state() - which is only used due to the non-node accounting
> > > of cgroup's the zswap memory to begin with.
> > >
> > > Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> > > and I was then able to reproduce the crash locally as well.
> > >
> > > [1] https://github.com/libguestfs/libguestfs/issues/139
> > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
> > >
> > > Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> > > Cc: stable@vger.kernel.org      [v6.8]
> > > Link: https://lkml.kernel.org/r/20240417143324.GA1055428@cmpxchg.org
> > > Reported-by: Christian Heusel <christian@heusel.eu>
> > > Debugged-by: Nhat Pham <nphamcs@gmail.com>
> > > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > > Tested-By: Christian Heusel <christian@heusel.eu>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > Thanks for fixing this. A couple of comments/questions below, but anyway LGTM:
> >
> > Acked-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > > ---
> > >  mm/zswap.c | 25 ++++++++++++++++---------
> > >  1 file changed, 16 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index caed028945b0..6f8850c44b61 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > >         if (!gfp_has_io_fs(sc->gfp_mask))
> > >                 return 0;
> > >
> > > -#ifdef CONFIG_MEMCG_KMEM
> > > -       mem_cgroup_flush_stats(memcg);
> > > -       nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > > -       nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > > -#else
> > > -       /* use pool stats instead of memcg stats */
> > > -       nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> > > -       nr_stored = atomic_read(&zswap_nr_stored);
> > > -#endif
> > > +       /*
> > > +        * For memcg, use the cgroup-wide ZSWAP stats since we don't
> > > +        * have them per-node and thus per-lruvec. Careful if memcg is
> > > +        * runtime-disabled: we can get sc->memcg == NULL, which is ok
> > > +        * for the lruvec, but not for memcg_page_state().
> > > +        *
> > > +        * Without memcg, use the zswap pool-wide metrics.
> > > +        */
> > > +       if (!mem_cgroup_disabled()) {
> >
> > With the current shrinker code, it seems like we cannot get sc->memcg
> > == NULL unless mem_cgroup_disabled() is true indeed. However, maybe
> > it's better to check if sc->memcg is not NULL directly instead?
> >
> > This would be more resilient in case the shrinker code changes and
> > passing sc->memcg == NULL becomes possible in other cases (e.g. during
> > global shrinking). It seems like other shrinkers do this, for example
> > see count_shadow_nodes() and deferred_split_count().
>
> Eh, I'm not sure it's better or worse, so it's a bit hard to care. We
> shouldn't get NULL here when memcg is enabled, and if somebody
> introduces that bug it's better to catch it early than run into subtle
> priority inversions when the kernel is deployed to millions of hosts.

No strong opinion here, I just thought consistency with other
shrinkers would be nice.

>
> > I am also wondering if we should also check !mem_cgroup_is_root()
> > here? We can avoid the expensive global flush and use the global stats
> > directly in this case. I could also send a follow up patch for this if
> > that's preferred.
>
> I'd rather not proliferate more memcg internals in this code. If this
> is a concern, optimizing it in the flush and stat functions would make
> more sense. Reclaim already flushes the subtree before getting here,
> so odds are good this is a no-op in most cases.

I agree that with per-memcg thresholding, it is very unlikely that
this flush does anything.

FWIW, if it turns out to be a problem, optimizing it in the flush
functions would not be as straightforward, because the memcg code is
not aware that zswap has up-to-date global versions of the stats.

Anyway, let's table this until someone actually complains, just seemed
to me like a low hanging fruit.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Null pointer dereference while shrinking zswap
  2024-04-16 12:19 [REGRESSION] Null pointer dereference while shrinking zswap Christian Heusel
  2024-04-16 19:18 ` Andrew Morton
  2024-04-16 22:14 ` Nhat Pham
@ 2024-04-17  0:33 ` Nhat Pham
  2 siblings, 0 replies; 17+ messages in thread
From: Nhat Pham @ 2024-04-17  0:33 UTC (permalink / raw)
  To: Christian Heusel
  Cc: Seth Jennings, Dan Streetman, Vitaly Wool, Andrew Morton,
	linux-mm, linux-kernel, David Runge, Richard W.M. Jones, Mark W,
	regressions

On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <christian@heusel.eu> wrote:
>
> Hello everyone,
>
> while rebuilding a few packages in Arch Linux we have recently come
> across a regression in the linux kernel which was made visible by a test
> failure in libguestfs[0], where the booted kernel showed a Call Trace
> like the following one:
>
> [  218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
> [  218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> [  218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> [  218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> [  218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> [  218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> [  218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> [  218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> [  218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> [  218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> [  218.742376] FS:  00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> [  218.742569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [  218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> [  218.743494] PKRU: 55555554^M
> [  218.743593] Call Trace:^M
> [  218.743733]  <TASK>^M
> [  218.743847]  ? __die+0x23/0x70^M
> [  218.743957]  ? page_fault_oops+0x171/0x4e0^M
> [  218.744056]  ? free_unref_page+0xf6/0x180^M
> [  218.744458]  ? exc_page_fault+0x7f/0x180^M
> [  218.744551]  ? asm_exc_page_fault+0x26/0x30^M
> [  218.744684]  ? memcg_page_state+0x9/0x30^M
> [  218.744779]  zswap_shrinker_count+0x9d/0x110^M
> [  218.744896]  do_shrink_slab+0x3a/0x360^M
> [  218.744990]  shrink_slab+0xc7/0x3c0^M
> [  218.745609]  drop_slab+0x85/0x140^M
> [  218.745691]  drop_caches_sysctl_handler+0x7e/0xd0^M
> [  218.745799]  proc_sys_call_handler+0x1c0/0x2e0^M
> [  218.745912]  vfs_write+0x23d/0x400^M
> [  218.745998]  ksys_write+0x6f/0xf0^M
> [  218.746080]  do_syscall_64+0x64/0xe0^M
> [  218.746169]  ? exit_to_user_mode_prepare+0x132/0x1f0^M
> [  218.746873]  entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>
> The regression is present in the mainline kernel and also was
> independently reported to the redhat bugtracker[1].
>
> I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> and have landed on the following results (removed unrelated test commit)
> as remainders since some of the commits were not buildable for me:
> - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
>
> I have decided on good/bad commits with the relevant libguestfs tests,
> but I think the reproducer in the redhat bugzilla is simpler (although I
> only became aware of it during the bisection and therefore didn't test
> it myself):
>
>   LIBGUESTFS_MEMSIZE=4096 LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 make -C /build/libguestfs/src/libguestfs-1.52.0/tests -k check TESTS=c-api/tests
>

I have a suspect for the bug:
https://lore.kernel.org/all/CAKEwX=MWPUf1NMGdn+1AkRdOUf25ifAbPyoP9zppPTx3U3Tv2Q@mail.gmail.com/

Feel free to fact check me, but let me see if I can reproduce this bug
on my own setting based on this analysis and send a fix accordingly :)


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-04-19 19:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-16 12:19 [REGRESSION] Null pointer dereference while shrinking zswap Christian Heusel
2024-04-16 19:18 ` Andrew Morton
2024-04-16 19:57   ` Christian Heusel
2024-04-16 22:14 ` Nhat Pham
2024-04-16 22:29   ` Christian Heusel
2024-04-16 23:29   ` Nhat Pham
2024-04-17  0:22     ` Nhat Pham
2024-04-17  3:44       ` Chengming Zhou
2024-04-17 14:33         ` Johannes Weiner
2024-04-17 15:08           ` Richard W.M. Jones
2024-04-17 17:18           ` Christian Heusel
2024-04-18 12:40             ` Johannes Weiner
2024-04-18 14:25               ` Linux regression tracking (Thorsten Leemhuis)
2024-04-18 20:09               ` Yosry Ahmed
2024-04-19 14:22                 ` Johannes Weiner
2024-04-19 19:10                   ` Yosry Ahmed
2024-04-17  0:33 ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox