linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
@ 2024-10-02 11:10 Naresh Kamboju
  2024-10-02 11:24 ` Dan Carpenter
  0 siblings, 1 reply; 8+ messages in thread
From: Naresh Kamboju @ 2024-10-02 11:10 UTC (permalink / raw)
  To: open list, lkft-triage, Linux Regressions, linux-mm
  Cc: Andrew Morton, Arnd Bergmann, Anders Roxell, Dan Carpenter

The following kernel warnings have been occurring on arm64 DUT and qemu-arm64
running Linux next-20240930, next-20241001 and next-20241002 while
booting the kernel.

This is an intermittent warning noticed on arm64
 - Juno-r2
 - Dragonboard-410c
 - Qemu-arm64

First seen on next-20240930

  Good: next-20240927
  BAD:  next-20240930..next-20241002

Since this is an intermittent problem hard to bisect.

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>

Warning log:
----------
<4>[   26.293906] ------------[ cut here ]------------
<4>[ 26.295948] WARNING: CPU: 1 PID: 1 at mm/list_lru.c:77
list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
<4>[   26.299608] Modules linked in: fuse drm backlight ip_tables x_tables
<4>[   26.308212] CPU: 1 UID: 0 PID: 1 Comm: systemd Not tainted
6.12.0-rc1-next-20241001 #1
<4>[   26.310552] Hardware name: linux,dummy-virt (DT)
<4>[   26.313304] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT
-SSBS BTYPE=--)
<4>[ 26.315519] pc : list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
<4>[ 26.316457] lr : list_lru_del (mm/list_lru.c:76 mm/list_lru.c:200)
<4>[   26.317603] sp : ffff80008002b950
<4>[   26.319015] x29: ffff80008002b950 x28: fff00000c0540240 x27:
0000000000000000
<4>[   26.321155] x26: fff00000c2dce690 x25: 8000000000000000 x24:
0000000000000000
<4>[   26.322653] x23: fff00000c0c4e900 x22: fff00000c12f4478 x21:
fff00000c12f4458
<4>[   26.324697] x20: fff00000c1b14800 x19: fff00000c0542088 x18:
0000000000000000
<4>[   26.326121] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000000
<4>[   26.327590] x14: 0000000000000000 x13: fff00000c146b940 x12:
0000000000000005
<4>[   26.329087] x11: 0000000000000000 x10: 0000000000000402 x9 :
0000000000000003
<4>[   26.330650] x8 : ffffffffffffffff x7 : 0000000023d53570 x6 :
0000000023d53570
<4>[   26.332484] x5 : 00000000000f000c x4 : ffffc1ffc3032e20 x3 :
fff00000c2f70800
<4>[   26.334759] x2 : 0000000000000000 x1 : 0000000000000000 x0 :
0000000000000001
<4>[   26.338095] Call trace:
<4>[ 26.339907] list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
<4>[ 26.340990] list_lru_del_obj (mm/list_lru.c:221)
<4>[ 26.341972] d_lru_del (fs/dcache.c:463)
<4>[ 26.342794] to_shrink_list (fs/dcache.c:477 fs/dcache.c:887)
<4>[ 26.343615] select_collect (fs/dcache.c:0)
<4>[ 26.344524] d_walk (fs/dcache.c:1278)
<4>[ 26.345384] shrink_dcache_parent (include/linux/list.h:373 fs/dcache.c:1511)
<4>[ 26.346512] d_invalidate (fs/dcache.c:1617)
<4>[ 26.347451] proc_invalidate_siblings_dcache (fs/proc/inode.c:143)
<4>[ 26.348744] proc_flush_pid (fs/proc/base.c:3480)
<4>[ 26.349747] release_task (kernel/exit.c:281)
<4>[ 26.350810] wait_consider_task (kernel/exit.c:1253 kernel/exit.c:1477)
<4>[ 26.352093] __do_wait (kernel/exit.c:1617 kernel/exit.c:1651)
<4>[ 26.353151] do_wait (kernel/exit.c:1693)
<4>[ 26.353958] __arm64_sys_waitid (kernel/exit.c:1775
kernel/exit.c:1788 kernel/exit.c:1783 kernel/exit.c:1783)
<4>[ 26.359772] invoke_syscall (arch/arm64/kernel/syscall.c:50)
<4>[ 26.360706] el0_svc_common (include/linux/thread_info.h:127
arch/arm64/kernel/syscall.c:140)
<4>[ 26.361477] do_el0_svc (arch/arm64/kernel/syscall.c:152)
<4>[ 26.362218] el0_svc (arch/arm64/kernel/entry-common.c:165
arch/arm64/kernel/entry-common.c:178
arch/arm64/kernel/entry-common.c:713)
<4>[ 26.363014] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:765)
<4>[ 26.364138] el0t_64_sync (arch/arm64/kernel/entry.S:598)
<4>[   26.365321] ---[ end trace 0000000000000000 ]---

boot Log links,
--------
  - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241001/testrun/25235075/suite/log-parser-boot/test/check-kernel-exception-warning-cpu-pid-at-mmlist_lruc-list_lru_del/log
  - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2mp2m5m4PnjJgdix32h7pIGe63Y/logs?format=html

Test results history:
----------
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241002/testrun/25242215/suite/log-parser-boot/test/check-kernel-exception-warning-cpu-pid-at-mmlist_lruc-list_lru_del/history/

metadata:
----
  git describe: next-20241001
  git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
  git sha: 77df9e4bb2224d8ffbddec04c333a9d7965dad6c
  kernel config:
- https://storage.tuxsuite.com/public/linaro/lkft/builds/2mp2jhmSKhlF6c0x1SBsJFyBbTq/config
  build url: https://storage.tuxsuite.com/public/linaro/lkft/builds/2mp2jhmSKhlF6c0x1SBsJFyBbTq/
  toolchain: clang-19 and gcc-13

Steps to reproduce:
---------
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2mp2m5m4PnjJgdix32h7pIGe63Y/reproducer
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2mp2m5m4PnjJgdix32h7pIGe63Y/tux_plan

--
Linaro LKFT
https://lkft.linaro.org


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
  2024-10-02 11:10 next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200) Naresh Kamboju
@ 2024-10-02 11:24 ` Dan Carpenter
  2024-10-02 11:25   ` Dan Carpenter
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Carpenter @ 2024-10-02 11:24 UTC (permalink / raw)
  To: Naresh Kamboju, Kairui Song
  Cc: open list, lkft-triage, Linux Regressions, linux-mm,
	Andrew Morton, Arnd Bergmann, Anders Roxell

[-- Attachment #1: Type: text/plain, Size: 5269 bytes --]

Let's add Kairui Song to the  CC list.

One simple thing is that we should add a READ_ONCE() to the comparison.  Naresh,
could you test the attached diff?  I don't know that it will fix it but it's
worth checking the easy stuff first.

regards,
dan carpenter

On Wed, Oct 02, 2024 at 04:40:36PM +0530, Naresh Kamboju wrote:
> The following kernel warnings have been occurring on arm64 DUT and qemu-arm64
> running Linux next-20240930, next-20241001 and next-20241002 while
> booting the kernel.
> 
> This is an intermittent warning noticed on arm64
>  - Juno-r2
>  - Dragonboard-410c
>  - Qemu-arm64
> 
> First seen on next-20240930
> 
>   Good: next-20240927
>   BAD:  next-20240930..next-20241002
> 
> Since this is an intermittent problem hard to bisect.
> 
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> 
> Warning log:
> ----------
> <4>[   26.293906] ------------[ cut here ]------------
> <4>[ 26.295948] WARNING: CPU: 1 PID: 1 at mm/list_lru.c:77
> list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
> <4>[   26.299608] Modules linked in: fuse drm backlight ip_tables x_tables
> <4>[   26.308212] CPU: 1 UID: 0 PID: 1 Comm: systemd Not tainted
> 6.12.0-rc1-next-20241001 #1
> <4>[   26.310552] Hardware name: linux,dummy-virt (DT)
> <4>[   26.313304] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT
> -SSBS BTYPE=--)
> <4>[ 26.315519] pc : list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
> <4>[ 26.316457] lr : list_lru_del (mm/list_lru.c:76 mm/list_lru.c:200)
> <4>[   26.317603] sp : ffff80008002b950
> <4>[   26.319015] x29: ffff80008002b950 x28: fff00000c0540240 x27:
> 0000000000000000
> <4>[   26.321155] x26: fff00000c2dce690 x25: 8000000000000000 x24:
> 0000000000000000
> <4>[   26.322653] x23: fff00000c0c4e900 x22: fff00000c12f4478 x21:
> fff00000c12f4458
> <4>[   26.324697] x20: fff00000c1b14800 x19: fff00000c0542088 x18:
> 0000000000000000
> <4>[   26.326121] x17: 0000000000000000 x16: 0000000000000000 x15:
> 0000000000000000
> <4>[   26.327590] x14: 0000000000000000 x13: fff00000c146b940 x12:
> 0000000000000005
> <4>[   26.329087] x11: 0000000000000000 x10: 0000000000000402 x9 :
> 0000000000000003
> <4>[   26.330650] x8 : ffffffffffffffff x7 : 0000000023d53570 x6 :
> 0000000023d53570
> <4>[   26.332484] x5 : 00000000000f000c x4 : ffffc1ffc3032e20 x3 :
> fff00000c2f70800
> <4>[   26.334759] x2 : 0000000000000000 x1 : 0000000000000000 x0 :
> 0000000000000001
> <4>[   26.338095] Call trace:
> <4>[ 26.339907] list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
> <4>[ 26.340990] list_lru_del_obj (mm/list_lru.c:221)
> <4>[ 26.341972] d_lru_del (fs/dcache.c:463)
> <4>[ 26.342794] to_shrink_list (fs/dcache.c:477 fs/dcache.c:887)
> <4>[ 26.343615] select_collect (fs/dcache.c:0)
> <4>[ 26.344524] d_walk (fs/dcache.c:1278)
> <4>[ 26.345384] shrink_dcache_parent (include/linux/list.h:373 fs/dcache.c:1511)
> <4>[ 26.346512] d_invalidate (fs/dcache.c:1617)
> <4>[ 26.347451] proc_invalidate_siblings_dcache (fs/proc/inode.c:143)
> <4>[ 26.348744] proc_flush_pid (fs/proc/base.c:3480)
> <4>[ 26.349747] release_task (kernel/exit.c:281)
> <4>[ 26.350810] wait_consider_task (kernel/exit.c:1253 kernel/exit.c:1477)
> <4>[ 26.352093] __do_wait (kernel/exit.c:1617 kernel/exit.c:1651)
> <4>[ 26.353151] do_wait (kernel/exit.c:1693)
> <4>[ 26.353958] __arm64_sys_waitid (kernel/exit.c:1775
> kernel/exit.c:1788 kernel/exit.c:1783 kernel/exit.c:1783)
> <4>[ 26.359772] invoke_syscall (arch/arm64/kernel/syscall.c:50)
> <4>[ 26.360706] el0_svc_common (include/linux/thread_info.h:127
> arch/arm64/kernel/syscall.c:140)
> <4>[ 26.361477] do_el0_svc (arch/arm64/kernel/syscall.c:152)
> <4>[ 26.362218] el0_svc (arch/arm64/kernel/entry-common.c:165
> arch/arm64/kernel/entry-common.c:178
> arch/arm64/kernel/entry-common.c:713)
> <4>[ 26.363014] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:765)
> <4>[ 26.364138] el0t_64_sync (arch/arm64/kernel/entry.S:598)
> <4>[   26.365321] ---[ end trace 0000000000000000 ]---
> 
> boot Log links,
> --------
>   - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241001/testrun/25235075/suite/log-parser-boot/test/check-kernel-exception-warning-cpu-pid-at-mmlist_lruc-list_lru_del/log
>   - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2mp2m5m4PnjJgdix32h7pIGe63Y/logs?format=html
> 
> Test results history:
> ----------
> - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241002/testrun/25242215/suite/log-parser-boot/test/check-kernel-exception-warning-cpu-pid-at-mmlist_lruc-list_lru_del/history/
> 
> metadata:
> ----
>   git describe: next-20241001
>   git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
>   git sha: 77df9e4bb2224d8ffbddec04c333a9d7965dad6c
>   kernel config:
> - https://storage.tuxsuite.com/public/linaro/lkft/builds/2mp2jhmSKhlF6c0x1SBsJFyBbTq/config
>   build url: https://storage.tuxsuite.com/public/linaro/lkft/builds/2mp2jhmSKhlF6c0x1SBsJFyBbTq/
>   toolchain: clang-19 and gcc-13
> 
> Steps to reproduce:
> ---------
> - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2mp2m5m4PnjJgdix32h7pIGe63Y/reproducer
> - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2mp2m5m4PnjJgdix32h7pIGe63Y/tux_plan
> 
> --
> Linaro LKFT
> https://lkft.linaro.org

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 420 bytes --]

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 79c2d21504a2..a9a8b02e056a 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -74,7 +74,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 		else
 			spin_lock(&l->lock);
 		if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
-			WARN_ON(l->nr_items < 0);
+			WARN_ON(READ_ONCE(l->nr_items) < 0);
 			rcu_read_unlock();
 			return l;
 		}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
  2024-10-02 11:24 ` Dan Carpenter
@ 2024-10-02 11:25   ` Dan Carpenter
  2024-10-02 11:28     ` Dan Carpenter
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Carpenter @ 2024-10-02 11:25 UTC (permalink / raw)
  To: Naresh Kamboju, Kairui Song
  Cc: open list, lkft-triage, Linux Regressions, linux-mm,
	Andrew Morton, Arnd Bergmann, Anders Roxell

On Wed, Oct 02, 2024 at 02:24:20PM +0300, Dan Carpenter wrote:
> Let's add Kairui Song to the  CC list.
> 
> One simple thing is that we should add a READ_ONCE() to the comparison.  Naresh,
> could you test the attached diff?  I don't know that it will fix it but it's
> worth checking the easy stuff first.
> 

Actually that's not right.  Let me write a different patch.

regards,
dan carpenter



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
  2024-10-02 11:25   ` Dan Carpenter
@ 2024-10-02 11:28     ` Dan Carpenter
  2024-10-02 18:58       ` Kairui Song
  2024-10-03  5:38       ` Naresh Kamboju
  0 siblings, 2 replies; 8+ messages in thread
From: Dan Carpenter @ 2024-10-02 11:28 UTC (permalink / raw)
  To: Naresh Kamboju, Kairui Song
  Cc: open list, lkft-triage, Linux Regressions, linux-mm,
	Andrew Morton, Arnd Bergmann, Anders Roxell

On Wed, Oct 02, 2024 at 02:25:34PM +0300, Dan Carpenter wrote:
> On Wed, Oct 02, 2024 at 02:24:20PM +0300, Dan Carpenter wrote:
> > Let's add Kairui Song to the  CC list.
> > 
> > One simple thing is that we should add a READ_ONCE() to the comparison.  Naresh,
> > could you test the attached diff?  I don't know that it will fix it but it's
> > worth checking the easy stuff first.
> > 
> 
> Actually that's not right.  Let me write a different patch.

Try this one.

regards,
dan carpenter

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 79c2d21504a2..2c429578ed31 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -65,6 +65,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 		       bool irq, bool skip_empty)
 {
 	struct list_lru_one *l;
+	long nr_items;
 	rcu_read_lock();
 again:
 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
@@ -73,8 +74,9 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 			spin_lock_irq(&l->lock);
 		else
 			spin_lock(&l->lock);
-		if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
-			WARN_ON(l->nr_items < 0);
+		nr_items = READ_ONCE(l->nr_items);
+		if (likely(nr_items != LONG_MIN)) {
+			WARN_ON(nr_items < 0);
 			rcu_read_unlock();
 			return l;
 		}


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
  2024-10-02 11:28     ` Dan Carpenter
@ 2024-10-02 18:58       ` Kairui Song
  2024-10-09 16:51         ` Dan Carpenter
  2024-10-03  5:38       ` Naresh Kamboju
  1 sibling, 1 reply; 8+ messages in thread
From: Kairui Song @ 2024-10-02 18:58 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Naresh Kamboju, open list, lkft-triage, Linux Regressions,
	linux-mm, Andrew Morton, Arnd Bergmann, Anders Roxell

On Wed, Oct 2, 2024 at 7:28 PM Dan Carpenter <dan.carpenter@linaro.org> wrote:
>
> On Wed, Oct 02, 2024 at 02:25:34PM +0300, Dan Carpenter wrote:
> > On Wed, Oct 02, 2024 at 02:24:20PM +0300, Dan Carpenter wrote:
> > > Let's add Kairui Song to the  CC list.
> > >
> > > One simple thing is that we should add a READ_ONCE() to the comparison.  Naresh,
> > > could you test the attached diff?  I don't know that it will fix it but it's
> > > worth checking the easy stuff first.
> > >
> >
> > Actually that's not right.  Let me write a different patch.
>
> Try this one.
>
> regards,
> dan carpenter
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 79c2d21504a2..2c429578ed31 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -65,6 +65,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
>                        bool irq, bool skip_empty)
>  {
>         struct list_lru_one *l;
> +       long nr_items;
>         rcu_read_lock();
>  again:
>         l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> @@ -73,8 +74,9 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
>                         spin_lock_irq(&l->lock);
>                 else
>                         spin_lock(&l->lock);
> -               if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
> -                       WARN_ON(l->nr_items < 0);
> +               nr_items = READ_ONCE(l->nr_items);
> +               if (likely(nr_items != LONG_MIN)) {
> +                       WARN_ON(nr_items < 0);
>                         rcu_read_unlock();
>                         return l;
>                 }
>

Thanks. The warning is a new added sanity check, I'm not sure if this
WARN_ON triggered by an existing list_lru leak or if it's a new issue.

And unfortunately so far I can't reproduce it locally on my ARM
machine, it should be easily reproducible according to the
description. And if the WARN only triggered once, and only during
boot, mayce some static data wasn't initialized correctly? Or the
enablement of memcg caused some list_lru leak
(mem_cgroup_from_slab_obj changed from returning NULL to returning
actual memcg, so a item added to rootcg before will be attempt removed
from actual memcg, seems a real race). If it's the latter case, then
it's an existing issue caught by the new sanity check.

The READ_ONCE patch may be worth trying, I'll also try to do more
debugging on this and try to send a fix later.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
  2024-10-02 11:28     ` Dan Carpenter
  2024-10-02 18:58       ` Kairui Song
@ 2024-10-03  5:38       ` Naresh Kamboju
  1 sibling, 0 replies; 8+ messages in thread
From: Naresh Kamboju @ 2024-10-03  5:38 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Kairui Song, open list, lkft-triage, Linux Regressions, linux-mm,
	Andrew Morton, Arnd Bergmann, Anders Roxell

On Wed, 2 Oct 2024 at 16:58, Dan Carpenter <dan.carpenter@linaro.org> wrote:
>
> On Wed, Oct 02, 2024 at 02:25:34PM +0300, Dan Carpenter wrote:
> > On Wed, Oct 02, 2024 at 02:24:20PM +0300, Dan Carpenter wrote:
> > > Let's add Kairui Song to the  CC list.
> > >
> > > One simple thing is that we should add a READ_ONCE() to the comparison.  Naresh,
> > > could you test the attached diff?  I don't know that it will fix it but it's
> > > worth checking the easy stuff first.
> > >
> >
> > Actually that's not right.  Let me write a different patch.
>
> Try this one.
>

Thanks for the patch,

I have applied this patch and testing is in progress.
From last night the tests running in a loop did not find the reported warning.


> regards,
> dan carpenter
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 79c2d21504a2..2c429578ed31 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -65,6 +65,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
>                        bool irq, bool skip_empty)
>  {
>         struct list_lru_one *l;
> +       long nr_items;
>         rcu_read_lock();
>  again:
>         l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> @@ -73,8 +74,9 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
>                         spin_lock_irq(&l->lock);
>                 else
>                         spin_lock(&l->lock);
> -               if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
> -                       WARN_ON(l->nr_items < 0);
> +               nr_items = READ_ONCE(l->nr_items);
> +               if (likely(nr_items != LONG_MIN)) {
> +                       WARN_ON(nr_items < 0);
>                         rcu_read_unlock();
>                         return l;
>                 }

- Naresh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
  2024-10-02 18:58       ` Kairui Song
@ 2024-10-09 16:51         ` Dan Carpenter
  2024-10-09 18:02           ` Kairui Song
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Carpenter @ 2024-10-09 16:51 UTC (permalink / raw)
  To: Kairui Song
  Cc: Naresh Kamboju, open list, lkft-triage, Linux Regressions,
	linux-mm, Andrew Morton, Arnd Bergmann, Anders Roxell

On Thu, Oct 03, 2024 at 02:58:19AM +0800, Kairui Song wrote:
> On Wed, Oct 2, 2024 at 7:28 PM Dan Carpenter <dan.carpenter@linaro.org> wrote:
> >
> > On Wed, Oct 02, 2024 at 02:25:34PM +0300, Dan Carpenter wrote:
> > > On Wed, Oct 02, 2024 at 02:24:20PM +0300, Dan Carpenter wrote:
> > > > Let's add Kairui Song to the  CC list.
> > > >
> > > > One simple thing is that we should add a READ_ONCE() to the comparison.  Naresh,
> > > > could you test the attached diff?  I don't know that it will fix it but it's
> > > > worth checking the easy stuff first.
> > > >
> > >
> > > Actually that's not right.  Let me write a different patch.
> >
> > Try this one.
> >
> > regards,
> > dan carpenter
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 79c2d21504a2..2c429578ed31 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -65,6 +65,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
> >                        bool irq, bool skip_empty)
> >  {
> >         struct list_lru_one *l;
> > +       long nr_items;
> >         rcu_read_lock();
> >  again:
> >         l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> > @@ -73,8 +74,9 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
> >                         spin_lock_irq(&l->lock);
> >                 else
> >                         spin_lock(&l->lock);
> > -               if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
> > -                       WARN_ON(l->nr_items < 0);
> > +               nr_items = READ_ONCE(l->nr_items);
> > +               if (likely(nr_items != LONG_MIN)) {
> > +                       WARN_ON(nr_items < 0);
> >                         rcu_read_unlock();
> >                         return l;
> >                 }
> >
> 
> Thanks. The warning is a new added sanity check, I'm not sure if this
> WARN_ON triggered by an existing list_lru leak or if it's a new issue.
> 
> And unfortunately so far I can't reproduce it locally on my ARM
> machine, it should be easily reproducible according to the
> description. And if the WARN only triggered once, and only during
> boot, mayce some static data wasn't initialized correctly?

I have a config where it printed twice and the second time wasn't during boot.

https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241009/testrun/25363339/suite/boot/test/gcc-13-lkftconfig-rcutorture/log

> Or the enablement of memcg caused some list_lru leak
> (mem_cgroup_from_slab_obj changed from returning NULL to returning
> actual memcg, so a item added to rootcg before will be attempt removed
> from actual memcg, seems a real race). If it's the latter case, then
> it's an existing issue caught by the new sanity check.
> 
> The READ_ONCE patch may be worth trying, I'll also try to do more
> debugging on this and try to send a fix later.

The READ_ONCE() patch *seemed* to work, but the bug is intermittent so maybe it
just changed the timing or something.  Still, I feel from a correctness
perspective the READ_ONCE() thing is probably correct, right?

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200)
  2024-10-09 16:51         ` Dan Carpenter
@ 2024-10-09 18:02           ` Kairui Song
  0 siblings, 0 replies; 8+ messages in thread
From: Kairui Song @ 2024-10-09 18:02 UTC (permalink / raw)
  To: Dan Carpenter, Andrew Morton
  Cc: Naresh Kamboju, open list, lkft-triage, Linux Regressions,
	linux-mm, Arnd Bergmann, Anders Roxell

On Thu, Oct 10, 2024 at 12:51 AM Dan Carpenter <dan.carpenter@linaro.org> wrote:
>
> On Thu, Oct 03, 2024 at 02:58:19AM +0800, Kairui Song wrote:
> > On Wed, Oct 2, 2024 at 7:28 PM Dan Carpenter <dan.carpenter@linaro.org> wrote:
> > >
> > > On Wed, Oct 02, 2024 at 02:25:34PM +0300, Dan Carpenter wrote:
> > > > On Wed, Oct 02, 2024 at 02:24:20PM +0300, Dan Carpenter wrote:
> > > > > Let's add Kairui Song to the  CC list.
> > > > >
> > > > > One simple thing is that we should add a READ_ONCE() to the comparison.  Naresh,
> > > > > could you test the attached diff?  I don't know that it will fix it but it's
> > > > > worth checking the easy stuff first.
> > > > >
> > > >
> > > > Actually that's not right.  Let me write a different patch.
> > >
> > > Try this one.
> > >
> > > regards,
> > > dan carpenter
> > >
> > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > index 79c2d21504a2..2c429578ed31 100644
> > > --- a/mm/list_lru.c
> > > +++ b/mm/list_lru.c
> > > @@ -65,6 +65,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
> > >                        bool irq, bool skip_empty)
> > >  {
> > >         struct list_lru_one *l;
> > > +       long nr_items;
> > >         rcu_read_lock();
> > >  again:
> > >         l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> > > @@ -73,8 +74,9 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
> > >                         spin_lock_irq(&l->lock);
> > >                 else
> > >                         spin_lock(&l->lock);
> > > -               if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
> > > -                       WARN_ON(l->nr_items < 0);
> > > +               nr_items = READ_ONCE(l->nr_items);
> > > +               if (likely(nr_items != LONG_MIN)) {
> > > +                       WARN_ON(nr_items < 0);
> > >                         rcu_read_unlock();
> > >                         return l;
> > >                 }
> > >
> >
> > Thanks. The warning is a new added sanity check, I'm not sure if this
> > WARN_ON triggered by an existing list_lru leak or if it's a new issue.
> >
> > And unfortunately so far I can't reproduce it locally on my ARM
> > machine, it should be easily reproducible according to the
> > description. And if the WARN only triggered once, and only during
> > boot, mayce some static data wasn't initialized correctly?
>
> I have a config where it printed twice and the second time wasn't during boot.
>
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241009/testrun/25363339/suite/boot/test/gcc-13-lkftconfig-rcutorture/log
>
> > Or the enablement of memcg caused some list_lru leak
> > (mem_cgroup_from_slab_obj changed from returning NULL to returning
> > actual memcg, so a item added to rootcg before will be attempt removed
> > from actual memcg, seems a real race). If it's the latter case, then
> > it's an existing issue caught by the new sanity check.
> >
> > The READ_ONCE patch may be worth trying, I'll also try to do more
> > debugging on this and try to send a fix later.
>
> The READ_ONCE() patch *seemed* to work, but the bug is intermittent so maybe it
> just changed the timing or something.  Still, I feel from a correctness
> perspective the READ_ONCE() thing is probably correct, right?
>

Yes, the READ_ONCE fix is absolutely correct.

Not sure if it's possible in theory, that the compiler or CPU will use
the old value for the `WARN`, but use a new read value for the `if` above.
This READ_ONCE will prevent that from happening, if possible.

I think we should just merge the READ_ONCE fix, and see if any more
tests expose this issue again.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-10-09 18:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-02 11:10 next-20241001: WARNING: at mm/list_lru.c:77 list_lru_del (mm/list_lru.c:212 mm/list_lru.c:200) Naresh Kamboju
2024-10-02 11:24 ` Dan Carpenter
2024-10-02 11:25   ` Dan Carpenter
2024-10-02 11:28     ` Dan Carpenter
2024-10-02 18:58       ` Kairui Song
2024-10-09 16:51         ` Dan Carpenter
2024-10-09 18:02           ` Kairui Song
2024-10-03  5:38       ` Naresh Kamboju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox