WARNING: CPU: 0 PID: 11655 at mm/page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
@ 2019-06-19  2:08 Andrei Vagin
  2019-06-19  3:41 ` Roman Gushchin
  2019-06-19 21:19 ` Roman Gushchin
  0 siblings, 2 replies; 5+ messages in thread
From: Andrei Vagin @ 2019-06-19  2:08 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm

Hello,

We run CRIU tests on linux-next kernels and today we found this
warning in the kernel log:

[  381.345960] WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
page_counter_cancel+0x26/0x30
[  381.345992] Modules linked in:
[  381.345998] CPU: 0 PID: 11655 Comm: kworker/0:8 Not tainted
5.2.0-rc5-next-20190618+ #1
[  381.346001] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[  381.346010] Workqueue: memcg_kmem_cache kmemcg_workfn
[  381.346013] RIP: 0010:page_counter_cancel+0x26/0x30
[  381.346017] Code: 1f 44 00 00 0f 1f 44 00 00 48 89 f0 53 48 f7 d8
f0 48 0f c1 07 48 29 f0 48 89 c3 48 89 c6 e8 61 ff ff ff 48 85 db 78
02 5b c3 <0f> 0b 5b c3 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 41
41 55
[  381.346019] RSP: 0018:ffffb3b34319f990 EFLAGS: 00010086
[  381.346022] RAX: fffffffffffffffc RBX: fffffffffffffffc RCX: 0000000000000004
[  381.346024] RDX: 0000000000000000 RSI: fffffffffffffffc RDI: ffff9c2cd7165270
[  381.346026] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000001
[  381.346028] R10: 00000000000000c8 R11: ffff9c2cd684e660 R12: 00000000fffffffc
[  381.346030] R13: 0000000000000002 R14: 0000000000000006 R15: ffff9c2c8ce1f200
[  381.346033] FS:  0000000000000000(0000) GS:ffff9c2cd8200000(0000)
knlGS:0000000000000000
[  381.346039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  381.346041] CR2: 00000000007be000 CR3: 00000001cdbfc005 CR4: 00000000001606f0
[  381.346043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  381.346045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  381.346047] Call Trace:
[  381.346054]  page_counter_uncharge+0x1d/0x30
[  381.346065]  __memcg_kmem_uncharge_memcg+0x39/0x60
[  381.346071]  __free_slab+0x34c/0x460
[  381.346079]  deactivate_slab.isra.80+0x57d/0x6d0
[  381.346088]  ? add_lock_to_list.isra.36+0x9c/0xf0
[  381.346095]  ? __lock_acquire+0x252/0x1410
[  381.346106]  ? cpumask_next_and+0x19/0x20
[  381.346110]  ? slub_cpu_dead+0xd0/0xd0
[  381.346113]  flush_cpu_slab+0x36/0x50
[  381.346117]  ? slub_cpu_dead+0xd0/0xd0
[  381.346125]  on_each_cpu_mask+0x51/0x70
[  381.346131]  ? ksm_migrate_page+0x60/0x60
[  381.346134]  on_each_cpu_cond_mask+0xab/0x100
[  381.346143]  __kmem_cache_shrink+0x56/0x320
[  381.346150]  ? ret_from_fork+0x3a/0x50
[  381.346157]  ? unwind_next_frame+0x73/0x480
[  381.346176]  ? __lock_acquire+0x252/0x1410
[  381.346188]  ? kmemcg_workfn+0x21/0x50
[  381.346196]  ? __mutex_lock+0x99/0x920
[  381.346199]  ? kmemcg_workfn+0x21/0x50
[  381.346205]  ? kmemcg_workfn+0x21/0x50
[  381.346216]  __kmemcg_cache_deactivate_after_rcu+0xe/0x40
[  381.346220]  kmemcg_cache_deactivate_after_rcu+0xe/0x20
[  381.346223]  kmemcg_workfn+0x31/0x50
[  381.346230]  process_one_work+0x23c/0x5e0
[  381.346241]  worker_thread+0x3c/0x390
[  381.346248]  ? process_one_work+0x5e0/0x5e0
[  381.346252]  kthread+0x11d/0x140
[  381.346255]  ? kthread_create_on_node+0x60/0x60
[  381.346261]  ret_from_fork+0x3a/0x50
[  381.346275] irq event stamp: 10302
[  381.346278] hardirqs last  enabled at (10301): [<ffffffffb2c1a0b9>]
_raw_spin_unlock_irq+0x29/0x40
[  381.346282] hardirqs last disabled at (10302): [<ffffffffb2182289>]
on_each_cpu_mask+0x49/0x70
[  381.346287] softirqs last  enabled at (10262): [<ffffffffb2191f4a>]
cgroup_idr_replace+0x3a/0x50
[  381.346290] softirqs last disabled at (10260): [<ffffffffb2191f2d>]
cgroup_idr_replace+0x1d/0x50
[  381.346293] ---[ end trace b324ba73eb3659f0 ]---

All logs are here:
https://travis-ci.org/avagin/linux/builds/546601278

The problem is probably in the " [PATCH v7 00/10] mm: reparent slab
memory on cgroup removal" series.

Thanks,
Andrei


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
  2019-06-19  2:08 WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62 Andrei Vagin
@ 2019-06-19  3:41 ` Roman Gushchin
  2019-06-19 21:19 ` Roman Gushchin
  1 sibling, 0 replies; 5+ messages in thread
From: Roman Gushchin @ 2019-06-19  3:41 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: linux-mm

Hi Andrei!

Thank you for the report!
I guess the problem is caused by a race between drain_all_stock() in mem_cgroup_css_offline() and kmem_cache reparenting, so some portion of the charge isn’t propagating to the parent level in time, causing the disbalance. If so, it’s not a huge problem, but definitely something to fix.

I’m on pto/traveling this week without a reliable internet connection. I will send out a fix on Sunday/early next week.

Thanks!

Sent from my iPhone

> On Jun 18, 2019, at 19:08, Andrei Vagin <avagin@gmail.com> wrote:
> 
> Hello,
> 
> We run CRIU tests on linux-next kernels and today we found this
> warning in the kernel log:
> 
> [  381.345960] WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
> page_counter_cancel+0x26/0x30
> [  381.345992] Modules linked in:
> [  381.345998] CPU: 0 PID: 11655 Comm: kworker/0:8 Not tainted
> 5.2.0-rc5-next-20190618+ #1
> [  381.346001] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [  381.346010] Workqueue: memcg_kmem_cache kmemcg_workfn
> [  381.346013] RIP: 0010:page_counter_cancel+0x26/0x30
> [  381.346017] Code: 1f 44 00 00 0f 1f 44 00 00 48 89 f0 53 48 f7 d8
> f0 48 0f c1 07 48 29 f0 48 89 c3 48 89 c6 e8 61 ff ff ff 48 85 db 78
> 02 5b c3 <0f> 0b 5b c3 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 41
> 41 55
> [  381.346019] RSP: 0018:ffffb3b34319f990 EFLAGS: 00010086
> [  381.346022] RAX: fffffffffffffffc RBX: fffffffffffffffc RCX: 0000000000000004
> [  381.346024] RDX: 0000000000000000 RSI: fffffffffffffffc RDI: ffff9c2cd7165270
> [  381.346026] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000001
> [  381.346028] R10: 00000000000000c8 R11: ffff9c2cd684e660 R12: 00000000fffffffc
> [  381.346030] R13: 0000000000000002 R14: 0000000000000006 R15: ffff9c2c8ce1f200
> [  381.346033] FS:  0000000000000000(0000) GS:ffff9c2cd8200000(0000)
> knlGS:0000000000000000
> [  381.346039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  381.346041] CR2: 00000000007be000 CR3: 00000001cdbfc005 CR4: 00000000001606f0
> [  381.346043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  381.346045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  381.346047] Call Trace:
> [  381.346054]  page_counter_uncharge+0x1d/0x30
> [  381.346065]  __memcg_kmem_uncharge_memcg+0x39/0x60
> [  381.346071]  __free_slab+0x34c/0x460
> [  381.346079]  deactivate_slab.isra.80+0x57d/0x6d0
> [  381.346088]  ? add_lock_to_list.isra.36+0x9c/0xf0
> [  381.346095]  ? __lock_acquire+0x252/0x1410
> [  381.346106]  ? cpumask_next_and+0x19/0x20
> [  381.346110]  ? slub_cpu_dead+0xd0/0xd0
> [  381.346113]  flush_cpu_slab+0x36/0x50
> [  381.346117]  ? slub_cpu_dead+0xd0/0xd0
> [  381.346125]  on_each_cpu_mask+0x51/0x70
> [  381.346131]  ? ksm_migrate_page+0x60/0x60
> [  381.346134]  on_each_cpu_cond_mask+0xab/0x100
> [  381.346143]  __kmem_cache_shrink+0x56/0x320
> [  381.346150]  ? ret_from_fork+0x3a/0x50
> [  381.346157]  ? unwind_next_frame+0x73/0x480
> [  381.346176]  ? __lock_acquire+0x252/0x1410
> [  381.346188]  ? kmemcg_workfn+0x21/0x50
> [  381.346196]  ? __mutex_lock+0x99/0x920
> [  381.346199]  ? kmemcg_workfn+0x21/0x50
> [  381.346205]  ? kmemcg_workfn+0x21/0x50
> [  381.346216]  __kmemcg_cache_deactivate_after_rcu+0xe/0x40
> [  381.346220]  kmemcg_cache_deactivate_after_rcu+0xe/0x20
> [  381.346223]  kmemcg_workfn+0x31/0x50
> [  381.346230]  process_one_work+0x23c/0x5e0
> [  381.346241]  worker_thread+0x3c/0x390
> [  381.346248]  ? process_one_work+0x5e0/0x5e0
> [  381.346252]  kthread+0x11d/0x140
> [  381.346255]  ? kthread_create_on_node+0x60/0x60
> [  381.346261]  ret_from_fork+0x3a/0x50
> [  381.346275] irq event stamp: 10302
> [  381.346278] hardirqs last  enabled at (10301): [<ffffffffb2c1a0b9>]
> _raw_spin_unlock_irq+0x29/0x40
> [  381.346282] hardirqs last disabled at (10302): [<ffffffffb2182289>]
> on_each_cpu_mask+0x49/0x70
> [  381.346287] softirqs last  enabled at (10262): [<ffffffffb2191f4a>]
> cgroup_idr_replace+0x3a/0x50
> [  381.346290] softirqs last disabled at (10260): [<ffffffffb2191f2d>]
> cgroup_idr_replace+0x1d/0x50
> [  381.346293] ---[ end trace b324ba73eb3659f0 ]---
> 
> All logs are here:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_avagin_linux_builds_546601278&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=kpOAQ-QKsSZxwkrkvl5sjp-p0lK15lr38jLoHbKhwVQ&s=-sDpLY8sPriCii_-pdfWaH84xNWSJB9aPb0MTMzWEb0&e= 
> 
> The problem is probably in the " [PATCH v7 00/10] mm: reparent slab
> memory on cgroup removal" series.
> 
> Thanks,
> Andrei

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
  2019-06-19  2:08 WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62 Andrei Vagin
  2019-06-19  3:41 ` Roman Gushchin
@ 2019-06-19 21:19 ` Roman Gushchin
  2019-06-19 23:41   ` Andrei Vagin
  1 sibling, 1 reply; 5+ messages in thread
From: Roman Gushchin @ 2019-06-19 21:19 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: linux-mm

On Tue, Jun 18, 2019 at 07:08:26PM -0700, Andrei Vagin wrote:
> Hello,
> 
> We run CRIU tests on linux-next kernels and today we found this
> warning in the kernel log:

Hello, Andrei!

Can you, please, check if the following patch fixes the problem?

Thanks a lot!

--

diff --git a/mm/slab.h b/mm/slab.h
index a4c9b9d042de..7667dddb6492 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -326,7 +326,8 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
        memcg = READ_ONCE(s->memcg_params.memcg);
        lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
        mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
-       memcg_kmem_uncharge_memcg(page, order, memcg);
+       if (!mem_cgroup_is_root(memcg))
+               memcg_kmem_uncharge_memcg(page, order, memcg);
        rcu_read_unlock();
 
        percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
  2019-06-19 21:19 ` Roman Gushchin
@ 2019-06-19 23:41   ` Andrei Vagin
  2019-06-20  1:32     ` Roman Gushchin
  0 siblings, 1 reply; 5+ messages in thread
From: Andrei Vagin @ 2019-06-19 23:41 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm

On Wed, Jun 19, 2019 at 2:19 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Jun 18, 2019 at 07:08:26PM -0700, Andrei Vagin wrote:
> > Hello,
> >
> > We run CRIU tests on linux-next kernels and today we found this
> > warning in the kernel log:
>
> Hello, Andrei!
>
> Can you, please, check if the following patch fixes the problem?

All my tests passed: https://travis-ci.org/avagin/linux/builds/547940031

Tested-by: Andrei Vagin <avagin@gmail.com>

Thanks,
Andrei

>
> Thanks a lot!
>
> --
>
> diff --git a/mm/slab.h b/mm/slab.h
> index a4c9b9d042de..7667dddb6492 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -326,7 +326,8 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>         memcg = READ_ONCE(s->memcg_params.memcg);
>         lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>         mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
> -       memcg_kmem_uncharge_memcg(page, order, memcg);
> +       if (!mem_cgroup_is_root(memcg))
> +               memcg_kmem_uncharge_memcg(page, order, memcg);
>         rcu_read_unlock();
>
>         percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
  2019-06-19 23:41   ` Andrei Vagin
@ 2019-06-20  1:32     ` Roman Gushchin
  0 siblings, 0 replies; 5+ messages in thread
From: Roman Gushchin @ 2019-06-20  1:32 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: linux-mm

On Wed, Jun 19, 2019 at 04:41:05PM -0700, Andrei Vagin wrote:
> On Wed, Jun 19, 2019 at 2:19 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Tue, Jun 18, 2019 at 07:08:26PM -0700, Andrei Vagin wrote:
> > > Hello,
> > >
> > > We run CRIU tests on linux-next kernels and today we found this
> > > warning in the kernel log:
> >
> > Hello, Andrei!
> >
> > Can you, please, check if the following patch fixes the problem?
> 
> All my tests passed: https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_avagin_linux_builds_547940031&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=PL584FpQp6_68teeNmPDMBi6YNZHVSL9_X83jLmfid0&s=g-gMywZpFZp5GRfXixu-iX_YPx0rRrMhCgMZc-5IcF4&e= 
> 
> Tested-by: Andrei Vagin <avagin@gmail.com>

Thank you very much!

I'll send the proper patch soon. It's a bit different to what you've tested
(I realized that for root_mem_cgroup vmstats should be handled differently too).
So I won't put your tested-by for now, let's wait for tests passing with the
actual patch.

Thank you!

Roman


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-06-20  1:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-19  2:08 WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62 Andrei Vagin
2019-06-19  3:41 ` Roman Gushchin
2019-06-19 21:19 ` Roman Gushchin
2019-06-19 23:41   ` Andrei Vagin
2019-06-20  1:32     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox