CGroup unused allocated slab objects will not get released

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* CGroup unused allocated slab objects will not get released
@ 2019-09-18 20:31 Saeed Karimabadi (skarimab)
  2019-09-18 22:23 ` Roman Gushchin
  0 siblings, 1 reply; 4+ messages in thread
From: Saeed Karimabadi (skarimab) @ 2019-09-18 20:31 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, Tejun Heo, Li Zefan, Johannes Weiner,
	cgroups, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm
  Cc: xe-linux-external(mailer list)

Hi  Kernel Maintainers,

We are chasing an issue where slab allocator is not releasing task_struct slab objects allocated by cgroups 
and we are wondering if this is a known issue or an expected behavior ?
If we stress test the system and spawn multiple tasks with different cgroups, number of active allocated 
task_struct objects will increase but kernel will never release those memory later on, even though if system 
goes to the idle state with lower number of the running processes.
To test this, we have prepared a bash script that would create 1000 cgroups and it will spawn 100,000 bash 
tasks. The full script and its test result is available on github :

https://github.com/saeedsk/slab-allocator-test

Here is a quick snapshot of the test result before and after running multiple concurrent tasks with different cgroups:

------------- system initial statistics -------------
Slab:             419196 kB
SReclaimable:     123788 kB
SUnreclaim:       295,408 kB
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> 
		: slabdata <active_slabs> <num_slabs> <sharedavail>
task_struct          735    990   5888    5    8 : tunables    0    0    0 : slabdata    198    198      0
Number of running processes before starting the test : 334

...... loading 100,000 time bounded tasks with 1000 mem cgroups .............. 
..... wait until are tasks are complete , normally within next 5 seconds ........

------------- after tasks are loaded and completed running  -------------
Slab:             948932 kB
SReclaimable:     125816 kB
SUnreclaim:       823,116 kB
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> 
		: slabdata <active_slabs> <num_slabs> <sharedavail>
task_struct        11404  11665   5888    5    8 : tunables    0    0    0 : slabdata   2333   2333      0
Number of running processes when the test is completed : 334

As it is shown above, number of active task_struct slabs has been increased from 736 to 11404 objects 
during the test. System keeps 11404 task_struct objects in the idle time where only 334 tasks is running. 
This huge number of active task_struct slabs it is not normal and a huge fraction of that memory can be -
released to system memory pool. If we write to slab's shrink systf entry, then kernel will release deactivated
objects and it will free up the related memory, but it is not happening automatically by kernel as it was 
expected.

Following line is the command that would release those zombie objects:
# for file in /sys/kernel/slab/*; do echo 1 > $file/shrink; done

We know that some of slab caches are supposed to remain allocated until system really need that memory. 
So in one test we tried to consume all available system memory in a hope that kernel would release the above 
Memory but it didn't happened and "out of memory killer" started killing processes and no memory got released 
by kernel slab allocator.

In recent systemd releases, CGroup memory accounting has been enabled by default and systemd will 
create multiple cgroups to run different software daemons. Although we have called this test as 
an stress test but this situation may happen in normal system boot time where systemd is trying
to load and run multiple instances of programs daemons with different cgroups.
This issue only manifest itself when cgroup are actively in use. I've confirmed that this issue is present
 in Kernel V4.19.66, Kernel V5.0.0 (Ubuntu 19.04) and latest Kernel Release V5.3.0.
Any comment and or hint would be greatly appreciated.
Here is some related kernel configuration while this test were done:

$ grep SLAB  .config
# CONFIG_SLAB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set

#grep SLUB  .config
CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_MEMCG_SYSFS_ON is not set
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set

$ grep KMEM  .config
CONFIG_MEMCG_KMEM=y
# CONFIG_DEVKMEM is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set

Thanks,
Saeed Karimabadi
Cisco Systems Inc.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: CGroup unused allocated slab objects will not get released
  2019-09-18 20:31 CGroup unused allocated slab objects will not get released Saeed Karimabadi (skarimab)
@ 2019-09-18 22:23 ` Roman Gushchin
  2019-09-18 23:48   ` Saeed Karimabadi (skarimab)
  0 siblings, 1 reply; 4+ messages in thread
From: Roman Gushchin @ 2019-09-18 22:23 UTC (permalink / raw)
  To: Saeed Karimabadi (skarimab)
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, Tejun Heo, Li Zefan, Johannes Weiner,
	cgroups, Michal Hocko, Vladimir Davydov,
	xe-linux-external(mailer list)

On Wed, Sep 18, 2019 at 08:31:18PM +0000, Saeed Karimabadi (skarimab) wrote:
> Hi  Kernel Maintainers,
> 
> We are chasing an issue where slab allocator is not releasing task_struct slab objects allocated by cgroups 
> and we are wondering if this is a known issue or an expected behavior ?
> If we stress test the system and spawn multiple tasks with different cgroups, number of active allocated 
> task_struct objects will increase but kernel will never release those memory later on, even though if system 
> goes to the idle state with lower number of the running processes.

Hi Saeed!

I've recently proposed a new slab memory cgroup controller, which aims to solve
the problem you're describing: https://lwn.net/Articles/798605/ . It also generally
reduces the amount of memory used by slabs.

I've been told that not all e-mails in the patchset reached lkml,
so, please, find the original patchset here:
  https://github.com/rgushchin/linux/tree/new_slab.rfc
and it's backport to the 5.3 release here:
  https://github.com/rgushchin/linux/tree/new_slab.rfc.v5.3

If you can try it on your setup, I'd appreciate it a lot, and it also can
help with merging it upstream soon.

Thank you!

Roman


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: CGroup unused allocated slab objects will not get released
  2019-09-18 22:23 ` Roman Gushchin
@ 2019-09-18 23:48   ` Saeed Karimabadi (skarimab)
  2019-09-19  0:33     ` Roman Gushchin
  0 siblings, 1 reply; 4+ messages in thread
From: Saeed Karimabadi (skarimab) @ 2019-09-18 23:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, Tejun Heo, Li Zefan, Johannes Weiner,
	cgroups, Michal Hocko, Vladimir Davydov,
	xe-linux-external(mailer list)

Hi Roman,

Thanks for your prompt reply and also sharing your patch. 
I did build kernel 5.3.0 with your patch and I can confirm your patch fixes the problem I was describing. 
I used Qemu for this test and the script ran 1000 tasks concurrently in 100 different cgroups.
I'm wondering if your could has gone through any long term regression test?
Do you see any possible simple patch that can fix this excessive memory usage in older kernel code like 4.x versions?

Here are more detail information about the test results:

******************************************************************************
Your proposed patche back-ported to Kernel 5.3.0 :
  https://github.com/rgushchin/linux/tree/new_slab.rfc.v5.3
------------- Before Running the script  -------------
Slab:                      42756 kB
SReclaimable:      25408 kB
SUnreclaim:          17348 kB
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : 
	            tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
task_struct          102    200   3200   10    8 : tunables    0    0    0 : slabdata     20     20      0
------------- After running the script -------------
Slab:                      43736 kB
SReclaimable:      25484 kB
SUnreclaim:         18252 kB
task_struct          149    220   3200   10    8 : tunables    0    0    0 : slabdata     22     22      0

******************************************************************************
Vanilla Kernel 5.3.0 :
------------- Before Running the script  -------------
Slab:                      34704 kB
SReclaimable:      19956 kB
SUnreclaim:          14748 kB
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : 
                           tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
task_struct           99    130   3200   10    8 : tunables    0    0    0 : slabdata     13     13      0
------------- After running the script -------------
Slab:                      59388 kB
SReclaimable:      23580 kB
SUnreclaim:          35808 kB
task_struct         1174   1230   3200   10    8 : tunables    0    0    0 : slabdata    123    123      0

Regards,
Saeed


-----Original Message-----
From: Roman Gushchin <guro@fb.com> 
Sent: Wednesday, September 18, 2019 3:23 PM
To: Saeed Karimabadi (skarimab) <skarimab@cisco.com>
Cc: Christoph Lameter <cl@linux.com>; Pekka Enberg <penberg@kernel.org>; David Rientjes <rientjes@google.com>; Joonsoo Kim <iamjoonsoo.kim@lge.com>; Andrew Morton <akpm@linux-foundation.org>; linux-mm@kvack.org; Tejun Heo <tj@kernel.org>; Li Zefan <lizefan@huawei.com>; Johannes Weiner <hannes@cmpxchg.org>; cgroups@vger.kernel.org; Michal Hocko <mhocko@kernel.org>; Vladimir Davydov <vdavydov.dev@gmail.com>; xe-linux-external(mailer list) <xe-linux-external@cisco.com>
Subject: Re: CGroup unused allocated slab objects will not get released

On Wed, Sep 18, 2019 at 08:31:18PM +0000, Saeed Karimabadi (skarimab) wrote:
> Hi  Kernel Maintainers,
> 
> We are chasing an issue where slab allocator is not releasing task_struct slab objects allocated by cgroups 
> and we are wondering if this is a known issue or an expected behavior ?
> If we stress test the system and spawn multiple tasks with different cgroups, number of active allocated 
> task_struct objects will increase but kernel will never release those memory later on, even though if system 
> goes to the idle state with lower number of the running processes.

Hi Saeed!

I've recently proposed a new slab memory cgroup controller, which aims to solve
the problem you're describing: https://lwn.net/Articles/798605/ . It also generally
reduces the amount of memory used by slabs.

I've been told that not all e-mails in the patchset reached lkml,
so, please, find the original patchset here:
  https://github.com/rgushchin/linux/tree/new_slab.rfc
and it's backport to the 5.3 release here:
  https://github.com/rgushchin/linux/tree/new_slab.rfc.v5.3

If you can try it on your setup, I'd appreciate it a lot, and it also can
help with merging it upstream soon.

Thank you!

Roman


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: CGroup unused allocated slab objects will not get released
  2019-09-18 23:48   ` Saeed Karimabadi (skarimab)
@ 2019-09-19  0:33     ` Roman Gushchin
  0 siblings, 0 replies; 4+ messages in thread
From: Roman Gushchin @ 2019-09-19  0:33 UTC (permalink / raw)
  To: Saeed Karimabadi (skarimab)
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, Tejun Heo, Li Zefan, Johannes Weiner,
	cgroups, Michal Hocko, Vladimir Davydov,
	xe-linux-external(mailer list)

Hi Saeed!

On Wed, Sep 18, 2019 at 11:48:19PM +0000, Saeed Karimabadi (skarimab) wrote:
> Hi Roman,
> 
> Thanks for your prompt reply and also sharing your patch. 
> I did build kernel 5.3.0 with your patch and I can confirm your patch fixes the problem I was describing. 
> I used Qemu for this test and the script ran 1000 tasks concurrently in 100 different cgroups.
> I'm wondering if your could has gone through any long term regression test?

Thank you for testing it!
We've tested on different fb production workloads, and it was doing great.
There were significant memory savings and no noticeable cpu regression in
all tested environments.
If you've any tests you can run and share results, I'd appreciate it.

> Do you see any possible simple patch that can fix this excessive memory usage in older kernel code like 4.x versions?

This patchset is definitely too heavy to backport to 4.x. As a workaround
you can disable the kernel memory accounting using a boot option, if it's
acceptable.

Thanks!

> 
> Here are more detail information about the test results:
> 
> ******************************************************************************
> Your proposed patche back-ported to Kernel 5.3.0 :
>   https://github.com/rgushchin/linux/tree/new_slab.rfc.v5.3
> ------------- Before Running the script  -------------
> Slab:                      42756 kB
> SReclaimable:      25408 kB
> SUnreclaim:          17348 kB
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : 
> 	            tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> task_struct          102    200   3200   10    8 : tunables    0    0    0 : slabdata     20     20      0
> ------------- After running the script -------------
> Slab:                      43736 kB
> SReclaimable:      25484 kB
> SUnreclaim:         18252 kB
> task_struct          149    220   3200   10    8 : tunables    0    0    0 : slabdata     22     22      0
> 
> ******************************************************************************
> Vanilla Kernel 5.3.0 :
> ------------- Before Running the script  -------------
> Slab:                      34704 kB
> SReclaimable:      19956 kB
> SUnreclaim:          14748 kB
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : 
>                            tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> task_struct           99    130   3200   10    8 : tunables    0    0    0 : slabdata     13     13      0
> ------------- After running the script -------------
> Slab:                      59388 kB
> SReclaimable:      23580 kB
> SUnreclaim:          35808 kB
> task_struct         1174   1230   3200   10    8 : tunables    0    0    0 : slabdata    123    123      0
> 
> Regards,
> Saeed
> 
> 
> -----Original Message-----
> From: Roman Gushchin <guro@fb.com> 
> Sent: Wednesday, September 18, 2019 3:23 PM
> To: Saeed Karimabadi (skarimab) <skarimab@cisco.com>
> Cc: Christoph Lameter <cl@linux.com>; Pekka Enberg <penberg@kernel.org>; David Rientjes <rientjes@google.com>; Joonsoo Kim <iamjoonsoo.kim@lge.com>; Andrew Morton <akpm@linux-foundation.org>; linux-mm@kvack.org; Tejun Heo <tj@kernel.org>; Li Zefan <lizefan@huawei.com>; Johannes Weiner <hannes@cmpxchg.org>; cgroups@vger.kernel.org; Michal Hocko <mhocko@kernel.org>; Vladimir Davydov <vdavydov.dev@gmail.com>; xe-linux-external(mailer list) <xe-linux-external@cisco.com>
> Subject: Re: CGroup unused allocated slab objects will not get released
> 
> On Wed, Sep 18, 2019 at 08:31:18PM +0000, Saeed Karimabadi (skarimab) wrote:
> > Hi  Kernel Maintainers,
> > 
> > We are chasing an issue where slab allocator is not releasing task_struct slab objects allocated by cgroups 
> > and we are wondering if this is a known issue or an expected behavior ?
> > If we stress test the system and spawn multiple tasks with different cgroups, number of active allocated 
> > task_struct objects will increase but kernel will never release those memory later on, even though if system 
> > goes to the idle state with lower number of the running processes.
> 
> Hi Saeed!
> 
> I've recently proposed a new slab memory cgroup controller, which aims to solve
> the problem you're describing: https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_798605_&d=DwIFAw&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=fWQormdkeCMUp9VGpxmefgOpLEKeqxTz7u4jw51PDAQ&s=g-9JRnTKBsVSQ7w6U_mpQ5hrjXcCKOXuYSIsTSCuTck&e=  . It also generally
> reduces the amount of memory used by slabs.
> 
> I've been told that not all e-mails in the patchset reached lkml,
> so, please, find the original patchset here:
>   https://github.com/rgushchin/linux/tree/new_slab.rfc
> and it's backport to the 5.3 release here:
>   https://github.com/rgushchin/linux/tree/new_slab.rfc.v5.3
> 
> If you can try it on your setup, I'd appreciate it a lot, and it also can
> help with merging it upstream soon.
> 
> Thank you!
> 
> Roman


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-19  0:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-18 20:31 CGroup unused allocated slab objects will not get released Saeed Karimabadi (skarimab)
2019-09-18 22:23 ` Roman Gushchin
2019-09-18 23:48   ` Saeed Karimabadi (skarimab)
2019-09-19  0:33     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox