From: Roman Gushchin <guro@fb.com>
To: Greg Thelen <gthelen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Shakeel Butt <shakeelb@google.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Kernel Team <Kernel-team@fb.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>, Rik van Riel <riel@surriel.com>,
Christoph Lameter <cl@linux.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>
Subject: Re: [PATCH v4 0/7] mm: reparent slab memory on cgroup removal
Date: Wed, 5 Jun 2019 17:33:00 +0000 [thread overview]
Message-ID: <20190605173256.GB10098@tower.DHCP.thefacebook.com> (raw)
In-Reply-To: <xr93ef48v5ub.fsf@gthelen.svl.corp.google.com>
On Wed, Jun 05, 2019 at 12:39:24AM -0700, Greg Thelen wrote:
> Roman Gushchin <guro@fb.com> wrote:
>
> > # Why do we need this?
> >
> > We've noticed that the number of dying cgroups is steadily growing on most
> > of our hosts in production. The following investigation revealed an issue
> > in userspace memory reclaim code [1], accounting of kernel stacks [2],
> > and also the mainreason: slab objects.
> >
> > The underlying problem is quite simple: any page charged
> > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> > all charged pages are gone. If a slab object is actively used by other cgroups,
> > it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
> >
> > Slab objects, and first of all vfs cache, is shared between cgroups, which are
> > using the same underlying fs, and what's even more important, it's shared
> > between multiple generations of the same workload. So if something is running
> > periodically every time in a new cgroup (like how systemd works), we do
> > accumulate multiple dying cgroups.
> >
> > Strictly speaking pagecache isn't different here, but there is a key difference:
> > we disable protection and apply some extra pressure on LRUs of dying cgroups,
> > and these LRUs contain all charged pages.
> > My experiments show that with the disabled kernel memory accounting the number
> > of dying cgroups stabilizes at a relatively small number (~100, depends on
> > memory pressure and cgroup creation rate), and with kernel memory accounting
> > it grows pretty steadily up to several thousands.
> >
> > Memory cgroups are quite complex and big objects (mostly due to percpu stats),
> > so it leads to noticeable memory losses. Memory occupied by dying cgroups
> > is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
> > of memory wasted for dying cgroups. It leads to a degradation of performance
> > with the uptime, and generally limits the usage of cgroups.
> >
> > My previous attempt [3] to fix the problem by applying extra pressure on slab
> > shrinker lists caused a regressions with xfs and ext4, and has been reverted [4].
> > The following attempts to find the right balance [5, 6] were not successful.
> >
> > So instead of trying to find a maybe non-existing balance, let's do reparent
> > the accounted slabs to the parent cgroup on cgroup removal.
> >
> >
> > # Implementation approach
> >
> > There is however a significant problem with reparenting of slab memory:
> > there is no list of charged pages. Some of them are in shrinker lists,
> > but not all. Introducing of a new list is really not an option.
> >
> > But fortunately there is a way forward: every slab page has a stable pointer
> > to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> > instead of slab pages.
> >
> > It's actually simpler and cheaper, but requires some underlying changes:
> > 1) Make kmem_caches to hold a single reference to the memory cgroup,
> > instead of a separate reference per every slab page.
> > 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
> > page->kmem_cache->memcg indirection instead. It's used only on
> > slab page release, so it shouldn't be a big issue.
> > 3) Introduce a refcounter for non-root slab caches. It's required to
> > be able to destroy kmem_caches when they become empty and release
> > the associated memory cgroup.
> >
> > There is a bonus: currently we do release empty kmem_caches on cgroup
> > removal, however all other are waiting for the releasing of the memory cgroup.
> > These refactorings allow kmem_caches to be released as soon as they
> > become inactive and free.
> >
> > Some additional implementation details are provided in corresponding
> > commit messages.
> >
> > # Results
> >
> > Below is the average number of dying cgroups on two groups of our production
> > hosts. They do run some sort of web frontend workload, the memory pressure
> > is moderate. As we can see, with the kernel memory reparenting the number
> > stabilizes in 60s range; however with the original version it grows almost
> > linearly and doesn't show any signs of plateauing. The difference in slab
> > and percpu usage between patched and unpatched versions also grows linearly.
> > In 7 days it exceeded 200Mb.
> >
> > day 0 1 2 3 4 5 6 7
> > original 56 362 628 752 1070 1250 1490 1560
> > patched 23 46 51 55 60 57 67 69
> > mem diff(Mb) 22 74 123 152 164 182 214 241
>
> No objection to the idea, but a question...
Hi Greg!
> In patched kernel, does slabinfo (or similar) show the list reparented
> slab caches? A pile of zombie kmem_caches is certainly better than a
> pile of zombie mem_cgroup. But it still seems like it'll might cause
> degradation - does cache_reap() walk an ever growing set of zombie
> caches?
It's not a pile of zombie kmem_caches vs a pile of zombie mem_cgroups.
It's a smaller pile of zombie kmem_caches vs a larger pile of zombie kmem_caches
*and* a pile of zombie mem_cgroups. The patchset makes the number of zombie
kmem_caches lower, not bigger.
Re slabinfo and other debug interfaces: I do not change anything here.
>
> We've found it useful to add a slabinfo_full file which includes zombie
> kmem_cache with their memcg_name. This can help hunt down zombies.
I'm not sure we need to add a permanent debug interface, because something like
drgn ( https://github.com/osandov/drgn ) can be used instead.
If you think that we lack some necessary debug interfaces, I'm totally open
here, but it's not a part of this patchset. Let's talk about them separately.
Thank you for looking into it!
Roman
prev parent reply other threads:[~2019-06-05 17:33 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-14 21:39 Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 2/7] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 3/7] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 4/7] mm: unify SLAB and SLUB page accounting Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
2019-05-15 0:06 ` Shakeel Butt
2019-05-20 14:54 ` Waiman Long
2019-05-20 17:56 ` Roman Gushchin
2019-05-21 18:39 ` Waiman Long
2019-05-21 19:23 ` Roman Gushchin
2019-05-21 19:35 ` Waiman Long
2019-05-15 14:00 ` Christopher Lameter
2019-05-15 14:11 ` Shakeel Butt
2019-05-23 0:58 ` [mm] e52271917f: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h kernel test robot
2019-05-23 21:00 ` Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 6/7] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-05-15 0:10 ` Shakeel Butt
2019-05-14 21:39 ` [PATCH v4 7/7] mm: fix /proc/kpagecgroup interface for slab pages Roman Gushchin
2019-05-15 0:16 ` Shakeel Butt
2019-06-05 7:39 ` [PATCH v4 0/7] mm: reparent slab memory on cgroup removal Greg Thelen
2019-06-05 17:33 ` Roman Gushchin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190605173256.GB10098@tower.DHCP.thefacebook.com \
--to=guro@fb.com \
--cc=Kernel-team@fb.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=riel@surriel.com \
--cc=shakeelb@google.com \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox