From: Roman Gushchin <guro@fb.com>
To: Muchun Song <songmuchun@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>,
Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Shakeel Butt <shakeelb@google.com>,
Yang Shi <shy828301@gmail.com>, <alexs@kernel.org>,
Alexander Duyck <alexander.h.duyck@linux.intel.com>,
Wei Yang <richard.weiyang@gmail.com>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
Date: Fri, 30 Apr 2021 20:10:46 -0700 [thread overview]
Message-ID: <YIzGtpAwO+2YOPA6@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <CAMZfGtXawtMT4JfBtDLZ+hES4iEHFboe2UgJee_s-NhZR5faAw@mail.gmail.com>
On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > memory.
> > > > >
> > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > cache is the cause of list_lru_one allocation.
> > > > >
> > > > > crash> p memcg_nr_cache_ids
> > > > > memcg_nr_cache_ids = $2 = 24574
> > > > >
> > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > can be calculated with the following formula.
> > > > >
> > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > >
> > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > >
> > > > > crash> list super_blocks | wc -l
> > > > > 952
> > > >
> > > > The more I see people trying to work around this, the more I think
> > > > that the way memcgs have been grafted into the list_lru is back to
> > > > front.
> > > >
> > > > We currently allocate scope for every memcg to be able to tracked on
> > > > every not on every superblock instantiated in the system, regardless
> > > > of whether that superblock is even accessible to that memcg.
> > > >
> > > > These huge memcg counts come from container hosts where memcgs are
> > > > confined to just a small subset of the total number of superblocks
> > > > that instantiated at any given point in time.
> > > >
> > > > IOWs, for these systems with huge container counts, list_lru does
> > > > not need the capability of tracking every memcg on every superblock.
> > > >
> > > > What it comes down to is that the list_lru is only needed for a
> > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > given list_lru.
> > > >
> > > > Which makes me think we should be moving more towards "add the memcg
> > > > to the list_lru at the first insert" model rather than "instantiate
> > > > all at memcg init time just in case". The model we originally came
> > > > up with for supprting memcgs is really starting to show it's limits,
> > > > and we should address those limitations rahter than hack more
> > > > complexity into the system that does nothing to remove the
> > > > limitations that are causing the problems in the first place.
> > >
> > > I totally agree.
> > >
> > > It looks like the initial implementation of the whole kernel memory accounting
> > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > cgroups is relatively small and stable.
> >
> > Yes, that was one of the original assumptions - tens to maybe low
> > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > aware, and so would only need a single LRU list per memcg. Hence the
> > total overhead even with "lots" of memcgsi and superblocks the
> > overhead wasn't that great.
> >
> > Then came "memcgs need to be NUMA aware" because of the size of the
> > machines they were being use for resrouce management in, and that
> > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > about needing to support a couple of orders of magnitude more memcgs
> > and superblocks than were originally designed for.
> >
> > So, really, we're way beyond the original design scope of this
> > subsystem now.
>
> Got it. So it is better to allocate the structure of the list_lru_node
> dynamically. We should only allocate it when it is really demanded.
> But allocating memory by using GFP_ATOMIC in list_lru_add() is
> not a good idea. So we should allocate the memory out of
> list_lru_add(). I can propose an approach that may work.
>
> Before start, we should know about the following rules of list lrus.
>
> - Only objects allocated with __GFP_ACCOUNT need to allocate
> the struct list_lru_node.
> - The caller of allocating memory must know which list_lru the
> object will insert.
>
> So we can allocate struct list_lru_node when allocating the
> object instead of allocating it when list_lru_add(). It is easy, because
> we already know the list_lru and memcg which the object belongs
> to. So we can introduce a new helper to allocate the object and
> list_lru_node. Like below.
>
> void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
> gfp_t gfpflags)
> {
> void *ret = kmem_cache_alloc(s, gfpflags);
>
> if (ret && (gfpflags & __GFP_ACCOUNT)) {
> struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
>
> if (mem_cgroup_is_root(memcg))
> return ret;
>
> /* Allocate per-memcg list_lru_node, if it already
> allocated, do nothing. */
> memcg_list_lru_node_alloc(lru, memcg,
> page_to_nid(virt_to_page(ret)), gfpflags);
> }
>
> return ret;
> }
>
> If the user wants to insert the allocated object to its lru list in
> the feature. The
> user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> I have looked at the code closely. There are 3 different kmem_caches that
> need to use this new API to allocate memory. They are inode_cachep,
> dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
>
> Hi Roman and Dave,
>
> What do you think about this approach? If there is no problem, I can provide
> a preliminary patchset within a week.
At a very first glance it looks similar to what Bharata proposed, but with some
additional tricks. It would be nice to find a common ground here. In general,
I think it's a right direction.
In general I believe we might need some more fundamental changes, but I don't
have a specific recipe yet. I need to think more of it.
Thanks!
next prev parent reply other threads:[~2021-05-01 3:11 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-28 9:49 Muchun Song
2021-04-28 9:49 ` [PATCH 1/9] mm: list_lru: fix list_lru_count_one() return value Muchun Song
2021-04-28 9:49 ` [PATCH 2/9] mm: memcontrol: remove kmemcg_id reparenting Muchun Song
2021-04-28 9:49 ` [PATCH 3/9] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus Muchun Song
2021-04-28 9:49 ` [PATCH 4/9] mm: memcontrol: remove the kmem states Muchun Song
2021-04-28 9:49 ` [PATCH 5/9] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() Muchun Song
2021-04-28 9:49 ` [PATCH 6/9] mm: list_lru: support for shrinking list lru Muchun Song
2021-04-28 9:49 ` [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID Muchun Song
2021-04-29 6:47 ` Christoph Hellwig
2021-04-29 7:36 ` [External] " Muchun Song
2021-04-28 9:49 ` [PATCH 9/9] mm: memcontrol: rename memcg_{get,put}_cache_ids to memcg_list_lru_resize_{lock,unlock} Muchun Song
2021-04-28 23:32 ` [PATCH 0/9] Shrink the list lru size on memory cgroup removal Shakeel Butt
2021-04-29 3:05 ` [External] " Muchun Song
2021-04-30 0:49 ` Dave Chinner
2021-04-30 1:39 ` Roman Gushchin
2021-04-30 3:27 ` Dave Chinner
2021-04-30 8:32 ` [External] " Muchun Song
2021-05-01 3:10 ` Roman Gushchin [this message]
2021-05-01 3:27 ` Matthew Wilcox
2021-05-02 23:58 ` Dave Chinner
2021-05-03 6:33 ` Muchun Song
2021-05-05 1:13 ` Dave Chinner
2021-05-07 5:45 ` Muchun Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YIzGtpAwO+2YOPA6@carbon.dhcp.thefacebook.com \
--to=guro@fb.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.h.duyck@linux.intel.com \
--cc=alexs@kernel.org \
--cc=david@fromorbit.com \
--cc=hannes@cmpxchg.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=richard.weiyang@gmail.com \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=songmuchun@bytedance.com \
--cc=vdavydov.dev@gmail.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox