From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xn6i=J3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A46DAC43460
	for <linux-mm@archiver.kernel.org>; Fri, 30 Apr 2021 08:33:21 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id CFA9661107
	for <linux-mm@archiver.kernel.org>; Fri, 30 Apr 2021 08:33:20 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CFA9661107
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 35EEE6B0098; Fri, 30 Apr 2021 04:33:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 324288D0002; Fri, 30 Apr 2021 04:33:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1C5176B009A; Fri, 30 Apr 2021 04:33:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0232.hostedemail.com [216.40.44.232])
	by kanga.kvack.org (Postfix) with ESMTP id 026986B0098
	for <linux-mm@kvack.org>; Fri, 30 Apr 2021 04:33:19 -0400 (EDT)
Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id B20FA180ACC20
	for <linux-mm@kvack.org>; Fri, 30 Apr 2021 08:33:19 +0000 (UTC)
X-FDA: 78088368918.31.18B12A7
Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176])
	by imf02.hostedemail.com (Postfix) with ESMTP id BAC1340002F2
	for <linux-mm@kvack.org>; Fri, 30 Apr 2021 08:32:47 +0000 (UTC)
Received: by mail-pl1-f176.google.com with SMTP id v20so8942148plo.10
        for <linux-mm@kvack.org>; Fri, 30 Apr 2021 01:33:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=vssGAg/sTeYEvytxFlNmd3yOz+/EYCcIJtC9IEXz/LY=;
        b=wKpaF83caKmgfTQby3hfgOAh/Ux0dEU14brSn+SpDTvpFHnaLYMXAQtY+CkM2sJ1hm
         5sAZ6e1ipywqUUsKBXYp2HpQHlaYXdPUEbiD57C40+g5FHfcJ94RfDXJXkKtdVSVooqN
         +eADBHUdGYNYI15CjJvkdTgVziispoVKjB5owhGw1da+F4FtejMmPvOMircVYp+R1BYM
         GOoOb2h3vLyQvj49V/Tpz6tGrNFUQPtd3DrA7jmTBBRHFpR2mIeAY4U5sR+Qv0Cbgtdb
         9zuoaYsE6Mvaz3R+9yl1El9SO6gjgQOafdJGWiSenEDpQLEa3bEgWxukO4+vrVF6ll+5
         Xk1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=vssGAg/sTeYEvytxFlNmd3yOz+/EYCcIJtC9IEXz/LY=;
        b=g8X1hkUXt+w1Lv32O4fmEA9ffASFsH+HMJPupe74Y3XHkr/sD4swCOtOdgD6p0qOsO
         hSSqv1yAKuOvg6nVTl7FHocI81fQoSFXluSIeT/n+vNams+d0fUj3e9cyQ7XgiamnBIK
         keroilRIer30475eDfy7fXaplKuBIBqpBRkvMmuLDjaKRL7DdbH7kKEJybACdO5ahCHc
         CCM90XTI4vdeEX4bzXEUlUa8QMo0LajWGZcy8kM1cFiDXWY/5stwn747NkrqTIjfKbid
         djaTGaqZX8UHmiAgt/A9CxHXncExRETjLn/Psox/znM49W3FYJ3aZw0QmcQ4zMQW0iXI
         w0Ng==
X-Gm-Message-State: AOAM532QvWgiAQiIvDpiMTm6y6MoJTgiXAn3LY3dl5iB7qpB1xd1Dt4u
	Cdux0NxnzmVvJPEc5Ti+/QVfdjbAK0NnO+R0NcwFCg==
X-Google-Smtp-Source: ABdhPJzg7Llraxup10A/y5Z2duNM5R/9xOPCS/zXXaOR8iNf5hQQzFA3K3MrlSTtTnNT2BruWrZqyQ0kubOmtrezjeo=
X-Received: by 2002:a17:90b:148c:: with SMTP id js12mr14128064pjb.147.1619771597250;
 Fri, 30 Apr 2021 01:33:17 -0700 (PDT)
MIME-Version: 1.0
References: <20210428094949.43579-1-songmuchun@bytedance.com>
 <20210430004903.GF1872259@dread.disaster.area> <YItf3GIUs2skeuyi@carbon.dhcp.thefacebook.com>
 <20210430032739.GG1872259@dread.disaster.area>
In-Reply-To: <20210430032739.GG1872259@dread.disaster.area>
From: Muchun Song <songmuchun@bytedance.com>
Date: Fri, 30 Apr 2021 16:32:39 +0800
Message-ID: <CAMZfGtXawtMT4JfBtDLZ+hES4iEHFboe2UgJee_s-NhZR5faAw@mail.gmail.com>
Subject: Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory
 cgroup removal
To: Dave Chinner <david@fromorbit.com>, Roman Gushchin <guro@fb.com>
Cc: Matthew Wilcox <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, 
	Vladimir Davydov <vdavydov.dev@gmail.com>, Shakeel Butt <shakeelb@google.com>, 
	Yang Shi <shy828301@gmail.com>, alexs@kernel.org, 
	Alexander Duyck <alexander.h.duyck@linux.intel.com>, Wei Yang <richard.weiyang@gmail.com>, 
	linux-fsdevel <linux-fsdevel@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, 
	Linux Memory Management List <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=wKpaF83c;
	dmarc=pass (policy=none) header.from=bytedance.com;
	spf=pass (imf02.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com
X-Stat-Signature: 5i5rmz4watiy37t8qw7rhd3diriw9xtc
X-Rspamd-Queue-Id: BAC1340002F2
X-Rspamd-Server: rspam05
Received-SPF: none (bytedance.com>: No applicable sender policy available) receiver=imf02; identity=mailfrom; envelope-from="<songmuchun@bytedance.com>"; helo=mail-pl1-f176.google.com; client-ip=209.85.214.176
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1619771567-85817
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > memory.
> > > >
> > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > cache is the cause of list_lru_one allocation.
> > > >
> > > >   crash> p memcg_nr_cache_ids
> > > >   memcg_nr_cache_ids = $2 = 24574
> > > >
> > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > can be calculated with the following formula.
> > > >
> > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > >
> > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > >
> > > >   crash> list super_blocks | wc -l
> > > >   952
> > >
> > > The more I see people trying to work around this, the more I think
> > > that the way memcgs have been grafted into the list_lru is back to
> > > front.
> > >
> > > We currently allocate scope for every memcg to be able to tracked on
> > > every not on every superblock instantiated in the system, regardless
> > > of whether that superblock is even accessible to that memcg.
> > >
> > > These huge memcg counts come from container hosts where memcgs are
> > > confined to just a small subset of the total number of superblocks
> > > that instantiated at any given point in time.
> > >
> > > IOWs, for these systems with huge container counts, list_lru does
> > > not need the capability of tracking every memcg on every superblock.
> > >
> > > What it comes down to is that the list_lru is only needed for a
> > > given memcg if that memcg is instatiating and freeing objects on a
> > > given list_lru.
> > >
> > > Which makes me think we should be moving more towards "add the memcg
> > > to the list_lru at the first insert" model rather than "instantiate
> > > all at memcg init time just in case". The model we originally came
> > > up with for supprting memcgs is really starting to show it's limits,
> > > and we should address those limitations rahter than hack more
> > > complexity into the system that does nothing to remove the
> > > limitations that are causing the problems in the first place.
> >
> > I totally agree.
> >
> > It looks like the initial implementation of the whole kernel memory accounting
> > and memcg-aware shrinkers was based on the idea that the number of memory
> > cgroups is relatively small and stable.
>
> Yes, that was one of the original assumptions - tens to maybe low
> hundreds of memcgs at most. The other was that memcgs weren't NUMA
> aware, and so would only need a single LRU list per memcg. Hence the
> total overhead even with "lots" of memcgsi and superblocks the
> overhead wasn't that great.
>
> Then came "memcgs need to be NUMA aware" because of the size of the
> machines they were being use for resrouce management in, and that
> greatly increased the per-memcg, per LRU overhead. Now we're talking
> about needing to support a couple of orders of magnitude more memcgs
> and superblocks than were originally designed for.
>
> So, really, we're way beyond the original design scope of this
> subsystem now.

Got it. So it is better to allocate the structure of the list_lru_node
dynamically. We should only allocate it when it is really demanded.
But allocating memory by using GFP_ATOMIC in list_lru_add() is
not a good idea. So we should allocate the memory out of
list_lru_add(). I can propose an approach that may work.

Before start, we should know about the following rules of list lrus.

- Only objects allocated with __GFP_ACCOUNT need to allocate
  the struct list_lru_node.
- The caller of allocating memory must know which list_lru the
  object will insert.

So we can allocate struct list_lru_node when allocating the
object instead of allocating it when list_lru_add().  It is easy, because
we already know the list_lru and memcg which the object belongs
to. So we can introduce a new helper to allocate the object and
list_lru_node. Like below.

void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
                                gfp_t gfpflags)
{
        void *ret = kmem_cache_alloc(s, gfpflags);

        if (ret && (gfpflags & __GFP_ACCOUNT)) {
                struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);

                if (mem_cgroup_is_root(memcg))
                        return ret;

                /* Allocate per-memcg list_lru_node, if it already
allocated, do nothing. */
                memcg_list_lru_node_alloc(lru, memcg,
page_to_nid(virt_to_page(ret)), gfpflags);
        }

        return ret;
}

If the user wants to insert the allocated object to its lru list in
the feature. The
user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
I have looked at the code closely. There are 3 different kmem_caches that
need to use this new API to allocate memory. They are inode_cachep,
dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.

Hi Roman and Dave,

What do you think about this approach? If there is no problem, I can provide
a preliminary patchset within a week.

Thanks.

>
> > With systemd creating a separate cgroup
> > for everything including short-living processes it simple not true anymore.
>
> Yeah, that too. Everything is much more dynamic these days...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com