From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 08D62C3DA61
	for <linux-mm@archiver.kernel.org>; Fri, 19 Jul 2024 02:46:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6EA926B0082; Thu, 18 Jul 2024 22:46:32 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6999A6B0083; Thu, 18 Jul 2024 22:46:32 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 561D76B0088; Thu, 18 Jul 2024 22:46:32 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 329676B0082
	for <linux-mm@kvack.org>; Thu, 18 Jul 2024 22:46:32 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 9BF1216035A
	for <linux-mm@kvack.org>; Fri, 19 Jul 2024 02:46:31 +0000 (UTC)
X-FDA: 82354963782.03.FA83C30
Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171])
	by imf18.hostedemail.com (Postfix) with ESMTP id BC1BC1C001A
	for <linux-mm@kvack.org>; Fri, 19 Jul 2024 02:46:28 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=CAh0DXyt;
	spf=pass (imf18.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.171 as permitted sender) smtp.mailfrom=muchun.song@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721357156;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=53hukNCuddnsQLkkBxYdQOMeUlAgrbw23pWOXChSc7M=;
	b=n7IwAqv/KA35kgkbk+Xzf2PoPvSMu2oUPXsL+iWTdXtTnYppetVPmF4rAwPskuxx3VocVK
	yDqZuVPinGsh0SnYZwboAIH0uXa1BOAt3bjki3eBBHNCd15q6hYltjzSiF3d6NKEe7hVwB
	doSO4jnA/HXYWAdnIML4WxnvkfuGFGw=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=CAh0DXyt;
	spf=pass (imf18.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.171 as permitted sender) smtp.mailfrom=muchun.song@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721357157; a=rsa-sha256;
	cv=none;
	b=PLCyTqdvuVpLbuCApvieLvmqDbFxV1+ywPfSpWS+CFxP9eFx2Fu3U7wMsvk3b51UUUg2mg
	QMsq50JKv47Zlm7uNF0h4Oa18t4w1AvVwBYT/p1KdHHBBgGvUGOW6HPNyJZW77vkt0Liar
	O+GK9l84ZB3SLMVGInm/AAZj81edStk=
X-Envelope-To: ryncsn@gmail.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1721357186;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=53hukNCuddnsQLkkBxYdQOMeUlAgrbw23pWOXChSc7M=;
	b=CAh0DXyt1RLAlM/Q5406B6dXbPTdcJzjhzxsKikMBzuqTKpzG4JzZvHAERXU/U1j34aP71
	KAKFjQItWdxzCkcBwmImyqdzmHMOTgS6JUz1ZiD5Lc8aIS+nHqG49AIt8bsrXGH/FQshpE
	IyR+NOxh3q/R8VH1rMmHJE2yp84eMw8=
X-Envelope-To: linux-mm@kvack.org
X-Envelope-To: akpm@linux-foundation.org
X-Envelope-To: willy@infradead.org
X-Envelope-To: hannes@cmpxchg.org
X-Envelope-To: roman.gushchin@linux.dev
X-Envelope-To: longman@redhat.com
X-Envelope-To: shakeelb@google.com
X-Envelope-To: nphamcs@gmail.com
X-Envelope-To: mhocko@suse.com
X-Envelope-To: zhouchengming@bytedance.com
X-Envelope-To: zhengqi.arch@bytedance.com
X-Envelope-To: chrisl@kernel.org
X-Envelope-To: yosryahmed@google.com
X-Envelope-To: ying.huang@intel.com
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
Subject: Re: [PATCH 5/7] mm/list_lru: simplify reparenting and initial
 allocation
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Muchun Song <muchun.song@linux.dev>
In-Reply-To: <CAMgjq7DXKhOTqNRWsWjLzMyLHHVxDg0vUuELm_xRaov9X4xDQA@mail.gmail.com>
Date: Fri, 19 Jul 2024 10:45:43 +0800
Cc: Linux Memory Management List <linux-mm@kvack.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 Matthew Wilcox <willy@infradead.org>,
 Johannes Weiner <hannes@cmpxchg.org>,
 Roman Gushchin <roman.gushchin@linux.dev>,
 Waiman Long <longman@redhat.com>,
 Shakeel Butt <shakeelb@google.com>,
 Nhat Pham <nphamcs@gmail.com>,
 Michal Hocko <mhocko@suse.com>,
 Chengming Zhou <zhouchengming@bytedance.com>,
 Qi Zheng <zhengqi.arch@bytedance.com>,
 Chris Li <chrisl@kernel.org>,
 Yosry Ahmed <yosryahmed@google.com>,
 "Huang, Ying" <ying.huang@intel.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <CF85A682-AE76-4B0E-A264-102681D21D38@linux.dev>
References: <20240624175313.47329-1-ryncsn@gmail.com>
 <20240624175313.47329-6-ryncsn@gmail.com>
 <FE1DBFDB-FFBA-45FA-9D5D-D6A5DB6FD646@linux.dev>
 <CAMgjq7DXKhOTqNRWsWjLzMyLHHVxDg0vUuELm_xRaov9X4xDQA@mail.gmail.com>
To: Kairui Song <ryncsn@gmail.com>
X-Migadu-Flow: FLOW_OUT
X-Stat-Signature: sm8zh1kjxmcmjromw6hcp6m9pkph1f93
X-Rspam-User: 
X-Rspamd-Queue-Id: BC1BC1C001A
X-Rspamd-Server: rspam02
X-HE-Tag: 1721357188-442159
X-HE-Meta: U2FsdGVkX18K55uBR8xw3v7V4rvxDDbg1XrKUULXd0nTjGNNeNwEGsXbe5RAnRi27kw9qmWI7oZg6KUXYhFl9Jofk6KlBc9IyX904uupu2EDJ3g4FsXZYrv1+j2ZY0YVhImd5VCjPkSoFdN624/KxWyCxehq7l4+42yMlkuRaOJhzrUgFQ4JGv+LsQ3CfDhFPlyFLuo/GjYVy/s6hHlu1YFgKudi6eblixZq3cjnXJ+LYeQyATzRb+dDSe2KcXnKrVRqqzSwkvjlFRNy1WdG+Ziy/6zUFze3VJs2Pp07oFY0UdgIwF4lLCEOFZR3gn7mEInUTljuP2yfp5CCSZmMwEtS2H5sLSFHPcCbcY40yJr2x5TMWK3Y1zHkcRPIAq6R2hioNLFtVakJERBhwrcBi+oUIk0NR6gPc3IauEHDuVJKvUPH4+ToRyHKjm2gSvxXXBEE7XwWfPihV/aSNXBYyX2a3Iko2CbBg6ecSUob8/XTYi/sXmfa67GIIgnGNAS+YdMD2bD0Bd7hM4nFbBLY7/YLJjPE0FEge9IbtM0G9ZGE2JzsdmrMtFthiuJiv1+I+F0m5NPu05tqAaotUyVPsH0nlh8aNOBiefj/WUYfQedg5PV+BrpDee2zlinsj8ylZPAlFIbP7UCsPJBiF4GoWQ/opXNIlkWigPBvpVNgp2E14qLBlUh9HEMIL9NXrK7fHut8ujB4G+WcwXDBKD8cpF7nLqSnIJXkIMhvh9KHRrFSLDsR9BjnYS+/b4Xkb2peTrxeNkdt2Y0yliJEJXUxMKD1K67M0BjBbJRSaZUl4+2rmtHvfvee7V772WUV5t5frbEFWSFjcD/7G658YZPLnXElWjR1SRL4G9blb4lCuieBSeptMkWfm/OBrgAnLM0AWTCbDJ2V6KFQAl5HVNHLei07oFUXxI+0p1sErF1SVXjvZxc0vOPuXoGQuAcGKoLMWOwhfPcJO+e9wy+9Jdl
 c6lbtHoB
 NX0HKecx/AXzrGJyxmTdplAxDwRxsLmH/Xk8Ls60glwRkRxXrqPk4IaTxKyun8XTj30ifn+N7fouuShqPAzL4nfG/Yd5xoqKkUJ4infMurgSVkiD61BKkvpY54Ik/Y/Q5EuWvxym2irje5FfdYtR2FV+jz2eTL8UqLH3G6WwdYAU2FNdFcwFrhFEc6fsiJ/6/I8JKrymUnaHgAGh4wqwFFW3OYNL+Yo1J87Tc2/Mr6drlsIsBuKB7xDQTmgK+MhQxbI+hExcM64fMYlSFvp9IqS3eCQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


> On Jul 18, 2024, at 19:49, Kairui Song <ryncsn@gmail.com> wrote:
>=20
> On Wed, Jul 17, 2024 at 11:05=E2=80=AFAM Muchun Song =
<muchun.song@linux.dev> wrote:
>>> On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
>>>=20
>>> From: Kairui Song <kasong@tencent.com>
>>>=20
>>> Currently, there is a lot of code for detecting reparent racing
>>> using kmemcg_id as the synchronization flag. And an intermediate
>>> table is required to record and compare the kmemcg_id.
>>>=20
>>> We can simplify this by just checking the cgroup css status, skip
>>> if cgroup is being offlined. On the reparenting side, ensure no
>>> more allocation is on going and no further allocation will occur
>>> by using the XArray lock as barrier.
>>>=20
>>> Combined with a O(n^2) top-down walk for the allocation, we get rid
>>> of the intermediate table allocation completely. Despite being =
O(n^2),
>>> it should be actually faster because it's not practical to have a =
very
>>> deep cgroup level.
>>=20
>> I kind of agree with you. But just instinctually.
>>=20
>>>=20
>>> This also avoided changing kmemcg_id before reparenting, making
>>> cgroups have a stable index for list_lru_memcg. After this change
>>> it's possible that a dying cgroup will see a NULL value in XArray
>>> corresponding to the kmemcg_id, because the kmemcg_id will point
>>> to an empty slot. In such case, just fallback to use its parent.
>>>=20
>>> As a result the code is simpler, following test also showed a
>>> performance gain (6 test runs):
>>>=20
>>> mkdir /tmp/test-fs
>>> modprobe brd rd_nr=3D1 rd_size=3D16777216
>>> mkfs.xfs /dev/ram0
>>> mount -t xfs /dev/ram0 /tmp/test-fs
>>> worker() {
>>>    echo TEST-CONTENT > "/tmp/test-fs/$1"
>>> }
>>> do_test() {
>>>    for i in $(seq 1 2048); do
>>>        (exec sh -c 'echo "$PPID"') > =
"/sys/fs/cgroup/benchmark/$i/cgroup.procs"
>>>        worker "$i" &
>>>    done; wait
>>>    echo 1 > /proc/sys/vm/drop_caches
>>> }
>>> mkdir -p /sys/fs/cgroup/benchmark
>>> echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
>>> for i in $(seq 1 2048); do
>>>    rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null
>>>    mkdir -p "/sys/fs/cgroup/benchmark/$i"
>>> done
>>> time do_test
>>>=20
>>> Before:
>>> real    0m5.932s user    0m2.366s sys     0m5.583s
>>> real    0m5.939s user    0m2.347s sys     0m5.597s
>>> real    0m6.149s user    0m2.398s sys     0m5.761s
>>> real    0m5.945s user    0m2.403s sys     0m5.547s
>>> real    0m5.925s user    0m2.293s sys     0m5.651s
>>> real    0m6.017s user    0m2.367s sys     0m5.686s
>>>=20
>>> After:
>>> real    0m5.712s user    0m2.343s sys     0m5.307s
>>> real    0m5.885s user    0m2.326s sys     0m5.518s
>>> real    0m5.694s user    0m2.347s sys     0m5.264s
>>> real    0m5.865s user    0m2.300s sys     0m5.545s
>>> real    0m5.748s user    0m2.273s sys     0m5.424s
>>> real    0m5.756s user    0m2.318s sys     0m5.398s
>>>=20
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>> mm/list_lru.c | 182 =
++++++++++++++++++++++----------------------------
>>> mm/zswap.c    |   7 +-
>>> 2 files changed, 81 insertions(+), 108 deletions(-)
>>>=20
>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>> index 4c619857e916..ac8aec8451dd 100644
>>> --- a/mm/list_lru.c
>>> +++ b/mm/list_lru.c
>>> @@ -59,6 +59,20 @@ list_lru_from_memcg_idx(struct list_lru *lru, int =
nid, int idx)
>>> }
>>>      return &lru->node[nid].lru;
>>> }
>>> +
>>> +static inline struct list_lru_one *
>>> +list_lru_from_memcg(struct list_lru *lru, int nid, struct =
mem_cgroup *memcg)
>>> +{
>>> +     struct list_lru_one *l;
>>> +again:
>>> +     l =3D list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>> +     if (likely(l))
>>> +     return l;
>>> +
>>> +     memcg =3D parent_mem_cgroup(memcg);
>>> +     WARN_ON(!css_is_dying(&memcg->css));
>>> +     goto again;
>>=20
>> Just want to confirm this will only loop max twice, right?
>=20
> If the memcg is dying, it finds the nearest parent that is still
> active and returns the parent's lru. If the parent is also dying, try
> the parent's parent. It may repeat mult times until it reaches root cg
> but very unlikely.

Right.

>=20
>>=20
>>> +}
>>> #else
>>> static void list_lru_register(struct list_lru *lru)
>>> {
>>> @@ -83,6 +97,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int =
nid, int idx)
>>> {
>>>      return &lru->node[nid].lru;
>>> }
>>> +
>>> +static inline struct list_lru_one *
>>> +list_lru_from_memcg(struct list_lru *lru, int nid, int idx)
>>> +{
>>> +     return &lru->node[nid].lru;
>>> +}
>>> #endif /* CONFIG_MEMCG_KMEM */
>>>=20
>>> bool list_lru_add(struct list_lru *lru, struct list_head *item, int =
nid,
>>> @@ -93,7 +113,7 @@ bool list_lru_add(struct list_lru *lru, struct =
list_head *item, int nid,
>>>=20
>>>      spin_lock(&nlru->lock);
>>>      if (list_empty(item)) {
>>> -             l =3D list_lru_from_memcg_idx(lru, nid, =
memcg_kmem_id(memcg));
>>> +             l =3D list_lru_from_memcg(lru, nid, memcg);
>>>              list_add_tail(item, &l->list);
>>>              /* Set shrinker bit if the first element was added */
>>>              if (!l->nr_items++)
>>> @@ -124,7 +144,7 @@ bool list_lru_del(struct list_lru *lru, struct =
list_head *item, int nid,
>>>=20
>>>      spin_lock(&nlru->lock);
>>>      if (!list_empty(item)) {
>>> -             l =3D list_lru_from_memcg_idx(lru, nid, =
memcg_kmem_id(memcg));
>>> +             l =3D list_lru_from_memcg(lru, nid, memcg);
>>>              list_del_init(item);
>>>              l->nr_items--;
>>>              nlru->nr_items--;
>>> @@ -339,20 +359,6 @@ static struct list_lru_memcg =
*memcg_init_list_lru_one(gfp_t gfp)
>>> return mlru;
>>> }
>>>=20
>>> -static void memcg_list_lru_free(struct list_lru *lru, int src_idx)
>>> -{
>>> -     struct list_lru_memcg *mlru =3D xa_erase_irq(&lru->xa, =
src_idx);
>>> -
>>> -     /*
>>> -      * The __list_lru_walk_one() can walk the list of this node.
>>> -      * We need kvfree_rcu() here. And the walking of the list
>>> -      * is under lru->node[nid]->lock, which can serve as a RCU
>>> -      * read-side critical section.
>>> -      */
>>> -     if (mlru)
>>> -             kvfree_rcu(mlru, rcu);
>>> -}
>>> -
>>> static inline void memcg_init_list_lru(struct list_lru *lru, bool =
memcg_aware)
>>> {
>>>      if (memcg_aware)
>>> @@ -377,22 +383,18 @@ static void memcg_destroy_list_lru(struct =
list_lru *lru)
>>> }
>>>=20
>>> static void memcg_reparent_list_lru_node(struct list_lru *lru, int =
nid,
>>> -                                      int src_idx, struct =
mem_cgroup *dst_memcg)
>>> +                                      struct list_lru_one *src,
>>> +                                      struct mem_cgroup *dst_memcg)
>>> {
>>>      struct list_lru_node *nlru =3D &lru->node[nid];
>>> -     int dst_idx =3D dst_memcg->kmemcg_id;
>>> -     struct list_lru_one *src, *dst;
>>> +     struct list_lru_one *dst;
>>>=20
>>>      /*
>>>       * Since list_lru_{add,del} may be called under an IRQ-safe =
lock,
>>>       * we have to use IRQ-safe primitives here to avoid deadlock.
>>>       */
>>>      spin_lock_irq(&nlru->lock);
>>> -
>>> -     src =3D list_lru_from_memcg_idx(lru, nid, src_idx);
>>> -     if (!src)
>>> -             goto out;
>>> -     dst =3D list_lru_from_memcg_idx(lru, nid, dst_idx);
>>> +     dst =3D list_lru_from_memcg_idx(lru, nid, =
memcg_kmem_id(dst_memcg));
>>>=20
>>>      list_splice_init(&src->list, &dst->list);
>>>=20
>>> @@ -401,46 +403,45 @@ static void =
memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
>>>              set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
>>>              src->nr_items =3D 0;
>>>      }
>>> -out:
>>>      spin_unlock_irq(&nlru->lock);
>>> }
>>>=20
>>> void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct =
mem_cgroup *parent)
>>> {
>>> -     struct cgroup_subsys_state *css;
>>>      struct list_lru *lru;
>>> -     int src_idx =3D memcg->kmemcg_id, i;
>>> -
>>> -     /*
>>> -      * Change kmemcg_id of this cgroup and all its descendants to =
the
>>> -      * parent's id, and then move all entries from this cgroup's =
list_lrus
>>> -      * to ones of the parent.
>>> -      */
>>> -     rcu_read_lock();
>>> -     css_for_each_descendant_pre(css, &memcg->css) {
>>> -             struct mem_cgroup *child;
>>> -
>>> -             child =3D mem_cgroup_from_css(css);
>>> -             WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id);
>>> -     }
>>> -     rcu_read_unlock();
>>> +     int i;
>>>=20
>>> -     /*
>>> -      * With kmemcg_id set to parent, holding the lru lock below =
can
>>> -      * prevent list_lru_{add,del,isolate} from touching the lru, =
safe
>>> -      * to reparent.
>>> -      */
>>>      mutex_lock(&list_lrus_mutex);
>>>      list_for_each_entry(lru, &memcg_list_lrus, list) {
>>> +             struct list_lru_memcg *mlru;
>>> +             XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
>>> +
>>> +             /*
>>> +              * Lock the Xarray to ensure no on going allocation =
and
>>=20
>> ongoing? I'd like to rewrite the comment here, like:
>>=20
>> The XArray lock is used to make the future observers will see the =
current
>> memory cgroup has been dying (i.e. css_is_dying() will return true), =
which
>> is significant for memcg_list_lru_alloc() to make sure it will not =
allocate
>> a new mlru for this died memory cgroup for which we just free it =
below.
>=20
> Good suggestion.
>=20
>>=20
>>> +              * further allocation will see css_is_dying().
>>> +              */
>>> +             xas_lock_irq(&xas);
>>> +             mlru =3D xas_load(&xas);
>>> +             if (mlru)
>>> +                     xas_store(&xas, NULL);
>>> +             xas_unlock_irq(&xas);
>>=20
>> The above block could be replaced with xa_erase_irq(), simpler, =
right?
>=20
> I wanted to reuse the `xas` here, it will be used by later commits and
> this helps reduce total stack usage. Might be trivial, I can use
> xa_erase_irq here for easier coding, and expand this to this again
> later.

Yes.

>=20
>>=20
>>> +             if (!mlru)
>>> +                     continue;
>>> +
>>> +             /*
>>> +              * With Xarray value set to NULL, holding the lru lock =
below
>>> +              * prevents list_lru_{add,del,isolate} from touching =
the lru,
>>> +              * safe to reparent.
>>> +              */
>>>              for_each_node(i)
>>> -                     memcg_reparent_list_lru_node(lru, i, src_idx, =
parent);
>>> +                     memcg_reparent_list_lru_node(lru, i, =
&mlru->node[i], parent);
>>>=20
>>>              /*
>>>               * Here all list_lrus corresponding to the cgroup are =
guaranteed
>>>               * to remain empty, we can safely free this lru, any =
further
>>>               * memcg_list_lru_alloc() call will simply bail out.
>>>               */
>>> -             memcg_list_lru_free(lru, src_idx);
>>> +             kvfree_rcu(mlru, rcu);
>>>      }
>>>      mutex_unlock(&list_lrus_mutex);
>>> }
>>> @@ -456,78 +457,51 @@ static inline bool =
memcg_list_lru_allocated(struct mem_cgroup *memcg,
>>> int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru =
*lru,
>>>                       gfp_t gfp)
>>> {
>>> -     int i;
>>>      unsigned long flags;
>>> -     struct list_lru_memcg_table {
>>> -             struct list_lru_memcg *mlru;
>>> -             struct mem_cgroup *memcg;
>>> -     } *table;
>>> +     struct list_lru_memcg *mlru;
>>> +     struct mem_cgroup *pos, *parent;
>>>      XA_STATE(xas, &lru->xa, 0);
>>>=20
>>>      if (!list_lru_memcg_aware(lru) || =
memcg_list_lru_allocated(memcg, lru))
>>>              return 0;
>>>=20
>>>      gfp &=3D GFP_RECLAIM_MASK;
>>> -     table =3D kmalloc_array(memcg->css.cgroup->level, =
sizeof(*table), gfp);
>>> -     if (!table)
>>> -             return -ENOMEM;
>>> -
>>>      /*
>>>       * Because the list_lru can be reparented to the parent =
cgroup's
>>>       * list_lru, we should make sure that this cgroup and all its
>>>       * ancestors have allocated list_lru_memcg.
>>>       */
>>> -     for (i =3D 0; memcg; memcg =3D parent_mem_cgroup(memcg), i++) =
{
>>> -             if (memcg_list_lru_allocated(memcg, lru))
>>> -                     break;
>>> -
>>> -             table[i].memcg =3D memcg;
>>> -             table[i].mlru =3D memcg_init_list_lru_one(gfp);
>>> -             if (!table[i].mlru) {
>>> -                     while (i--)
>>> -                             kfree(table[i].mlru);
>>> -                     kfree(table);
>>> -                     return -ENOMEM;
>>> +     do {
>>> +             /*
>>> +              * Keep finding the farest parent that wasn't =
populated
>>> +              * until found memcg itself.
>>> +              */
>>> +             pos =3D memcg;
>>> +             parent =3D parent_mem_cgroup(pos);
>>> +             while (parent && !memcg_list_lru_allocated(parent, =
lru)) {
>>=20
>> The check of @parent is unnecessary since memcg_list_lru_allocated() =
always
>> return true for root memory cgroup. Right?
>=20
> Right, just wrote this to be less error prone as
> memcg_list_lru_allocated can't handle NULL. But yeah, parent =3D NULL
> shouldn't happen here, it can be removed.
>=20
>>=20
>>> +                     pos =3D parent;
>>> +                     parent =3D parent_mem_cgroup(pos);
>>>              }
>>> -     }
>>>=20
>>> -     xas_lock_irqsave(&xas, flags);
>>> -     while (i--) {
>>> -             int index =3D READ_ONCE(table[i].memcg->kmemcg_id);
>>> -             struct list_lru_memcg *mlru =3D table[i].mlru;
>>> -
>>> -             xas_set(&xas, index);
>>> -retry:
>>> -             if (unlikely(index < 0 || xas_error(&xas) || =
xas_load(&xas))) {
>>> -                     kfree(mlru);
>>> -             } else {
>>> -                     xas_store(&xas, mlru);
>>> -                     if (xas_error(&xas) =3D=3D -ENOMEM) {
>>> +             mlru =3D memcg_init_list_lru_one(gfp);
>>> +             do {
>>> +                     bool alloced =3D false;
>>> +
>>> +                     xas_set(&xas, pos->kmemcg_id);
>>> +                     xas_lock_irqsave(&xas, flags);
>>> +                     if (!css_is_dying(&pos->css) && =
!xas_load(&xas)) {
>>> +                             xas_store(&xas, mlru);
>>> +                             alloced =3D true;
>>> +                     }
>>> +                     if (!alloced || xas_error(&xas)) {
>>>                              xas_unlock_irqrestore(&xas, flags);
>>> -                             if (xas_nomem(&xas, gfp))
>>> -                                     xas_set_err(&xas, 0);
>>> -                             xas_lock_irqsave(&xas, flags);
>>> -                             /*
>>> -                              * The xas lock has been released, =
this memcg
>>> -                              * can be reparented before us. So =
reload
>>> -                              * memcg id. More details see the =
comments
>>> -                              * in memcg_reparent_list_lrus().
>>> -                              */
>>> -                             index =3D =
READ_ONCE(table[i].memcg->kmemcg_id);
>>> -                             if (index < 0)
>>> -                                     xas_set_err(&xas, 0);
>>> -                             else if (!xas_error(&xas) && index !=3D =
xas.xa_index)
>>> -                                     xas_set(&xas, index);
>>> -                             goto retry;
>>> +                             kfree(mlru);
>>> +                             goto out;
>>=20
>> 1) If xas_error() returns true, we should continue since xas_mem() =
will handle the NOMEM
>> case for us.
>=20
> Oh, My bad, I think I lost an intermediate commit for this part and
> ended up exiting too early on error case; and it also needs to check
> !mlru case above and return -ENOMEM, will update this part.
>=20
>>=20
>> 2) If xas_load() returns a non-NULL pointer meaning there is a =
concurrent thread has
>> assigned a new mlru for us, we should continue since we might have =
not assigned a mlru
>> for the leaf memory cgroup. Suppose the following case:
>>=20
>>        A
>>      / \
>>     B   C
>>=20
>> If there are two threads allocating mlru for A, then either B or C =
will not allocate a mlru.
>> Here is a code snippet FYI.
>=20
> Nice find, forgot the case when multiple children could race for one =
parent.
>=20
>>=20
>> Thanks.
>>=20
>> ```c
>> struct list_lru_memcg *mlru =3D NULL;
>>=20
>> do {
>>        /* ... */
>>        if (!mlru)
>>                mlru =3D memcg_init_list_lru_one(gfp);
>>=20
>>        xas_set(&xas, pos->kmemcg_id);
>>        do {
>>                xas_lock_irqsave(&xas, flags);
>>                if (css_is_dying(&pos->css) || xas_load(&xas))
>>                        goto unlock;
>>                xas_store(&xas, mlru);
>>                if (!xas_error(&xas))
>>                        mlru =3D NULL;
>> unlock:
>>                xas_unlock_irqrestore(&xas, flags);
>>        } while (xas_nomem(&xas, gfp));
>> } while (pos !=3D memcg);
>>=20
>> if (mlru)
>>        kfree(mlru);
>=20
> Good suggestion, one more thing is, should it abort the iteration if
> the memcg is dead? Considering handling the memcg_init_list_lru_one
> failure, so the loop could be this?

Agree.

>=20
> +        mlru =3D NULL;
> +        do {
> +                pos =3D memcg;
> +                parent =3D parent_mem_cgroup(pos);
> +                while (!memcg_list_lru_allocated(parent, lru)) {
> +                        pos =3D parent;
> +                        parent =3D parent_mem_cgroup(pos);
> +                }
> +
> +                mlru =3D mlru ?: memcg_init_list_lru_one(gfp);
> +                if (!mlru)
> +                        return -ENOMEM;
> +                xas_set(&xas, pos->kmemcg_id);
> +                do {
> +                        xas_lock_irqsave(&xas, flags);
> +                        if (!xas_load(&xas) && =
!css_is_dying(&pos->css)) {
> +                                xas_store(&xas, mlru);
> +                                if (!xas_error(&xas))
> +                                        mlru =3D NULL;
> +                        }
> +                        xas_unlock_irqrestore(&xas, flags);
> +                } while (xas_nomem(&xas, gfp));
> +        } while (!css_is_dying(&pos->css) && pos !=3D memcg);
> +
> +        if (mlru)
> +                kfree(mlru);