From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08D62C3DA61 for ; Fri, 19 Jul 2024 02:46:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6EA926B0082; Thu, 18 Jul 2024 22:46:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6999A6B0083; Thu, 18 Jul 2024 22:46:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 561D76B0088; Thu, 18 Jul 2024 22:46:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 329676B0082 for ; Thu, 18 Jul 2024 22:46:32 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 9BF1216035A for ; Fri, 19 Jul 2024 02:46:31 +0000 (UTC) X-FDA: 82354963782.03.FA83C30 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) by imf18.hostedemail.com (Postfix) with ESMTP id BC1BC1C001A for ; Fri, 19 Jul 2024 02:46:28 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=CAh0DXyt; spf=pass (imf18.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.171 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721357156; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=53hukNCuddnsQLkkBxYdQOMeUlAgrbw23pWOXChSc7M=; b=n7IwAqv/KA35kgkbk+Xzf2PoPvSMu2oUPXsL+iWTdXtTnYppetVPmF4rAwPskuxx3VocVK yDqZuVPinGsh0SnYZwboAIH0uXa1BOAt3bjki3eBBHNCd15q6hYltjzSiF3d6NKEe7hVwB doSO4jnA/HXYWAdnIML4WxnvkfuGFGw= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=CAh0DXyt; spf=pass (imf18.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.171 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721357157; a=rsa-sha256; cv=none; b=PLCyTqdvuVpLbuCApvieLvmqDbFxV1+ywPfSpWS+CFxP9eFx2Fu3U7wMsvk3b51UUUg2mg QMsq50JKv47Zlm7uNF0h4Oa18t4w1AvVwBYT/p1KdHHBBgGvUGOW6HPNyJZW77vkt0Liar O+GK9l84ZB3SLMVGInm/AAZj81edStk= X-Envelope-To: ryncsn@gmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1721357186; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=53hukNCuddnsQLkkBxYdQOMeUlAgrbw23pWOXChSc7M=; b=CAh0DXyt1RLAlM/Q5406B6dXbPTdcJzjhzxsKikMBzuqTKpzG4JzZvHAERXU/U1j34aP71 KAKFjQItWdxzCkcBwmImyqdzmHMOTgS6JUz1ZiD5Lc8aIS+nHqG49AIt8bsrXGH/FQshpE IyR+NOxh3q/R8VH1rMmHJE2yp84eMw8= X-Envelope-To: linux-mm@kvack.org X-Envelope-To: akpm@linux-foundation.org X-Envelope-To: willy@infradead.org X-Envelope-To: hannes@cmpxchg.org X-Envelope-To: roman.gushchin@linux.dev X-Envelope-To: longman@redhat.com X-Envelope-To: shakeelb@google.com X-Envelope-To: nphamcs@gmail.com X-Envelope-To: mhocko@suse.com X-Envelope-To: zhouchengming@bytedance.com X-Envelope-To: zhengqi.arch@bytedance.com X-Envelope-To: chrisl@kernel.org X-Envelope-To: yosryahmed@google.com X-Envelope-To: ying.huang@intel.com Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\)) Subject: Re: [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: Date: Fri, 19 Jul 2024 10:45:43 +0800 Cc: Linux Memory Management List , Andrew Morton , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Waiman Long , Shakeel Butt , Nhat Pham , Michal Hocko , Chengming Zhou , Qi Zheng , Chris Li , Yosry Ahmed , "Huang, Ying" Content-Transfer-Encoding: quoted-printable Message-Id: References: <20240624175313.47329-1-ryncsn@gmail.com> <20240624175313.47329-6-ryncsn@gmail.com> To: Kairui Song X-Migadu-Flow: FLOW_OUT X-Stat-Signature: sm8zh1kjxmcmjromw6hcp6m9pkph1f93 X-Rspam-User: X-Rspamd-Queue-Id: BC1BC1C001A X-Rspamd-Server: rspam02 X-HE-Tag: 1721357188-442159 X-HE-Meta: U2FsdGVkX18K55uBR8xw3v7V4rvxDDbg1XrKUULXd0nTjGNNeNwEGsXbe5RAnRi27kw9qmWI7oZg6KUXYhFl9Jofk6KlBc9IyX904uupu2EDJ3g4FsXZYrv1+j2ZY0YVhImd5VCjPkSoFdN624/KxWyCxehq7l4+42yMlkuRaOJhzrUgFQ4JGv+LsQ3CfDhFPlyFLuo/GjYVy/s6hHlu1YFgKudi6eblixZq3cjnXJ+LYeQyATzRb+dDSe2KcXnKrVRqqzSwkvjlFRNy1WdG+Ziy/6zUFze3VJs2Pp07oFY0UdgIwF4lLCEOFZR3gn7mEInUTljuP2yfp5CCSZmMwEtS2H5sLSFHPcCbcY40yJr2x5TMWK3Y1zHkcRPIAq6R2hioNLFtVakJERBhwrcBi+oUIk0NR6gPc3IauEHDuVJKvUPH4+ToRyHKjm2gSvxXXBEE7XwWfPihV/aSNXBYyX2a3Iko2CbBg6ecSUob8/XTYi/sXmfa67GIIgnGNAS+YdMD2bD0Bd7hM4nFbBLY7/YLJjPE0FEge9IbtM0G9ZGE2JzsdmrMtFthiuJiv1+I+F0m5NPu05tqAaotUyVPsH0nlh8aNOBiefj/WUYfQedg5PV+BrpDee2zlinsj8ylZPAlFIbP7UCsPJBiF4GoWQ/opXNIlkWigPBvpVNgp2E14qLBlUh9HEMIL9NXrK7fHut8ujB4G+WcwXDBKD8cpF7nLqSnIJXkIMhvh9KHRrFSLDsR9BjnYS+/b4Xkb2peTrxeNkdt2Y0yliJEJXUxMKD1K67M0BjBbJRSaZUl4+2rmtHvfvee7V772WUV5t5frbEFWSFjcD/7G658YZPLnXElWjR1SRL4G9blb4lCuieBSeptMkWfm/OBrgAnLM0AWTCbDJ2V6KFQAl5HVNHLei07oFUXxI+0p1sErF1SVXjvZxc0vOPuXoGQuAcGKoLMWOwhfPcJO+e9wy+9Jdl c6lbtHoB NX0HKecx/AXzrGJyxmTdplAxDwRxsLmH/Xk8Ls60glwRkRxXrqPk4IaTxKyun8XTj30ifn+N7fouuShqPAzL4nfG/Yd5xoqKkUJ4infMurgSVkiD61BKkvpY54Ik/Y/Q5EuWvxym2irje5FfdYtR2FV+jz2eTL8UqLH3G6WwdYAU2FNdFcwFrhFEc6fsiJ/6/I8JKrymUnaHgAGh4wqwFFW3OYNL+Yo1J87Tc2/Mr6drlsIsBuKB7xDQTmgK+MhQxbI+hExcM64fMYlSFvp9IqS3eCQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > On Jul 18, 2024, at 19:49, Kairui Song wrote: >=20 > On Wed, Jul 17, 2024 at 11:05=E2=80=AFAM Muchun Song = wrote: >>> On Jun 25, 2024, at 01:53, Kairui Song wrote: >>>=20 >>> From: Kairui Song >>>=20 >>> Currently, there is a lot of code for detecting reparent racing >>> using kmemcg_id as the synchronization flag. And an intermediate >>> table is required to record and compare the kmemcg_id. >>>=20 >>> We can simplify this by just checking the cgroup css status, skip >>> if cgroup is being offlined. On the reparenting side, ensure no >>> more allocation is on going and no further allocation will occur >>> by using the XArray lock as barrier. >>>=20 >>> Combined with a O(n^2) top-down walk for the allocation, we get rid >>> of the intermediate table allocation completely. Despite being = O(n^2), >>> it should be actually faster because it's not practical to have a = very >>> deep cgroup level. >>=20 >> I kind of agree with you. But just instinctually. >>=20 >>>=20 >>> This also avoided changing kmemcg_id before reparenting, making >>> cgroups have a stable index for list_lru_memcg. After this change >>> it's possible that a dying cgroup will see a NULL value in XArray >>> corresponding to the kmemcg_id, because the kmemcg_id will point >>> to an empty slot. In such case, just fallback to use its parent. >>>=20 >>> As a result the code is simpler, following test also showed a >>> performance gain (6 test runs): >>>=20 >>> mkdir /tmp/test-fs >>> modprobe brd rd_nr=3D1 rd_size=3D16777216 >>> mkfs.xfs /dev/ram0 >>> mount -t xfs /dev/ram0 /tmp/test-fs >>> worker() { >>> echo TEST-CONTENT > "/tmp/test-fs/$1" >>> } >>> do_test() { >>> for i in $(seq 1 2048); do >>> (exec sh -c 'echo "$PPID"') > = "/sys/fs/cgroup/benchmark/$i/cgroup.procs" >>> worker "$i" & >>> done; wait >>> echo 1 > /proc/sys/vm/drop_caches >>> } >>> mkdir -p /sys/fs/cgroup/benchmark >>> echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control >>> for i in $(seq 1 2048); do >>> rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null >>> mkdir -p "/sys/fs/cgroup/benchmark/$i" >>> done >>> time do_test >>>=20 >>> Before: >>> real 0m5.932s user 0m2.366s sys 0m5.583s >>> real 0m5.939s user 0m2.347s sys 0m5.597s >>> real 0m6.149s user 0m2.398s sys 0m5.761s >>> real 0m5.945s user 0m2.403s sys 0m5.547s >>> real 0m5.925s user 0m2.293s sys 0m5.651s >>> real 0m6.017s user 0m2.367s sys 0m5.686s >>>=20 >>> After: >>> real 0m5.712s user 0m2.343s sys 0m5.307s >>> real 0m5.885s user 0m2.326s sys 0m5.518s >>> real 0m5.694s user 0m2.347s sys 0m5.264s >>> real 0m5.865s user 0m2.300s sys 0m5.545s >>> real 0m5.748s user 0m2.273s sys 0m5.424s >>> real 0m5.756s user 0m2.318s sys 0m5.398s >>>=20 >>> Signed-off-by: Kairui Song >>> --- >>> mm/list_lru.c | 182 = ++++++++++++++++++++++---------------------------- >>> mm/zswap.c | 7 +- >>> 2 files changed, 81 insertions(+), 108 deletions(-) >>>=20 >>> diff --git a/mm/list_lru.c b/mm/list_lru.c >>> index 4c619857e916..ac8aec8451dd 100644 >>> --- a/mm/list_lru.c >>> +++ b/mm/list_lru.c >>> @@ -59,6 +59,20 @@ list_lru_from_memcg_idx(struct list_lru *lru, int = nid, int idx) >>> } >>> return &lru->node[nid].lru; >>> } >>> + >>> +static inline struct list_lru_one * >>> +list_lru_from_memcg(struct list_lru *lru, int nid, struct = mem_cgroup *memcg) >>> +{ >>> + struct list_lru_one *l; >>> +again: >>> + l =3D list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg)); >>> + if (likely(l)) >>> + return l; >>> + >>> + memcg =3D parent_mem_cgroup(memcg); >>> + WARN_ON(!css_is_dying(&memcg->css)); >>> + goto again; >>=20 >> Just want to confirm this will only loop max twice, right? >=20 > If the memcg is dying, it finds the nearest parent that is still > active and returns the parent's lru. If the parent is also dying, try > the parent's parent. It may repeat mult times until it reaches root cg > but very unlikely. Right. >=20 >>=20 >>> +} >>> #else >>> static void list_lru_register(struct list_lru *lru) >>> { >>> @@ -83,6 +97,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int = nid, int idx) >>> { >>> return &lru->node[nid].lru; >>> } >>> + >>> +static inline struct list_lru_one * >>> +list_lru_from_memcg(struct list_lru *lru, int nid, int idx) >>> +{ >>> + return &lru->node[nid].lru; >>> +} >>> #endif /* CONFIG_MEMCG_KMEM */ >>>=20 >>> bool list_lru_add(struct list_lru *lru, struct list_head *item, int = nid, >>> @@ -93,7 +113,7 @@ bool list_lru_add(struct list_lru *lru, struct = list_head *item, int nid, >>>=20 >>> spin_lock(&nlru->lock); >>> if (list_empty(item)) { >>> - l =3D list_lru_from_memcg_idx(lru, nid, = memcg_kmem_id(memcg)); >>> + l =3D list_lru_from_memcg(lru, nid, memcg); >>> list_add_tail(item, &l->list); >>> /* Set shrinker bit if the first element was added */ >>> if (!l->nr_items++) >>> @@ -124,7 +144,7 @@ bool list_lru_del(struct list_lru *lru, struct = list_head *item, int nid, >>>=20 >>> spin_lock(&nlru->lock); >>> if (!list_empty(item)) { >>> - l =3D list_lru_from_memcg_idx(lru, nid, = memcg_kmem_id(memcg)); >>> + l =3D list_lru_from_memcg(lru, nid, memcg); >>> list_del_init(item); >>> l->nr_items--; >>> nlru->nr_items--; >>> @@ -339,20 +359,6 @@ static struct list_lru_memcg = *memcg_init_list_lru_one(gfp_t gfp) >>> return mlru; >>> } >>>=20 >>> -static void memcg_list_lru_free(struct list_lru *lru, int src_idx) >>> -{ >>> - struct list_lru_memcg *mlru =3D xa_erase_irq(&lru->xa, = src_idx); >>> - >>> - /* >>> - * The __list_lru_walk_one() can walk the list of this node. >>> - * We need kvfree_rcu() here. And the walking of the list >>> - * is under lru->node[nid]->lock, which can serve as a RCU >>> - * read-side critical section. >>> - */ >>> - if (mlru) >>> - kvfree_rcu(mlru, rcu); >>> -} >>> - >>> static inline void memcg_init_list_lru(struct list_lru *lru, bool = memcg_aware) >>> { >>> if (memcg_aware) >>> @@ -377,22 +383,18 @@ static void memcg_destroy_list_lru(struct = list_lru *lru) >>> } >>>=20 >>> static void memcg_reparent_list_lru_node(struct list_lru *lru, int = nid, >>> - int src_idx, struct = mem_cgroup *dst_memcg) >>> + struct list_lru_one *src, >>> + struct mem_cgroup *dst_memcg) >>> { >>> struct list_lru_node *nlru =3D &lru->node[nid]; >>> - int dst_idx =3D dst_memcg->kmemcg_id; >>> - struct list_lru_one *src, *dst; >>> + struct list_lru_one *dst; >>>=20 >>> /* >>> * Since list_lru_{add,del} may be called under an IRQ-safe = lock, >>> * we have to use IRQ-safe primitives here to avoid deadlock. >>> */ >>> spin_lock_irq(&nlru->lock); >>> - >>> - src =3D list_lru_from_memcg_idx(lru, nid, src_idx); >>> - if (!src) >>> - goto out; >>> - dst =3D list_lru_from_memcg_idx(lru, nid, dst_idx); >>> + dst =3D list_lru_from_memcg_idx(lru, nid, = memcg_kmem_id(dst_memcg)); >>>=20 >>> list_splice_init(&src->list, &dst->list); >>>=20 >>> @@ -401,46 +403,45 @@ static void = memcg_reparent_list_lru_node(struct list_lru *lru, int nid, >>> set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); >>> src->nr_items =3D 0; >>> } >>> -out: >>> spin_unlock_irq(&nlru->lock); >>> } >>>=20 >>> void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct = mem_cgroup *parent) >>> { >>> - struct cgroup_subsys_state *css; >>> struct list_lru *lru; >>> - int src_idx =3D memcg->kmemcg_id, i; >>> - >>> - /* >>> - * Change kmemcg_id of this cgroup and all its descendants to = the >>> - * parent's id, and then move all entries from this cgroup's = list_lrus >>> - * to ones of the parent. >>> - */ >>> - rcu_read_lock(); >>> - css_for_each_descendant_pre(css, &memcg->css) { >>> - struct mem_cgroup *child; >>> - >>> - child =3D mem_cgroup_from_css(css); >>> - WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id); >>> - } >>> - rcu_read_unlock(); >>> + int i; >>>=20 >>> - /* >>> - * With kmemcg_id set to parent, holding the lru lock below = can >>> - * prevent list_lru_{add,del,isolate} from touching the lru, = safe >>> - * to reparent. >>> - */ >>> mutex_lock(&list_lrus_mutex); >>> list_for_each_entry(lru, &memcg_list_lrus, list) { >>> + struct list_lru_memcg *mlru; >>> + XA_STATE(xas, &lru->xa, memcg->kmemcg_id); >>> + >>> + /* >>> + * Lock the Xarray to ensure no on going allocation = and >>=20 >> ongoing? I'd like to rewrite the comment here, like: >>=20 >> The XArray lock is used to make the future observers will see the = current >> memory cgroup has been dying (i.e. css_is_dying() will return true), = which >> is significant for memcg_list_lru_alloc() to make sure it will not = allocate >> a new mlru for this died memory cgroup for which we just free it = below. >=20 > Good suggestion. >=20 >>=20 >>> + * further allocation will see css_is_dying(). >>> + */ >>> + xas_lock_irq(&xas); >>> + mlru =3D xas_load(&xas); >>> + if (mlru) >>> + xas_store(&xas, NULL); >>> + xas_unlock_irq(&xas); >>=20 >> The above block could be replaced with xa_erase_irq(), simpler, = right? >=20 > I wanted to reuse the `xas` here, it will be used by later commits and > this helps reduce total stack usage. Might be trivial, I can use > xa_erase_irq here for easier coding, and expand this to this again > later. Yes. >=20 >>=20 >>> + if (!mlru) >>> + continue; >>> + >>> + /* >>> + * With Xarray value set to NULL, holding the lru lock = below >>> + * prevents list_lru_{add,del,isolate} from touching = the lru, >>> + * safe to reparent. >>> + */ >>> for_each_node(i) >>> - memcg_reparent_list_lru_node(lru, i, src_idx, = parent); >>> + memcg_reparent_list_lru_node(lru, i, = &mlru->node[i], parent); >>>=20 >>> /* >>> * Here all list_lrus corresponding to the cgroup are = guaranteed >>> * to remain empty, we can safely free this lru, any = further >>> * memcg_list_lru_alloc() call will simply bail out. >>> */ >>> - memcg_list_lru_free(lru, src_idx); >>> + kvfree_rcu(mlru, rcu); >>> } >>> mutex_unlock(&list_lrus_mutex); >>> } >>> @@ -456,78 +457,51 @@ static inline bool = memcg_list_lru_allocated(struct mem_cgroup *memcg, >>> int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru = *lru, >>> gfp_t gfp) >>> { >>> - int i; >>> unsigned long flags; >>> - struct list_lru_memcg_table { >>> - struct list_lru_memcg *mlru; >>> - struct mem_cgroup *memcg; >>> - } *table; >>> + struct list_lru_memcg *mlru; >>> + struct mem_cgroup *pos, *parent; >>> XA_STATE(xas, &lru->xa, 0); >>>=20 >>> if (!list_lru_memcg_aware(lru) || = memcg_list_lru_allocated(memcg, lru)) >>> return 0; >>>=20 >>> gfp &=3D GFP_RECLAIM_MASK; >>> - table =3D kmalloc_array(memcg->css.cgroup->level, = sizeof(*table), gfp); >>> - if (!table) >>> - return -ENOMEM; >>> - >>> /* >>> * Because the list_lru can be reparented to the parent = cgroup's >>> * list_lru, we should make sure that this cgroup and all its >>> * ancestors have allocated list_lru_memcg. >>> */ >>> - for (i =3D 0; memcg; memcg =3D parent_mem_cgroup(memcg), i++) = { >>> - if (memcg_list_lru_allocated(memcg, lru)) >>> - break; >>> - >>> - table[i].memcg =3D memcg; >>> - table[i].mlru =3D memcg_init_list_lru_one(gfp); >>> - if (!table[i].mlru) { >>> - while (i--) >>> - kfree(table[i].mlru); >>> - kfree(table); >>> - return -ENOMEM; >>> + do { >>> + /* >>> + * Keep finding the farest parent that wasn't = populated >>> + * until found memcg itself. >>> + */ >>> + pos =3D memcg; >>> + parent =3D parent_mem_cgroup(pos); >>> + while (parent && !memcg_list_lru_allocated(parent, = lru)) { >>=20 >> The check of @parent is unnecessary since memcg_list_lru_allocated() = always >> return true for root memory cgroup. Right? >=20 > Right, just wrote this to be less error prone as > memcg_list_lru_allocated can't handle NULL. But yeah, parent =3D NULL > shouldn't happen here, it can be removed. >=20 >>=20 >>> + pos =3D parent; >>> + parent =3D parent_mem_cgroup(pos); >>> } >>> - } >>>=20 >>> - xas_lock_irqsave(&xas, flags); >>> - while (i--) { >>> - int index =3D READ_ONCE(table[i].memcg->kmemcg_id); >>> - struct list_lru_memcg *mlru =3D table[i].mlru; >>> - >>> - xas_set(&xas, index); >>> -retry: >>> - if (unlikely(index < 0 || xas_error(&xas) || = xas_load(&xas))) { >>> - kfree(mlru); >>> - } else { >>> - xas_store(&xas, mlru); >>> - if (xas_error(&xas) =3D=3D -ENOMEM) { >>> + mlru =3D memcg_init_list_lru_one(gfp); >>> + do { >>> + bool alloced =3D false; >>> + >>> + xas_set(&xas, pos->kmemcg_id); >>> + xas_lock_irqsave(&xas, flags); >>> + if (!css_is_dying(&pos->css) && = !xas_load(&xas)) { >>> + xas_store(&xas, mlru); >>> + alloced =3D true; >>> + } >>> + if (!alloced || xas_error(&xas)) { >>> xas_unlock_irqrestore(&xas, flags); >>> - if (xas_nomem(&xas, gfp)) >>> - xas_set_err(&xas, 0); >>> - xas_lock_irqsave(&xas, flags); >>> - /* >>> - * The xas lock has been released, = this memcg >>> - * can be reparented before us. So = reload >>> - * memcg id. More details see the = comments >>> - * in memcg_reparent_list_lrus(). >>> - */ >>> - index =3D = READ_ONCE(table[i].memcg->kmemcg_id); >>> - if (index < 0) >>> - xas_set_err(&xas, 0); >>> - else if (!xas_error(&xas) && index !=3D = xas.xa_index) >>> - xas_set(&xas, index); >>> - goto retry; >>> + kfree(mlru); >>> + goto out; >>=20 >> 1) If xas_error() returns true, we should continue since xas_mem() = will handle the NOMEM >> case for us. >=20 > Oh, My bad, I think I lost an intermediate commit for this part and > ended up exiting too early on error case; and it also needs to check > !mlru case above and return -ENOMEM, will update this part. >=20 >>=20 >> 2) If xas_load() returns a non-NULL pointer meaning there is a = concurrent thread has >> assigned a new mlru for us, we should continue since we might have = not assigned a mlru >> for the leaf memory cgroup. Suppose the following case: >>=20 >> A >> / \ >> B C >>=20 >> If there are two threads allocating mlru for A, then either B or C = will not allocate a mlru. >> Here is a code snippet FYI. >=20 > Nice find, forgot the case when multiple children could race for one = parent. >=20 >>=20 >> Thanks. >>=20 >> ```c >> struct list_lru_memcg *mlru =3D NULL; >>=20 >> do { >> /* ... */ >> if (!mlru) >> mlru =3D memcg_init_list_lru_one(gfp); >>=20 >> xas_set(&xas, pos->kmemcg_id); >> do { >> xas_lock_irqsave(&xas, flags); >> if (css_is_dying(&pos->css) || xas_load(&xas)) >> goto unlock; >> xas_store(&xas, mlru); >> if (!xas_error(&xas)) >> mlru =3D NULL; >> unlock: >> xas_unlock_irqrestore(&xas, flags); >> } while (xas_nomem(&xas, gfp)); >> } while (pos !=3D memcg); >>=20 >> if (mlru) >> kfree(mlru); >=20 > Good suggestion, one more thing is, should it abort the iteration if > the memcg is dead? Considering handling the memcg_init_list_lru_one > failure, so the loop could be this? Agree. >=20 > + mlru =3D NULL; > + do { > + pos =3D memcg; > + parent =3D parent_mem_cgroup(pos); > + while (!memcg_list_lru_allocated(parent, lru)) { > + pos =3D parent; > + parent =3D parent_mem_cgroup(pos); > + } > + > + mlru =3D mlru ?: memcg_init_list_lru_one(gfp); > + if (!mlru) > + return -ENOMEM; > + xas_set(&xas, pos->kmemcg_id); > + do { > + xas_lock_irqsave(&xas, flags); > + if (!xas_load(&xas) && = !css_is_dying(&pos->css)) { > + xas_store(&xas, mlru); > + if (!xas_error(&xas)) > + mlru =3D NULL; > + } > + xas_unlock_irqrestore(&xas, flags); > + } while (xas_nomem(&xas, gfp)); > + } while (!css_is_dying(&pos->css) && pos !=3D memcg); > + > + if (mlru) > + kfree(mlru);