From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4783EC83F17 for ; Tue, 15 Jul 2025 01:01:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ACE3C6B007B; Mon, 14 Jul 2025 21:01:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A7F166B0089; Mon, 14 Jul 2025 21:01:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 994606B008A; Mon, 14 Jul 2025 21:01:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 89B256B007B for ; Mon, 14 Jul 2025 21:01:23 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 88D835874D for ; Tue, 15 Jul 2025 01:01:22 +0000 (UTC) X-FDA: 83664695604.03.714B80C Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [91.218.175.188]) by imf03.hostedemail.com (Postfix) with ESMTP id 55A4B20013 for ; Tue, 15 Jul 2025 01:01:20 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=sdMUz4iC; spf=pass (imf03.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752541280; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kkGY49o9mkr9a18WXmSmtl1PWltf1qT78TLZWFouPow=; b=1p9rJVp0oSA6o//LnsbuVzRADkqPqJpd28hZJ8WiB8fGlHaI2q30+fQVVVHAzjQr0NDybh 081A2GapR5Cy6jZbshqTB3/y5/i/CYN7xrvBZheKbfk8FiZUECl0yZiCBC1n1g+4u2Du4S jwXeG4cReSrTng09ei8TWz8kFYUNFNw= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=sdMUz4iC; spf=pass (imf03.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752541280; a=rsa-sha256; cv=none; b=F6F5kVvj89eu8Uur2er4YdhwPXQ5MGMgMXZeNFudB4V4Bjv06iy03hFj7ZtsHdOnEbasQf 7Mjeb465fFKwIT7QsgNFoxZYgL9bJFD0lffIdqg8d6OQeKAhioBUrGxoUNey+EtUnVx/og KNLr1YJ9g+xtBgccchRvO5yKLDCyNyY= Date: Mon, 14 Jul 2025 18:01:12 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1752541277; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kkGY49o9mkr9a18WXmSmtl1PWltf1qT78TLZWFouPow=; b=sdMUz4iC1c5M0dN9rlk4LEQ9HNBiA8a7LJns1+1pYnGwxz4Y9QzI/7gH6hzysKJhb0cNqG mvbLJsiMLQ31HYTmWG7sF5MAfldhmxpzgCSMNMoqIru2mLWDFbhUNyxbC5w24N1YZONSXF 6Z1zsa1YlmWAp66kh8qAR0uwAU812uc= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Tejun Heo Cc: "Paul E . McKenney" , Andrew Morton , JP Kobryn , Johannes Weiner , Ying Huang , Vlastimil Babka , Alexei Starovoitov , Sebastian Andrzej Siewior , Michal =?utf-8?Q?Koutn=C3=BD?= , bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: Re: [PATCH v3] cgroup: llist: avoid memory tears for llist_node Message-ID: References: <20250704180804.3598503-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250704180804.3598503-1-shakeel.butt@linux.dev> X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Queue-Id: 55A4B20013 X-Rspamd-Server: rspam06 X-Stat-Signature: ke68gimdfx6up36wgnbh74rthj34gbsp X-HE-Tag: 1752541280-270365 X-HE-Meta: U2FsdGVkX19TiFE/jBTC/aDb+7HJ79ssg24vAaNzS7akCP3iicXTaR+HEbOJKEIPyX02TeksTiB99n4OLR4p87I/1gS3BxirZwrelk+59OTU6WTiDnyp52CUQUDpMUSMhPksgnu1CKDkXWQGldxuZNfsvQKLg7egND4t2hADp813AWbJIPXHLkTrBvxO7FN3tP2s5rRvaamnodtD1TAZNeCXhJpfj3gYwK+vvBTVZMRgkLRHGkZRtJjOGGF0Gja4orO/FaAqVb5KHfs8VckXrl9ePwM+KqFRmlHci0bEJERIQ/M5JaosvAeQs0ZKg86AKqpcUZVMoNhV40G2tJN8wN9E4qpERdtqifAk6BuaRMgvqwPFF/+N9wx00+N1bW8W2xk8RVH2sdpspzzZamfSnpNrdJDXqar0T1aSBg+sqYjauBjjOzO2IrYWPIjE4AckegdHc1kacjV9HymcWuRREoYXidat+P7FnOjs9iatscFd6thIQpW+Tds9Gpko4jINjcQrnfM4F/ajVnpUj9bjMARP/niL5XARU24BKn6AprcesA1o/gj8WE8Dxg86Q+r9PZQyMqiXbeBfQDM/ieDgVXLYvJWk93rZsMcSxQz2c/KU0xXpAnVdxLYOyzL9/XHNDUuBtRUQqodZVFdVYDZJ2g3+XpR2L+CtxOteOgJ/hmzbK3W6M1eFZ43krWZjKoHOZ1noa7PdHuy17nd0voSCJRnBlbjQrBYkYuxFwiQuthzmQ891XdtvJi3+CSCmI68JO+g5LKXkjLyz2g1TjAm8DAvK4woXo1xgqEX2CyaNPaTTjc4rL9RfcFG6WiDFeqdt2y8ne+pc2/HUb5ijiBwAfPRcCAEBM8V93aeJl2+Ksf99coYy8MuxztI4ydax2dFIwcmIPUOV1n3foA/8IkUx5F6B48hnMLJs040MLBksdo8zpDSsaRBkPT2TAz6ZL01AOeNiHhCGOQhBNx4oZBC /1Vj2cnS Ad9ZLfQOHGohqLhakVYm0SlznGKou4iz0N5XV3nZaWUOplVt1nbt0Nq8yPO9WCumeMlo75ExAZZ6aSuLlwPfhAMPl1+o2c1+nqGODKL8kwj20GPhQDzxVZ1r+FpRQCRSq727KuX+oBx0pnXNrHdW3ff4CEM/5pL++f6IqL8nKLHakvP5SRa2KXIZhCERVVSXf4cO9pfe7gOhE61PcHdZ8S/Ww74A9ZfaCszRhc2MaZillCp4UGD29cVHFebVPT8QyVW2quFwA6VsYC9d8BnzxJprMp5KWFWb7WhuZfaEylA0/RMKgOG1M3UAngA1vJvutoYhWtSQ9dWlazyP+iMvQL174uxypO6GzB+XVHW+zlsSYWA0y3L/waiCrabptFu3sPPPtaZ+Fc9woCtg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Tejun, any comments or concerns on this patch? On Fri, Jul 04, 2025 at 11:08:04AM -0700, Shakeel Butt wrote: > Before the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi > safe"), the struct llist_node is expected to be private to the one > inserting the node to the lockless list or the one removing the node > from the lockless list. After the mentioned commit, the llist_node in > the rstat code is per-cpu shared between the stacked contexts i.e. > process, softirq, hardirq & nmi. It is possible the compiler may tear > the loads or stores of llist_node. Let's avoid that. > > KCSAN reported the following race: > > Reported by Kernel Concurrency Sanitizer on: > CPU: 60 UID: 0 PID: 5425 ... 6.16.0-rc3-next-20250626 #1 NONE > Tainted: [E]=UNSIGNED_MODULE > Hardware name: ... > ================================================================== > ================================================================== > BUG: KCSAN: data-race in css_rstat_flush / css_rstat_updated > write to 0xffffe8fffe1c85f0 of 8 bytes by task 1061 on cpu 1: > css_rstat_flush+0x1b8/0xeb0 > __mem_cgroup_flush_stats+0x184/0x190 > flush_memcg_stats_dwork+0x22/0x50 > process_one_work+0x335/0x630 > worker_thread+0x5f1/0x8a0 > kthread+0x197/0x340 > ret_from_fork+0xd3/0x110 > ret_from_fork_asm+0x11/0x20 > read to 0xffffe8fffe1c85f0 of 8 bytes by task 3551 on cpu 15: > css_rstat_updated+0x81/0x180 > mod_memcg_lruvec_state+0x113/0x2d0 > __mod_lruvec_state+0x3d/0x50 > lru_add+0x21e/0x3f0 > folio_batch_move_lru+0x80/0x1b0 > __folio_batch_add_and_move+0xd7/0x160 > folio_add_lru_vma+0x42/0x50 > do_anonymous_page+0x892/0xe90 > __handle_mm_fault+0xfaa/0x1520 > handle_mm_fault+0xdc/0x350 > do_user_addr_fault+0x1dc/0x650 > exc_page_fault+0x5c/0x110 > asm_exc_page_fault+0x22/0x30 > value changed: 0xffffe8fffe18e0d0 -> 0xffffe8fffe1c85f0 > > $ ./scripts/faddr2line vmlinux css_rstat_flush+0x1b8/0xeb0 > css_rstat_flush+0x1b8/0xeb0: > init_llist_node at include/linux/llist.h:86 > (inlined by) llist_del_first_init at include/linux/llist.h:308 > (inlined by) css_process_update_tree at kernel/cgroup/rstat.c:148 > (inlined by) css_rstat_updated_list at kernel/cgroup/rstat.c:258 > (inlined by) css_rstat_flush at kernel/cgroup/rstat.c:389 > > $ ./scripts/faddr2line vmlinux css_rstat_updated+0x81/0x180 > css_rstat_updated+0x81/0x180: > css_rstat_updated at kernel/cgroup/rstat.c:90 (discriminator 1) > > These are expected race and a simple READ_ONCE/WRITE_ONCE resolves these > reports. However let's add comments to explain the race and the need for > memory barriers if stronger guarantees are needed. > > More specifically the rstat updater and the flusher can race and cause a > scenario where the stats updater skips adding the css to the lockless > list but the flusher might not see those updates done by the skipped > updater. This is benign race and the subsequent flusher will flush those > stats and at the moment there aren't any rstat users which are not fine > with this kind of race. However some future user might want more > stricter guarantee, so let's add appropriate comments to ease the job of > future users. > > Signed-off-by: Shakeel Butt > Reviewed-by: Paul E. McKenney > Fixes: 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe") > --- > > Changes since v2: > - Removed data_race() as explained and requested by Paul. > - Squashed into one patch. > http://lore.kernel.org/20250703200012.3734798-1-shakeel.butt@linux.dev > > Changes since v1: > - Added comments explaining race and the need for memory barrier as > requested by Tejun > - Added comments as a separate patch. > http://lore.kernel.org/20250626190550.4170599-1-shakeel.butt@linux.dev > > include/linux/llist.h | 6 +++--- > kernel/cgroup/rstat.c | 28 +++++++++++++++++++++++++++- > 2 files changed, 30 insertions(+), 4 deletions(-) > > diff --git a/include/linux/llist.h b/include/linux/llist.h > index 27b17f64bcee..607b2360c938 100644 > --- a/include/linux/llist.h > +++ b/include/linux/llist.h > @@ -83,7 +83,7 @@ static inline void init_llist_head(struct llist_head *list) > */ > static inline void init_llist_node(struct llist_node *node) > { > - node->next = node; > + WRITE_ONCE(node->next, node); > } > > /** > @@ -97,7 +97,7 @@ static inline void init_llist_node(struct llist_node *node) > */ > static inline bool llist_on_list(const struct llist_node *node) > { > - return node->next != node; > + return READ_ONCE(node->next) != node; > } > > /** > @@ -220,7 +220,7 @@ static inline bool llist_empty(const struct llist_head *head) > > static inline struct llist_node *llist_next(struct llist_node *node) > { > - return node->next; > + return READ_ONCE(node->next); > } > > /** > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index c8a48cf83878..981e2f77ad4e 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -60,6 +60,12 @@ static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu) > * Atomically inserts the css in the ss's llist for the given cpu. This is > * reentrant safe i.e. safe against softirq, hardirq and nmi. The ss's llist > * will be processed at the flush time to create the update tree. > + * > + * NOTE: if the user needs the guarantee that the updater either add itself in > + * the lockless list or the concurrent flusher flushes its updated stats, a > + * memory barrier is needed before the call to css_rstat_updated() i.e. a > + * barrier after updating the per-cpu stats and before calling > + * css_rstat_updated(). > */ > __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu) > { > @@ -86,7 +92,12 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu) > return; > > rstatc = css_rstat_cpu(css, cpu); > - /* If already on list return. */ > + /* > + * If already on list return. This check is racy and smp_mb() is needed > + * to pair it with the smp_mb() in css_process_update_tree() if the > + * guarantee that the updated stats are visible to concurrent flusher is > + * needed. > + */ > if (llist_on_list(&rstatc->lnode)) > return; > > @@ -148,6 +159,21 @@ static void css_process_update_tree(struct cgroup_subsys *ss, int cpu) > while ((lnode = llist_del_first_init(lhead))) { > struct css_rstat_cpu *rstatc; > > + /* > + * smp_mb() is needed here (more specifically in between > + * init_llist_node() and per-cpu stats flushing) if the > + * guarantee is required by a rstat user where etiher the > + * updater should add itself on the lockless list or the > + * flusher flush the stats updated by the updater who have > + * observed that they are already on the list. The > + * corresponding barrier pair for this one should be before > + * css_rstat_updated() by the user. > + * > + * For now, there aren't any such user, so not adding the > + * barrier here but if such a use-case arise, please add > + * smp_mb() here. > + */ > + > rstatc = container_of(lnode, struct css_rstat_cpu, lnode); > __css_process_update_tree(rstatc->owner, cpu); > } > -- > 2.47.1 >