From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BBF8C369DC for ; Wed, 30 Apr 2025 13:14:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 23F7E6B00A8; Wed, 30 Apr 2025 09:14:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C6666B00A9; Wed, 30 Apr 2025 09:14:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 01C866B00BA; Wed, 30 Apr 2025 09:14:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D1C9A6B00A8 for ; Wed, 30 Apr 2025 09:14:44 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B6E9C5FE85 for ; Wed, 30 Apr 2025 13:14:44 +0000 (UTC) X-FDA: 83390754888.08.3B1484A Received: from out-186.mta0.migadu.com (out-186.mta0.migadu.com [91.218.175.186]) by imf10.hostedemail.com (Postfix) with ESMTP id DA8B7C0017 for ; Wed, 30 Apr 2025 13:14:42 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=K26iuG03; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf10.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.186 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746018883; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xGXCoPebQpBN7d9KiYFtD7+T3Ar2fmtuRlVtn6J5UqE=; b=ngfxU3GzDCotLdN8YwyIpIBp00mufnEX9EYWwuwQxq7JC1FDaiv5kVs71cG5HHTMyL0anw 28SZspg5PMnEWA6Mph2403vSk+5z2JF8NjQRZafoWHxqi3a4UcDYedtSk+5mwOsLzz1ZyL 93FBSGEU0nRB/HpsLt5ACrZLG4YTO0E= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=K26iuG03; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf10.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.186 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746018883; a=rsa-sha256; cv=none; b=E9XN49eohppLxldiKeF1BYjxz4LAg2K5STBeHCWio+BBP9b9QtjFdiKRDoQPs8B+S0A3r9 o8VN9zW7x95ts0Aoxn9JUKlxgyQYoPcQEYHZ3OHNVXJTg02dU9j3ek4rs40jyQPUqRFBkj PBhFUrPkhIsmcnVZ0SYHhKJYbm8k620= Date: Wed, 30 Apr 2025 06:14:28 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1746018880; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=xGXCoPebQpBN7d9KiYFtD7+T3Ar2fmtuRlVtn6J5UqE=; b=K26iuG03G/2tOpBPupr8ClbpU5zNWopJDq0PAhEgaELU8dRYdMheVhHr6NTj1zdc3qyIW0 6rtFGC0gP0CaewrqmmHxLZvu9x/hPZ+fXqT21/8EomGkxlSBaN4dsJJVWvbcoM+QQebbgN KxD9MnRiaZiCYAx9FLSjMwzDl0VlE3w= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Shakeel Butt Cc: Tejun Heo , Andrew Morton , Alexei Starovoitov , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Michal =?utf-8?Q?Koutn=C3=BD?= , Vlastimil Babka , Sebastian Andrzej Siewior , JP Kobryn , bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe Message-ID: References: <20250429061211.1295443-1-shakeel.butt@linux.dev> <20250429061211.1295443-4-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250429061211.1295443-4-shakeel.butt@linux.dev> X-Migadu-Flow: FLOW_OUT X-Stat-Signature: oxd3okxwbd4zihg8qcwrj6wgztngktrw X-Rspamd-Queue-Id: DA8B7C0017 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1746018882-411246 X-HE-Meta: U2FsdGVkX1/aPs/W/dBnjbARZIlIox30poTqZSxiBYVXSXLtMzc9M04NnEq0IxG+2eoocDW5aoW3lv9soZbqEBY5Jw4lQWRnvbVxNGnXLjFINUb8xZpNgBVSjodJGx6UuiYxlGkNprAidRQr/8dMfqEuDwTSd8FMCtRUiqxEb3l2ogrCfjWUrQOTJ8ZZcZpA5sqA/FlXslaTXDTeKWNyUtBYLb8z4RwJpNmcpx7yCoxCd2Y6FfPoKBR+z9OGu2cQHzxtZLEbESLWqbrlRKpMsxdqodxac8HS8dOdQhC1DykJPSJ401TNRIOr/IuW6+43MF8slnh+zPRNGYHaV8qI/ZTUHemwVajRX/mbQtBUgGzGQnFzZOcCbHpCLI/0KkNHWlRDyFpJAPK8dYCjMdxrWfcCS1OOwJqA05XJ7341WGgBjtIaYQkeopisIjUb7iNXeR2Mg1snimUQKeQYvSNKtLsq+zmfj6sVC6qTrmCgHdmJ24DouEI78uyvY4tbpOL/ifcXT0mHqe+mR4Ob1DJGMGmyxMIs/JRY1cHWc0m2Exu5qsrrUkxkpmVQA8SZy1p1TsUt0Xli5li2NKlTEh39j+drS7OvQWTTHTdDbu8HjEIdQrGeVcSwRIC0kT1F6YoTYV96rMgWM9J+GYsIWLfTcqIAoIuryUQR+yHnFEeoCBY6M7o1xv0zcDWKBT/o/73AuHIJa+8JVWDZOBPDczyUSvPHoHVL9exkrVve7y/3ulMrf+Tp3Dy6OscHnyHuAZY/09u3TDpVY6oyPkYmgE6WXotSv39eMCV/EdxMZcig4Drs/+wUGnBsZrN3XgwWe8kyRjuOgbaptBURnyuXLsnJ+nWOGC87/ePp+gbG4dNsVXj1WSZvjcq5FTwzGbvFb+u4560ahXvEtXK30DND78wdPI18e+8QL7HKP3TWfVra991J7axiGbXAkcZ1+9sPBUXSWGbbn8eHX2IC1royBV5 ab2bghPA Mj9luXIPtmEUbFwDa3eeXcFSExUWgUNXvRRYq8ygIpQCt36q+W70zvI25D3+AqZZgdqvMrwWn9DzA+qA+zdvmQOAGq/suwxuvxdobQKTwYh04hrE5zYW7Dh9uemvPFIPLmCrlf39erJ9+8Pa6vqmIeMvjQ3BUN5PjScA7fETJytNKje/ElTpTMJGYeXNkblsU4D2ejLNRT4NbfML5aHaDq04h2XBOwFqsnYo0CY9wWvhzMhA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Apr 28, 2025 at 11:12:09PM -0700, Shakeel Butt wrote: > To make css_rstat_updated() able to safely run in nmi context, it can > not spin on locks and rather has to do trylock on the per-cpu per-ss raw > spinlock. This patch implements the backlog mechanism to handle the > failure in acquiring the per-cpu per-ss raw spinlock. > > Each subsystem provides a per-cpu lockless list on which the kernel > stores the css given to css_rstat_updated() on trylock failure. These > lockless lists serve as backlog. On cgroup stats flushing code path, the > kernel first processes all the per-cpu lockless backlog lists of the > given ss and then proceeds to flush the update stat trees. > > With css_rstat_updated() being nmi safe, the memch stats can and will be > converted to be nmi safe to enable nmi safe mem charging. > > Signed-off-by: Shakeel Butt > --- > kernel/cgroup/rstat.c | 99 +++++++++++++++++++++++++++++++++---------- > 1 file changed, 76 insertions(+), 23 deletions(-) > [..] > @@ -153,6 +160,51 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu) > > css = parent; > } > +} > + > +static void css_process_backlog(struct cgroup_subsys *ss, int cpu) > +{ > + struct llist_head *lhead = ss_lhead_cpu(ss, cpu); > + struct llist_node *lnode; > + > + while ((lnode = llist_del_first_init(lhead))) { > + struct css_rstat_cpu *rstatc; > + > + rstatc = container_of(lnode, struct css_rstat_cpu, lnode); > + __css_rstat_updated(rstatc->owner, cpu); > + } > +} > + > +/** > + * css_rstat_updated - keep track of updated rstat_cpu > + * @css: target cgroup subsystem state > + * @cpu: cpu on which rstat_cpu was updated > + * > + * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching > + * rstat_cpu->updated_children list. See the comment on top of > + * css_rstat_cpu definition for details. > + */ > +__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu) > +{ > + unsigned long flags; > + > + /* > + * Speculative already-on-list test. This may race leading to > + * temporary inaccuracies, which is fine. > + * > + * Because @parent's updated_children is terminated with @parent > + * instead of NULL, we can tell whether @css is on the list by > + * testing the next pointer for NULL. > + */ > + if (data_race(css_rstat_cpu(css, cpu)->updated_next)) > + return; > + > + if (!_css_rstat_cpu_trylock(css, cpu, &flags)) { IIUC this trylock will only fail if a BPF program runs in NMI context and tries to update cgroup stats, interrupting a context that is already holding the lock (i.e. updating or flushing stats). How often does this happen in practice tho? Is it worth the complexity? I wonder if it's better if we make css_rstat_updated() inherently lockless instead. What if css_rstat_updated() always just adds to a lockless tree, and we defer constructing the proper tree to the flushing side? This should make updates generally faster and avoids locking or disabling interrupts in the fast path. We essentially push more work to the flushing side. We may be able to consolidate some of the code too if all the logic manipulating the tree is on the flushing side. WDYT? Am I missing something here? > + css_add_to_backlog(css, cpu); > + return; > + } > + > + __css_rstat_updated(css, cpu); > > _css_rstat_cpu_unlock(css, cpu, flags, true); > } > @@ -255,6 +307,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list( > > flags = _css_rstat_cpu_lock(root, cpu, false); > > + css_process_backlog(root->ss, cpu); > /* Return NULL if this subtree is not on-list */ > if (!rstatc->updated_next) > goto unlock_ret; > -- > 2.47.1 >