From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EB678C36018
	for <linux-mm@archiver.kernel.org>; Thu,  3 Apr 2025 00:00:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A3A94280004; Wed,  2 Apr 2025 20:00:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9E9EC280001; Wed,  2 Apr 2025 20:00:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8B22C280004; Wed,  2 Apr 2025 20:00:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 6B298280001
	for <linux-mm@kvack.org>; Wed,  2 Apr 2025 20:00:45 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 8DEF65886F
	for <linux-mm@kvack.org>; Thu,  3 Apr 2025 00:00:46 +0000 (UTC)
X-FDA: 83290776492.18.504D6C7
Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180])
	by imf15.hostedemail.com (Postfix) with ESMTP id DC231A0004
	for <linux-mm@kvack.org>; Thu,  3 Apr 2025 00:00:44 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=an8qUAGu;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf15.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743638445;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=NBVywoZb0NEn0WwM4pPg/B+f3OtNhSDn+vmBeEDlF7c=;
	b=wGCYXu3WuOW2dW3qIFI0hRVdqdJXjdHDl4HGbOevrn5g41Rz2IFwHMi2qi8sgVWGHeqFW0
	fuDTfnw95J3WTjxbYSZ4wrZaI8hMrnVxS+1aAod81gujeehb23dUjR9w8NANkquy60tXm1
	Q5MnAOoNnaWb94AST5QVbUB3JACuTuU=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=an8qUAGu;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf15.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743638445; a=rsa-sha256;
	cv=none;
	b=NtaKVkAakKDOfVjgCAGDmouB3G1gi47Akszk9wbrcco090UdVFwGI0Nluaq32OvE0K1DsP
	YV4jbjXygWtJU4dh04pGygYE42+0Ik8lyoQlOH0LWcKZxurQDR+yWF90oqJDI9wH2XmSKN
	gB+sTftuOUEhR9XuuD0fg0hi/RbMsPE=
Date: Wed, 2 Apr 2025 17:00:34 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1743638442;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NBVywoZb0NEn0WwM4pPg/B+f3OtNhSDn+vmBeEDlF7c=;
	b=an8qUAGuyhMYlilculWrL5DNR/oDu+DDo0lVvJk2VlaXl0uFGifvakKKysQ5FnrJAy+pfv
	UK13v7/PRjNor8WI6v3guxShOFydlWRmvq3erqHaDgQUNmRSWzRXicp0N1m4oSepfeLdGr
	ohH+Nt2KE7OKUM0ih0pVYdvqTHDKgoo=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Andrew Morton <akpm@linux-foundation.org>, 
	Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>, 
	Christoph Lameter <cl@linux.com>, Martin Liu <liumartin@google.com>, 
	David Rientjes <rientjes@google.com>, Christian =?utf-8?B?S8O2bmln?= <christian.koenig@amd.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Sweet Tea Dorminy <sweettea@google.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, 
	Suren Baghdasaryan <surenb@google.com>, Vlastimil Babka <vbabka@suse.cz>, 
	Christian Brauner <brauner@kernel.org>, Wei Yang <richard.weiyang@gmail.com>, 
	David Hildenbrand <david@redhat.com>, Miaohe Lin <linmiaohe@huawei.com>, 
	Al Viro <viro@zeniv.linux.org.uk>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	linux-trace-kernel@vger.kernel.org, Yu Zhao <yuzhao@google.com>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Mateusz Guzik <mjguzik@gmail.com>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of
 percpu_counters
Message-ID: <2m3wwqpha2jlo4zjn6xbucahfufej75gbaxxgh4j4h67pgrw7b@diodkog7ygk3>
References: <20250331223516.7810-2-sweettea-kernel@dorminy.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250331223516.7810-2-sweettea-kernel@dorminy.me>
X-Migadu-Flow: FLOW_OUT
X-Rspamd-Server: rspam01
X-Stat-Signature: qfejk3upybrqiejbb1omjaitw8aurgeo
X-Rspam-User: 
X-Rspamd-Queue-Id: DC231A0004
X-HE-Tag: 1743638444-626654
X-HE-Meta: U2FsdGVkX1+X3bVNV36SyrbfCbLZ+BxnoK0XHcB/FNcywBqjjwo9DeYC09IcMDO+Inv27USF5PLwf6FbsT4k4DS+/MVLhkijeUAyk1iafYhjXyjsXRI21/+GY1Xj9D+mHgFdmpFuLlNNiUlricqS7wx5B2V5vPknP4TFlZIFZOr2gUV69bBbmLb6opCrC/o+880c8HhmHu6NYjG1eqQ8vq4Y8KS2QdSMU5NFdDj9or+7XnwZpUXBREzq0tI24kOdRlb18rcidDvP98jBKRNUjPt04rpdM/+SKrYvge59GJ5mYAIMp+Q8STW63avrbFxbZpTD4aBzlYdSN0GYy+OfVirKnc6O8wSeTIdXySzzynQ3RwLbG3VGaA9m0hWtlLW83T5iLA15p6zLodRiP1f8PPxhayx1T2MK2KKrUSNTUX9wjdK12SB8G3UkLjNV1JM59XbT/AHV1/0E41lP7e0xugwaVjx04Q1Y+9Vzj0C/4zdPadccFIWOKQ2v0uzGB9LL49XRp6T6bA+6yg4iSVVcEf2BWGQUq4q6dCXNA9c/VPCqNs2MrLlksBU45FWUGM4rD5RpuABYKW0SK/EjEM84lRRQZ8nVu8Igb6Se+dr+/Ouhly78KEznHgbwtg7woB/dfycJV4U5UpgmkDTWbPWGPPCI7pRTaE39SbjJnsPHD85EOHMZWioR6ByqQhb3nMw9i1aTE379hZqmky8d3yOXvFJHq0CbtIYuXRpYN4zAKHV532JWrbnzkDA81uTsToJip5Voo271YLFuzTTDDbz9k3smxw13urBXD+5OnCh7KGijQeW9YFc561PXpE0xoWojcb2il5v/aKfajgQouhBHpxnEZBqmJmtO09HsNHdNsbLmPX0j2GsMlFJ+yfSIwdhRhwwdfAfr79dSIIg+53U4ynEx+Xh0Yri+D9Skcpyl1SZksf/EAbkY6ed2F3qaHkH9McRjdsVzSvWcBOcMxdK
 ptGZzddB
 jeuIrYaoqfoLMr9k=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:
> [Resend as requested as RFC and minus prereq-patch-id junk]
> 
> Recently, several internal services had an RSS usage regression as part of a
> kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
> read RSS statistics in a backup watchdog process to monitor and decide if
> they'd overrun their memory budget.

Any reason these applications are not using memcg stats/usage instead of
RSS? RSS is not the only memory comsumption for these applications.

> Now, however, a representative service
> with five threads, expected to use about a hundred MB of memory, on a 250-cpu
> machine had memory usage tens of megabytes different from the expected amount
> -- this constituted a significant percentage of inaccuracy, causing the
> watchdog to act.

Are these 5 threads jump all over the 250 cpus?

> 
> This was a result of f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") [1].  Previously, the memory error was bounded by
> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
> scheduler decisions moving the threads around the CPUs, the memory error could
> be as large as a gigabyte.

Applications with 10s of thousands of threads is very normal at Google.
So, inaccuracy should be comparable for such applications.

> 
> This is a really tremendous inaccuracy for any few-threaded program on a
> large machine and impedes monitoring significantly. These stat counters are
> also used to make OOM killing decisions, so this additional inaccuracy could
> make a big difference in OOM situations -- either resulting in the wrong
> process being killed, or in less memory being returned from an OOM-kill than
> expected.
> 
> Finally, while the change to percpu_counter does significantly improve the
> accuracy over the previous per-thread error for many-threaded services, it does
> also have performance implications - up to 12% slower for short-lived processes
> and 9% increased system time in make test workloads [2].
> 
> A previous attempt to address this regression by Peng Zhang [3] used a hybrid
> approach with delayed allocation of percpu memory for rss_stats, showing
> promising improvements of 2-4% for process operations and 6.7% for page
> faults.
> 
> This RFC takes a different direction by replacing percpu_counters with a
> more efficient set of per-NUMA-node atomics. The approach:
> 
> - Uses one atomic per node up to a bound to reduce cross-node updates.
> - Keeps a similar batching mechanism, with a smaller batch size.
> - Eliminates the use of a spin lock during batch updates, bounding stat
>   update latency.
> - Reduces percpu memory usage and thus thread startup time.

That one atomic per node will easily become a bottleneck for
applications with a lot of threads particularly on the system where
there are a lot of cpus per numa node.

> 
> Most importantly, this bounds the total error to 32 times the number of NUMA
> nodes, significantly smaller than previous error bounds.
> 
> On a 112-core machine, lmbench showed comparable results before and after this
> patch.  However, on a 224 core machine, performance improvements were

How many cpus per node for each of these machines?

> significant over percpu_counter:
> - Pagefault latency improved by 8.91%

The following fork ones are understandable as percpu counter allocation
is involved but the above page fault latency needs some explanation.

> - Process fork latency improved by 6.27%
> - Process fork/execve latency improved by 6.06%
> - Process fork/exit latency improved by 6.58%
> 
> will-it-scale also showed significant improvements on these machines.

Are these process ones or the threads ones?

> 
> [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@google.com/
> [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/
> [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/
> 
> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> ---
> 
> This is mostly a resend of an earlier patch, where I made an utter hash
> of specifying a base commit (and forgot to update my commit text to not
> call it an RFC, and forgot to update my email to the one I use for
> upstream work...). This is based on akpm/mm-unstable as of today.
> 
> v1 can be found at
> https://lore.kernel.org/lkml/20250325221550.396212-1-sweettea-kernel@dorminy.me/
> 
> Some interesting ideas came out of that discussion: Mathieu Desnoyers
> has a design doc for a improved percpu counter, multi-level, with
> constant drift, at 
> https://lore.kernel.org/lkml/a89cb4d9-088e-4ed6-afde-f1b097de8db9@efficios.com/
> and would like performance comparisons against just reducing the batch
> size in the existing code;

You can do the experiments with different batch sizes in the existing
code without waiting for Mathieu's multi-level percpu counter.

> and Mateusz Guzik would also like a more general solution and is also
> working to fix the performance issues by caching mm state. Finally,
> Lorenzo Stoakes nacks, as it's too speculative and needs more
> discussion.
> 
> I think the important part is that this improves accuracy; the current
> scheme is difficult to use on many-cored machines. It improves
> performance, but there are tradeoffs; but it tightly bounds the
> inaccuracy so that decisions can actually be reasonably made with the
> resulting numbers.
> 
> This patch assumes that intra-NUMA node atomic updates are very cheap

The above statement/assumption needs experimental data.

> and that
> assigning CPUs to an atomic counter by numa_node_id() % 16 is suitably
> balanced. However, if each atomic were shared by only, say, eight CPUs from the
> same NUMA node, this would further reduce atomic contention at the cost of more
> memory and more complicated assignment of CPU to atomic index. I don't think
> that additional complication is worth it given that this scheme seems to get
> good performance, but it might be. I do need to actually test the impact
> on a many-cores-one-NUMA-node machine, and I look forward to testing out
> Mathieu's heirarchical percpu counter with bounded error.
> 

I am still not buying the 'good performance' point. To me we might need
to go with reduced batch size of existing approach or multi level
approach from Mathieu (I still have to see Mateusz and Kairui's
proposals).