From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 163E6C3600C
	for <linux-mm@archiver.kernel.org>; Thu,  3 Apr 2025 14:31:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 29C98280003; Thu,  3 Apr 2025 10:31:43 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 27275280001; Thu,  3 Apr 2025 10:31:43 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 13A41280003; Thu,  3 Apr 2025 10:31:43 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id EA257280001
	for <linux-mm@kvack.org>; Thu,  3 Apr 2025 10:31:42 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 08ECAC09C9
	for <linux-mm@kvack.org>; Thu,  3 Apr 2025 14:31:44 +0000 (UTC)
X-FDA: 83292971328.11.8736025
Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51])
	by imf28.hostedemail.com (Postfix) with ESMTP id 0EC0CC0007
	for <linux-mm@kvack.org>; Thu,  3 Apr 2025 14:31:41 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="KEGekn/3";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf28.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=mjguzik@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743690702; a=rsa-sha256;
	cv=none;
	b=8hyt78LTDReeFK7gRHuHrCitHo+4mFFDCiJGow153ypUcSdxlhOXqxMt9GUqrv/bgPDWX+
	Uz0DoAY7wTNaQXkAhWtOZieP41IysidT2jt4SFv6G6LRhGKY18r1XaKe4rSaEcbbCexI2A
	tGjPArkvLp2qz+c2fR/74gzokOIkSyA=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="KEGekn/3";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf28.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=mjguzik@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743690702;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=cK9/wkbg8zvAiAyI3GpEsZTlV7drMA2QZPU1sf6wwxU=;
	b=Zx1P1afG45UvUXJ8rk/gfDCF/muJPTLuMHdgsMqPEF6ZSwM9UeAmoqUF2RkGbKkiytjuP6
	meKnEUK9LDH7gABmGcvl8unMKg22hU12AtcM/xlPVFtwsMIOkp3hl7Ts1MmAVo0Ptk123V
	Er2Kskv3YCW7kH3Li1p8Hw+wfmBArz4=
Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-5e5bc066283so1611509a12.0
        for <linux-mm@kvack.org>; Thu, 03 Apr 2025 07:31:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1743690700; x=1744295500; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=cK9/wkbg8zvAiAyI3GpEsZTlV7drMA2QZPU1sf6wwxU=;
        b=KEGekn/3jRx2SSoQn8kkOHg/+ZI+Mve9y3PYnRUMItzM/qJyKvfEQWpZMTO/mAC8S7
         pdoWqTVU+JV+WmrIn6mibuqlx8Z4OUE6YF0YKIw5oZTJWKjr9CRlOj3tcvAziLJbP2sV
         rx+kcd9qh++OO55owyJiSM40beZnExP5/Im1U++9Ny8PABcrlxuLG9cAFr61CiWgUfSu
         Y51UAV1DDcV3qaKSPHSS7f3MD0uYFFlZx7+aN2PGXpsfZCOGewvamlWnSt8aA18Wu72K
         UbTUVilr9gtFRm3naPUX3juPH6Nvt7twlzsXkVB/25WKGf77cVI+KxyMHT08iK4Vu1qa
         mTlA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1743690700; x=1744295500;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=cK9/wkbg8zvAiAyI3GpEsZTlV7drMA2QZPU1sf6wwxU=;
        b=gGKwyDq6SFd4KPrQfNbgIdGN3spofFKfb6/Y+urZBwYfr3/aXG49biHfhr4Yu2jy1e
         eD95qdQb789GehwswB7nvJd1HXST0Sq0IslPNhnTT2nlpvirTIDkxvvuRpM4K03negyT
         EhXqkmO/SHdCrmJR4XGNW5Y2HDFOHfjaVgrweRy4kFdtHR0lE7CCBN05IWxJ6qkuhNo0
         /4Od7Oj1ZNjzeIX/vVATUHtSc7I5YHAe5C0o1BEA8IzUflj0wQWD7XY6elJhp9x/kzOn
         UZSOgfhxhDG8f7yY1ylGv1ufAEMkRRrVOohnZv51o6pJgfc7QsAzxu6jDgWnfjBG69c8
         SkRA==
X-Forwarded-Encrypted: i=1; AJvYcCUQeSqtD1mNEkIWA1ZIQAJvXdJriPVyLe0Hd0SjOobIO7HrRpapGBmkpuK1BJBP7rnfQLab1v7uUQ==@kvack.org
X-Gm-Message-State: AOJu0YwvzSObtVQcgxLR/iyPf5C9leoPVIanV5rnxHg7VgXl+79CY2s+
	dblKsWWuHgeWJrNVmq06Y/cBLU6QLOXk/fKhtNHE0WiXiO7wosIaswfC7O9Ar3mOQhuOKz6sf6I
	aE3AchhnS+j/JfBcfmU2woBIqOso=
X-Gm-Gg: ASbGncsJgYNV/yWFSR7aB1qejnAAeNmOJfIqddH3vdAvAwp1DtGSKPhyI38Mo5vu9US
	xQ8wDFBj/yJBrpVteErMRVRxEFEs+8u25mqk2dh0Zo9kbLwagZqkIKoj68TvA/NUQmakedg04wd
	r3zyBwvCb3H/88cw80AsEi9U+l
X-Google-Smtp-Source: AGHT+IHkJ4KaUGObBE6FKmfGZm5gTXO6nvY2/gUckZhrPo4HJ/GKOeBtA77J6m8VGj9ClwYiYfsjNg2+RSWj2GRIfVk=
X-Received: by 2002:a05:6402:350b:b0:5e0:892f:89ae with SMTP id
 4fb4d7f45d1cf-5edfcc27575mr17848743a12.4.1743690699939; Thu, 03 Apr 2025
 07:31:39 -0700 (PDT)
MIME-Version: 1.0
References: <20250331223516.7810-2-sweettea-kernel@dorminy.me> <CAMgjq7AroDCKTfJzJRr++8H2b3eTd=MeUqwkPUX4ixRVqZw6-A@mail.gmail.com>
In-Reply-To: <CAMgjq7AroDCKTfJzJRr++8H2b3eTd=MeUqwkPUX4ixRVqZw6-A@mail.gmail.com>
From: Mateusz Guzik <mjguzik@gmail.com>
Date: Thu, 3 Apr 2025 16:31:28 +0200
X-Gm-Features: ATxdqUFbcuKJ9YYwfeWxR-IddzPF5kxU7naBcMCGt4TtUjwLbygy2hilEX9A1F4
Message-ID: <CAGudoHH7OUHG2HHrjzqkiqgYXzLEtovCptHpxkyVNPwSMHWfrw@mail.gmail.com>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters
To: Kairui Song <ryncsn@gmail.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>, Andrew Morton <akpm@linux-foundation.org>, 
	Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Dennis Zhou <dennis@kernel.org>, 
	Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>, Martin Liu <liumartin@google.com>, 
	David Rientjes <rientjes@google.com>, =?UTF-8?Q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Johannes Weiner <hannes@cmpxchg.org>, 
	Sweet Tea Dorminy <sweettea@google.com>, Lorenzo Stoakes <lorenzo.stoakes@oracle.com>, 
	"Liam R . Howlett" <Liam.Howlett@oracle.com>, Suren Baghdasaryan <surenb@google.com>, 
	Vlastimil Babka <vbabka@suse.cz>, Christian Brauner <brauner@kernel.org>, 
	Wei Yang <richard.weiyang@gmail.com>, David Hildenbrand <david@redhat.com>, 
	Miaohe Lin <linmiaohe@huawei.com>, Al Viro <viro@zeniv.linux.org.uk>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, 
	Yu Zhao <yuzhao@google.com>, Roman Gushchin <roman.gushchin@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 0EC0CC0007
X-Stat-Signature: q5mxn5a8wyhqf5yjk9ykomted388yxk1
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-HE-Tag: 1743690701-710448
X-HE-Meta: U2FsdGVkX1+QJONaMzTKN6LhISzbmE5NKESoJPBF6wnYSCPvP1gNn0eNuTIIs1KBeSn4Ee+O04bA3M/bMx91rJDyrYhRa7eow2J3IDkNtvrinOpdHV/D+PFehgop+zG0Drg2044XX+rjANdk1XfzliP/5ZD4e2nu9CMdomG1c7vCTipuue2siOa39of0jPYrjW5iBTvnP8j2vWC10c+GD86p3d+rKK2bPUx4w9e1NZ0IpDBTz5lJoVrOY0U1S9dDS2rP2XCHPx9JKiFozjHCSfKHKYlm925Ysb6WN6ZuOqfeA1IPaONRJNJKhXzIxEB5/QAAHuDDka0tqWImvr90FjtLl4h8r7lHtasF1OK0BAFTdHFomyJ7xGnbkElgHTkb7sV5LFzkGqoDnZNX3JEW9uNgtvYZIOTLsoymYjzFA0CqAIjbqsb6TSVulGkoSqqlkZas2yqnnJKCWC0UWgRb0POjlQH2u0H8xL4/RGmeQObtMiW5WkEMvmTGYwuNFCB5p0KlRQYxFPmgGkgHZq97K/SuGiAOA13bazDu8oWiLifS+HYw/fh7j9+0v4TAUfcryizRHum4pYBiutG2hMx23PuEb3HjnByhWJqBU8U/TfDiDu/p5Pi2JuniAdo1m3esECx8yeoNqdfBWe7g/104nH/WtjTPIiL4T0y3eYwQlOdqGjZByOJ1Eel0CExs/I6UzjEelWSJDu1a18SruKvCSfjtKYVtTI5LlRm0Rw8yeFhl4CWQssaYnDky85nI3/WoqFb8vgEVDcYRy+EkRG+BEHH87SJW7dpiTtU9kXIkFtKe+V1IHDE4BUtNsZ9iOqSnFv36O8YVOxc48wW3kaW+WifSJ+DpV2nPcfTHjoHo4m4h761HLfS1GAj/hxaz6V2m0RmyjE3wDK+/imqq9MRMz2tyZ6CsycUY0bw3YlKZ4oTilRro8DCOFxrjhVHSMi4XxZGVp2kwA45eeNGSHqe
 nIubFld3
 rr8ldT93HbjK5M0Ye4K/izV8WFDBoOPP437u7gw0pSGOxlxB1q0R0HZivb5iJfGqKl/yLZofrxyGoNe/LM+o6/00XeMeo3a56/JgvEA2l/lohVxAurXyTFLJOj12BLWQ7BE6xeJvzBrz4iu+aSGwoSjKcgKhvAP+KsuUCyK5G/tkH/dCIDcaZksYVX/ybGudyyns3TpjhV2aa/R4Pg0nFqozRVgbDXX2X5N7sRktoOjCYBJWzD3y6MYzkSuzOVF5yLf/pDvvP/BQCaZN+qMs52jlcUGHzJHSncvhX+W+N+WfXs4VqDOiEV+/PoicMKlIRzbIrHFTdqk1C4CgqMshOVH6qDuOpMqUFnvZuyfrR+VDy2bPmP6zmqRZEbiOXY+0yNpK+ZcorGMQ7yjpoN+Eyqq+AqO5lVohJZsf7Tvm3Iv91+6EDg8udmBfDkiq2WghNRYk9OQa/tOjjPDSD/kt4E1vEIS5Qasf5B0P1ZLx75au5RHKgpyrd0aJVxRusMhRIHC6KyaAbzeyxYjDSxcgzbr8+cUtNcSMGlkk/cHmgXSqnzx0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Apr 1, 2025 at 5:27=E2=80=AFAM Kairui Song <ryncsn@gmail.com> wrote=
:
>
> On Tue, Apr 1, 2025 at 6:36=E2=80=AFAM Sweet Tea Dorminy
> <sweettea-kernel@dorminy.me> wrote:
> >
> > [Resend as requested as RFC and minus prereq-patch-id junk]
> >
> > Recently, several internal services had an RSS usage regression as part=
 of a
> > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able=
 to
> > read RSS statistics in a backup watchdog process to monitor and decide =
if
> > they'd overrun their memory budget. Now, however, a representative serv=
ice
> > with five threads, expected to use about a hundred MB of memory, on a 2=
50-cpu
> > machine had memory usage tens of megabytes different from the expected =
amount
> > -- this constituted a significant percentage of inaccuracy, causing the
> > watchdog to act.
> >
> > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into
> > percpu_counter") [1].  Previously, the memory error was bounded by
> > 64*nr_threads pages, a very livable megabyte. Now, however, as a result=
 of
> > scheduler decisions moving the threads around the CPUs, the memory erro=
r could
> > be as large as a gigabyte.
> >
> > This is a really tremendous inaccuracy for any few-threaded program on =
a
> > large machine and impedes monitoring significantly. These stat counters=
 are
> > also used to make OOM killing decisions, so this additional inaccuracy =
could
> > make a big difference in OOM situations -- either resulting in the wron=
g
> > process being killed, or in less memory being returned from an OOM-kill=
 than
> > expected.
> >
> > Finally, while the change to percpu_counter does significantly improve =
the
> > accuracy over the previous per-thread error for many-threaded services,=
 it does
> > also have performance implications - up to 12% slower for short-lived p=
rocesses
> > and 9% increased system time in make test workloads [2].
> >
> > A previous attempt to address this regression by Peng Zhang [3] used a =
hybrid
> > approach with delayed allocation of percpu memory for rss_stats, showin=
g
> > promising improvements of 2-4% for process operations and 6.7% for page
> > faults.
> >
> > This RFC takes a different direction by replacing percpu_counters with =
a
> > more efficient set of per-NUMA-node atomics. The approach:
> >
> > - Uses one atomic per node up to a bound to reduce cross-node updates.
> > - Keeps a similar batching mechanism, with a smaller batch size.
> > - Eliminates the use of a spin lock during batch updates, bounding stat
> >   update latency.
> > - Reduces percpu memory usage and thus thread startup time.
> >
> > Most importantly, this bounds the total error to 32 times the number of=
 NUMA
> > nodes, significantly smaller than previous error bounds.
> >
> > On a 112-core machine, lmbench showed comparable results before and aft=
er this
> > patch.  However, on a 224 core machine, performance improvements were
> > significant over percpu_counter:
> > - Pagefault latency improved by 8.91%
> > - Process fork latency improved by 6.27%
> > - Process fork/execve latency improved by 6.06%
> > - Process fork/exit latency improved by 6.58%
> >
> > will-it-scale also showed significant improvements on these machines.
> >
> > [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@googl=
e.com/
> > [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/
> > [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@h=
uawei.com/
>
> Hi, thanks for the idea.
>
> I'd like to mention my previous work on this:
> https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/
>
> Basically using one global percpu counter instead of a per-task one, and
> flush each CPU's sub-counter on context_switch (if next->active_mm !=3D
> current->active_mm, no switch for IRQ or kthread).
> More like a percpu stash.
>
> Benchmark looks great and the fast path is super fast (just a
> this_cpu_add). context_switch is also fine because the scheduler would
> try to keep one task on the same CPU  to make better use of cache. And
> it can leverage the cpu bitmap like tlb shootdown to optimize the
> whole thing.
>
> The error and total memory consumption are both lower than current design=
 too.

Note there are 2 unrelated components in that patchset:
- one per-cpu instance of rss counters which is rolled up on context
switches, avoiding the costly counter alloc/free on mm
creation/teardown
- cpu iteration in get_mm_counter

The allocation problem is fixable without abandoning the counters, see
my other e -mail (tl;dr let mm's hanging out in slab caches *keep* the
counters). This aspect has to be solved anyway due to mm_alloc_cid().
Providing a way to sort it out covers *both* the rss counters and the
cid thing.

In your patchset the accuracy increase comes at the expense of walking
all CPUs every time, while a big part of the point of using percpu
counters is to have a good enough approximation somewhere that this is
not necessary.

Indeed the stock kernel fails to achieve that at the moment and as you
can see there is discussion how to tackle it. It is a general percpu
counter problem.

I verified get_mm_counter is issued in particular on mmap and munmap.
On high core count boxes (hundreds of cores) the mandatory all CPU
walk has to be a problem, especially if a given process is also highly
multi-threaded and mmap/munmap heavy.

Thus I think your patchset would also benefit from some form of
distribution of the counter other than just per-cpu and the one
centralized value. At the same time if RSS accuracy is your only
concern and you don't care about walking the CPUs, then you could
modify the current code to also do it.

Or to put it differently, while it may be changing the scheme to have
a local copy makes sense, the patchset is definitely not committable
in the proposed form -- it really wants to have better quality caching
of the state.
--
Mateusz Guzik <mjguzik gmail.com>