From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3FB28C36008
	for <linux-mm@archiver.kernel.org>; Wed, 26 Mar 2025 23:36:26 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 928F02800B9; Wed, 26 Mar 2025 19:36:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8D6942800A5; Wed, 26 Mar 2025 19:36:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7C5082800B9; Wed, 26 Mar 2025 19:36:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 5EFDE2800A5
	for <linux-mm@kvack.org>; Wed, 26 Mar 2025 19:36:23 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id AFC551216A1
	for <linux-mm@kvack.org>; Wed, 26 Mar 2025 23:36:24 +0000 (UTC)
X-FDA: 83265313488.08.AFB19FD
Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50])
	by imf23.hostedemail.com (Postfix) with ESMTP id B4771140004
	for <linux-mm@kvack.org>; Wed, 26 Mar 2025 23:36:22 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=JNYuGEsY;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743032182;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+gpxRnuJVIVSr8pBWj81ZRvp6VQfzrOyG3qUhDElcCg=;
	b=PRNpjiSbd9d+xKUUjrd1bkdP922PM8Ith+R7U0ZQ1osZHGJrUASZmf/lfYiBoLXN2M74bE
	P5S9EKQK14uLzdz/7Mu0k0+G7n8Mv1r7ONsnUL421jWvbl78npHituVBSJRDMyqjfgf4PA
	IvT5XM6vkDPvN1F7x/snGsC8y1i4x0w=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=JNYuGEsY;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743032182; a=rsa-sha256;
	cv=none;
	b=bgjn4rmsYU+TF54W0F3CP2deUUOv1FZTlyM0JYeU7mA5jYu+FvMZ9wdx4x3v0PAv8Ptr5e
	cLx1sb07kQFc9z8P2fFah4WEBrDDaBN+m5d67dsAU624Q9H1abBUglZY5LKqVSpZEN6gbM
	8LqxRtl3Hu4NoDd4oKiO32zdmODBvJc=
Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-43cef035a3bso2838035e9.1
        for <linux-mm@kvack.org>; Wed, 26 Mar 2025 16:36:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1743032181; x=1743636981; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=+gpxRnuJVIVSr8pBWj81ZRvp6VQfzrOyG3qUhDElcCg=;
        b=JNYuGEsY0dLOUjuH4oKMi1ykyhR1PonUyThnl/LnlAFOfQy92xWy7C/HM9jMApCH0x
         CwHhuSvhuL3R0PRqreFEF5TwUmYb6KeppylFq6Hx02u20jC985E7cJ5HnFjF2xOFKL1V
         QGSgHunmv0LsZw/ueqpu6Nctgf4/yUcXD6Zn1J2LMkAhk5I//zXVnfaaLUoWj+Wa+jgG
         imiNJX2yFG+1QMO3rYqAFGO/W/3Ya7uXFvnRmua3A08jHX3tHQFnlNeXihysDXxQj8A5
         aKujSFS017YvKHKdw+qSfSRNlWKfOcL0ibNqteMVy8jfF2NixConHVcQxDWlid5tT/rI
         HmBw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1743032181; x=1743636981;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+gpxRnuJVIVSr8pBWj81ZRvp6VQfzrOyG3qUhDElcCg=;
        b=VUz330vSSvWVcB0UJ6DPK5IpNJAdflbr3iTHI58meLO/WZrEof9+AB2GxOXNE1SdYa
         3HL9EicahiRQSnoM/3usJKDkZSKCZXYjFYOiPWQ5wRhHMK+S/7SuoIw5byH24TFehxcr
         J/v8PorTi2hSd+t0sFrvJpp709bLt/B3J/NUqUoSp3ODMyWPUQTQfLBe6LTeS4d+RtGQ
         CS7AqybwKj1y8AsQzDLsPsyMvbev1YCcGm5oU7ZABRnKz7jr+2d6+oRnPiwDXeoVGUbf
         3zOkJVEynvENqu+lFtWJ7TZ3krIODMYUdZ1/n5mifDgyJ12X7VbqDeyZlfGMjd1nR+N+
         6x4g==
X-Forwarded-Encrypted: i=1; AJvYcCVOQwsh8xq8a9Cpk/7dPTcTebeZGtb9kr0kr00o22qWq/hK7iczbvQE1B0n+lL01/nnFIijumd60g==@kvack.org
X-Gm-Message-State: AOJu0YyQdJyNUjb2c+xK696CrJAVtTlGhPm3xHWmKSY2ycR/pYZ1atv8
	Wm876rbpZmogjkO8PceQX7vXiAGHOG3tepwsR3hwGo3Xs0WXx09B
X-Gm-Gg: ASbGncvA8Uu6KWXVGjZifTaPtPRqZmDjrlUGXRajwjIL6x3S6LrpzZuGp8vrI47N5ai
	PeBbl6lKigI8yhooY8TGLz8/BNnhsmDwYVbWGj62bUAVE9/xyd//rF9rRtvdiSMrm58ZHglKZ5z
	ePRfY8XBQBr/ihsrrkBsaItB/dZDUFFDAPgOTsYXR9rpN/cvwkbOTO4ZEARIqFNmFUdC6KYJ0DY
	eMJsojwake5JyMBZxU+ET9JSR7WD4NW+DG9/0m3pZlGjKq1w1mqbWZg3MqWHr7JTksCF2zDLzym
	k7BV170ak5v6C04rttroKky4EA/aV7PMH4SyTPPsKi5vDMQjVJI12oCETys0
X-Google-Smtp-Source: AGHT+IFz0F2wc1aL46ik2E1XpPNYlbJYtT9XpCftxWi2JDIsiQu5DLWnXHZ0BlYk5pj9NjT8WdVWRw==
X-Received: by 2002:a05:600c:4e56:b0:43d:bb9:ad00 with SMTP id 5b1f17b1804b1-43d850fd05amr11243815e9.15.1743032180778;
        Wed, 26 Mar 2025 16:36:20 -0700 (PDT)
Received: from f (cst-prg-80-192.cust.vodafone.cz. [46.135.80.192])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-39ad32bd464sm71193f8f.57.2025.03.26.16.36.12
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 26 Mar 2025 16:36:19 -0700 (PDT)
Date: Thu, 27 Mar 2025 00:36:07 +0100
From: Mateusz Guzik <mjguzik@gmail.com>
To: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Andrew Morton <akpm@linux-foundation.org>, 
	Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>, 
	Christoph Lameter <cl@linux.com>, Martin Liu <liumartin@google.com>, 
	David Rientjes <rientjes@google.com>, Jani Nikula <jani.nikula@intel.com>, 
	Sweet Tea Dorminy <sweettea@google.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Christian Brauner <brauner@kernel.org>, Lorenzo Stoakes <lorenzo.stoakes@oracle.com>, 
	Suren Baghdasaryan <surenb@google.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, 
	Wei Yang <richard.weiyang@gmail.com>, David Hildenbrand <david@redhat.com>, 
	Miaohe Lin <linmiaohe@huawei.com>, Al Viro <viro@zeniv.linux.org.uk>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Yu Zhao <yuzhao@google.com>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Greg Thelen <gthelen@google.com>
Subject: Re: [PATCH] mm: use per-numa-node atomics instead of percpu_counters
Message-ID: <evqxyt5dkkervut7q4ea4dnykcz75lxx2wvqyg2lq7m3ptam3c@53ao37tlhytk>
References: <20250325221550.396212-1-sweettea-kernel@dorminy.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20250325221550.396212-1-sweettea-kernel@dorminy.me>
X-Rspamd-Server: rspam01
X-Stat-Signature: 7pq7sd54oq8sioenb3zrmxgbf6b7jymj
X-Rspam-User: 
X-Rspamd-Queue-Id: B4771140004
X-HE-Tag: 1743032182-908043
X-HE-Meta: U2FsdGVkX1+YLljK+ltKrJFunHO1h760rEwZ6ju3zpyj/V4LVTgWPZEKHWie5W7SBeERFOOcreQQiuaG2b7L4aOhwvAYCVcLikwnAgm5XJrZbXP0HZVPp2JwIhY8JyaOytY4sUQHIxRO5oxQLbhqkYYHtgZAbAvVhsyKZxomc64uRDqOLqI4T7r44jW0a96BYKwhV9su1bnpEoNcAPVcwzSziD4Ehmw2CfsiEaygqWngecCLWNXNn0zewvecLq9dHD1jnxxMeaPQzsyYIpv0rV4hdg6uUXiqEK3QSrxTHcTqECx7gdk7qJZnb0KIWnFnzucrT8PcWKzTUz8XRJ26o0jy2/x9AFMkNgKh9P5A6I9Tcf0AOGnTAnTl+WGEuAhqmSeEa1BjkmP8BJUneFuXe++Lgjvzni4rLWLGx2w06btXSSJvJfTnmYV/5zUqyjFUxP71pLt4dzy1fAwn8M24VqkAXZMeawDxGd4gp68hrYIYfh4ZIthd8MctdV0dw00oLD9tPe02DMv6AV+JCxifBLWAOHJDkxQ6V/l/vTYBjNeUSUW+cVEtltTzOrGX/+gxjIazWUduzQ19aG0qoru4grgx7AAS4Cwz/iEdctwwcdLzfL2NvqxOLJRgAlZ2xkgsvCwYcpKdzS8x8Lp0JDnLSqrQpccqSIDAN0nDSssKjFzTlXqwM2qLl6cwGfGRmnAhVU6cB7yFL+npZVuvQ8yzl/112okjqHs4h6Sk4PLVRZpwI2Kp93T5NqdNKuusPh08bwh9zccknq1ccRy8xVr6TvXKVkra/Lu1jIpnFZuy5OZzQTZO6ntLJFIlebRZzhd0EgmIB9zSxyVJ0ylwgPxPPhVRF2bHxPNX/qTK+MoQ1K2owtow019zlETh05fkUN6ugJQxsbYGE53AvEoy6fn4YN+dbKLgIGf4xY1MJ8f4ZjNQs3ve38aUiaWjhGlcfvdVukLNX5+kAVKQfIlb/R1
 E/qqQh5j
 Wi6OPwpCmBWqSvYpafATyfYD5YcwXrwaIIJ+mFVWUI7aG75Gw2/uoE9l42FfdoGzkBwPPsKcsHDB5+wevK57IAj2KW5E4cI4G2VrwPn3pyfCzY7nJ01OvDtrcQX6OXXd0ojWQ4A+Ofg5GZc5BhOWDdbz78gBIKJpjwGEtvDkkb9uEVUERyhet7riLaMR1FXIMHc7M4aldul2L0qbnVh9s5Ls6yeLU/Kd4izGi86Jb9VIlMrw2mneq/5y72EbqdvJu6Et0iG8UfrnH1yZoNdEQpYJZ/aaN2QFPV/J5BrKv9WyEj11EJ+GqdfnBmVGyGdGwrRWwPiE76zLJ1U7sfvEL+72sDdBt5s4vdDx9oezOzVCsbvDyMcx6AKHIxlkRsznyecFZKyS8vAril12eVz68K0qPON/Auly2JSXdlXYh/bPd9ZqzVZikUAQye2bDRosRk0IbDPoF3yAXoQ+hlpXxI8POazvkOHuKI2wwQPQW1pNZSU4jAIIoyZ+Q6SQ00ztv6Z43T4RrfgSiTXii/LXCn5dZIGmCOZ3gxdFw
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Mar 25, 2025 at 06:15:49PM -0400, Sweet Tea Dorminy wrote:
> From: Sweet Tea Dorminy <sweettea@google.com>
> 
> This was a result of f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") [1].  Previously, the memory error was bounded by
> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
> scheduler decisions moving the threads around the CPUs, the memory error could
> be as large as a gigabyte.
> 
> This is a really tremendous inaccuracy for any few-threaded program on a
> large machine and impedes monitoring significantly. These stat counters are
> also used to make OOM killing decisions, so this additional inaccuracy could
> make a big difference in OOM situations -- either resulting in the wrong
> process being killed, or in less memory being returned from an OOM-kill than
> expected.
> 
> Finally, while the change to percpu_counter does significantly improve the
> accuracy over the previous per-thread error for many-threaded services, it does
> also have performance implications - up to 12% slower for short-lived processes
> and 9% increased system time in make test workloads [2].
> 
> A previous attempt to address this regression by Peng Zhang [3] used a hybrid
> approach with delayed allocation of percpu memory for rss_stats, showing
> promising improvements of 2-4% for process operations and 6.7% for page
> faults.
> 
> This RFC takes a different direction by replacing percpu_counters with a
> more efficient set of per-NUMA-node atomics. The approach:
> 
> - Uses one atomic per node up to a bound to reduce cross-node updates.
> - Keeps a similar batching mechanism, with a smaller batch size.
> - Eliminates the use of a spin lock during batch updates, bounding stat
>   update latency.
> - Reduces percpu memory usage and thus thread startup time.
> 
> Most importantly, this bounds the total error to 32 times the number of NUMA
> nodes, significantly smaller than previous error bounds.
> 
> On a 112-core machine, lmbench showed comparable results before and after this
> patch.  However, on a 224 core machine, performance improvements were
> significant over percpu_counter:
> - Pagefault latency improved by 8.91%
> - Process fork latency improved by 6.27%
> - Process fork/execve latency improved by 6.06%
> - Process fork/exit latency improved by 6.58%
> 
> will-it-scale also showed significant improvements on these machines.
> 

The problem on fork/exec/exit stems from back-to-back trips to the
per-cpu allocator every time a mm is allocated/freed (which happens for
each of these syscalls) -- they end up serializing on the same global
spinlock.

On the alloc side this is mm_alloc_cid() followed by percpu_counter_init_many().

Even if you eliminate the counters for rss, you are still paying for CID. While
this scales better than the stock kernel, it still leaves perf on the table.

Per our discussion on IRC there is WIP to eliminate both cases by
caching the state in mm. This depends on adding a dtor for SLUB to undo
the work in ctor. Harry did the work on that front, this is not
submitted to -next though.

There is a highly-inefficient sanity-check loop in check_mm(). Instead
of walking the entire list 4 times with toggling interrupts in-between,
it can do the walk once.

So that's for the fork/execve/exit triplets.

As for the page fault latency, your patch adds atomics to the fast path.
Even absent any competition for cachelines with other CPUs this will be
slower to execute than the current primitive. I suspect you are
observing a speed up with your change because you end up landing in the
slowpath a lot and that sucker is globally serialized on a spinlock --
this has to hurt.

Per my other message in the thread, and a later IRC discussion, this is
fixable with adding intermediate counters, in the same spirit you did here.

I'll note though that numa nodes can be incredibly core-y and that
granularity may be way too coarse.

That aside there are globally-locked lists mms are cycling in and out of
which also can get the "stay there while cached" treatment.

All in all I claim that:
1. fork/execve/exit tests will do better than they are doing with your
patch if going to the percpu allocator gets eliminated altogether (in
your patch it is not for mm_alloc_cid() and the freeing counterpart),
along with unscrewing the loop in check_mm().
2. fault handling will be faster than it is with your patch *if*
something like per-numa state gets added for the slowpath -- the stock
fast path is faster than yours, the stock slowpath is way slower. you
can get the best of both worlds on this one.

Hell, it may be your patch as is can be easily repurposed to
decentralize the main percpu counter? I mean perhaps there is no need
for any fancy hierarchical structure.

I can commit to providing a viable patch for sorting out the
fork/execve/exit side, but it is going to take about a week. You do have
a PoC in the meantime (too ugly to share publicly :>).

So that's my take on it. Note I'm not a maintainer of any of this, but I
did some work on the thing in the past.