From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FB28C36008 for ; Wed, 26 Mar 2025 23:36:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 928F02800B9; Wed, 26 Mar 2025 19:36:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D6942800A5; Wed, 26 Mar 2025 19:36:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7C5082800B9; Wed, 26 Mar 2025 19:36:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5EFDE2800A5 for ; Wed, 26 Mar 2025 19:36:23 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id AFC551216A1 for ; Wed, 26 Mar 2025 23:36:24 +0000 (UTC) X-FDA: 83265313488.08.AFB19FD Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf23.hostedemail.com (Postfix) with ESMTP id B4771140004 for ; Wed, 26 Mar 2025 23:36:22 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JNYuGEsY; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743032182; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+gpxRnuJVIVSr8pBWj81ZRvp6VQfzrOyG3qUhDElcCg=; b=PRNpjiSbd9d+xKUUjrd1bkdP922PM8Ith+R7U0ZQ1osZHGJrUASZmf/lfYiBoLXN2M74bE P5S9EKQK14uLzdz/7Mu0k0+G7n8Mv1r7ONsnUL421jWvbl78npHituVBSJRDMyqjfgf4PA IvT5XM6vkDPvN1F7x/snGsC8y1i4x0w= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JNYuGEsY; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743032182; a=rsa-sha256; cv=none; b=bgjn4rmsYU+TF54W0F3CP2deUUOv1FZTlyM0JYeU7mA5jYu+FvMZ9wdx4x3v0PAv8Ptr5e cLx1sb07kQFc9z8P2fFah4WEBrDDaBN+m5d67dsAU624Q9H1abBUglZY5LKqVSpZEN6gbM 8LqxRtl3Hu4NoDd4oKiO32zdmODBvJc= Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-43cef035a3bso2838035e9.1 for ; Wed, 26 Mar 2025 16:36:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743032181; x=1743636981; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=+gpxRnuJVIVSr8pBWj81ZRvp6VQfzrOyG3qUhDElcCg=; b=JNYuGEsY0dLOUjuH4oKMi1ykyhR1PonUyThnl/LnlAFOfQy92xWy7C/HM9jMApCH0x CwHhuSvhuL3R0PRqreFEF5TwUmYb6KeppylFq6Hx02u20jC985E7cJ5HnFjF2xOFKL1V QGSgHunmv0LsZw/ueqpu6Nctgf4/yUcXD6Zn1J2LMkAhk5I//zXVnfaaLUoWj+Wa+jgG imiNJX2yFG+1QMO3rYqAFGO/W/3Ya7uXFvnRmua3A08jHX3tHQFnlNeXihysDXxQj8A5 aKujSFS017YvKHKdw+qSfSRNlWKfOcL0ibNqteMVy8jfF2NixConHVcQxDWlid5tT/rI HmBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743032181; x=1743636981; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=+gpxRnuJVIVSr8pBWj81ZRvp6VQfzrOyG3qUhDElcCg=; b=VUz330vSSvWVcB0UJ6DPK5IpNJAdflbr3iTHI58meLO/WZrEof9+AB2GxOXNE1SdYa 3HL9EicahiRQSnoM/3usJKDkZSKCZXYjFYOiPWQ5wRhHMK+S/7SuoIw5byH24TFehxcr J/v8PorTi2hSd+t0sFrvJpp709bLt/B3J/NUqUoSp3ODMyWPUQTQfLBe6LTeS4d+RtGQ CS7AqybwKj1y8AsQzDLsPsyMvbev1YCcGm5oU7ZABRnKz7jr+2d6+oRnPiwDXeoVGUbf 3zOkJVEynvENqu+lFtWJ7TZ3krIODMYUdZ1/n5mifDgyJ12X7VbqDeyZlfGMjd1nR+N+ 6x4g== X-Forwarded-Encrypted: i=1; AJvYcCVOQwsh8xq8a9Cpk/7dPTcTebeZGtb9kr0kr00o22qWq/hK7iczbvQE1B0n+lL01/nnFIijumd60g==@kvack.org X-Gm-Message-State: AOJu0YyQdJyNUjb2c+xK696CrJAVtTlGhPm3xHWmKSY2ycR/pYZ1atv8 Wm876rbpZmogjkO8PceQX7vXiAGHOG3tepwsR3hwGo3Xs0WXx09B X-Gm-Gg: ASbGncvA8Uu6KWXVGjZifTaPtPRqZmDjrlUGXRajwjIL6x3S6LrpzZuGp8vrI47N5ai PeBbl6lKigI8yhooY8TGLz8/BNnhsmDwYVbWGj62bUAVE9/xyd//rF9rRtvdiSMrm58ZHglKZ5z ePRfY8XBQBr/ihsrrkBsaItB/dZDUFFDAPgOTsYXR9rpN/cvwkbOTO4ZEARIqFNmFUdC6KYJ0DY eMJsojwake5JyMBZxU+ET9JSR7WD4NW+DG9/0m3pZlGjKq1w1mqbWZg3MqWHr7JTksCF2zDLzym k7BV170ak5v6C04rttroKky4EA/aV7PMH4SyTPPsKi5vDMQjVJI12oCETys0 X-Google-Smtp-Source: AGHT+IFz0F2wc1aL46ik2E1XpPNYlbJYtT9XpCftxWi2JDIsiQu5DLWnXHZ0BlYk5pj9NjT8WdVWRw== X-Received: by 2002:a05:600c:4e56:b0:43d:bb9:ad00 with SMTP id 5b1f17b1804b1-43d850fd05amr11243815e9.15.1743032180778; Wed, 26 Mar 2025 16:36:20 -0700 (PDT) Received: from f (cst-prg-80-192.cust.vodafone.cz. [46.135.80.192]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-39ad32bd464sm71193f8f.57.2025.03.26.16.36.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Mar 2025 16:36:19 -0700 (PDT) Date: Thu, 27 Mar 2025 00:36:07 +0100 From: Mateusz Guzik To: Sweet Tea Dorminy Cc: Andrew Morton , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , Jani Nikula , Sweet Tea Dorminy , Johannes Weiner , Christian Brauner , Lorenzo Stoakes , Suren Baghdasaryan , "Liam R . Howlett" , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Yu Zhao , Roman Gushchin , Greg Thelen Subject: Re: [PATCH] mm: use per-numa-node atomics instead of percpu_counters Message-ID: References: <20250325221550.396212-1-sweettea-kernel@dorminy.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250325221550.396212-1-sweettea-kernel@dorminy.me> X-Rspamd-Server: rspam01 X-Stat-Signature: 7pq7sd54oq8sioenb3zrmxgbf6b7jymj X-Rspam-User: X-Rspamd-Queue-Id: B4771140004 X-HE-Tag: 1743032182-908043 X-HE-Meta: U2FsdGVkX1+YLljK+ltKrJFunHO1h760rEwZ6ju3zpyj/V4LVTgWPZEKHWie5W7SBeERFOOcreQQiuaG2b7L4aOhwvAYCVcLikwnAgm5XJrZbXP0HZVPp2JwIhY8JyaOytY4sUQHIxRO5oxQLbhqkYYHtgZAbAvVhsyKZxomc64uRDqOLqI4T7r44jW0a96BYKwhV9su1bnpEoNcAPVcwzSziD4Ehmw2CfsiEaygqWngecCLWNXNn0zewvecLq9dHD1jnxxMeaPQzsyYIpv0rV4hdg6uUXiqEK3QSrxTHcTqECx7gdk7qJZnb0KIWnFnzucrT8PcWKzTUz8XRJ26o0jy2/x9AFMkNgKh9P5A6I9Tcf0AOGnTAnTl+WGEuAhqmSeEa1BjkmP8BJUneFuXe++Lgjvzni4rLWLGx2w06btXSSJvJfTnmYV/5zUqyjFUxP71pLt4dzy1fAwn8M24VqkAXZMeawDxGd4gp68hrYIYfh4ZIthd8MctdV0dw00oLD9tPe02DMv6AV+JCxifBLWAOHJDkxQ6V/l/vTYBjNeUSUW+cVEtltTzOrGX/+gxjIazWUduzQ19aG0qoru4grgx7AAS4Cwz/iEdctwwcdLzfL2NvqxOLJRgAlZ2xkgsvCwYcpKdzS8x8Lp0JDnLSqrQpccqSIDAN0nDSssKjFzTlXqwM2qLl6cwGfGRmnAhVU6cB7yFL+npZVuvQ8yzl/112okjqHs4h6Sk4PLVRZpwI2Kp93T5NqdNKuusPh08bwh9zccknq1ccRy8xVr6TvXKVkra/Lu1jIpnFZuy5OZzQTZO6ntLJFIlebRZzhd0EgmIB9zSxyVJ0ylwgPxPPhVRF2bHxPNX/qTK+MoQ1K2owtow019zlETh05fkUN6ugJQxsbYGE53AvEoy6fn4YN+dbKLgIGf4xY1MJ8f4ZjNQs3ve38aUiaWjhGlcfvdVukLNX5+kAVKQfIlb/R1 E/qqQh5j Wi6OPwpCmBWqSvYpafATyfYD5YcwXrwaIIJ+mFVWUI7aG75Gw2/uoE9l42FfdoGzkBwPPsKcsHDB5+wevK57IAj2KW5E4cI4G2VrwPn3pyfCzY7nJ01OvDtrcQX6OXXd0ojWQ4A+Ofg5GZc5BhOWDdbz78gBIKJpjwGEtvDkkb9uEVUERyhet7riLaMR1FXIMHc7M4aldul2L0qbnVh9s5Ls6yeLU/Kd4izGi86Jb9VIlMrw2mneq/5y72EbqdvJu6Et0iG8UfrnH1yZoNdEQpYJZ/aaN2QFPV/J5BrKv9WyEj11EJ+GqdfnBmVGyGdGwrRWwPiE76zLJ1U7sfvEL+72sDdBt5s4vdDx9oezOzVCsbvDyMcx6AKHIxlkRsznyecFZKyS8vAril12eVz68K0qPON/Auly2JSXdlXYh/bPd9ZqzVZikUAQye2bDRosRk0IbDPoF3yAXoQ+hlpXxI8POazvkOHuKI2wwQPQW1pNZSU4jAIIoyZ+Q6SQ00ztv6Z43T4RrfgSiTXii/LXCn5dZIGmCOZ3gxdFw X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 25, 2025 at 06:15:49PM -0400, Sweet Tea Dorminy wrote: > From: Sweet Tea Dorminy > > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into > percpu_counter") [1]. Previously, the memory error was bounded by > 64*nr_threads pages, a very livable megabyte. Now, however, as a result of > scheduler decisions moving the threads around the CPUs, the memory error could > be as large as a gigabyte. > > This is a really tremendous inaccuracy for any few-threaded program on a > large machine and impedes monitoring significantly. These stat counters are > also used to make OOM killing decisions, so this additional inaccuracy could > make a big difference in OOM situations -- either resulting in the wrong > process being killed, or in less memory being returned from an OOM-kill than > expected. > > Finally, while the change to percpu_counter does significantly improve the > accuracy over the previous per-thread error for many-threaded services, it does > also have performance implications - up to 12% slower for short-lived processes > and 9% increased system time in make test workloads [2]. > > A previous attempt to address this regression by Peng Zhang [3] used a hybrid > approach with delayed allocation of percpu memory for rss_stats, showing > promising improvements of 2-4% for process operations and 6.7% for page > faults. > > This RFC takes a different direction by replacing percpu_counters with a > more efficient set of per-NUMA-node atomics. The approach: > > - Uses one atomic per node up to a bound to reduce cross-node updates. > - Keeps a similar batching mechanism, with a smaller batch size. > - Eliminates the use of a spin lock during batch updates, bounding stat > update latency. > - Reduces percpu memory usage and thus thread startup time. > > Most importantly, this bounds the total error to 32 times the number of NUMA > nodes, significantly smaller than previous error bounds. > > On a 112-core machine, lmbench showed comparable results before and after this > patch. However, on a 224 core machine, performance improvements were > significant over percpu_counter: > - Pagefault latency improved by 8.91% > - Process fork latency improved by 6.27% > - Process fork/execve latency improved by 6.06% > - Process fork/exit latency improved by 6.58% > > will-it-scale also showed significant improvements on these machines. > The problem on fork/exec/exit stems from back-to-back trips to the per-cpu allocator every time a mm is allocated/freed (which happens for each of these syscalls) -- they end up serializing on the same global spinlock. On the alloc side this is mm_alloc_cid() followed by percpu_counter_init_many(). Even if you eliminate the counters for rss, you are still paying for CID. While this scales better than the stock kernel, it still leaves perf on the table. Per our discussion on IRC there is WIP to eliminate both cases by caching the state in mm. This depends on adding a dtor for SLUB to undo the work in ctor. Harry did the work on that front, this is not submitted to -next though. There is a highly-inefficient sanity-check loop in check_mm(). Instead of walking the entire list 4 times with toggling interrupts in-between, it can do the walk once. So that's for the fork/execve/exit triplets. As for the page fault latency, your patch adds atomics to the fast path. Even absent any competition for cachelines with other CPUs this will be slower to execute than the current primitive. I suspect you are observing a speed up with your change because you end up landing in the slowpath a lot and that sucker is globally serialized on a spinlock -- this has to hurt. Per my other message in the thread, and a later IRC discussion, this is fixable with adding intermediate counters, in the same spirit you did here. I'll note though that numa nodes can be incredibly core-y and that granularity may be way too coarse. That aside there are globally-locked lists mms are cycling in and out of which also can get the "stay there while cached" treatment. All in all I claim that: 1. fork/execve/exit tests will do better than they are doing with your patch if going to the percpu allocator gets eliminated altogether (in your patch it is not for mm_alloc_cid() and the freeing counterpart), along with unscrewing the loop in check_mm(). 2. fault handling will be faster than it is with your patch *if* something like per-numa state gets added for the slowpath -- the stock fast path is faster than yours, the stock slowpath is way slower. you can get the best of both worlds on this one. Hell, it may be your patch as is can be easily repurposed to decentralize the main percpu counter? I mean perhaps there is no need for any fancy hierarchical structure. I can commit to providing a viable patch for sorting out the fork/execve/exit side, but it is going to take about a week. You do have a PoC in the meantime (too ugly to share publicly :>). So that's my take on it. Note I'm not a maintainer of any of this, but I did some work on the thing in the past.