From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C1A0BCFD2F6 for ; Sat, 29 Nov 2025 05:57:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 85E2F6B000C; Sat, 29 Nov 2025 00:57:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8153A6B000D; Sat, 29 Nov 2025 00:57:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 701BA6B000E; Sat, 29 Nov 2025 00:57:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5BE7A6B000C for ; Sat, 29 Nov 2025 00:57:38 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id CE9C987C4D for ; Sat, 29 Nov 2025 05:57:37 +0000 (UTC) X-FDA: 84162587754.15.A2A6C1D Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf27.hostedemail.com (Postfix) with ESMTP id DDE3A40004 for ; Sat, 29 Nov 2025 05:57:35 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=F5MoRF1T; spf=pass (imf27.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1764395856; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7pucvq8UTv64qpwVHdPIZ+2FfvakzVEyw07NIdj5Q+U=; b=MElXCuMvfUtNpzWfEcC/v+dILmLiqI/YhJdXk3T2MuGfuLK3U8+yJq2C5h+kp/zr3koiAS bdUMxm07GgES5LbPOAPKnEuS0eIVhrzUGAw5duoizmbSQmdcMb7+fYL4cMjs0GAqpr14F2 YjIu6Zouia53J0N/NxH7A2voAAInM6A= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=F5MoRF1T; spf=pass (imf27.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1764395856; a=rsa-sha256; cv=none; b=ojAGZmrEkZdufboH/xChXCnipmrW7qxQswLuBu2dVBZNy38Xkk4bsm2+3j0jAnw/fBHDn+ f/a7Jkc6IHSBgy08CMlrcbUQ7O7jBCebY3C81ZX4JgQGocA8ABXbdETuTXdYaW552oWc7j 1wW+B4J2dVjVszQO8iReIES5/EWN1EE= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-64198771a9bso4810093a12.2 for ; Fri, 28 Nov 2025 21:57:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764395854; x=1765000654; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7pucvq8UTv64qpwVHdPIZ+2FfvakzVEyw07NIdj5Q+U=; b=F5MoRF1Tkbn0kcYb8XmTqZx7lK0vKyNWdYhn0kUplrnJGi4XIjd+RH+yGTvHtFnna4 vgtwgYhzIY4ZeIiuYk3yZnPE9NuqwrIvGfBjtQLAvmoT5+IPvgjJ8E1Aen0J2V/7Y4Tj JMfRSD5Ix6DnLlWI4fV60w7tPz6iTP3dLiLazCPdLUHOBj2uf/uPJD2sonOWhSiKQA9m +bRsaMVzWGI7+Q95yeWg70pgPjJIeRvg9Fl/uTIf8qoc2+brchlhYFYdJr2UzsCHO5+1 JpYh7Q5NIwO2Zp7K/DPSN3j0pS2dilTISvBPF2jv//0Of1RtQ6m4h0JobtU5wW218v99 Fk/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764395854; x=1765000654; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=7pucvq8UTv64qpwVHdPIZ+2FfvakzVEyw07NIdj5Q+U=; b=DCUigYJ/abWTE5NFFlNBATz8AdQXwhF34pwQE9MWQKyC0pzVxbV+4bJYBPzNsauypk X2JUDlMzTEuXXblN5t/nT0JTBh3TE9joKtrZ26WsS/VICMeKrCj78CKlxrZdveH3e4V0 CCH6OIw87qsdt50Ugo7HNDKEvMspxBW+FWhrc/YNdEgtgBc0S3HKKd7pqxUEXEpep1t7 YntUZqIhdtOGRRnb9winvXSBXXJwbDUHMHM9JuQCs+LusNwGF9AtCpioFQOyP6eQ19Pp SlJu1GE2T6VnU0s5DnVzGWnxlxF8viaDKAxQ9+PBhh72i2CId2tLChJytiyaj3BWa5nT DU9Q== X-Forwarded-Encrypted: i=1; AJvYcCWf/AY8uT2f774AiJoBpwkMNO9uCCtKyiplPMi2JFGllc2aF/TxF9yyrFgoIoIYSX0zcIkqakuX3Q==@kvack.org X-Gm-Message-State: AOJu0YytQkLesarQM36HukEzjZQIXN4MaXq5swPP+tCR6AlwKXrCPOBp mkS+OfWuNLxEAptWzd7gfdsF2xCCrL3kzHyIRME1d2qJDYCiPcceUIAJfnH+xeuKBmaiSGhcsou YDaTat+Z2ByAsaODwN9l3YV0bKy3xcP4= X-Gm-Gg: ASbGnct77QGvwLuCZ+kbHclkphwot/QD+xSUHTmmkm1dxIOby1jA2UdL+2vk741+7Xi drhhH4ZQBlkNNlDs5BhVouK+9OwalEsSWXzkclcSukN9BfGxCR20JbhfX6VHfG1x1E63AAVbBmT Mg52RlS9kMKbuBQeFSlnZnlQ7ObmAP4yzVWnDrU6Vuxs3oq25EaGx4MtXwWIbo6D0x54qC2dkkS afBtAqcj5ziR+GTDg3FbrvVGZYbJDyho+5IbvvKUleATHw292ugEpoP08Gg5yCB1hBYcbKmIYUC hkrxDli7N6/Y0MYwxhitAh144g== X-Google-Smtp-Source: AGHT+IGgaoRpdfEmgc8Id6eVtbvXCmDKZ6RVqHgdv+n0xRKP3FUTsLvtBF+027sAzwMYNBHB0rtwcqnSPG2haI6VYHA= X-Received: by 2002:a05:6402:1475:b0:640:c8b8:d55 with SMTP id 4fb4d7f45d1cf-64554339bd6mr31939720a12.3.1764395853975; Fri, 28 Nov 2025 21:57:33 -0800 (PST) MIME-Version: 1.0 References: <20251127233635.4170047-1-krisman@suse.de> In-Reply-To: From: Mateusz Guzik Date: Sat, 29 Nov 2025 06:57:21 +0100 X-Gm-Features: AWmQ_bn6jVVRbPFZERZlQjSy9G6SnJ7CCksAbVsbn-o8u7rsKoU4Ih49F9s24kM Message-ID: Subject: Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks To: Jan Kara Cc: Mathieu Desnoyers , Gabriel Krisman Bertazi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Shakeel Butt , Michal Hocko , Dennis Zhou , Tejun Heo , Christoph Lameter , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Thomas Gleixner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: DDE3A40004 X-Rspamd-Server: rspam02 X-Stat-Signature: npqkdo3k8acg8iydie5hnzn7p3sqx968 X-Rspam-User: X-HE-Tag: 1764395855-54506 X-HE-Meta: U2FsdGVkX1+x0ceOWMxsTvjXn+sQWXeRYqECZTwk2A39oqGQkE+yquu/CQvkf/MUtcy8WBWBpitNMh1THSD++YNO3YM8jG01tdp++hGdudXxqd3sSsJEWEbGAUT+Ae4cVfQPplatPWrl0Y2hsstAdGoMIdq9SEI3M0EghjPrUV4xqC8FBfTxRzw7gbQcEU4F48cr0xU+y+ZzQMVblXoBo8chCgY+C0SHwOlxSqozVY8DT4KIGOc7xgWH4A2Uu9uWnjbPifsoaN9zNIYr/cu1+3IY/6FmTWPechR1+KnJxZWdn0u67SrM8yCEm7F7L9iyZyOyg79P6Cl1jOu0qZFOpZ/PwJw/kysxoi4K5nu7qT1M3pkwsDgB0fpN5pU6X5+34ONz6njAhynBWld1EMLANEh9L24lyALWlbruQIh4IiAhZcNSDcQb+pnwTJDmXl7AGj0ul9Nl8KZ+wAraMLG1+/IFwYZgpN4+s80gQtPz/x8ue4rQU7KQOR2WIXcdy4Df4/0hpK+lDs6a2ySYIuuT5CezJepOdm+APfiImqMrRSMPazpGh4JgSPzYZ2Za1Zmg6GRvGh0eJqPAITLyKznuRrTZZbqzAPyFXcYSV41zWTBbnODCo9Ut+Q1SVvL+wyXQaBYMb7XunhQXGYZ+5YLH36zYdIQ79f14AziE87+f6e1yy69QCpGoloXsaHyvQSH8A0OhzOEkrx+ZGdl/FIs7bx9+8R37Hl+YELgtVzLUUEdroed67pfkfzSCn62NDeKo3u9DQJaYM8CiMxDkQmsXUM+0zR1VzMDh3t/1JxdcFV7guHeH/do1m/hPW5jflffQxR4a6alVzv+1GHr23tx2mpsrZg/h9Wz4ibJmKtFFiGl9/8j35Od6fEsVn5gYt7AuTyJ3FjblGdCm7rg+2Tn5IFDkPqzuwLyAbvdBau8Z9qPkXNG2bUJj272F0gxNtCSDozk5SxUL6T0ZbUjY7lz Jip//JlL 12CEU4VQLoEQtUtBatBoYaOQ8DzJwUUosh6nFyj133r1G9Hgz464O/R50yvEYGZkqbL9ZyoXX83GuCikJ0srOOLasCIhSnpwz797qOhXcUuzrpFkTCP+6DAoaRY4Rc5myMhiWvaCNJyHsQUEJueJUf5626tSvPIDB1ibFOPAMzuTw3DcSTtHjmhUC61cvvS8cSqRfUjTpi/JLv1FZUV0focY2Ph+lhQznVZ5+UnjwSRDcd6ZD1Qbc2En3iHLoRkckRqclZUs+E4P3/NrRhwagT1sJqoU4FolP4ENzRfaxshMlgHSO0oI++i6iaQkOMgiMbOCT5kRXuQwbpLJymwX3m0UA/ZG1BCWkmbPmSVlt9QYwLKSfFI00xHe0spZxpwHT4bWzD7fv4hwpTinOWWYmum4oqoh5o8wWy+N/zDXQcElUzUY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Nov 28, 2025 at 9:10=E2=80=AFPM Jan Kara wrote: > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: > > What would really reduce memory allocation overhead on fork > > is to move all those fields into a top level > > "struct mm_percpu_struct" as a first step. This would > > merge 3 per-cpu allocations into one when forking a new > > task. > > > > Then the second step is to create a mm_percpu_struct > > cache to bypass the per-cpu allocator. > > > > I suspect that by doing just that we'd get most of the > > performance benefits provided by the single-threaded special-case > > proposed here. > > I don't think so. Because in the profiles I have been doing for these > loads the biggest cost wasn't actually the per-cpu allocation itself but > the cost of zeroing the allocated counter for many CPUs (and then the > counter summarization on exit) and you're not going to get rid of that wi= th > just reshuffling per-cpu fields and adding slab allocator in front. > The entire ordeal has been discussed several times already. I'm rather disappointed there is a new patchset posted which does not address any of it and goes straight to special-casing single-threaded operation. The major claims (by me anyway) are: 1. single-threaded operation for fork + exec suffers avoidable overhead even without the rss counter problem, which are tractable with the same kind of thing which would sort out the multi-threaded problem 2. unfortunately there is an increasing number of multi-threaded (and often short lived) processes (example: lld, the linker form the llvm project; more broadly plenty of things Rust where people think threading =3D=3D performance) Bottom line is, solutions like the one proposed in the patchset are at best a stopgap and even they leave performance on the table for the case they are optimizing for. The pragmatic way forward (as I see it anyway) is to fix up the multi-threaded thing and see if trying to special case for single-threaded case is justifiable afterwards. Given that the current patchset has to resort to atomics in certain cases, there is some error-pronnes and runtime overhead associated with it going beyond merely checking if the process is single-threaded, which puts an additional question mark on it. Now to business: You mentioned the rss loops are a problem. I agree, but they can be largely damage-controlled. More importantly there are 2 loops of the sort already happening even with the patchset at hand. mm_alloc_cid() results in one loop in the percpu allocator to zero out the area, then mm_init_cid() performs the following: for_each_possible_cpu(i) { struct mm_cid *pcpu_cid =3D per_cpu_ptr(mm->pcpu_cid, i); pcpu_cid->cid =3D MM_CID_UNSET; pcpu_cid->recent_cid =3D MM_CID_UNSET; pcpu_cid->time =3D 0; } There is no way this is not visible already on 256 threads. Preferably some magic would be done to init this on first use on given CPU.There is some bitmap tracking CPU presence, maybe this can be tackled on top of it. But for the sake of argument let's say that's too expensive or perhaps not feasible. Even then, the walk can be done *once* by telling the percpu allocator to refrain from zeroing memory. Which brings me to rss counters. In the current kernel that's *another* loop over everything to zero it out. But it does not have to be that way. Suppose bitmap shenanigans mentioned above are no-go for these as well. So instead the code could reach out to the percpu allocator to allocate memory for both cid and rss (as mentined by Mathieu), but have it returned uninitialized and loop over it once sorting out both cid and rss in the same body. This should be drastically faster than the current code. But one may observe it is an invariant the values sum up to 0 on process ex= it. So if one was to make sure the first time this is handed out by the percpu allocator the values are all 0s and then cache the area somewhere for future allocs/frees of mm, there would be no need to do the zeroing on alloc. On the free side summing up rss counters in check_mm() is only there for debugging purposes. Suppose it is useful enough that it needs to stay. Even then, as implemented right now, this is just slow for no reason: for (i =3D 0; i < NR_MM_COUNTERS; i++) { long x =3D percpu_counter_sum(&mm->rss_stat[i]); [snip] } That's *four* loops with extra overhead of irq-trips for every single one. This can be patched up to only do one loop, possibly even with irqs enabled the entire time. Doing the loop is still slower than not doing it, but his may be just fast enough to obsolete the ideas like in the proposed patchset. While per-cpu level caching for all possible allocations seems like the easiest way out, it in fact does *NOT* fully solve problem -- you are still going to globally serialize in lru_gen_add_mm() (and the del part), pgd_alloc() and other places. Or to put it differently, per-cpu caching of mm_struct itself makes no sense in the current kernel (with the patchset or not) because on the way to finish the alloc or free you are going to globally serialize several times and *that* is the issue to fix in the long run. You can make the problematic locks fine-grained (and consequently alleviate the scalability aspect), but you are still going to suffer the overhead of taking them. As far as I'm concerned the real long term solution(tm) would make the cached mm's retain the expensive to sort out state -- list presence, percpu memory and whatever else. To that end I see 2 feasible approaches: 1. a dedicated allocator with coarse granularity Instead of per-cpu, you could have an instance for every n threads (let's say 8 or whatever). this would pose a tradeoff between total memory usage and scalability outside of a microbenchmark setting. you are still going to serialize in some cases, but only once on alloc and once on free, not several times and you are still cheaper single-threaded. This is faster all around. 2. dtor support in the slub allocator ctor does the hard work and dtor undoes it. There is an unfinished patchset by Harry which implements the idea[1]. There is a serious concern about deadlock potential stemming from running arbitrary dtor code during memory reclaim. I already described elsewhere how with a little bit of discipline supported by lockdep this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't take any locks if you hold them and you have to disable interrupts) + mark dtors as only allowed to hold a leaf spinlock et voila, code guaranteed to not deadlock). But then all code trying to cache its state in to be undone with dtor has to be patched to facilitate it. Again bugs in the area sorted out by lockdep. The good news is that folks were apparently open to punting reclaim of such memory into a workqueue, which completely alleviates that concern anyway. So happens if fork + exit is involved there are numerous other bottlenecks which overshadow the above, but that's a rant for another day. Here we can pretend for a minute they are solved. [1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?re= f_type=3Dheads