From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D3F56CFA466 for ; Mon, 24 Nov 2025 04:03:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2392C6B000D; Sun, 23 Nov 2025 23:03:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E9956B0010; Sun, 23 Nov 2025 23:03:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B1406B0012; Sun, 23 Nov 2025 23:03:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E7FC86B000D for ; Sun, 23 Nov 2025 23:03:57 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 3AB681DF86F for ; Mon, 24 Nov 2025 04:03:55 +0000 (UTC) X-FDA: 84144157230.06.0F332E7 Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by imf13.hostedemail.com (Postfix) with ESMTP id 4F07320003 for ; Mon, 24 Nov 2025 04:03:53 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kfjGzXnO; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1763957033; a=rsa-sha256; cv=none; b=rMlSaWZFpkiHPM2y25GA7M3dqSgPj2GldDeCvqy4FoEn2ZpYSjqa21klAnCuE+VW9ICFX0 4fgY4UpHnvmeStgQKQYY87Ue8qEFHUmqUhsWlDWTAycvuwLED9uxMPbfxYsMDZzUiGJUln ODfx+gfZYUoQc5GUpqKFGJ21PZuo944= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kfjGzXnO; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1763957033; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jIzIXc5+6uVpF0ayOHYE8/pgkHr6fXDjfmHkpf7eXRU=; b=3TtyEmxOjgTxHlyTute9TionorgVcO1mojOFnxDZAwqvMeCjfcRl4xNmDGW6UTY5WLJCbh LGbw4ku5+PPFuaLhZmLRek/3Ic3eluzSJmV2jB2FbM1bnt94LHFr8aM7fnxpj0LczWBOoh 5IiCYxUNOI+Qe9oA1udEZEOVLoxuP6o= Received: by mail-ed1-f44.google.com with SMTP id 4fb4d7f45d1cf-64198771a9bso6807931a12.2 for ; Sun, 23 Nov 2025 20:03:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1763957032; x=1764561832; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jIzIXc5+6uVpF0ayOHYE8/pgkHr6fXDjfmHkpf7eXRU=; b=kfjGzXnOUkKMQb43XjyyWRu8Z0gNhSgvMVwP0SZzWoknw+jxPLXbUImB5hKrQXym7h kwiRzAGn6f9lOc30johtsRFbXTPCOQslNxyGANhjgCf9A/ghsuvlAM3+Rt5UX+0udC8J Vf5Z3/lem9M0l+dlMTZJ65CT6K89N99+GGa/lc/k/korzkMZx/rjMbkKavAYMvhQZWBl GG1lNPRYVaaNpuvuugKrAeZURFSUNeN0IjIrB9R3RUKm3v870YtSdzDiYld8FOP3idqj C1a1Y8y0TJx0xy3QTRn7TJJBFZjwZBadS6pWHdMpL3evvNmnb9AWdV2T1RFLcd4gtNHb 8Zeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763957032; x=1764561832; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=jIzIXc5+6uVpF0ayOHYE8/pgkHr6fXDjfmHkpf7eXRU=; b=b8UXIwYOLs012YhLjHhiJf6AC7Bkiw6UPm5Rh5410VpH08ez9Kc6HRHmjYVPCaUBI2 jBOTC/eAErPWMotJjhety+ur/lqyh0UcvxLbDcJ8wpzDqVRizZIrYsyPhemelZwnbbVk PkoFLb+vQp0gZ/+M+cflugdAp5+vF47qHoV6yRPbyLXZ61asWYE5aHQ65OsDAqCLqk96 TwGUl9iETjvf4ZMjWtizOYaSqrw5eU0vef0f4SwjtRNgwoWQwFGwO0LZnKB+h1BQGkrA 0FdXlCXfqQNT8Z+h7dZjLV+m1c1lzf0ZDuzR9WWp6Epsa6Ci4udVyAOq5rIKYMJ1Rz7W FV3g== X-Forwarded-Encrypted: i=1; AJvYcCWyXrHUhZ16djf6CoJhPQ5E5CV0SOhsVSoFVO693Ccc/WQlawH+ckEYWC3ROYKHY9YTnLI08PInzg==@kvack.org X-Gm-Message-State: AOJu0Ywq8PAU9GtFO1xLOwPcUiZq7RtrfYpguSf/JPZYuq74Dyvz8OHU YwJ9oh9ugm0AyErhtxocQ/qMMvNIVdTtLg61owo9/QXPqCLMBqahX1yNbXmBrKeTNx/knFwPTBu cSZZYQ84VIYVvIqtM4uAmJq0BN5siuvc= X-Gm-Gg: ASbGnctMdt/Ia1RDvkLtUFLMLVuEgWBwY5myI+OgWb7rdSCGYEvWiVXn5sKJuk9sGRD e2cCjI5IzctHw+F0nM8pHcFx7UqIGAr1crY6nT4ZvLQrRVmwIiSXrUQQDPg35l1ghaxboshMz+R CIbNA6dRYaRfFl4DF0jEt9tD3zeFdpeGe6y3dJcDMvMDQKDKmHz8ZQE89qUj0luguxnK5Y6Ob1F zHeOevKIrZDxCVfOGI0x/T2uXAbDe5/ktkeVHSzRew/7T9KoY4//+sBSirH721gWuoMSY17JXcF 4WQpNsZgtDq0UPm0uiPFOEbZgQ== X-Google-Smtp-Source: AGHT+IG+EpNIJR/8vbvl4HffeNR5OJyD41qemaPRJgSXLBv9qphS6ohV7DktIMFhDU7zw0nhM/SlfpEje2DlezUY42M= X-Received: by 2002:a05:6402:4314:b0:640:96fe:c7b8 with SMTP id 4fb4d7f45d1cf-64554339994mr8753470a12.2.1763957031489; Sun, 23 Nov 2025 20:03:51 -0800 (PST) MIME-Version: 1.0 References: <20251123063054.3502938-1-mjguzik@gmail.com> In-Reply-To: From: Mateusz Guzik Date: Mon, 24 Nov 2025 05:03:39 +0100 X-Gm-Features: AWmQ_blKcDGc87XpiuRA7G298sEhmBQXlkiPO9xXgf-J4eqsM71xt2lj6A2WBO4 Message-ID: Subject: Re: [PATCH 0/3] further damage-control lack of clone scalability To: Matthew Wilcox Cc: oleg@redhat.com, brauner@kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 4F07320003 X-Stat-Signature: e6wqda3mkeztm97xerumytu3tk4yn77q X-HE-Tag: 1763957033-94571 X-HE-Meta: U2FsdGVkX189CiMfsJCsta8DfdwweEv1/9vyRqIYDEKriS57584LbIlgjia9AqF/MquiLxD8mTpnEL/wgKsfmtIViCtsuIA/3uIQm3V+QsEbKrC4Zpcs0srMy9f6QlTMcrPYieCpd6KqwZgJIgQjOQLrl/+bZzzfhMcKCw58lDsDjrOlBYqDmABhh82nLRzpqeshsTOKeiBNhSbWNOmrwriatb02OrEaYK/HQpPaNFAubUq5j+g7xkOl9DFxBWgKq7Qmzd1l+krApSfIMG8DnMl+wn8K1d/FQKdHEKo4XqgMWttqTT0h1/wwL7iSTaCte/v/iUXgNeSqwfaVq2etzZpNQuV0O0Yot8rUGQa0CDneHkJ3ovdP7rZxivnJs6O3v3tpltsJl9wFWQce3gomxO5OkLmO/cIGXpruEWqV0wsNSKq4hBbpoH1nd5wZRM0xQoxtdBnmoSw4JLCM1y8j/zhAy/0Yd6HJHoO3eWpXUaN7aZ5oekTrQ98/t65lxsPeCxyVDUWTEfITni7v62NdbV3ITWdW/0ozbnhtsFm+eE4OsjF5f0mFZ0VjQ6hd5z5o9IP09Vs4AtBI6ubelgo2FQb6xkjiaAqhE9Io2ilC7YMoxgvfV7O/tQpj46NPCDWp64VZUoGZPDotkB9h0NC4MGoIUROYHa6fewFJc6uWwU1o8l3z0IFw1t08Hpw1yZjFsifMGGUrSeD8cbk2C9KFI3eZfAsi4OhNnQGmx3RB7CvyIoJV1Ut1GrMphDj9b6Od86geX1JPO7AQenwhrPmwWoN3BuwoMqVD/7gVQd1o8SmkEGxrHtRvKLMoT371i+2SGGQz/D2YPrU+OMyG6Srvan6EMBh1q53dB8yztUDdagx8S+cMelYXjhBAyA757WmsJGgvXe1OAENfUjMVWWPN04V0ACxSWNhaLRn0zj/YE/I9yYA2rrS4qSBmpFfOkkieLIey2BSpmewJ57TnlCV dDYfCzBU PyBBq/P4y86mzSa0jj3iPinC2LzsRXUxG8xXUV/f0PWyM0wQYNAtsTQYA9pBrJg3RZ2Mqz0rBsf+exnWFSC9X4EWI4O/K8b7HK6RYYspn4OyUXHdbaINP6b7WoHvpr7m8C8eN73lRTdqmvBm/CxHkrfgvSw9O3m38rwTyw6euZtOU0SGswBJqO41ZwzJzE3gJgNAckQdWFLXp/ASuwILkFl+dQoks6G2LNNnnW1rH5nNhTGMWEtcxr51XEH4KKOgZXoEW8w8yI+M3vq/8e1LcBxWp0qMjnYxdYYBIWGxvM5ax3zB2Qx5VRmKJXuHBuCUjgvCxc64lhRz9zM93tzN+vxPn2g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Nov 23, 2025 at 11:33=E2=80=AFPM Mateusz Guzik = wrote: > Ultimately in order to makes this scale CPUs need to stop sharing the > locks (in the common case anyway). To that end PID space needs to get > partitioned, with ranges allocated to small sets of CPUs (say 8 or 10 > or similar -- small enough for contention to not matter outside of a > microbenchmark). The ID space on Linux is enormous, so this should not > be a problem. The only potential issue is that PIDs no longer will be > sequential which is userspace-visible (I mean, suppose a wanker knows > for a fact nobody is forking and insists on taking advantage of it (by > Hyrum's law), then his two forks in a row have predictable IDs). > Anyhow, then most real workloads should be able to allocate IDs > without depleting the range on whatever they happen to be running on > (but deallocating may still need to look at other CPU ranges). > > With something like a bitmap + hash this is trivially achievable -- > just have a set of bitmaps and have each assigned a 'base' which you > add/subtract on the id obtained from the given bitmap. The hash is a > non-problem. > So I had a little bit of a think and I got something and it boils down to special casing the last level of a (quasi)radix tree. It provides the scalability I was looking for, albeit with some uglification. This is a sketch. Note that as port of allocation policy the kernel must postpone pids reuse, as in you can't just free/alloc in a tiny range and call it a day. Part of the issue for me is that there are 32 levels of allowed namespaces. The stock code will relock pidmap_lock for every single one(!) on alloc, this is just terrible. Suppose one introduces per-namespace locks, that's still 32 lock trips to grab a pid and that still does not solve the scalability problem. For my taste that's questionable at best, but at the same time this is what the kernel is already doing, so let's pretend for a minute the relocks are not a concern. The solution is based on (quasi)radix where the datum is a pointer to a struct containing a spinlock, bitmap and array of pids. Likely will be a xarray, but I'm going to keep typing radix for the time being as this is the concept which matters. The struct contains a carved out id range and can fit -- say -- 250 entries or so, or whatever else which fits in a size considered fine(tm). The actual pid is obtained by adding up the radix id (which serves as a prefix) and the offset into the array. In order to alloc a pid for a given namespace, the calling CPU would check if it has a range carved out. If so, it locks the thing and looks for a free pid. Absent a free pid or a range in the first place it goes to xarray to get space. This avoids synchronisation with other CPUs for the 250 forks (modulo a thread with an id from this range exiting on a different cpu), which sorts out the scalability problem in practical settings. Of course once someone notices that there are no IDs in use anymore *and* the last id was handed out at some point, the range gets freed. But you have to do the locking for every ns. So let's say it is in fact a problem and it would be most preferred if the CPU could take *one* lock and stick with it for all namespaces, all while retaining scalability. Instead, every CPU could have its own pidmap_lock. The struct with the bitmap + pid array would have a pointer to a spinlock, which would refer to the pidmap lock of whichever CPU which allocated the range. Et voila, allocs still get away with one lock acquire in the common case where there are free ids in all namespaces which need to be touched. Contention is only present on xarray locks if you ran out of space *or* on the pidmap lock if someone is freeing an id. Hell, this can probably be made lockless to further speed it up if need be. However, lockless or not, the key point is that most allocs will *not* be bouncing any cachelines.