From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94D01C369D1 for ; Thu, 24 Apr 2025 11:29:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 57ED26B0030; Thu, 24 Apr 2025 07:28:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 52B756B008A; Thu, 24 Apr 2025 07:28:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3A5DA6B0093; Thu, 24 Apr 2025 07:28:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1A7506B008A for ; Thu, 24 Apr 2025 07:28:59 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4FC7481C2D for ; Thu, 24 Apr 2025 11:28:59 +0000 (UTC) X-FDA: 83368715598.10.025BC85 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf29.hostedemail.com (Postfix) with ESMTP id 22145120011 for ; Thu, 24 Apr 2025 11:28:56 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=rNb3xYDo; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=w7YLHNPW; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=rNb3xYDo; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=w7YLHNPW; spf=pass (imf29.hostedemail.com: domain of pfalcato@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=pfalcato@suse.de; dmarc=pass (policy=none) header.from=suse.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745494137; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KBfxIUFb6Wi4ZOUmIf9v+/3evFqDQIaD4Xx97OscEQs=; b=qBCkGp5nUY0FsYs2+KG6wJV5SmSNIix710a76rzNMIGfbsS8MVdWOg04oHB/KtN/lFtVYz HCOp/5rZARKwHVveZMJWEa4R18J04Mb23MfbheLmLGr9FjWt/h5+BLlGNYLL1tUCiFLIhC qz/SH3jRyROPchMrXMSSoDrWSUEE8ts= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=rNb3xYDo; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=w7YLHNPW; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=rNb3xYDo; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=w7YLHNPW; spf=pass (imf29.hostedemail.com: domain of pfalcato@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=pfalcato@suse.de; dmarc=pass (policy=none) header.from=suse.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745494137; a=rsa-sha256; cv=none; b=gn4rstpYvIWGfHFVi++Thv/q3807w3LgJY83zOtUAhlmKZ7kM/0BNyKD/A5TDEpyH2fr3i 3VPR2mThWa5gg9WCip4POPNPmKYIQxhNELNoEh2yAWYOsIJkcSFgt6xYyBdK5cZzlsKY05 q/GpW138ne66zDZraE4eQA5zERIpNQQ= Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 803A021171; Thu, 24 Apr 2025 11:28:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1745494135; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KBfxIUFb6Wi4ZOUmIf9v+/3evFqDQIaD4Xx97OscEQs=; b=rNb3xYDod78VKRHrk8rmwtaUQLD47juZn8PUyBzJCwBldOcjD5kvY2SFu7zAYM/q6hyUy6 o/s4mgWSuCWQqzbni3gT/Q37Oo7AUVijFPfua7padGYZfZHe5EqM5FI22Vwjav3o0igXsJ 6Uf49ES99/kbFwElIntDzQsl7p2S+a0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1745494135; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KBfxIUFb6Wi4ZOUmIf9v+/3evFqDQIaD4Xx97OscEQs=; b=w7YLHNPWOfe51DKNAJbmpwK3kIk2SpQB1ADESOyNZagDetijisf0c/AyfVdBSRuBVsAgOX 8gNwFI0fLw7s5KBg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1745494135; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KBfxIUFb6Wi4ZOUmIf9v+/3evFqDQIaD4Xx97OscEQs=; b=rNb3xYDod78VKRHrk8rmwtaUQLD47juZn8PUyBzJCwBldOcjD5kvY2SFu7zAYM/q6hyUy6 o/s4mgWSuCWQqzbni3gT/Q37Oo7AUVijFPfua7padGYZfZHe5EqM5FI22Vwjav3o0igXsJ 6Uf49ES99/kbFwElIntDzQsl7p2S+a0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1745494135; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KBfxIUFb6Wi4ZOUmIf9v+/3evFqDQIaD4Xx97OscEQs=; b=w7YLHNPWOfe51DKNAJbmpwK3kIk2SpQB1ADESOyNZagDetijisf0c/AyfVdBSRuBVsAgOX 8gNwFI0fLw7s5KBg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 4886E1393C; Thu, 24 Apr 2025 11:28:54 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id uK0iDXYgCmhiPAAAD6G6ig (envelope-from ); Thu, 24 Apr 2025 11:28:54 +0000 Date: Thu, 24 Apr 2025 12:28:37 +0100 From: Pedro Falcato To: Harry Yoo Cc: Vlastimil Babka , Christoph Lameter , David Rientjes , Andrew Morton , Dennis Zhou , Tejun Heo , Mateusz Guzik , Jamal Hadi Salim , Cong Wang , Jiri Pirko , Vlad Buslov , Yevgeny Kliteynik , Jan Kara , Byungchul Park , linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Message-ID: References: <20250424080755.272925-1-harry.yoo@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20250424080755.272925-1-harry.yoo@oracle.com> X-Stat-Signature: ezm1onhfoug18bhgypwbkt51xutjhpeb X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 22145120011 X-Rspam-User: X-HE-Tag: 1745494136-266100 X-HE-Meta: U2FsdGVkX18UxfsiqM/mpqX6cEelmQ9sKVVBcXztv8fDaPt/EP6uDLZeTmk/9/c4Z2r1RRb+8B8n7c2ccHZnsDHOTieQdXxdzfPd9jfdgPG19hQiKtTzDT3T4aXA9O5tE4lilUInEN0H261oDuzALXKFrL4orr5T/lQvO5XHumHLLoM0SdgtgBX7HdcLsykZJojAAX+ovQoak1VCJwOeQewTxQd81kp4EBkbpkKjkfYQfbNKuyMDrPsVD+oImnhdCiz/+JLc7mcma7+kHmbLGjcWjCljQMiTJvGIgvHqDlhdHQEgAvPqhlWUIyJCcM8eYD0r1WPsDkdMgGzjr4W5oHlLy76eOUaouJCQMZtepxfRfJPhb49DyK0OCCIivp+4Y5PeSXJar26KYfqm8TNXoOz8+8DqRguHnBXnMUcdY4XMYd7tbVB+ffiQFEG204TO8fn9mQIE2p82z/vz3eYp3t/Jw7rAcN7+x7CVirK/RZ8CzKwCDZJ84bb4Nghmkmd94AVgVluK9hmRPWkBId7DPC7/bTI5W18TNxWmA4lexj+XXOt4a+yWUU9PUeEDDBpnE3fj2AnDy1ACya6be3ZUHaXHwNq3Y1oGN7t5NfskxYYijby388ydKe4A+IIcC8MDWmZIZzQjXfJneemDIsQAZ3B7jDRdrXJh6RTj+Mevp4WHvQbHBTCSVSpYz9I18TRLcGwJGD+fNiZejV8PqCKeIKh6GiuYSYfkTUpPwFsytavYYX/6FldhNm1IQ7xXGH3UAo924m1GgxpZD1dm0pEQfZ7V+A/ZoOF41a24p8nA3XkyFBsVLXjjYHubvoQ+tzJQfImZChRLnZQH019ULeWTVQ+91nDBYA+Ux4MrB+XBSToCLJHxLfcpvCyoBfSpi/9dGMk+O66FlA2I+DbP+NIK4nkd4Ivb4nfVLvrt1C4Z7jzXY7W5BX6qWKrh0ztvb6uVyS09DZK8fYmSpatiMZo MZ+RjioS xqqQdFGaBge6NYmDb0yfmGHQv+SPfCsHQdVHHKyXSWoW+MOi6Qegn1sbXF480+Jrz/Gmrv1izy4B69O7ggUw5m1NmVOC2LaS3zTta3joTcQX5RPgZQ0/YC72LkfcCCVd6LFlGZ6MNQPky7IE9L1Ql12PWAkDpOdvBvnSnk4TP4C8Sx5Yvtob4Bok6TWiRx2YA33UeXIgaOE9Ys5fNQdpNU2D6X1CZh7iHJzbSDZPL7OO3eiIzMNHPCh5dOhDXchCQzXciOqRH/jqCFs2+xuhmS7jRdZgtYBsA9k0rY4apScSp0mWPajKwJCHEsBav0PM8YHeGfs8CGR2506lJvYafw1VJ9lPGQq01+yrQ53M1bMBMBAm3MZ0r/AkfXX331QywgLzo0s5IaYE4AP0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote: > Overview > ======== > > The slab destructor feature existed in early days of slab allocator(s). > It was removed by the commit c59def9f222d ("Slab allocators: Drop support > for destructors") in 2007 due to lack of serious use cases at that time. > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab > constructor/destructor pair to mitigate the global serialization point > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees > percpu memory during its lifetime. > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat), > so each allocate–free cycle requires two expensive acquire/release on > that mutex. > > We can mitigate this contention by retaining the percpu regions after > the object is freed and releasing them only when the backing slab pages > are freed. > > How to do this with slab constructors and destructors: the constructor > allocates percpu memory, and the destructor frees it when the slab pages > are reclaimed; this slightly alters the constructor’s semantics, > as it can now fail. > I really really really really don't like this. We're opening a pandora's box of locking issues for slab deadlocks and other subtle issues. IMO the best solution there would be, what, failing dtors? which says a lot about the whole situation... Case in point: What happens if you allocate a slab and start ->ctor()-ing objects, and then one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing everything back (AIUI this is not handled in this series, yet). Besides this complication, if failing dtors were added into the mix, we'd be left with a half-initialized slab(!!) in the middle of the cache waiting to get freed, without being able to. Then there are obviously other problems like: whatever you're calling must not ever require the slab allocator (directly or indirectly) and must not do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator is a no-go (AIUI!) already because of such issues. Then there's the separate (but adjacent, particularly as we're considering this series due to performance improvements) issue that the ctor() and dtor() interfaces are terrible, in the sense that they do not let you batch in any way shape or form (requiring us to lock/unlock many times, allocate many times, etc). If this is done for performance improvements, I would prefer a superior ctor/dtor interface that takes something like a slab iterator and lets you do these things. The ghost of 1992 Solaris still haunts us... > This series is functional (although not compatible with MM debug > features yet), but still far from perfect. I’m actively refining it and > would appreciate early feedback before I improve it further. :) > > This series is based on slab/for-next [2]. > > Performance Improvement > ======================= > > I measured the benefit of this series for two different users: > exec() and tc filter insertion/removal. > > exec() throughput > ----------------- > > The performance of exec() is important when short-lived processes are > frequently created. For example: shell-heavy workloads and running many > test cases [3]. > > I measured exec() throughput with a microbenchmark: > - 33% of exec() throughput gain on 2-socket machine with 192 CPUs, > - 4.56% gain on a desktop with 24 hardware threads, and > - Even 4% gain on a single-threaded exec() throughput. > > Further investigation showed that this was due to the overhead of > acquiring/releasing pcpu_alloc_mutex and its contention. > > See patch 7 for more detail on the experiment. > > Traffic Filter Insertion and Removal > ------------------------------------ > > Each tc filter allocates three percpu memory regions per tc_action object, > so frequently inserting and removing filters contend heavily on the same > mutex. > > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for > more detail), I observed a 26% reduction in system time and observed > much less contention on pcpu_alloc_mutex with this series. > > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared > about tc filter insertion rate; these changes may still benefit > workloads they run today. > The performance improvements are obviously fantastic, but I do wonder if things could be fixed by just fixing the underlying problems, instead of tapering over them with slab allocator magic and dubious object lifecycles. In this case, the big issue is that the pcpu allocator does not scale well. -- Pedro