From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3DC46F4385A for ; Wed, 15 Apr 2026 16:47:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5AB9B6B0089; Wed, 15 Apr 2026 12:47:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 55C106B008C; Wed, 15 Apr 2026 12:47:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 498C96B0092; Wed, 15 Apr 2026 12:47:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 3A29D6B0089 for ; Wed, 15 Apr 2026 12:47:33 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A2050160578 for ; Wed, 15 Apr 2026 16:47:32 +0000 (UTC) X-FDA: 84661371144.02.7C75821 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf25.hostedemail.com (Postfix) with ESMTP id 340A6A000D for ; Wed, 15 Apr 2026 16:47:29 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=none; spf=pass (imf25.hostedemail.com: domain of fedorov.nikita@h-partners.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=fedorov.nikita@h-partners.com; dmarc=pass (policy=quarantine) header.from=h-partners.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1776271651; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=Zea3HjUgRWG/fBmALUWnqq0TgzKXGLOrbiT8m5uaTMM=; b=8qGuPC8KUYK+LaWjno91pXL6GpobNMmoLUpqkTtbVMFjxJ41zUSOLp04FGO2D56s3HZCSf rBQN7v5nGkDh9knBmYXM7GYfwE/N6K2qARQbun81HcduU1P6cAGoY6oqmDKiGTYWMA7Aio FTDUKWa/zp1rOMdC49BC1MTZIXCmk0U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1776271651; a=rsa-sha256; cv=none; b=pH0uGzucnNEeFbFXwFI9UiRQXOcDueJkATeBb2WIkEIt/Qsekb1qEl2NI7zEc1xiIiaTRT FxvSlM6LT4KOUnlOuieyKNa8mOMgx+zRdotn6s6YHHSf+zHFkp1wJKgty6REqHTTDrSj1A tM3qy4wuDFVNrxEOm/zdxttTYmfurk4= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=none; spf=pass (imf25.hostedemail.com: domain of fedorov.nikita@h-partners.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=fedorov.nikita@h-partners.com; dmarc=pass (policy=quarantine) header.from=h-partners.com Received: from mail.maildlp.com (unknown [172.18.224.83]) by frasgout.his.huawei.com (SkyGuard) with ESMTPS id 4fwn9p1RZ4zHnGjl; Thu, 16 Apr 2026 00:47:10 +0800 (CST) Received: from mscpeml500003.china.huawei.com (unknown [7.188.49.51]) by mail.maildlp.com (Postfix) with ESMTPS id 8C8F340569; Thu, 16 Apr 2026 00:47:26 +0800 (CST) Received: from localhost.localdomain (10.123.66.205) by mscpeml500003.china.huawei.com (7.188.49.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Wed, 15 Apr 2026 19:47:25 +0300 From: Fedorov Nikita To: Catalin Marinas , Will Deacon , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , , , Juergen Gross , Ajay Kaher , Alexey Makhalov , , Arnd Bergmann , Peter Zijlstra , Boqun Feng , Waiman Long , Darren Hart , Davidlohr Bueso , , Andrew Morton , David Hildenbrand , Zi Yan , Matthew Brost , Joshua Hahn , Rakie Kim , , Gregory Price , Ying Huang , Alistair Popple , Anatoly Stepanov CC: Nikita Fedorov , , , , , , , , , , , , , Subject: [RFC PATCH v3 0/7] Date: Thu, 16 Apr 2026 00:44:52 +0800 Message-ID: <20260415164459.2904963-1-fedorov.nikita@h-partners.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.123.66.205] X-ClientProxiedBy: mscpeml500003.china.huawei.com (7.188.49.51) To mscpeml500003.china.huawei.com (7.188.49.51) X-Rspamd-Queue-Id: 340A6A000D X-Rspamd-Server: rspam07 X-Stat-Signature: rjc7z6aeofkgadm15yafxhifm7hk9aue X-Rspam-User: X-HE-Tag: 1776271649-256889 X-HE-Meta: U2FsdGVkX1+6qStEl1YRL5zYtuUUE/lJ4XcIvCK9n8+ejaT9vw83hlOSa6IXouOny6Gfc1TpDgcK/ZbMuDnCCkbQLjOb6chw8cc4Z09Dk1tETO2UUDfXRTqf3XMLPL+AcOtROY9k6iSCsK8ueMXNjZ5nOdrwDFbnwdTvHef8cymzDeMSp0sWxgQl1j9SjQK2GGz6rUybzsZYA9LDn2rhqnqS11efYs2liuRktTjZ7Enu5FDEucaq/OYTATAws2az45Js1k7zHB5zyi941JQHMjT2npGeKY/pJ4ofRat7qx9QCHn9speS8ecVWGd2mnozF2xn5TpOxDq/t290AOBvaEDO/n2lPceVpIy50jqeOcUGeDd3RfS5C0G4TofsXvKO7o7vlvs0PcDn5AvCAD7J2/0qIM+EQeU4jRVnZdx3FWWsCFs2NArJ0yfhx/u1USWGnX066bI3kIBSBpunCxP5vpQEdiRl3BgqejX9eqnIMCkE6xwVTWdDh5BfsOThDBpNunLUq6w6WQo1COK4nwFIlP6ekG3yjQFXIdDhaqfwxAs54BZ9l2/dbIdzm2V4kUCego3HPh8vZMmDCgAo4WfGj6rjxSOprwEy/h+u6PhkXCnCXRUZiiPXGUCM2YEVtznhqcjEkWn0MwOwRuRgXcGH4HCB08cfutNE7e39vuXFtrh41NWRpg1tH+UE2U3n8+ju5Z1CWqssUxNEpOWbGfBu8Kn1RwxxZCykuNq7/gdKed4j/mJ3fcDe7klCj50tqMHY0Ar7VuADlclFKE3YKLiMcPfQQ0L8Qg/jPjvXQfchgRsckCuqrNCrONwHpVA/1mQy7dNfTfUE7k0e6y2Rz2NmVmp2cFI2kt29heL/DZi/svqVWAD4sBqVhAh7ueeWLc13k9Bogu1Muhykkj1JkKGYAHrGFvUp9gjZAdsX8DAh33NtpWfj8yXi1Xf/EIj9Y7PQ62JwIw3O/iELa4XprFI K5y1m8Py 7K10aVgfYBZZ4C7JpoEgNgbclrdNBEr0IBLW/QV6VaeXhbioV/kIoZGbCZ9v+Oywib0CaTfbo2TYjJqA4giNORgpwWiQ0iugs0gjhPdJs9Dq+RjrYwEIriqC0ue6zYFgbucu2Hy7XMZzOTwQRJDYTvxtmmx9KtbMzVErZEBp005+w50mkMxIM4ElnhI7No4JsGj6n0q49k0Xy6nlPF9TQcMqzChv5nO/I3ZTOUSXOrDtfTqKu7/4QrejiHrjDtacSVm4X6UO5UlEpq7CiWnWWe3eLudBJWd0fKjBOIqgjqkP/QrKzbx8TkL6K4w== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Changes since v2: - split the patch into a smaller patch series, though it is still hard to split the hqlock internal logic itself - remove some unused code - rebased onto Linux v7.0 - allocate hq-spinlock metadata with kvmalloc instead of memblock - added contention detection and verified we have no performance degradation in low contention scenario [Motivation] In a high contention case, existing Linux kernel spinlock implementations can become inefficient on modern NUMA systems due to frequent and expensive cross-NUMA cacheline transfers. This might happen due to following reasons: - on "contender enqueue" each lock contender updates a shared lock structure - on "MCS handoff" cross-NUMA cache-line transfer occurs when two contenders are from different NUMA nodes. Previous work regarding NUMA-aware spinlock in Linux kernel is CNA-lock: https://lore.kernel.org/lkml/20210514200743.3026725-1-alex.kogan@oracle.com/ It reduces cross-NUMA cacheline traffic during handoff, but does not reduce it during enqueuing. CNA design also requires the first contender to do additional work during global spinning and keeps threads from all nodes other than the first one in the single secondary queue. In our measurements, we only saw benefits from using it on Kunpeng; on x86 platforms, CNA behaved the same as a regular qspinlock. Thus, there is still quite a lot of potential for optimization. HQ-lock has completely different design concept: kind of cohort-lock and queued-spinlock hybrid. If someone wants to try the HQ-lock in some subsystem, just change lock initialization code from `spin_lock_init()` to `spin_lock_init_hq()`, or change `DEFINE_SPINLOCK()` macro to `DEFINE_SPINLOCK_HQ()` if the lock is static. The dedicated bit in the lock structure is used to distiguish between the two lock types. [Performance measurements] Performance measurements were done on x86 (AMD EPYC) and arm64 (Kunpeng 920) platforms with the following scenarious: - Locktorture benchmark - Memcached + memtier benchmark - Ngnix + Wrk benchmark [Locktorture] NPS stands for "Nodes per socket" +------------------------------+-----------------------+-------+-------+--------+ | AMD EPYC 9654 | +------------------------------+-----------------------+-------+-------+--------+ | 192 cores (x2 hyper-threads) | | | | | | 2 sockets | | | | | | Locktorture 60 sec. | NUMA nodes per-socket | | | | | Average gain (single lock) | 1 NPS | 2 NPS | 4 NPS | 12 NPS | | Total contender threads | | | | | | 8 | 19% | 21% | 12% | 12% | | 16 | 13% | 18% | 34% | 75% | | 32 | 8% | 14% | 25% | 112% | | 64 | 11% | 12% | 30% | 152% | | 128 | 9% | 17% | 37% | 163% | | 256 | 2% | 16% | 40% | 168% | | 384 | -1% | 14% | 44% | 186% | +------------------------------+-----------------------+-------+-------+--------+ +-----------------+-------+-------+-------+--------+ | Fairness factor | 1 NPS | 2 NPS | 4 NPS | 12 NPS | +-----------------+-------+-------+-------+--------+ | 8 | 0.54 | 0.57 | 0.57 | 0.55 | | 16 | 0.52 | 0.53 | 0.60 | 0.58 | | 32 | 0.53 | 0.53 | 0.53 | 0.61 | | 64 | 0.52 | 0.56 | 0.54 | 0.56 | | 128 | 0.51 | 0.54 | 0.54 | 0.53 | | 256 | 0.52 | 0.52 | 0.52 | 0.52 | | 384 | 0.51 | 0.51 | 0.51 | 0.51 | +-----------------+-------+-------+-------+--------+ +-------------------------+--------------+ | Kunpeng 920 (arm64) | | +-------------------------+--------------+ | 96 cores (no MT) | | | 2 sockets, 4 NUMA nodes | | | Locktorture 60 sec. | | | | | | Total contender threads | Average gain | | 8 | 93% | | 16 | 142% | | 32 | 129% | | 64 | 152% | | 96 | 158% | +-------------------------+--------------+ [Memcached] +---------------------------------+-----------------+-------------------+ | AMD EPYC 9654 | | | +---------------------------------+-----------------+-------------------+ | 192 cores (x2 hyper-threads) | | | | 2 sockets, NPS=4 | | | | | | | | Memtier+memcached 1:1 R/W ratio | | | | Workers | Throughput gain | Latency reduction | | 32 | 1% | -1% | | 64 | 1% | -1% | | 128 | 3% | -4% | | 256 | 7% | -6% | | 384 | 10% | -8% | +---------------------------------+-----------------+-------------------+ +---------------------------------+-----------------+-------------------+ | Kunpeng 920 (arm64) | | | +---------------------------------+-----------------+-------------------+ | 96 cores (no MT) | | | | 2 sockets, 4 NUMA nodes | | | | | | | | Memtier+memcached 1:1 R/W ratio | | | | Workers | Throughput gain | Latency reduction | | 32 | 4% | -3% | | 64 | 6% | -6% | | 80 | 8% | -7% | | 96 | 8% | -8% | +---------------------------------+-----------------+-------------------+ [Nginx] +-----------------------------------------------------------------------+-----------------+ | Kunpeng 920 (arm64) | | +-----------------------------------------------------------------------+-----------------+ | 96 cores (no MT) | | | 2 sockets, 4 NUMA nodes | | | | | | Nginx + WRK benchmark, single file (lockref spinlock contention case) | | | Workers | Throughput gain | | 32 | 1% | | 64 | 68% | | 80 | 72% | | 96 | 78% | +-----------------------------------------------------------------------+-----------------+ Despite, the test is a single-file test, it can be related to real-life cases, when some html-pages are accessed much more frequently than others (index.html, etc.) [Low contention remarks] After adding contention detection scheme, we do not see performance degradation in low contention scenario (< 8 threads), throughput of HQspinlock is equal to qspinlock, while still having practically the same improvement in high contention case as mentioned above. Previous version: https://lore.kernel.org/lkml/20251206062106.2109014-1-stepanov.anatoly@huawei.com/ Anatoly Stepanov (7): kernel: add hq-spinlock types hq-spinlock: implement inner logic hq-spinlock: add contention detection hq-spinlock: add hq-spinlock tunables and debug statistics kernel: introduce general hq-spinlock support lockref: use hq-spinlock futex: use hq-spinlock for hash buckets arch/arm64/include/asm/qspinlock.h | 37 + arch/x86/include/asm/hq-spinlock.h | 34 + arch/x86/include/asm/paravirt-spinlock.h | 3 +- arch/x86/include/asm/qspinlock.h | 6 +- include/asm-generic/qspinlock.h | 23 +- include/asm-generic/qspinlock_types.h | 44 +- include/linux/lockref.h | 2 +- include/linux/spinlock.h | 26 + include/linux/spinlock_types.h | 26 + include/linux/spinlock_types_raw.h | 20 + kernel/Kconfig.locks | 29 + kernel/futex/core.c | 2 +- kernel/locking/hqlock_core.h | 850 +++++++++++++++++++++++ kernel/locking/hqlock_meta.h | 487 +++++++++++++ kernel/locking/hqlock_proc.h | 164 +++++ kernel/locking/hqlock_types.h | 122 ++++ kernel/locking/qspinlock.c | 65 +- kernel/locking/qspinlock.h | 4 +- kernel/locking/spinlock_debug.c | 20 + mm/mempolicy.c | 4 + 20 files changed, 1939 insertions(+), 29 deletions(-) create mode 100644 arch/arm64/include/asm/qspinlock.h create mode 100644 arch/x86/include/asm/hq-spinlock.h create mode 100644 kernel/locking/hqlock_core.h create mode 100644 kernel/locking/hqlock_meta.h create mode 100644 kernel/locking/hqlock_proc.h create mode 100644 kernel/locking/hqlock_types.h -- 2.34.1