From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 985E1EFB7EB for ; Tue, 24 Feb 2026 02:52:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF0016B0088; Mon, 23 Feb 2026 21:52:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B905E6B0089; Mon, 23 Feb 2026 21:52:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A71CF6B008A; Mon, 23 Feb 2026 21:52:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 927D66B0088 for ; Mon, 23 Feb 2026 21:52:47 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 3C2431C379 for ; Tue, 24 Feb 2026 02:52:47 +0000 (UTC) X-FDA: 84477827574.03.596CFE0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 3E3671C000A for ; Tue, 24 Feb 2026 02:52:45 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DksMoPFT; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf18.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1771901565; a=rsa-sha256; cv=none; b=PqeJjVRoOC4y9pHzEnetYpL3C5qPjX9OkP8cw5neByg7RYHkNGBkO3CdMxdExMXU1yZllv oCfnsQ6/7wwqrKlICMjtypG6DnIoVa2Cb6Jf5PWUbo5/GKDRXWJLKc9pKP+KrMlBWaTagm LjnmIcbchU+tfsQUjRr7vG7faaRR718= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DksMoPFT; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf18.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1771901565; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=KvdK4bKlV9p8CtqWIClX1h0DPFMJUXzNU1SYA1FO1bk=; b=B6i1zv4f2AZnWYbdIw8G7E/USdxBcz9/PfdFjY48Puq+kSZYBGecnGpP8BJqU7UXHKnE8n 4UKhKJDIpUrDbFjIL54lgzl1rjmYhiilS/26653eVy5Cv8RP6Zz/nv/f4KM6RSLYxpEmqw ehruF0Bhzprsr1rivbZVfBf2NGZboW0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1771901564; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=KvdK4bKlV9p8CtqWIClX1h0DPFMJUXzNU1SYA1FO1bk=; b=DksMoPFTLFU0MDZcEIzQTMuVlJdopLxuf5S0TW4dlLogHatVHeyfS+uCEOb/gjZZWzViSn SpxqMkZV4rLWyMCLGNbOkWLvFVExfkcvH2ZqhdcV6pfKaFWW9BdGbA04EFAnxajc2FdhZz P04F4GmOFsceotb1MvbMfR8XFNL9U2c= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-460-VLMHYQwwMwi9CxWQW8Ft8g-1; Mon, 23 Feb 2026 21:52:41 -0500 X-MC-Unique: VLMHYQwwMwi9CxWQW8Ft8g-1 X-Mimecast-MFC-AGG-ID: VLMHYQwwMwi9CxWQW8Ft8g_1771901560 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E5E321800473; Tue, 24 Feb 2026 02:52:39 +0000 (UTC) Received: from fedora (unknown [10.72.116.18]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id CC1241800348; Tue, 24 Feb 2026 02:52:34 +0000 (UTC) Date: Tue, 24 Feb 2026 10:52:28 +0800 From: Ming Lei To: Vlastimil Babka , Andrew Morton Cc: ming.lei@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org Subject: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Message-ID: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-MFC-PROC-ID: hvFk0IvQkeFunq1lElMj6jec6j37LeEC6Jg8iSEC3HI_1771901560 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 3E3671C000A X-Stat-Signature: ibqqjjikskj3cbetbxkumtco8d959fdo X-HE-Tag: 1771901565-896646 X-HE-Meta: U2FsdGVkX18MbLmgzBtXa6Gp5K90V3gUBNzOwq5wtwxu8352WmtANAfV1mGyWHZ9Qyb8FjqxUuDnk5JHIRKDHXjO00N38NAMxBCKRW6FYYu0qA/xrrtQGhwiFEg96YW/lkqSejgEUqJGjB2KQ+IMA4ho3qo7jC4QwVojlJC6sZM/VAsxendM4d9/u+qG8xIt+VkMVDkcm951Q2LAkiHfqhy1sAIxvZhFDmgF4YjyMbXxecs2zyEYnn9blufMUyZBLKO5D2EiL39MTuYYadLNalsFT2XUxQj9oFBFWL4ri/1m7s1BxbltvOVBWGx2BCgAbwNq/hAJDlDHNrOq0sFDRkkbFwby2D5CW9RTHCNRRXg3JeCDKYcROvPAgeMpaaEdKq0ChYhdkJoQBNcUjm8bp6qSP9Sv8K/IvTKtKLBXn2R0LUvm///HBskcYxb9qtQJUqKQ8T2m2kiIWe3VD4PfPhfRZkyCGylAw3i8Vnew0hRJBQ9LitUCd02CduHP5sHVOzQGhCtivOE0Pv6IDFV1j4r0DxRtere0YEwD26ZYYqpJoDHtb2/kk5MPgMI4LxpdRQA345S/1wP/+LnHbJecyZKzN/Doq2Dr3Gj06m0ooTuaPwZS8vOrNel3Cs0cJEcrs/oSBeObl+gvc4PNPzC93CZpjKtVnF9P/XQNkKzJSpFLw+P7ofQHc0a2F6AsGpuXpaXOpqE22oEAS4lLVaURfrZg9lz8rrySZdJWbI3WztNwKZnGe69+fk67Izr0dp7m7OR6f9hh+Cd+otWx1E4YKTdFPFOMx9JxdDvEgxq4aRSoO1QxY8mb2SI0EeCo+GiRZb+ahSJiguPsMWP8FHuqW7V8ltZ60gOVaiDz/VoeZUczFj0Zde6WjmFvQkRKpSgUOzL8vRI2PTA37l5autiFTuzf9HQzelEQIDb7jR48BoB0bdNDX7LhtbIHkCULNb0YIxK3PdskfZz4XygcwJ+ RIWiOc01 fVUz1on8BXYyOowZU1wYglzaVYSRT3qo+98Equv/dRZfrx9LiFpRP/1sfD9x+rpr5MOUX6/vmirJkGdev18+ORMcEHygMi+fhHkpTAeR3GoSQ8m73uZBLC1m9OJ3olDGlbGnfLKrqmY6JoPZf1kpNlPkeSNkxJ2uTfanLBO/IZP0SqArl8RU6S5isN7OT3oJufjjjPHpxTbSTIjtfTbi4xE612EWsjzXkNHlZA7WcpV1T3ysN1UCaBTO2fht1ZySTNnMtjFNTERbErJP8t3BDYiEURJfDhSBtJw/vUXQEaXbgj1UP936Y5eFCc78yDv+5CzhimLkHlBGs2hyDiKRn8mJ7GA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hello Vlastimil and MM guys, The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe performance regression for workloads with persistent cross-CPU alloc/free patterns. ublk null target benchmark IOPS drops significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64% drop). Bisecting within the sheaves series is blocked by a kernel panic at 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation paths"), so the exact first bad commit could not be identified. Reproducer ========== Hardware: NUMA machine with >= 32 CPUs Kernel: v7.0-rc (with slab/for-7.0/sheaves merged) # build kublk selftest make -C tools/testing/selftests/ublk/ # create ublk null target device with 16 queues tools/testing/selftests/ublk/kublk add -t null -q 16 # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0 # cleanup tools/testing/selftests/ublk/kublk del -n 0 Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge) Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next) perf profile (bad kernel) ========================= ~47% of CPU time is spent in bio allocation hitting the SLUB slow path, with massive spinlock contention on the node partial list lock: + 47.65% 1.21% io_uring [k] bio_alloc_bioset - 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof - 44.41% kmem_cache_alloc_noprof - 43.89% ___slab_alloc + 41.16% get_from_any_partial 0.91% get_from_partial_node + 0.87% alloc_from_new_slab + 0.65% allocate_slab - 44.70% 0.21% io_uring [k] mempool_alloc_noprof - 44.49% mempool_alloc_noprof - 44.43% kmem_cache_alloc_noprof - 43.90% ___slab_alloc + 41.18% get_from_any_partial 0.90% get_from_partial_node + 0.87% alloc_from_new_slab + 0.65% allocate_slab + 41.23% 0.10% io_uring [k] get_from_any_partial + 40.82% 0.48% io_uring [k] __raw_spin_lock_irqsave - 40.75% 0.20% io_uring [k] get_from_partial_node - 40.56% get_from_partial_node - 38.83% __raw_spin_lock_irqsave 38.65% native_queued_spin_lock_slowpath Analysis ======== The ublk null target workload exposes a cross-CPU slab allocation pattern: bios are allocated on the io_uring submitter CPU during block layer submission, but freed on a different CPU — the ublk daemon thread that runs the completion via io_uring_cmd_complete_in_task() task work. And the completion CPU stays in same LLC or numa node with submission CPU. This cross-CPU alloc/free pattern is not unique to ublk. The block layer's default rq_affinity=1 setting completes requests on a CPU sharing LLC with the submission CPU, which similarly causes bio freeing on a different CPU than allocation. The ublk null target simply makes this pattern more pronounced and measurable because all overhead is in the bio alloc/free path with no actual I/O. **The following is from AI, just for reference** The result is that the allocating CPU's per-CPU slab caches are continuously drained without being replenished by local frees. The bio layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch: freed bios go to the completion CPU's cache via bio_put_percpu_cache(), leaving the submitter CPUs' caches empty and falling through to mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path. In v6.19, SLUB handled this with a 3-tier allocation hierarchy: Tier 1: CPU slab freelist lock-free (cmpxchg) Tier 2: CPU partial slab list lock-free (per-CPU local_lock) Tier 3: Node partial list kmem_cache_node->list_lock The CPU partial slab list (Tier 2) was the critical buffer. It was populated during __slab_free() -> put_cpu_partial() and provided a lock-free pool of partial slabs per CPU. Even when the CPU slab was exhausted, the CPU partial list could supply more slabs without touching any shared lock. The sheaves architecture replaces this with a 2-tier hierarchy: Tier 1: Per-CPU sheaf lock-free (local_lock) Tier 2: Node partial list kmem_cache_node->list_lock The intermediate lock-free tier is gone. When the per-CPU sheaf is empty and the spare sheaf is also empty, every refill must go through the node partial list, requiring kmem_cache_node->list_lock. With 16 CPUs simultaneously allocating bios and all hitting empty sheaves, this creates a thundering herd on the node list_lock. When the local node's partial list is also depleted (objects freed on remote nodes accumulate there instead), get_from_any_partial() kicks in to search other NUMA nodes, compounding the contention with cross-NUMA list_lock acquisition — explaining the 41% in get_from_any_partial -> native_queued_spin_lock_slowpath seen in the profile. The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but does not address the fundamental architectural issue: the missing lock-free intermediate caching tier that the CPU partial list provided. Thanks, Ming