From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4AEA3EFD209 for ; Wed, 25 Feb 2026 08:20:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 809286B00D8; Wed, 25 Feb 2026 03:20:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B6E06B00DC; Wed, 25 Feb 2026 03:20:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69E156B00E5; Wed, 25 Feb 2026 03:20:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5571A6B00D8 for ; Wed, 25 Feb 2026 03:20:04 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 004401C4A9 for ; Wed, 25 Feb 2026 08:20:03 +0000 (UTC) X-FDA: 84482281128.26.8F06BCF Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com [95.215.58.173]) by imf18.hostedemail.com (Postfix) with ESMTP id 13DB31C000A for ; Wed, 25 Feb 2026 08:20:01 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="Sv/XUFhd"; spf=pass (imf18.hostedemail.com: domain of hao.li@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=hao.li@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1772007602; a=rsa-sha256; cv=none; b=R8uMeNVeup/fRxIb9S17GbdKyLOJEtTjawMAc2d18MlWSDqxk8mnsazh3s1iJYwY8X4H9M arwTpC+AAGgEvG+WaYi70xwPy6jshjU2luYXt//iD8fDkaTa8VEy5eZgVgWvnb3iFFxFVk ST73r4lgLbqLx9pUounjPiGm/4OWdd8= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="Sv/XUFhd"; spf=pass (imf18.hostedemail.com: domain of hao.li@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=hao.li@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1772007602; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CqMfuQhJe0D0dX24jwBy/Pxdi2sd3URLqvWF4bUrvgU=; b=dZtm6tfY24eQ4HbEtwTmJxP4Cv1izXoXsAoTEu9uJ2xavvRxhKfw72enV/WlE+/92dfZrj OnLJI6aEKZw7Hg4KBFzbC+8v23jlW5iA90gZ2jakxlaiulVHfQ0JEnqw+8KP4FjvNR5p6S 4h0BUbhDe8NesTc+63ZbdI/S9qzpZJI= Date: Wed, 25 Feb 2026 16:19:49 +0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772007600; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=CqMfuQhJe0D0dX24jwBy/Pxdi2sd3URLqvWF4bUrvgU=; b=Sv/XUFhdHVYdyOnP9s/ge+Ix9sVmvmrfu+qD5qxsje9QS7SbipcDGBGnpQ2rTybeNg5E1o 4AYLh2vEQbx5pbMRusGIMIQZPwfeHtK87V9JFkd/QRo89/j4de9q0C7I5l+iHXD/awBsAj TyeYJebnNhJ+Kl5M/qfLQ4I2zKIVaYw= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Hao Li To: Harry Yoo Cc: Ming Lei , Vlastimil Babka , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, surenb@google.com Subject: Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 13DB31C000A X-Stat-Signature: yoo9p4igrpzm9bahs8x3wnkoms9rqgkx X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1772007601-106117 X-HE-Meta: U2FsdGVkX19feDn2MpU+BoxkhoTEGBLv87UQ63KtQxP+8lzR9AmhdyjF35IABWAhvAxb+3FSXiY6IfbW5WTitkJPKXMvlnsSVURnQNmtnqEbMpK4bw2nOjXLzkJFP+7zxH+aCWMk37cWhaJFDIchU08ThOVipEc+vPNO89Y24bJrzdoTsGBR+4titL7spOn+FGi+mqqzkFgF0OxWfenQrNOZl4L2YgSzY9TiM2Pr1OCrFhMhl9nDlj5gjAT/MORokzNvLsk4mNnlce+mCjOyrIpsuBDaO/xgsxYT4dPnoHYC+HsGNa+mckjIFrjHqtsjUIz2makjwg0z1ycSVmvmdsRo8h8OEerN8+xC8KYKA0MQzCiktFAAL1p/2EbzBcLQWckxIX6/QMkFFQ4+Grt8h99vJQU0FWNt0jjYhV0GvGdQYClGieAf69BDOfHOMfxcAE5SFs59GyNGReRpXyeQTWFocpQ3Yqu3L9CLRvTqAosfG92686Uoozy7C29kjuh+AZXcORuSKjSE9yoS6GpoCOKKV0fRAjFyh9xoWKxReaZsnPG7SjuFFarIRCkBSDQxkAnhoWHMOLeGcadet9MnytGWsnkD8B5cLenm/Ud7cNBUtMzEFJHC6VrcHG6rgM7gtmWpl5cMFLJdyhvaCgm9XjxYlIKAD3Yci/d0VzS80DjIYEq7YuRjI8kBa0FcadJyhgdd7aHQtk2bLVDgX+QfPCnlJ6heJxdHcLtalDNuuD0YIz4WCTQ1+AAC5QI2fQVrRbAGJ6MkWYTL+Nv1Cz937dvbkknvQxSWTUuNLjDSgmlg09Wh/woGSMt7l82Xzj1gS8Qtk2FmAep5pZpVPn+pFy+vHaK+GXXx6DAtrPOYHA+HslyjuWbssVlQmwvGXH0AtWkIY0dEG+V8/Qy0PffMkn6ygYoDjO3IAfjpnPsZ+32M+7yupNbg8J3HwusC+2fnxDH+kSJHtk9RUO06FXM X644eYXG IJ27J9qc4m8xazfn6CHQ1Ps7Gk7x1WYZ42/xIFFt1SNUZBD9XvpMtnHmrGMpYWQfI+5UTf6Hw488l4ik2CtdNVyWmNtX7Tgh7FycIoJ/k/vgs8qUWYGRwxSot/GYJ/SabTEqAGvoLipuUHOPXfkR5BviarIbW/lA96o9tCkXxQqPo6TtvTHh36twwN7virDfMVZ9DLhkIVbZELhc= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote: > On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote: > > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote: > > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote: > > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote: > > > > > Hi Harry, > > > > > > > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote: > > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote: > > > > > > > Hello Vlastimil and MM guys, > > > > > > > > > > > > Hi Ming, thanks for the report! > > > > > > > > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch > > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe > > > > > > > performance regression for workloads with persistent cross-CPU > > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops > > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64% > > > > > > > drop). > > > > > > > > > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at > > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation > > > > > > > paths"), so the exact first bad commit could not be identified. > > > > > > > > > > > > Ouch. Why did it crash? > > > > > > > > > > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI > > > > > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) > > > > > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025 > > > > > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140 > > > > > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9 > > > > > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086 > > > > > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246 > > > > > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80 > > > > > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e > > > > > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000 > > > > > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410 > > > > > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000 > > > > > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0 > > > > > [ 16.162447] PKRU: 55555554 > > > > > [ 16.162448] Call Trace: > > > > > [ 16.162450] > > > > > [ 16.162452] kmem_cache_free+0x410/0x490 > > > > > [ 16.162454] do_readlinkat+0x14e/0x180 > > > > > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30 > > > > > [ 16.162461] do_syscall_64+0x7e/0x6b0 > > > > > [ 16.162465] ? post_alloc_hook+0xb9/0x140 > > > > > [ 16.162468] ? get_page_from_freelist+0x478/0x720 > > > > > [ 16.162470] ? path_openat+0xb3/0x2a0 > > > > > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350 > > > > > [ 16.162474] ? count_memcg_events+0xd6/0x210 > > > > > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0 > > > > > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0 > > > > > [ 16.162481] ? charge_memcg+0x48/0x80 > > > > > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0 > > > > > [ 16.162484] ? __folio_mod_stat+0x2d/0x90 > > > > > [ 16.162487] ? set_ptes.isra.0+0x36/0x80 > > > > > [ 16.162490] ? do_anonymous_page+0x100/0x4a0 > > > > > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0 > > > > > [ 16.162493] ? count_memcg_events+0xd6/0x210 > > > > > [ 16.162494] ? handle_mm_fault+0x212/0x340 > > > > > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0 > > > > > [ 16.162500] ? irqentry_exit+0x6d/0x540 > > > > > [ 16.162502] ? exc_page_fault+0x7e/0x1a0 > > > > > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > > > > > For this problem, I have a hypothesis which is inspired by a comment in the > > > > patch "slab: remove cpu (partial) slabs usage from allocation paths": > > > > > > > > /* > > > > * get a single object from the slab. This might race against __slab_free(), > > > > * which however has to take the list_lock if it's about to make the slab fully > > > > * free. > > > > */ > > > > > > > > My understanding is that this comment is pointing out a possible race between > > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes > > > > n->list_lock when it is about to make the slab fully free, and > > > > get_from_partial_node() also takes the same lock, the two paths should be > > > > mutually excluded by the lock and thus safe. > > > > > > > > However, I'm wondering if there could be another race window. Suppose CPU0's > > > > get_from_partial_node() has already finished __slab_update_freelist(), but has > > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object > > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the > > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then > > > > > > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding > > > > n->list_lock, trying to add this slab to the CPU partial list. > > > > > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait > > > for CPU0 to release the lock. And CPU0 will remove the slab from the > > > partial list before releasing the lock. Or am I missing something? > > > > > > > In that case, > > > > both paths would operate on the same union field in struct slab, which might > > > > lead to list corruption. > > > > > > Not sure how the scenario you describe could happen: > > > > > > CPU 0 CPU1 > > > - get_from_partial_node() > > > -> spin_lock(&n->list_lock) > > > - __slab_free() > > > -> __slab_update_freelist(), > > > slab becomes full > > > -> was_full == 1 > > > -> spin_lock(&n->list_lock) > > > > In __slab_free, if was_full == 1, then the condition > > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't > > enter the "if" block and therefore n->list_lock is not acquired. > > Does that sound right. > > Nah, you're right. Just slipped my mind. No need to acquire the lock > if it was full, because that means it's not on the partial list. Exactly. > > Hmm... but the logic has been there for very long time. Yes. > > Looks like we broke a premise for the percpu slab caching layer > to work correctly, while transitioning to sheaves. > > I think the new behavior introduced during the sheaves transition is that > SLUB can now allocate objects from slabs without freezing it. Allocating > objects from slab without freezing it seems to confuse the free path... I feel it's not a big issue. I think the root cause of this issue is as follows: Before this commit, get_partial_node would first remove the slab from the node list and then return the slab to the upper layer for freezing and object allocation. Therefore, when __slab_free encounters a slab marked as was_full, that slab would no longer be on the node list, avoiding race conditions with list operations. However, after this commit, get_from_partial_node first allocates an object from the slab and then removes the slab from the node list. During the interval between these two steps, __slab_free might encounter a slab marked as was_full and then it want to add the slab to the CPU partial list, while at the same time, another process is trying to remove the same slab from the node list, leading to a race condition. > > But not sure if we could "fix" that because the percpu partial slab > caching layer is gone anyway :) Yes, this bug has already disappeared with subsequent patches... By the way, to allow Ming Lei to continue the bisect process, maybe we should come up with a temporary workaround, such as: } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) { spin_lock_irqsave(&n->list_lock, flags); /* * Let this empty critical section push back put_cpu_partial, ensuring * its execution happens after the critical section of * get_from_partial_node running in parallel. */ spin_unlock_irqrestore(&n->list_lock, flags); /* * If we started with a full slab then put it onto the * per cpu partial list. */ put_cpu_partial(s, slab, 1); stat(s, CPU_PARTIAL_FREE); } -- Thanks, Hao