From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B776C25B67 for ; Tue, 24 Oct 2023 02:22:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D3B096B0171; Mon, 23 Oct 2023 22:22:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CEB786B0172; Mon, 23 Oct 2023 22:22:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD9756B0173; Mon, 23 Oct 2023 22:22:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id AC83D6B0171 for ; Mon, 23 Oct 2023 22:22:21 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4305980B4C for ; Tue, 24 Oct 2023 02:22:21 +0000 (UTC) X-FDA: 81378755682.16.D291FF3 Received: from out-207.mta0.migadu.com (out-207.mta0.migadu.com [91.218.175.207]) by imf11.hostedemail.com (Postfix) with ESMTP id C2C5540019 for ; Tue, 24 Oct 2023 02:22:17 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=nDy5mvdR; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf11.hostedemail.com: domain of chengming.zhou@linux.dev designates 91.218.175.207 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698114139; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fwaP8TdpUtTLFl5cjIZcASMe6zXr80njnSOmBInx+FU=; b=TPf7C4dt709ujpmNS72gxA+s86uQreD4rHpzGJVBwaphOqQzXOlYgsnUWitKoq/+M1AdLn MGSvn9Ynr/8xIb9AbksBs8FV6hKor5qMh0a5/Q1K6pUKNVNC1AVCgWKdrEV0WNjYhTXOZm A94pa9Zg+pFQsAu8K1FUBjxpFqyRL4Q= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=nDy5mvdR; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf11.hostedemail.com: domain of chengming.zhou@linux.dev designates 91.218.175.207 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698114139; a=rsa-sha256; cv=none; b=LDUsA6J5cjdFR4gwrYsdVXQL0ERG8hPAR/64TL9MB5XByZxG2WusMoAjaCI9Svyme7piWN uSNThXbgMYbUYRMgAKo6JJArvkh2YN3+JIoUYP/PVQFoL9FRGfoRLNPdP9sEXHjpmmdCLX roQNSsCraxdwsHDOCUVKFzA+elBlFnY= Message-ID: <72361910-240f-4aa2-a695-117e1b14a804@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1698114134; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fwaP8TdpUtTLFl5cjIZcASMe6zXr80njnSOmBInx+FU=; b=nDy5mvdR7l/4DSwIUlVfnKR937Jbhtx+5RDuxNAoUTLY07J4JPxEkYqo7PKTUyhfNQRcCm Nz9cCV0L4s5qNYbiBUno5kw3x6I+Upi3gvESwPgXlKRfWVbqoMAlQrHt8oZmjw+o1h/QQy cJT1PFM+Qd4uTANemIuKn0v5xhpA6D0= Date: Tue, 24 Oct 2023 10:20:49 +0800 MIME-Version: 1.0 Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs Content-Language: en-US To: Vlastimil Babka , cl@linux.com, penberg@kernel.org Cc: rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, willy@infradead.org, pcc@google.com, tytso@mit.edu, maz@kernel.org, ruansy.fnst@fujitsu.com, vishal.moola@gmail.com, lrh2000@pku.edu.cn, hughd@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chengming Zhou References: <20231021144317.3400916-1-chengming.zhou@linux.dev> <4134b039-fa99-70cd-3486-3d0c7632e4a3@suse.cz> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou In-Reply-To: <4134b039-fa99-70cd-3486-3d0c7632e4a3@suse.cz> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: C2C5540019 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: jwie339uy7ikycs3fgyqf4e4q4pcdi9z X-HE-Tag: 1698114137-916110 X-HE-Meta: U2FsdGVkX1//d/5IU5v9vphGHhtEDlKJNJ2LL7u1cYT7kH289HTLxoSAhSTdVIZjoQlE114ZDB99uVz96w9E1/zS//h06dH72FBtY56HT5Hq3X+Mo2X7UdbB02JT/EysKr2VxGaY8KIgIKYBYPdqQa46q4FManp9W/FwD+P1geiKrb86PbtP73R6pp+8N0DoRFXjcqrJd3bdbnI529gXauOmRKMlZcILucDLsLSxU9bpozAjUML0OF59pS+VTcp5Ys9oaoJfg+2g80mGPU+mty5Re9b/fgGICMWcAO8wM3Wnd/Ug4LMyrCDaGzSozs6TxqcrKYXwlNr34DjmPTh3A/7gZnhaOdDG5SSnFGAI+R1jJRwOA9lc7+jjyrkoQcYWjL9msMpfn9vhAjpzLuFeNgG0hAEdySwBVsp/e30dVtoSbXxoeRPCrjF6b+4vQAD6pugST44DAY6EPFF7hui0a4lRc5rqGP+OQmjcROjW8J3CzEP25pMSbwGjH7xOYaZLFFywlx0qmh2MVku2TBjwKNMB3IV9pvSnjVZdsTz5eYuBRB9OZ8pxVTeUiAs2J30FE8w0qBmIvDGgPaYqwXOJFiE+Z7pg7jntFClq4JOIvf3Ov8EX8FAUuYwrpAyrKyzEg8/3bewSQdbsMphQ5FNzJ6qJ7/CKTiyyQEDT7r+cugy1N/X61N//IsoWPPcbboze05Rv5ddQhScKNAfLLSzpSs+ojH1GMCEXC6W5sLDIw0dOrx0TrLp5yj/RhhedJ9jKwBSxwMuNJg/2FNIKKBox3avNSIGifYDf9ND0bpwQg2jrDlPZgHz1xqGEcs7B4//duEA7pe6deVaMZhq7fKEjQQxTyhu8hd8PuxH6J9BunvW/u796Yn/U+o11Fkwne9+eV7ydiZsiEun10guM/UapGFvnzeYX+hWFcD3cGFOjHcBsNiFjootbasfxtx6q3DSYgi+FvVvMd14B7TgePgK S4b81Wz4 I295X510zgE9NB7CktaSrWMOuBLaKbdm82vhpdyjM3eP3h+qWZPQKuETmU+Acm4TvDmM/i3VnJ/6VywvMKo0wo+zLQQbTVlaiXd1SpFy94q9nViQJ9uVf+f7jy2BG7JYiJF3kC7TlqKnCxeJEuKYuXoBHWR+NAiJaZBDLC8RjXt6EMCeI9rLWDI41Uf3DIYNikGX5dlOpUf4IV5OVdgHOGKgbT95GISYGjdpQVk8iCQp2KDByQEI+tCBnOwqS3UtMAQ69+eagbaEnFw1APLSZWLpo3W8duHSQy33e X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/10/23 23:46, Vlastimil Babka wrote: > On 10/21/23 16:43, chengming.zhou@linux.dev wrote: >> From: Chengming Zhou > > Hi! > >> Changes in RFC v2: >> - Reuse PG_workingset bit to keep track of whether slub is on the >> per-node partial list, as suggested by Matthew Wilcox. >> - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which >> is caused by leak of partial slabs when get_partial_node(). >> - Add a patch to simplify acquire_slab(). >> - Reorder patches a little. >> - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/ >> >> 1. Problem >> ========== >> Now we have to freeze the slab when get from the node partial list, and >> unfreeze the slab when put to the node partial list. Because we need to >> rely on the node list_lock to synchronize the "frozen" bit changes. >> >> This implementation has some drawbacks: >> >> - Alloc path: twice cmpxchg_double. >> It has to get some partial slabs from node when the allocator has used >> up the CPU partial slabs. So it freeze the slab (one cmpxchg_double) >> with node list_lock held, put those frozen slabs on its CPU partial >> list. Later ___slab_alloc() will cmpxchg_double try-loop again if that >> slab is picked to use. >> >> - Alloc path: amplified contention on node list_lock. >> Since we have to synchronize the "frozen" bit changes under the node >> list_lock, the contention of slab (struct page) can be transferred >> to the node list_lock. On machine with many CPUs in one node, the >> contention of list_lock will be amplified by all CPUs' alloc path. >> >> The current code has to workaround this problem by avoiding using >> cmpxchg_double try-loop, which will just break and return when >> contention of page encountered and the first cmpxchg_double failed. >> But this workaround has its own problem. > > I'd note here: For more context, see 9b1ea29bc0d7 ("Revert "mm, slub: > consider rest of partial list if acquire_slab() fails"") Good, will add it. > >> - Free path: redundant unfreeze. >> __slab_free() will freeze and cache some slabs on its partial list, >> and flush them to the node partial list when exceed, which has to >> unfreeze those slabs again under the node list_lock. Actually we >> don't need to freeze slab on CPU partial list, in which case we >> can save the unfreeze cmpxchg_double operations in flush path. >> >> 2. Solution >> =========== >> We solve these problems by leaving slabs unfrozen when moving out of >> the node partial list and on CPU partial list, so "frozen" bit is 0. >> >> These partial slabs won't be manipulate concurrently by alloc path, >> the only racer is free path, which may manipulate its list when !inuse. >> So we need to introduce another synchronization way to avoid it, we >> reuse PG_workingset to keep track of whether the slab is on node partial >> list or not, only in that case we can manipulate the slab list. >> >> The slab will be delay frozen when it's picked to actively use by the >> CPU, it becomes full at the same time, in which case we still need to >> rely on "frozen" bit to avoid manipulating its list. So the slab will >> be frozen only when activate use and be unfrozen only when deactivate. > > Interesting solution! I wonder if we could go a bit further and remove > acquire_slab() completely. Because AFAICS even after your changes, > acquire_slab() is still attempted including freezing the slab, which means > still doing an cmpxchg_double under the list_lock, and now also handling the > special case when it failed, but we at least filled percpu partial lists. > What if we only filled the partial list without freezing, and then froze the > first slab outside of the list_lock? Good idea, we can return one slab and put other slabs to the CPU partial list. So we can remove the acquire_slab() completely and don't need to handle the fail case. The code will be cleaner, too. > > Or more precisely, instead of returning the acquired "object" we would > return the first slab removed from partial list. I think it would simplify > the code a bit, and further reduce list_lock holding times. Ok, I will do this in the next version. But I find we have to return the object in the "IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)" case, in which we need to allocate a single object under the node list_lock. Maybe we can use "struct partial_context" to return the object in this case? struct partial_context { - struct slab **slab; gfp_t flags; unsigned int orig_size; + void *object; }; Then we can change all get_partial interfaces to return a slab. Do you agree with this way? > > I'll also point out a few more details, but it's not a full detailed review > as the suggestion above, and another for 4/5, could mean a rather > significant change for v3. Thank you! > > Thanks! > >> 3. Testing >> ========== >> We just did some simple testing on a server with 128 CPUs (2 nodes) to >> compare performance for now. >> >> - perf bench sched messaging -g 5 -t -l 100000 >> baseline RFC >> 7.042s 6.966s >> 7.022s 7.045s >> 7.054s 6.985s >> >> - stress-ng --rawpkt 128 --rawpkt-ops 100000000 >> baseline RFC >> 2.42s 2.15s >> 2.45s 2.16s >> 2.44s 2.17s >> >> It shows above there is about 10% improvement on stress-ng rawpkt >> testcase, although no much improvement on perf sched bench testcase. >> >> Thanks for any comment and code review! >> >> Chengming Zhou (6): >> slub: Keep track of whether slub is on the per-node partial list >> slub: Prepare __slab_free() for unfrozen partial slab out of node >> partial list >> slub: Don't freeze slabs for cpu partial >> slub: Simplify acquire_slab() >> slub: Introduce get_cpu_partial() >> slub: Optimize deactivate_slab() >> >> include/linux/page-flags.h | 2 + >> mm/slab.h | 19 +++ >> mm/slub.c | 245 +++++++++++++++++++------------------ >> 3 files changed, 150 insertions(+), 116 deletions(-) >> >