From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0098C369D7 for ; Tue, 22 Apr 2025 15:02:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2064B6B000A; Tue, 22 Apr 2025 11:02:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B6056B000D; Tue, 22 Apr 2025 11:02:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 056B16B000E; Tue, 22 Apr 2025 11:02:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id DB1F66B000A for ; Tue, 22 Apr 2025 11:02:51 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2F4B31A1F8C for ; Tue, 22 Apr 2025 15:02:52 +0000 (UTC) X-FDA: 83361996984.13.3870115 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf05.hostedemail.com (Postfix) with ESMTP id AD8AC100002 for ; Tue, 22 Apr 2025 15:02:49 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV; spf=pass (imf05.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745334170; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=; b=gbwUHsJjbso+CjxEJauhJGMVFX1K4yW9z33Fuv6Hckx0lFqLQ1FG8scBikCDRcfbbU1DfW J2jYQT+oMces45HNOGwQbHb7W+weuiibCTFw3G8n379fVXDUIGd82Ezio3wgg97QO/mWfy rkkOwKdQIg09TXhsa+b8gFdKHaI7Ucs= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV; spf=pass (imf05.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745334170; a=rsa-sha256; cv=none; b=jHMVo9y7jYwoL2TvlM92dlPR1zccol05rWEumLDLl3RE5f+/0HITXVbSbyWYJpAihOtJtX C56/PfxFD5ZcEDOgKpGiM4nolsaruTZ86ussyt/DsMO175Vi3HFfSYQsTvkjRGHJwXYh88 VwTdGWpza6+PWETn9bJu+byuEy0XcaI= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 99E551F444; Tue, 22 Apr 2025 15:02:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1745334167; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=; b=hpHdW0GkV/2UTDtGzEzvbvO3JLr6a44MxbCqemu97UEoDJTETLcwW2fE4Lxp58h6sXPKjy fBIyS+R6LVElgrnTym4mcdKuuEWgMwoX+xrKNwUzgNWb3QoQUiGH/hUIfvQ/qa+I1icxsq oSPqQhgofRgLLDsSjqg2P0Pzx02Nui4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1745334167; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=; b=+1kgr3uVRJHJh1hBN70rJBL8XMLl1uMojeZ8WlPh2QYWy6Okd9G5cYp9i8tgiuCekrIqeE YZSFmPFzIBtTNnAA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1745334167; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=; b=hpHdW0GkV/2UTDtGzEzvbvO3JLr6a44MxbCqemu97UEoDJTETLcwW2fE4Lxp58h6sXPKjy fBIyS+R6LVElgrnTym4mcdKuuEWgMwoX+xrKNwUzgNWb3QoQUiGH/hUIfvQ/qa+I1icxsq oSPqQhgofRgLLDsSjqg2P0Pzx02Nui4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1745334167; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=; b=+1kgr3uVRJHJh1hBN70rJBL8XMLl1uMojeZ8WlPh2QYWy6Okd9G5cYp9i8tgiuCekrIqeE YZSFmPFzIBtTNnAA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 8275E137CF; Tue, 22 Apr 2025 15:02:47 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id 41+JH5evB2jWVgAAD6G6ig (envelope-from ); Tue, 22 Apr 2025 15:02:47 +0000 Message-ID: <208e2b4a-cff5-432e-a330-74a8fc3a22cc@suse.cz> Date: Tue, 22 Apr 2025 17:02:47 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu sheaves Content-Language: en-US To: Suren Baghdasaryan Cc: "Liam R. Howlett" , Christoph Lameter , David Rientjes , Roman Gushchin , Harry Yoo , Uladzislau Rezki , linux-mm@kvack.org, linux-kernel@vger.kernel.org, rcu@vger.kernel.org, maple-tree@lists.infradead.org References: <20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz> <20250317-slub-percpu-caches-v3-2-9d9884d8b643@suse.cz> From: Vlastimil Babka In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Action: no action X-Stat-Signature: p3mp7k7t34mawkm4zoobwwbz48wi6xnj X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: AD8AC100002 X-Rspam-User: X-HE-Tag: 1745334169-769464 X-HE-Meta: U2FsdGVkX19sNUyNJMfdFVupr7+7NA7Hid2YTgmlpV1LqIBTH7CJgacgL+7s6OaL3ASRNyFYyOco2cBYJrxQC/UJwnDC72EhKKizlePVmg29tAZdTfahufWwYrPGVwqF/b3Fe099f4XKps/SlWDPH3jN1xqrvCXoonawUWWj3UVHG18IR+P+bP/phaOSJpGdFAc9I4kiYA3rhimiiczqO5PblAZNnY2x8NeGVaslsZmqWngoYUIcKDSFZ2mXyofO2NaZ5WEKs3nmDVpV0RVIPAuVQw/bDfC/+IyH8785aJ1JYv4qgqeX9l9ZrLW1QKZjrMpzGRgvtuzawLv4wFbtlAMa8fzXYkDvYDXJEPhM2q8hWFaloyYN3WUBnFesaZmFmG95Cp/dG/sxW53xuCg4VXf0Z36qQCjJ5X6FSkoL6vMsbtsMA8MlFJd/2rpFOwyHoOEUYJcjCds8qzG5rIoW4IvSEmuS/7th7DwIk2OE2hEL/oiQUk36OkhvlNPxWdvA9klkGcZoCmBjRsU4BtqOidFdAgzr9mZki/xkPBaQdckWNy7cvcgWQ8aGYaX1SlbgizCCaoyhju53rJieVoo3DyOI0MY3OoqfwWPggkjeICZFMdWXhLPt+5v7dbHt4aFJSFnblRqCcNjCz22e0WT1UDvZDb6Ll9zda6anHEJNEMIMjKQ3xSX6J6oMQjrXJjO1CRxT7EwUTMjdXu+aXFCg0s0pyBcO4XcF8fwb6jVXcWcwMt6k8gWPRanMWaM/WD0jpZ356hp9OVyuENMJiMv7M3D4kVaAlCnDf+PtAwWIBV/6DB0zXBhBIseo/m9wDSW1LYJ9DaMDKhDeYQDESmx/qFYr4vsVWoxv+5RzzpPqrOJ1xMgQqbFSwQFOXIlaVV+oDzuDh0i5Ytv0VcYfR4ER1fWlwIz2A8IEqDyhpV0riw5jpYXmclAyIs86t5S8ykb4ZhlKMrM9St5/ftb5wxr 2oRrreAz 9zsaCS0bZQixixqUfLn7r+3LtvKc4rXygd/Wesi3B8TC1oSuJkLaDQ8sR4yI3tpAfk3y9AIUzw+K5CrwBFLVnN4fIbVHZkBRziks3/d8WtSB960oih6gIp0T5GtqXfOTp6txtk1oFTSi2QPBSDjQGrfZJWUniZuNOpzfKHsq8WKbNbt8U0CGIcYA2pLlneU/03T9A9SQVnLAmR4OTwhGOYoNFPR4ylz8Yai27A88Xjkpjc4ofmWP6lgABOSsoO8VL3yEMir44AZJN3rzAqNjrh0Nw70ZbU0B+Koz8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 4/10/25 21:51, Suren Baghdasaryan wrote: >> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu) >> +{ >> + struct slub_percpu_sheaves *pcs; >> + >> + pcs = per_cpu_ptr(s->cpu_sheaves, cpu); >> + >> + /* The cpu is not executing anymore so we don't need pcs->lock */ >> + sheaf_flush_unused(s, pcs->main); > > You are flushing pcs->main but sheaf_flush_unused() will account that > in SHEAF_FLUSH_OTHER instead of SHEAF_FLUSH_MAIN. Perhaps > sheaf_flush_unused() needs a parameter to indicate which counter to be > incremented. Hm it's technically true, but it's a cpu hotremove operation so rather rare and not performance sensitive and one could argue it's not a main sheaf (of an active cpu) anymore. So I rather went with a comment than complicating the code. >> + if (pcs->spare) { >> + sheaf_flush_unused(s, pcs->spare); >> + free_empty_sheaf(s, pcs->spare); >> + pcs->spare = NULL; >> + } >> +} >> + >> +static void pcs_destroy(struct kmem_cache *s) >> +{ >> + int cpu; >> + >> + for_each_possible_cpu(cpu) { >> + struct slub_percpu_sheaves *pcs; >> + >> + pcs = per_cpu_ptr(s->cpu_sheaves, cpu); >> + >> + /* can happen when unwinding failed create */ >> + if (!pcs->main) >> + continue; >> + >> + WARN_ON(pcs->spare); >> + >> + if (!WARN_ON(pcs->main->size)) { > > If pcs->main->size > 0, shouldn't we flush pcs->main? I understand > it's not a normal situation but I think something like this would be > more correct: > > if (WARN_ON(pcs->main->size)) > sheaf_flush_unused(s, pcs->main); > free_empty_sheaf(s, pcs->main); > pcs->main = NULL; Initially I agreed but then I realized we only reach this code when we already found all slabs only containing free objects. So if we then find some in sheaves it means something must have gone terribly wrong so trying to flush can only make things worse. I've added a comment instead. >> + free_empty_sheaf(s, pcs->main); >> + pcs->main = NULL; >> + } >> + } >> + >> + free_percpu(s->cpu_sheaves); >> + s->cpu_sheaves = NULL; >> +} >> + >> +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn) >> +{ >> + struct slab_sheaf *empty = NULL; >> + unsigned long flags; >> + >> + spin_lock_irqsave(&barn->lock, flags); >> + >> + if (barn->nr_empty) { >> + empty = list_first_entry(&barn->sheaves_empty, >> + struct slab_sheaf, barn_list); >> + list_del(&empty->barn_list); >> + barn->nr_empty--; >> + } >> + >> + spin_unlock_irqrestore(&barn->lock, flags); >> + >> + return empty; >> +} >> + >> +static int barn_put_empty_sheaf(struct node_barn *barn, >> + struct slab_sheaf *sheaf, bool ignore_limit) > > This ignore_limit in barn_put_empty_sheaf()/barn_put_full_sheaf() is > sticking out and really used only inside rcu_free_sheaf() in the next > patch. Every time I saw the return value of these functions ignored I > had to remind myself that they pass ignore_limit=true, so the function > can't really fail. Maybe we could get rid of this flag and do trimming > of barn lists inside rcu_free_sheaf() separately in one go? IIUC Good suggestion, didn't realize the barn_put_() are only used in the fallback-due-to-a-race-or-cpu-migration paths, so did as you suggested. > because we ignore the limits in all other places, at the time of > rcu_free_sheaf() we might end up with a barn having many more sheaves > than the limit allows for, so trimming in bulk might be even more > productive. There are many caches with no rcu freeing happening so making rcu_free_sheaf() a bulk trimming point could be insufficient. I'd just leave it for now and hope that ignoring limits only due to races won't cause barns to grow more than it could self-correct for just by further allocation/freeing. >> +{ >> + unsigned long flags; >> + int ret = 0; >> + >> + spin_lock_irqsave(&barn->lock, flags); >> + >> + if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) { >> + ret = -E2BIG; >> + } else { >> + list_add(&sheaf->barn_list, &barn->sheaves_empty); >> + barn->nr_empty++; >> + } >> + >> + spin_unlock_irqrestore(&barn->lock, flags); >> + return ret; >> +} >> + >> +static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf, >> + bool ignore_limit) > > Can this function be called for partially populated sheaves or only > full ones? I think rcu_free_sheaf() in the next patch might end up > calling it for a partially populated sheaf. It can, but the code handles not-really-full sheaves fine and it's easier to allow them than try to be strict, there should be just a bit lower efficiency in debug scenarios and that's fine. Adding a comment. >> +{ >> + unsigned long flags; >> + int ret = 0; >> + >> + spin_lock_irqsave(&barn->lock, flags); >> + >> + if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) { >> + ret = -E2BIG; >> + } else { >> + list_add(&sheaf->barn_list, &barn->sheaves_full); >> + barn->nr_full++; >> + } >> + >> + spin_unlock_irqrestore(&barn->lock, flags); >> + return ret; >> +} >> + >> +/* >> + * If a full sheaf is available, return it and put the supplied empty one to >> + * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't >> + * change. > > I'm unclear why we ignore the limit here but not in > barn_replace_full_sheaf(). Maybe because full sheaves consume more > memory? Yes. > But then why do we mostly pass ignore_limit=true when invoking > barn_put_full_sheaf()? Not anymore with your suggestion. > Explanation of this limit logic needs to be > clarified. > >> +static __fastpath_inline >> +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p) >> +{ >> + struct slub_percpu_sheaves *pcs; >> + struct slab_sheaf *main; >> + unsigned int allocated = 0; >> + unsigned int batch; >> + >> +next_batch: >> + if (!localtry_trylock(&s->cpu_sheaves->lock)) >> + return allocated; >> + >> + pcs = this_cpu_ptr(s->cpu_sheaves); >> + >> + if (unlikely(pcs->main->size == 0)) { > > The above condition is unlikely only for the first batch. I think it's > actually guaranteed on all consecutive batches once you do "goto > next_batch", right? Hm yes but perhaps bulk allocs larger than batch size are rare? >> +static __fastpath_inline >> +bool free_to_pcs(struct kmem_cache *s, void *object) >> +{ >> + struct slub_percpu_sheaves *pcs; >> + >> +restart: >> + if (!localtry_trylock(&s->cpu_sheaves->lock)) >> + return false; >> + >> + pcs = this_cpu_ptr(s->cpu_sheaves); >> + >> + if (unlikely(pcs->main->size == s->sheaf_capacity)) { >> + >> + struct slab_sheaf *empty; >> + >> + if (!pcs->spare) { >> + empty = barn_get_empty_sheaf(pcs->barn); >> + if (empty) { >> + pcs->spare = pcs->main; >> + pcs->main = empty; >> + goto do_free; >> + } >> + goto alloc_empty; >> + } >> + >> + if (pcs->spare->size < s->sheaf_capacity) { >> + stat(s, SHEAF_SWAP); >> + swap(pcs->main, pcs->spare); >> + goto do_free; >> + } >> + >> + empty = barn_replace_full_sheaf(pcs->barn, pcs->main); > > This function reads easier now but if barn_replace_full_sheaf() could > ignore the MAX_FULL_SHEAVES and barn list trimming could be done > later, it would simplify this function even further. It could increase barn lock contention to put full sheaves there over limit and then flush them afterwards so I think it's better to not do that. Especially with "slab: determine barn status racily outside of lock". > > nit: I think the section below would be more readable if structured > with fail handling blocks at the end. Something like this: > > next_batch: > if (!localtry_trylock(&s->cpu_sheaves->lock)) > goto fallback; > > pcs = this_cpu_ptr(s->cpu_sheaves); > > if (likely(pcs->main->size < s->sheaf_capacity)) > goto do_free; Nice suggestion, did that. Thanks.