From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50FD0C282EC for ; Mon, 10 Mar 2025 17:31:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E68CE280011; Mon, 10 Mar 2025 13:31:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E1960280004; Mon, 10 Mar 2025 13:31:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBEB1280011; Mon, 10 Mar 2025 13:31:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A71CC280004 for ; Mon, 10 Mar 2025 13:31:18 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0D5F081336 for ; Mon, 10 Mar 2025 17:31:20 +0000 (UTC) X-FDA: 83206332720.09.F69D136 Received: from out-174.mta1.migadu.com (out-174.mta1.migadu.com [95.215.58.174]) by imf03.hostedemail.com (Postfix) with ESMTP id 6821120022 for ; Mon, 10 Mar 2025 17:31:17 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="g/zEmscr"; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf03.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.174 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741627878; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9AKzdTafnP5muRvcqyXS2cPqfeLVBByaoT/d3amflSw=; b=xoj2JKEpwDTx6hPVZ9qFwOLuUNKsAnGCbDEZiv8Y0eed64X2xte/iB15TNIUIE5y4AXtQo lcnqtSsBTmY0g+kMwW8RcB2wXUVRhl6eOKSdeEfIXSR1YnY9fp40pngj1LZUiWZkgUlG9F zKlEwg5k3WA/o0r1k2pbKN408F3mtoA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741627878; a=rsa-sha256; cv=none; b=G6+JdQtfdb8pBP3C+mvoiBUn3O1Xo8KQpQWKrEpHrf2KMMVu4nJkC+VAaswc1i5kHWLzwv 4j/eeg7ngdwXPexrGqKjN3MxoJIwms1bBUTYCD7tMoh4yTvbA5rU+S/3ac/fZ6DV5BWnMy a+ZMqt/6gViR4bUelsnxlwRqItr8MlM= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="g/zEmscr"; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf03.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.174 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev Date: Mon, 10 Mar 2025 17:31:02 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1741627874; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=9AKzdTafnP5muRvcqyXS2cPqfeLVBByaoT/d3amflSw=; b=g/zEmscrHJL62ZFqgpFnuqAI9kvWT61cfapyzX53zX6kHG8OVkvakrd95jEfjly4hW6Esl TajY/BRHAhXAKHdQn26Va00uNSowqAzhmH2UD6+YtJbZcu/JdZhcwzIEmZJtWEBz544CUi aBAVwwjDZHOQOjK7FcJsLkcX6ctTveU= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: "Sridhar, Kanchana P" Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hannes@cmpxchg.org" , "nphamcs@gmail.com" , "chengming.zhou@linux.dev" , "usamaarif642@gmail.com" , "ryan.roberts@arm.com" , "21cnbao@gmail.com" <21cnbao@gmail.com>, "ying.huang@linux.alibaba.com" , "akpm@linux-foundation.org" , "linux-crypto@vger.kernel.org" , "herbert@gondor.apana.org.au" , "davem@davemloft.net" , "clabbe@baylibre.com" , "ardb@kernel.org" , "ebiggers@google.com" , "surenb@google.com" , "Accardi, Kristen C" , "Feghali, Wajdi K" , "Gopal, Vinodh" Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage. Message-ID: References: <20250303084724.6490-1-kanchana.p.sridhar@intel.com> <20250303084724.6490-13-kanchana.p.sridhar@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Queue-Id: 6821120022 X-Rspamd-Server: rspam08 X-Stat-Signature: e4hbtkg5fyb98u8n5cerqq58oxdpd5r8 X-HE-Tag: 1741627877-453759 X-HE-Meta: U2FsdGVkX1/bypwy6IFudHMErzWentH4YD//UZqn1OKWevh78ojOTBGfJncEy52V3FxhU14V0Yg6F3WuWBIJj5SHC8nPkcWviEwSVR7FeKltSm6Li53R8Y37KDt8AQwe1LgxLf5WSClRIKPxlaR7Y3PfMFr6aIF3lt09ilP9Chzx9d6ABYR1rG3E5sW4pQCU8wshfEnpAFGLuNJJXy8xAZZLoePpu+S3Rd0EKa1Nd68TCWemPUY5G5yLTyYgkr0LTab06Rgk036xgNWCIdb34aPXOQcOE279F7X6Ce5pjNJ1miEdX7AYOky9QXjWkwyDgFpPC7ElKdB8q5c/SGHFT1UIxqtnZpA6VbJe7IEFFnzgYjlbDpFCojpBwfTl3E1v3X2yY36BP/NOyGY5Syhdx8vzSeb+6H9T3MMb0ocGI0bbY1zMb7FGDEbCKmLrWvYYIx8So1RKfq8HX55Aw61qYELN/uJivEHe1TrZZNX8lUppn/GE70msCMcuQFf6igl0tKARtVRa0Nwt3lzynp2kFibXPd2kXBMJ+K+CJLI+RASJJLx4TaQGvTeAecj89X8+OseTkM0yWzRodbmHVUpRVuZLQF68PoKYiBFvIZsq0w05LkzXdTu7jynMwDf0Zh+HlW/KbaIymtoXLA2N9SEHWmKbGW4aLlXtnwT9R2Bw3DyBLohXIuRbm2IfD5IJs3uehkI0+mu8bw2pMhyKwKZTLfqZJYD6rfpgAqvopg4IIgNStbgOimdyjiFTKUNIni/ssT2Fy399u3+d6qXcnJ5PUF9+UgfPtziEz3c4u8Y1WInQJEc0GG2jqhDUojR5vXZ+115zKjPHZ3HtwTVWPPqMiqAqHyabNfcS9EyjR45G8rzKKPmhxbAPVKQ8p03BV35UtMJTIBgLjzZh2G0ozRmDYzY9YS8Gh5TBD0DF6d0qgxPLVrYmIDQIKhmHGhIidb3C1Macmi7k7KfX3xZ4SSd puCpJvTA kSP7rLpORO3G6M5jFErZ3u4zEbnfCajBw8OExBPFOIs2hbLc5KuvVWv0gz1Qft6IDbpixNWqfvyQvxKvR4bnpGH5+A3qXzQTxgk46pViE9qNlHxe43pbWQ+VPC7yJd2mV9zlXlzVGPdyp8+OKXMfy4okj+CXJM2SnjWM6Dl9vK0DDzZDS1mWUQ9ryodWnMrMYmlnHUz9rEGiBV6gIRtb9M4JF+G/DjzjmUEfLexeUnSgkHquSeBeAs4WMaTunD+LDLp2UE2JWFbQ1RsRJeblmZr2HHOumxxi/EMQNyLkkNMOTGn4JhbCew5o+zw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote: > [..] > > > > > u8 *buffer; > > > > > + u8 nr_reqs; > > > > > + struct crypto_wait wait; > > > > > struct mutex mutex; > > > > > bool is_sleepable; > > > > > + bool __online; > > > > > > > > I don't believe we need this. > > > > > > > > If we are not freeing resources during CPU offlining, then we do not > > > > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > > > > > > > The whole point of synchronizing between offlining and > > > > compress/decompress operations is to avoid UAF. If offlining does not > > > > free resources, then we can hold the mutex directly in the > > > > compress/decompress path and drop the hotunplug callback completely. > > > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems like > > > > it's only used know when to set __online. > > > > > > All great points! In fact, that was the original solution I had implemented > > > (not having an offline callback). But then, I spent some time understanding > > > the v6.13 hotfix for synchronizing freeing of resources, and this comment > > > in zswap_cpu_comp_prepare(): > > > > > > /* > > > * Only hold the mutex after completing allocations, otherwise we > > may > > > * recurse into zswap through reclaim and attempt to hold the mutex > > > * again resulting in a deadlock. > > > */ > > > > > > Hence, I figured the constraint of "recurse into zswap through reclaim" was > > > something to comprehend in the simplification (even though I had a tough > > > time imagining how this could happen). > > > > The constraint here is about zswap_cpu_comp_prepare() holding the mutex, > > making an allocation which internally triggers reclaim, then recursing > > into zswap and trying to hold the same mutex again causing a deadlock. > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin > > with, the constraint naturally goes away. > > Actually, if it is possible for the allocations in zswap_cpu_comp_prepare() > to trigger reclaim, then I believe we need some way for reclaim to know if > the acomp_ctx resources are available. Hence, this seems like a potential > for deadlock regardless of the mutex. I took a closer look and I believe my hotfix was actually unnecessary. I sent it out in response to a syzbot report, but upon closer look it seems like it was not an actual problem. Sorry if my patch confused you. Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment above says: * PREPARE: The callbacks are invoked on a control CPU before the * hotplugged CPU is started up or after the hotplugged CPU has died. So even if we go into reclaim during zswap_cpu_comp_prepare(), it will never be on the CPU that we are allocating resources for. The other case where zswap_cpu_comp_prepare() could race with compression/decompression is when a pool is being created. In this case, reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the same CPU AFAICT. However, because the pool is still under creation, it will not be used (i.e. zswap_pool_current_get() won't find it). So I think we don't need to worry about zswap_cpu_comp_prepare() racing with compression or decompression for the same pool and CPU. > > I verified that all the zswap_cpu_comp_prepare() allocations are done with > GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a > risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare() > in general, i.e., aside from this patchset. > > I can think of the following options to resolve this, and would welcome > other suggestions: > > 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check > if acomp_ctx->__online is true, and if so, return the mutex. If > acomp_ctx->__online is false, then it returns NULL. In other words, we > don't have the for loop. > - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare() > to fail, cpuhotplug to fail. However, there is no deadlock. > - zswap_compress() will need to detect NULL returned by > acomp_ctx_get_cpu_lock(), and return an error. > - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling > acomp_ctx_get_cpu_lock(). > - We won't be migrated to a different CPU because we hold the mutex, hence > zswap_cpu_comp_dead() will wait on the mutex. > > 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim > and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h, > but for non-THP allocations. If we decide to adopt this approach, we would > need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c > to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this > new gfp_t, in lieu of GFP_KERNEL. > > > > > > > > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare() > > > does not acquire the mutex lock while allocating resources. We have > > already > > > initialized the mutex, so in theory, it is possible for compress/decompress > > > to acquire the mutex lock. The __online acts as a way to indicate whether > > > compress/decompress can proceed reliably to use the resources. > > > > For compress/decompress to acquire the mutex they need to run on that > > CPU, and I don't think that's possible before onlining completes, so > > zswap_cpu_comp_prepare() must have already completed before > > compress/decompress can use that CPU IIUC. > > If we can make this assumption, that would be great! However, I am not > totally sure because of the GFP_KERNEL allocations in > zswap_cpu_comp_prepare(). As I mentioned above, when zswap_cpu_comp_prepare() is run we are in one of two situations: - The pool is under creation, so we cannot race with stores/loads from that same pool. - The CPU is being onlined, in which case zswap_cpu_comp_prepare() is called from a control CPU before tasks start running on the CPU being onlined. Please correct me if I am wrong. [..] > > > > > @@ -285,13 +403,21 @@ static struct zswap_pool > > > > *zswap_pool_create(char *type, char *compressor) > > > > > goto error; > > > > > } > > > > > > > > > > - for_each_possible_cpu(cpu) > > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > > > + for_each_possible_cpu(cpu) { > > > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > > > >acomp_ctx, cpu); > > > > > + > > > > > + acomp_ctx->acomp = NULL; > > > > > + acomp_ctx->req = NULL; > > > > > + acomp_ctx->buffer = NULL; > > > > > + acomp_ctx->__online = false; > > > > > + acomp_ctx->nr_reqs = 0; > > > > > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > > > > right away? > > > > > > Yes, I figured this is needed for two reasons: > > > > > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into > > > zswap_cpu_comp_dealloc() to be handled by the common procedure > > > "acomp_ctx_dealloc()" unambiguously. > > > > This makes sense. When you move the refactoring to create > > acomp_ctx_dealloc() to a separate patch, please include this change in > > it and call it out explicitly in the commit message. > > Sure. > > > > > > 2) The second scenario I thought of that would need this, is let's say > > > the zswap compressor is switched immediately after setting the > > > compressor. Some cores have executed the onlining code and > > > some haven't. Because there are no pool refs held, > > > zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured > > > it would help to initialize these acomp_ctx members before the > > > hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). > > > > I believe cpuhp_state_add_instance() calls the onlining function > > synchronously on all present CPUs, so I don't think it's possible to end > > up in a state where the pool is being destroyed and some CPU executed > > zswap_cpu_comp_prepare() while others haven't. > > I looked at the cpuhotplug code some more. The startup callback is > invoked sequentially for_each_present_cpu(). If an error occurs for any > one of them, it calls the teardown callback only on the earlier cores that > have already finished running the startup callback. However, > zswap_cpu_comp_dealloc() will be called for all cores, even the ones > for which the startup callback was not run. Hence, I believe the > zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO) > to allocate the acomp_ctx. Yeah this is point (1) above IIUC, and I agree about zero initialization for that. > > > > > That being said, this made me think of a different problem. If pool > > destruction races with CPU onlining, there could be a race between > > zswap_cpu_comp_prepare() allocating resources and > > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them. > > > > I believe we must always call cpuhp_state_remove_instance() *before* > > freeing the resources to prevent this race from happening. This needs to > > be documented with a comment. > > Yes, this race condition is possible, thanks for catching this! The problem with > calling cpuhp_state_remove_instance() before freeing the resources is that > cpuhp_state_add_instance() and cpuhp_state_remove_instance() both > acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence > are serialized. > > For the reasons motivating why acomp_ctx->__online is set to false in > zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance() > before calling acomp_ctx_dealloc() because the latter could wait until > acomp_ctx->__online to be true before deleting the resources. I will > think about this some more. I believe this problem goes away with acomp_ctx->__online going away, right? > > Another possibility is to not rely on cpuhotplug in zswap, and instead > manage the per-cpu acomp_ctx resource allocation entirely in > zswap_pool_create(), and deletion entirely in zswap_pool_destroy(), > along with the necessary error handling. Let me think about this some > more as well. > > > > > Let me know if I missed something. > > > > > > > > Please let me know if these are valid considerations. > > > > > > > > > > > If it is in fact needed we should probably just use __GFP_ZERO. > > > > > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of > > "alloc_percpu()" > > > for the acomp_ctx? > > > > Yeah if we need to initialize all/most fields to 0 let's use > > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO. > > Sounds good. > > > > > [..] > > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > > > > > > > for (;;) { > > > > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > > > - mutex_lock(&acomp_ctx->mutex); > > > > > - if (likely(acomp_ctx->req)) > > > > > - return acomp_ctx; > > > > > /* > > > > > - * It is possible that we were migrated to a different CPU > > > > after > > > > > - * getting the per-CPU ctx but before the mutex was > > > > acquired. If > > > > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > > > > have > > > > > - * already freed ctx->req (among other things) and set it to > > > > > - * NULL. Just try again on the new CPU that we ended up on. > > > > > + * If the CPU onlining code successfully allocates acomp_ctx > > > > resources, > > > > > + * it sets acomp_ctx->__online to true. Until this happens, we > > > > have > > > > > + * two options: > > > > > + * > > > > > + * 1. Return NULL and fail all stores on this CPU. > > > > > + * 2. Retry, until onlining has finished allocating resources. > > > > > + * > > > > > + * In theory, option 1 could be more appropriate, because it > > > > > + * allows the calling procedure to decide how it wants to > > > > handle > > > > > + * reclaim racing with CPU hotplug. For instance, it might be > > > > Ok > > > > > + * for compress to return an error for the backing swap device > > > > > + * to store the folio. Decompress could wait until we get a > > > > > + * valid and locked mutex after onlining has completed. For > > > > now, > > > > > + * we go with option 2 because adding a do-while in > > > > > + * zswap_decompress() adds latency for software > > > > compressors. > > > > > + * > > > > > + * Once initialized, the resources will be de-allocated only > > > > > + * when the pool is destroyed. The acomp_ctx will hold on to > > > > the > > > > > + * resources through CPU offlining/onlining at any time until > > > > > + * the pool is destroyed. > > > > > + * > > > > > + * This prevents races/deadlocks between reclaim and CPU > > > > acomp_ctx > > > > > + * resource allocation that are a dependency for reclaim. > > > > > + * It further simplifies the interaction with CPU onlining and > > > > > + * offlining: > > > > > + * > > > > > + * - CPU onlining does not take the mutex. It only allocates > > > > > + * resources and sets __online to true. > > > > > + * - CPU offlining acquires the mutex before setting > > > > > + * __online to false. If reclaim has acquired the mutex, > > > > > + * offlining will have to wait for reclaim to complete before > > > > > + * hotunplug can proceed. Further, hotplug merely sets > > > > > + * __online to false. It does not delete the acomp_ctx > > > > > + * resources. > > > > > + * > > > > > + * Option 1 is better than potentially not exiting the earlier > > > > > + * for (;;) loop because the system is running low on memory > > > > > + * and/or CPUs are getting offlined for whatever reason. At > > > > > + * least failing this store will prevent data loss by failing > > > > > + * zswap_store(), and saving the data in the backing swap > > > > device. > > > > > */ > > > > > > > > I believe we can dropped. I don't think we can have any store/load > > > > operations on a CPU before it's fully onlined, and we should always have > > > > a reference on the pool here, so the resources cannot go away. > > > > > > > > So unless I missed something we can drop this completely now and just > > > > hold the mutex directly in the load/store paths. > > > > > > Based on the above explanations, please let me know if it is a good idea > > > to keep the __online, or if you think further simplification is possible. > > > > I still think it's not needed. Let me know if I missed anything. > > Let me think some more about whether it is feasible to not have cpuhotplug > manage the acomp_ctx resource allocation, and instead have this be done > through the pool creation/deletion routines. I don't think this is necessary, see my comments above.