From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0FC6C021BE for ; Wed, 26 Feb 2025 02:40:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 44AC2280002; Tue, 25 Feb 2025 21:40:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3FB05280001; Tue, 25 Feb 2025 21:40:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C245280002; Tue, 25 Feb 2025 21:40:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0CBEE280001 for ; Tue, 25 Feb 2025 21:40:44 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7851B1404B8 for ; Wed, 26 Feb 2025 02:40:43 +0000 (UTC) X-FDA: 83160542766.26.80CA336 Received: from out-184.mta1.migadu.com (out-184.mta1.migadu.com [95.215.58.184]) by imf18.hostedemail.com (Postfix) with ESMTP id 7F8491C0006 for ; Wed, 26 Feb 2025 02:40:41 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=e2OVFNAh; spf=pass (imf18.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.184 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740537641; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T4DzSfTGqtZ157FLjSkApLN27i1udx1HK0g7eBQ4qlk=; b=cEv3zBXo3qP2/EMM6FLgEf/+/POm9VxZvHTDBcaEpv4nfOy2+rL/2Jrg2WdCEoeqlJmAIC kJAiUaIZINrzQjil8oDIYiYic0Kts6mMtEtLizB/I6FmSEPyfpj+gcdhVRgiLlAJBVjsWC eIhsmetz4XmZkY9T57K6ZaXNGE+BJqg= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=e2OVFNAh; spf=pass (imf18.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.184 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740537641; a=rsa-sha256; cv=none; b=nok0BXVHQHeDwMx1XsWrpjzVFx1WjqJ/8hiySnsxu5Scaw2inyDNxi3IhCY5i6HyyTg9vy Fq2JQyyIwnl+iKclA7MxtbISsz/nJFCZgSUhQRRQ0AIwIi+oEkNzQ8nY8ZhGLFr50mtdwV TVt1TGv2aY4MwaqyW4deXNM7qZz1tt8= Date: Wed, 26 Feb 2025 02:40:35 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1740537640; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=T4DzSfTGqtZ157FLjSkApLN27i1udx1HK0g7eBQ4qlk=; b=e2OVFNAh1Js8neIqKUlPIsThYM4fHZZbV1ukHdf/latkSmhpIrdvAJ+TIgX0UOG2m9daj4 u5wVBpuvGqWitQiV9pW9gCOBd5/sA/f8iub0nmZREPY15veMA953lVeaXq7qPNaErOV1r2 6wvPmKOVkcewBQGzDCXBq8MG4CPzbqQ= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Nhat Pham Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, chengming.zhou@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] zswap: do not crash the kernel on decompression failure Message-ID: References: <20250225213200.729056-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250225213200.729056-1-nphamcs@gmail.com> X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 7F8491C0006 X-Stat-Signature: 6dokpa1sj6hmphrye76yxejixiceyhr5 X-HE-Tag: 1740537641-671803 X-HE-Meta: U2FsdGVkX18SvjNEBtW7OVYCHmG7N+tRnCcww2wxzN1tqcUx9W3aA36SJBrDtIofDlOEgENVZBpcFNYtp+CtbL5YnTikf74ahWp9NdA4vA0IZobViYo/Scied9XRNDwUMcjgdzKo+jGj7eRAOmQ73m5Nx2Ln9hh4kepnH0hTZqozqV3LvA4duG5vsnK42M/hVtczGIFun48RfgehZh2tfYywZCT76JAQ07outD/Hoelo8DHyr1vao+ju63Ez+AN3yICJkMX1GNUitpyL6U2SIZl6ICcyKz4qUnLOFW2pT8XFLKvH2G4Okw8f4gc+7eJgqXQS5z/k7fEfNYCsFvd6nupsuAot59wDNsJcsk4TExqchfyO9LiOsngKvqJBkhrWlW45fYb7YYRWt6s0JudZ9ny+nbzwsJysj01sMWyBkC3L80ulv2Pe1SZs9O1aYf+UfrdfXpwrdLOUfKTC/5v5jBWFqnu4mML15ng0zypmtobK+CPzzY+G//hFgQZ5ksJBkEttBB5FNyaSJn/VUydl1TrVUkzM7bDvliN5vt2gu17RpArQkZDvPRNZ2oprUr/9eD2zqg2hoXN1Fqd8AupX44tEATiZ1KxEQgGyqrEAo9e5uqlD3YSyye1U50ZuEZFDwUMsLsV4+r8cQ/+7aNrW2g43OgPah35QyLXUZaRgyHE7up7uBrMnKV70krfoAA4VcuHSGQA2M/8+tH0T8g8ZQfuH5PxWXs2W+EOAZEY/bocLTyCK0vxDnuGzTGAE1fcRVCad7/oynM0nOHXOMMOPhnuhyUEgTZZheAQd6/IcyCc9K4/1w2vTWBJurIlv/PTPZ57e6EEaJSuMI0HYXH222Lm0ClzgZUUirxJZaN1X0tSgbEqydF5IVOKkg9eBCh+xXU8XUPoiQgoEWdBbq9OwjtHiQ4zfnrnpanBtGDGB4rI4EfnLGyLPQKslVuYq+CdOio1bcHStZLMojPsLFRp Ic2Y/MKx vtC2G6D9v6tyZzDUCcZWJRCIDMaFjkljjMw5qQgIZSH3pSEg85+pTs5PQGUAuQn6L89Xzhrcc7ZS2rygHatHv4K5tF7so9sB7VBymTz4HkgGVRKkTFzbOM0m8zoK4c+hR4oiXstty0/9doThELgbLxxSEk5q984tBKxcbG203Lz+TmHQtv8QFJ9c8bBr6mU3algfBJsh6Pdct2UV9ZmyS332kQ02CukZBnL9G+EpoRVxnOnD3cI1Pw1F1VkOZOoajQEieKG0OFkUdBE+YH8v1NquwsduGiMH3iIrHqDPu1EKn1ZznE/FESA9NpA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > Currently, we crash the kernel when a decompression failure occurs in > zswap (either because of memory corruption, or a bug in the compression > algorithm). This is overkill. We should only SIGBUS the unfortunate > process asking for the zswap entry on zswap load, and skip the corrupted > entry in zswap writeback. It's probably worth adding more details here. For example, how the SIGBUS is delivered (it's not super clear in the code), and the distinction between handling decompression failures for loads and writeback. > > See [1] for a recent upstream discussion about this. > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > Suggested-by: Matthew Wilcox > Suggested-by: Yosry Ahmed It's Yosry Ahmed now :P > Signed-off-by: Nhat Pham > --- > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 58 insertions(+), 27 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f6316b66fb23..31d4397eed61 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > static u64 zswap_reject_compress_fail; > /* Compressed page was too big for the allocator to (optimally) store */ > static u64 zswap_reject_compress_poor; > +/* Load and writeback failed due to decompression failure */ Nit: Load or writeback? > +static u64 zswap_reject_decompress_fail; > /* Store failed because underlying allocator could not get memory */ > static u64 zswap_reject_alloc_fail; > /* Store failed because the entry metadata could not be allocated (rare) */ > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > return comp_ret == 0 && alloc_ret == 0; > } > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > { > struct zpool *zpool = entry->pool->zpool; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + bool ret = true; > u8 *src; > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > sg_init_table(&output, 1); > sg_set_folio(&output, folio, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > + acomp_ctx->req->dlen != PAGE_SIZE) { > + ret = false; > + zswap_reject_decompress_fail++; > + pr_alert_ratelimited( > + "decompression failed on zswap entry with offset %08lx\n", > + entry->swpentry.val); > + } This can probably be prettier if we save the return value of crypto_wait_req() in a variable, and then do the check at the end of the function: zswap_decompress() { mutex_lock(); ... ret = crypto_wait_req(..); ... mutex_unlock(); ... if (ret || acomp_ctx->req->dlen != PAGE_SIZE) { /* SHIT */ return false; } return true; } > mutex_unlock(&acomp_ctx->mutex); > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > + return ret; > } > > /********************************* > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > struct writeback_control wbc = { > .sync_mode = WB_SYNC_NONE, > }; > + int ret = 0; > > /* try to allocate swap cache folio */ > mpol = get_task_policy(current); > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * and freed when invalidated by the concurrent shrinker anyway. > */ > if (!folio_was_allocated) { > - folio_put(folio); > - return -EEXIST; > + ret = -EEXIST; > + goto put_folio; > } > > /* > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * be dereferenced. > */ > tree = swap_zswap_tree(swpentry); > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > - delete_from_swap_cache(folio); > - folio_unlock(folio); > - folio_put(folio); > - return -ENOMEM; > + if (entry != xa_load(tree, offset)) { > + ret = -ENOMEM; > + goto fail; > } > > - zswap_decompress(entry, folio); > + if (!zswap_decompress(entry, folio)) { > + ret = -EIO; > + goto fail; > + } > + > + xa_erase(tree, offset); > > count_vm_event(ZSWPWB); > if (entry->objcg) > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > /* start writeback */ > __swap_writepage(folio, &wbc); > - folio_put(folio); > + goto put_folio; > > - return 0; > +fail: > + delete_from_swap_cache(folio); > + folio_unlock(folio); > +put_folio: > + folio_put(folio); > + return ret; > } > > /********************************* > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) > if (WARN_ON_ONCE(folio_test_large(folio))) > return true; > > + /* > + * We cannot invalidate the zswap entry before decompressing it. If > + * decompression fails, we must keep the entry in the tree so that > + * a future read by another process on the same swap entry will also > + * have to go through zswap. Otherwise, we risk silently reading > + * corrupted data for the other process. > + */ > + entry = xa_load(tree, offset); We are doing load + erase here and in the writeback path now, so two xarray walks instead of one. How does this affect performance? We had a similar about the possiblity of doing a lockless xas_load() followed by a conditional xas_erase() for zswap_invalidate(): https://lore.kernel.org/lkml/20241018192525.95862-1-ryncsn@gmail.com/ Unfortunately it seems like we can't trivially do that unless we keep the tree locked, which we probably don't want to do throughout decompression. How crazy would it be to remove the entry from the tree and re-add it if compression fails? Does swapcache_prepare() provide sufficient protection for us to do this without anyone else looking at this entry (seems like it)? Anyway, this is all moot if the second walk is not noticeable from a perf perspective.