From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C42B9C021BE for ; Wed, 26 Feb 2025 02:45:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5ACD4280008; Tue, 25 Feb 2025 21:45:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 53631280001; Tue, 25 Feb 2025 21:45:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3AFFA280008; Tue, 25 Feb 2025 21:45:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1A54E280001 for ; Tue, 25 Feb 2025 21:45:49 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B8F3450B1D for ; Wed, 26 Feb 2025 02:45:48 +0000 (UTC) X-FDA: 83160555576.27.8A6521A Received: from out-184.mta1.migadu.com (out-184.mta1.migadu.com [95.215.58.184]) by imf02.hostedemail.com (Postfix) with ESMTP id C1E5080004 for ; Wed, 26 Feb 2025 02:45:46 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=RE174lwr; spf=pass (imf02.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.184 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740537947; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EbNoSRS6T8/zeLVgnP2O9DvEYvJLEH4yFWVyXWJJlAw=; b=XdhdL92T8rRr9lIAelH+/iTdLjnO7Y82N7aYk8SzA7slGcTE1rpmZgcdtYggwhv92Vn2PN RYzuY0l93Xgd1sMVMjrp+LmYK36K0P4yTqOwbvyZKONgfjMqvrivBJ9Db9ZR3ja9v3Daew qSLc8E//Ws6tWrLQ2ypY/telL2KGGTw= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=RE174lwr; spf=pass (imf02.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.184 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740537947; a=rsa-sha256; cv=none; b=VRkCv9zO/jJ+hmZXL5V5EdFcMjnVVo3U2PGhqCZwW8iVo+JNqdQ+V1HIAi15SulJMhnobv sCLbWRwzU/gdCBDDCuC7sP6mH05niwKfgcC0xHdRcR5+ODXj1JTXELNV6cfzwOiH+rK7xY A8QNIH9XqeZ7ZP+/xE9CCBuE1DWrHBY= Date: Wed, 26 Feb 2025 02:45:41 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1740537945; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=EbNoSRS6T8/zeLVgnP2O9DvEYvJLEH4yFWVyXWJJlAw=; b=RE174lwrbqIhruEE5ZFxcFQqYkGomE0iMBhYKd5kYBF+tCEmtemIxMjbgyJOHRUSsMXiC0 XjTiTo47k0Lv9D82WYjqrzIkJRt7X73tBH9V56uUV3gW7G9fg5tKs9qEGsGOfDNCKVjah+ EbZZcy8pS2nZkDwe0bFlG7QdaUBSbcs= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Johannes Weiner Cc: Nhat Pham , akpm@linux-foundation.org, chengming.zhou@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] zswap: do not crash the kernel on decompression failure Message-ID: References: <20250225213200.729056-1-nphamcs@gmail.com> <20250226005149.GA1500140@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250226005149.GA1500140@cmpxchg.org> X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: C1E5080004 X-Stat-Signature: quwmcczqpikfhnowyjdtyya16p6eiyrs X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1740537946-972775 X-HE-Meta: U2FsdGVkX19Q11B+nytMdukpwglGuTJbRG17X/TTHaMelXEK9sq24MXbZaq8AHjQeAA0PP9/WQLHanVj50SOBV4hcVINf7+yGLXQFnNO7MF3e5ZJHRGQvY8IIEvAyvHTPJvVzlvOZ/wkScSTEPpJKKY7Fm7J/hr8E0yTGvBrUyz0kWYGp6+Mat8Qp25iwTFGFNa7UoDdm8OZ0XBmgsAFDFZ5HPMgiPFCpstQlP7H5xNqNv6M0cDP5vqtt61xJimlnuej07dw3mrzORmkmL+JXRd5wVvExdr0U7WyszBtjHoIr3n8Tbo6RTeT/td1/w2Vn8gT6pl1b3KRI3TA6WPsresJ5MdcWpinPEp9wEJfz/M/FHYtoCcrRreH+JkJ+9kWNu7mUogQjfINXBb/9BuxVTzjakmfCbWBXMu3k+Bx1JLBLp/+qeHhWGhCmOd70WsQTzvCmhiKDma1/r8nntSzsZKP/DSRT6w6pmzDl67nfD5jfupW5awVcWZgl432atSSjRdk2ORDkNjTB1lB4o1D1ALR7tXnV5mMt2Pq7vSJZ3lYUt2dqxIIwVfB03eayl5Z4DmylaHqbR2Tp5VeSHTG73Fz35dwh4HJQVAsJUpUFA3qazmPr1+so700YppIUqgsNWMlBO7nyRtsfOStappuZz6rFMkg53bapB980q1Nrd+q2vQ8yThrinuRKifDm1zDbELPfGzk6TbcE2gkRI3Y3OHOKeDqsAdcvkMy1Zg7nNm146XbPhh17vQH24AbTTw48AxKUFeb7uNr+JptnXftjwSgD0tKEJBhnAWIzrKjABTUy7xA8TFYtK7eKpDvF2MDP1cb0yusb2/Gl+p2sgryiAkJxARLi40ijBVkanlPO3uVfHXSJITSRqjLaf+ilJTDKNvHAAEry5p80QXAW20pMENewpcBkVkZXmz3RPvsUW5evkRcCeDJItYw4MO248+1LSCda/YldI/t5H56DaW 0yN586/1 1TIqj2/XlDdYfB9TYJO2qDs58cerHktz2sU8WnL2rv769ICOCFuAQEYrkU1bg5pBCAcHkPfg+iCjowQKHQmKMZDPC4M0gKOChtseTw7toGO4rKpuW1Q1n3EgkfVf+l+sPTm6c1qvaeNUgjfN0twNjkzhN0wTxlbMDfoTK7F4S7gMSTaCKEE9G2TJeHqT2emextjRj+IJd0X7q0XtBkgoGsSvYgCr5TrDtoeUN8MLWrFqmIAWl7FDD2Cc8/xmGd+9BEu8vVq8n8Y10Zf60NLAda6Dbwt0e+qS/u4o+2kzR1kJv/4H0XmOO3v5soQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 25, 2025 at 07:51:49PM -0500, Johannes Weiner wrote: > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > Currently, we crash the kernel when a decompression failure occurs in > > zswap (either because of memory corruption, or a bug in the compression > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > process asking for the zswap entry on zswap load, and skip the corrupted > > entry in zswap writeback. > > > > See [1] for a recent upstream discussion about this. > > > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > > > Suggested-by: Matthew Wilcox > > Suggested-by: Yosry Ahmed > > Signed-off-by: Nhat Pham > > --- > > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 58 insertions(+), 27 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb23..31d4397eed61 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > > static u64 zswap_reject_compress_fail; > > /* Compressed page was too big for the allocator to (optimally) store */ > > static u64 zswap_reject_compress_poor; > > +/* Load and writeback failed due to decompression failure */ > > +static u64 zswap_reject_decompress_fail; > > "reject" refers to "rejected store", so the name doesn't quite make > sense. zswap_decompress_fail? > > > /* Store failed because underlying allocator could not get memory */ > > static u64 zswap_reject_alloc_fail; > > /* Store failed because the entry metadata could not be allocated (rare) */ > > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > return comp_ret == 0 && alloc_ret == 0; > > } > > > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > { > > struct zpool *zpool = entry->pool->zpool; > > struct scatterlist input, output; > > struct crypto_acomp_ctx *acomp_ctx; > > + bool ret = true; > > u8 *src; > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > sg_init_table(&output, 1); > > sg_set_folio(&output, folio, PAGE_SIZE, 0); > > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > > + acomp_ctx->req->dlen != PAGE_SIZE) { > > + ret = false; > > + zswap_reject_decompress_fail++; > > + pr_alert_ratelimited( > > + "decompression failed on zswap entry with offset %08lx\n", > > + entry->swpentry.val); > > Since this is a pretty dramatic failure scenario, IMO it would be > useful to dump as much info as possible. > > The exact return value of crypto_wait_req() could be useful, > entry->length and req->dlen too. > > entry->pool->tfm_name just to make absolute sure there is no > confusion, as the compressor can be switched for new stores. > > Maybe swp_type() and swp_offset() of entry->swpentry? Those could be > easy markers to see if the entry was corrupted for example. > > > + } > > mutex_unlock(&acomp_ctx->mutex); > > > > if (src != acomp_ctx->buffer) > > zpool_unmap_handle(zpool, entry->handle); > > + return ret; > > } > > > > /********************************* > > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > struct writeback_control wbc = { > > .sync_mode = WB_SYNC_NONE, > > }; > > + int ret = 0; > > > > /* try to allocate swap cache folio */ > > mpol = get_task_policy(current); > > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > * and freed when invalidated by the concurrent shrinker anyway. > > */ > > if (!folio_was_allocated) { > > - folio_put(folio); > > - return -EEXIST; > > + ret = -EEXIST; > > + goto put_folio; > > } > > > > /* > > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > * be dereferenced. > > */ > > tree = swap_zswap_tree(swpentry); > > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > > - delete_from_swap_cache(folio); > > - folio_unlock(folio); > > - folio_put(folio); > > - return -ENOMEM; > > + if (entry != xa_load(tree, offset)) { > > + ret = -ENOMEM; > > + goto fail; > > } > > > > - zswap_decompress(entry, folio); > > + if (!zswap_decompress(entry, folio)) { > > + ret = -EIO; > > + goto fail; > > + } > > + > > + xa_erase(tree, offset); > > > > count_vm_event(ZSWPWB); > > if (entry->objcg) > > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > /* start writeback */ > > __swap_writepage(folio, &wbc); > > - folio_put(folio); > > + goto put_folio; > > > > - return 0; > > +fail: > > + delete_from_swap_cache(folio); > > + folio_unlock(folio); > > +put_folio: > > + folio_put(folio); > > + return ret; > > } > > Nice, yeah it's time for factoring out the error unwinding. If you > write it like this, you can save a jump in the main sequence: > > __swap_writepage(folio, &wbc); > ret = 0; > put: > folio_put(folio); > return ret; > delete_unlock: (I like how you sneaked the label rename in here, I didn't like 'fail' either :P) > delete_from_swap_cache(folio); > folio_unlock(folio); > goto put; I would go even further and avoid gotos completely (and make it super clear what gets executed in the normal path vs the failure path): __swap_writepage(folio, &wbc); folio_put(folio); if (ret) { delete_from_swap_cache(folio); folio_unlock(folio); } return ret; > > > > > /********************************* > > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) > > if (WARN_ON_ONCE(folio_test_large(folio))) > > return true; > > > > + /* > > + * We cannot invalidate the zswap entry before decompressing it. If > > + * decompression fails, we must keep the entry in the tree so that > > + * a future read by another process on the same swap entry will also > > + * have to go through zswap. Otherwise, we risk silently reading > > + * corrupted data for the other process. > > + */ > > + entry = xa_load(tree, offset); > > + if (!entry) > > + return false; > > The explanation in the comment makes sense in the context of this > change. But fresh eyes reading this function and having never seen > that this *used to* open with xa_erase() will be confused. It answers > questions the reader doesn't have at this point - it's just a function > called zswap_load() beginning with an xa_load(), so what? > > At first I was going to suggest moving it down to the swapcache > branch. But honestly after reading *that* comment again, in the new > sequence, I don't think the question will arise there either. It's > pretty self-evident that the whole "we can invalidate when reading > into the swapcache" does not apply if the read failed. +1 > > > + /* > > + * If decompression fails, we return true to notify the caller that the > > + * folio's data were in zswap, but do not mark the folio as up-to-date. > > + * This will effectively SIGBUS the calling process. > > + */ > > It would be good to put a lampshade on this weirdness that the return > value has nothing to do with success or not. This wasn't as important > a distinction, but with actual decompression failures I think it is. +1 > > Something like this? > > if (!zswap_decompress(entry, folio)) { > /* > * The zswap_load() return value doesn't indicate success or > * failure, but whether zswap owns the swapped out contents. > * This MUST return true here, otherwise swap_readpage() will > * read garbage from the backend. > * > * Success is signaled by marking the folio uptodate. > */ We use the same trick in the folio_test_large() branch, so maybe this should be moved to above the function definition. Then we can perhaps refer to it in places where we return true wihout setting uptodate for added clarity if needed. > return true; > } > > folio_mark_uptodate(folio); > > > + if (!zswap_decompress(entry, folio)) > > + return true; > > + > > + count_vm_event(ZSWPIN); > > + if (entry->objcg) > > + count_objcg_events(entry->objcg, ZSWPIN, 1); > > + > > /* > > * When reading into the swapcache, invalidate our entry. The > > * swapcache can be the authoritative owner of the page and > > @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio) > > * files, which reads into a private page and may free it if > > * the fault fails. We remain the primary owner of the entry.) > > */ > > - if (swapcache) > > - entry = xa_erase(tree, offset); > > - else > > - entry = xa_load(tree, offset); > > - > > - if (!entry) > > - return false; > > - > > - zswap_decompress(entry, folio); > > - > > - count_vm_event(ZSWPIN); > > - if (entry->objcg) > > - count_objcg_events(entry->objcg, ZSWPIN, 1); > > - > > if (swapcache) { > > + xa_erase(tree, offset); > > zswap_entry_free(entry); > > folio_mark_dirty(folio); > > } >