From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C47F8C021B2 for ; Wed, 26 Feb 2025 00:51:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4BF4C280003; Tue, 25 Feb 2025 19:51:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 46ECB280002; Tue, 25 Feb 2025 19:51:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 33FC3280003; Tue, 25 Feb 2025 19:51:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 148AC280002 for ; Tue, 25 Feb 2025 19:51:58 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8245CA0FF9 for ; Wed, 26 Feb 2025 00:51:57 +0000 (UTC) X-FDA: 83160268674.18.2714C6D Received: from mail-qv1-f53.google.com (mail-qv1-f53.google.com [209.85.219.53]) by imf14.hostedemail.com (Postfix) with ESMTP id 725EF100006 for ; Wed, 26 Feb 2025 00:51:55 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=xeiLJRxh; spf=pass (imf14.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.53 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740531115; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FAM8inQvtY5DGj2G6NXvF7xge4FZGHFAkat8ef3nbFM=; b=42HCykX/0A3cqovpnn2PcUX/sk60irJEtBw7tzaETAhxG9x10uhRd/BHQD0u/GYSaZAY2Q uLJa7hBnMFyIqGP9ZTB4DXADQyKIrJ8JwH4swqFdo3zzBanzM4LbK5tH8Zexi9UAW+T8AV JORRElLA6jaoKLCWnUakPmDTcU0mBeg= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=xeiLJRxh; spf=pass (imf14.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.53 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740531115; a=rsa-sha256; cv=none; b=LP2s7g6p1eFYQ4ZepqNdIdzsl21MJmSV6RVZvgd7gxHWS7SR6FWUSSn1Fl9272M57w7hT5 jHe9QgoATW63Jf0FP9pAswrGC3s4g43XVPUxU3D58bkQoF2gZ2D4a2002q/t2JWlpVW90c 8S+m0R7ALmrmeH/GPtNCeaYkaISSaJQ= Received: by mail-qv1-f53.google.com with SMTP id 6a1803df08f44-6dd15d03eacso4112166d6.0 for ; Tue, 25 Feb 2025 16:51:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1740531114; x=1741135914; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=FAM8inQvtY5DGj2G6NXvF7xge4FZGHFAkat8ef3nbFM=; b=xeiLJRxh37kj/BEQJrr6WSO9PoEfuwkcW9UKPk8bbvXn1fpa5fNvPNATNFZKOsXjOk J+huI+QTl8HWOlnR2JfCq2US2eGmGzNoqxskDCclOJG+1JXatu0NwXNQI9k6fUdx7G/J YOLgfWbffxFk7W5eGXgjwiv2y17FYXy/WsirKJRca/kXE+Av0QBFYcQGX/PDaHvHCckD E0w92R7JlxjUv9Rr3xgBOCrm8eP8m3LyWW/b4E/dVBhw6tK7hVFzZy9R8fFb3otKrv3F ZYFQGnc2fx87+tSPNNdBPuU0ZaZx4kF4b2EefQBaqnWi3EEH0nfoNTLKpPzRbN/AuI3b JjlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740531114; x=1741135914; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=FAM8inQvtY5DGj2G6NXvF7xge4FZGHFAkat8ef3nbFM=; b=ww3PaebRnKnJVXbWF6+6x9+AECZ3s4qMhEl8+MsH0J3COQMCkySQgKU9gEuHPTcrJW i39UuzhsBpP9QsF8Rw5Hmzz+c5Bs+rHpc1ZlY/BRcqPnre9TR/+J///jmrBXAA9StrV3 BXWtrAG2yWdt0mr+tFGeTdt7tjY8batUoeCu5FIzMIJ1572oLHvovc+zOZvciaqJua3s cao2gkS4KqcIdXnJGLuFgX9GPzicVqjjyr62UzP/30B/YHDO1UiYLU26o03uPkxSWCYX xQVAim13mTYyJqZyNy38wRBw1fdxxVyZxKJKEoJpI2U2smBXKyoU8gTGDORO5BIpiJeb VHxQ== X-Forwarded-Encrypted: i=1; AJvYcCWnMEPjXW1C5gexrblZtQPZAC/+YYlqV10ZH4f/M8jfVSL0ztDpIYpr7ZGl14RAHx3UKUYjcSAgnA==@kvack.org X-Gm-Message-State: AOJu0YypL8njKIi9tYB7qBXszEpPBPzBnFt0J4y/1rLGYQ+EdwccRwrk 6+P8BTvgFyUi/fRceNBP+3izZYUfCCnA+rzQ794l0xUSmoG4tgXn52PYMhjq+ek= X-Gm-Gg: ASbGncurZOapHWVd1cN+M2UIkYlWFl+DOou6fVtEAl/MwVbH4VhBufojdBgYdgbBgPq Z3Tauh4AD0OnqSTci7LGv/s20MhMY9LuzIOjoRqSI5peXx99Yrwli5mF/f6Jkfwwh7v7VzAiVCh VDwLObp/UVw9fbp8PN6dHcchu0E165Z2b/xgeKMiaFStbfeELdQQQZrkqucnC4KrUBUQwnb6ns0 yJmCrvIyX7qSZZNswXOCfz5UOWIg4PKTxJrDjhx9fRcqIysvcMFU8/KnmnAFi4NrcxU+GPQi6YV vEuJIEddJylV3Ob2zUx2MPNA X-Google-Smtp-Source: AGHT+IEHtCZgirfdLjKUhS5v7/WA0Xig1OgVsaRdpMj0BDfsQM7QTY7Gz4zxp9HVWykrQ/dSB7PPWA== X-Received: by 2002:a05:6214:2401:b0:6e6:9b86:85d0 with SMTP id 6a1803df08f44-6e6a259c330mr319603746d6.8.1740531114354; Tue, 25 Feb 2025 16:51:54 -0800 (PST) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6e87b09c5acsm16054256d6.52.2025.02.25.16.51.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Feb 2025 16:51:53 -0800 (PST) Date: Tue, 25 Feb 2025 19:51:49 -0500 From: Johannes Weiner To: Nhat Pham Cc: akpm@linux-foundation.org, yosryahmed@google.com, chengming.zhou@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] zswap: do not crash the kernel on decompression failure Message-ID: <20250226005149.GA1500140@cmpxchg.org> References: <20250225213200.729056-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250225213200.729056-1-nphamcs@gmail.com> X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 725EF100006 X-Stat-Signature: 8o39qdqyk1e8ukr88ihwgk5iy8sgo17r X-HE-Tag: 1740531115-770817 X-HE-Meta: U2FsdGVkX197bi4L3e0fC5OPG7RBIfbqGvBp8Dq0qGHesNQwUM8A1NGBTnjLJt3FVBzoE6N5dcyOO/Yoaem9kpbW7qw47mHDo2G0UF7Wfk3C0QXYqaZpCNGEErCuCy/OM8uJdoSpON2axRgKCrY9sClLeNbyrC60QeFYcoGWWGQb6jsSDU8kGUvYUGjG7BkUr77WtyMYdVgMaQQPEp8dm54mzqkm0lOcG8MQ70oIxSs3r0HPoxscPco2a4DXJnN4CokEqrGqfpmk2tQBrZAtYNzj8+YEbVs5k4uzTgzT1HBLd52Tu6OCjX+k9BNZueMglH4SvXkX6mIjtk9ksfJFu9JLGcBh7wYi8hT7Z61J+YI9xUp/gx6oMMcKSV2wcqHiHBL+qxkG1tJwO+czU7uedMNwOoOW2UePVlKJkSN6PyOmjHPmQSMyFBTk5i+k3zn4tCU7RR8pj/haRvChBtlaXmO9cc82sDBKX6wcTrziFsYmPHn235HlL8YRu/XK3TblIM51rAEXXnq+eit9zdzI8Pj1W9lo3BY3UlV2w0NV/+M577KerexiQgkN7jyzmWg5zlqbTY7pnALnDUdcFT72cBVdIb56mBiLrV38JwJc4NszYEoM84Pn8Xuyo2ts5tNFqmyQlks69MHkeRQsimAzbniisUJsFriWVt5Hava6jmWnaus9W2vEHYsm6ErhQmeZfTrF4F+31aoJcDEJbczcSiYk8j04a9HuZDhwq8KLeG+c9X1HheXEcItObwcXRNmvJF0FH2937RVz1JS1Ijxg9+99forKu+t80trX+sIFej+iQk0ZXPOSz8Y1mjIRuqE7DImuy6bNzTaJn09cHyO5PqaUrbhifPDO1glWf5JvEL/neAICmIjm08QWEtUsZ6+AgN/6TemFTbXXbcvnop1lzsAyqXJ3MWEwMpkYEv9B/XP4LSbDwrT4nv0mzgrBnGQdBeV9Jcwb+NH6BKXZ8gE TSzjHxaJ kCDK0dl9OJqD+pJ1dHQRqC8SpOMMOlrsBRNJeLpopipwaMx3eAVDpzTqyxQA8GIrWp0iVLITFpv/1laaYrrb+I0OT4DBK7L/anga0TGwyh9ErvNS85AXqgxGKeuJwQ8LTwoCvEzckZ0dz9QaSCUP41ha/HKV9Co2MNPE2Wgi82Q7e7VFND19d0Yn/KExLEuiyaIHsoNNhPSqXBee10Ga3Db9VnvUK8kNmuKUsxDtvlD2tNStvg8hqIlXN4+AhFrLbJOldjfN1PA5aQRjoVc0gNhy51kSGn9xzWoVj65VwFyR0JQUgF9SVVRQ26iG4Pif/n+mhB0mf+FIjsr9FqyClLzImOPkc91wD6z0Y2hve5jyYpeCcYlksc2KMxzf+jVruwoagmHviU7E/lRpBhzQYq9QMJXfC6v4ukOBSWHTpHW9j5EObwThoUagopcFMGr7EYV1cGIaN3fGDkEAbmNP21zoVBA95Vv9mFc07QN2KZNvJ7vak8/uAudnK0eYS7soQPKA6mqTIa3pdmJVDDVCFHMxA0WFubalk8kizbOsppO9lnUbKa7H437RlF/U7ZMblMAWTSKEM4LFmYVRcTTMKC2tBPQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > Currently, we crash the kernel when a decompression failure occurs in > zswap (either because of memory corruption, or a bug in the compression > algorithm). This is overkill. We should only SIGBUS the unfortunate > process asking for the zswap entry on zswap load, and skip the corrupted > entry in zswap writeback. > > See [1] for a recent upstream discussion about this. > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > Suggested-by: Matthew Wilcox > Suggested-by: Yosry Ahmed > Signed-off-by: Nhat Pham > --- > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 58 insertions(+), 27 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f6316b66fb23..31d4397eed61 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > static u64 zswap_reject_compress_fail; > /* Compressed page was too big for the allocator to (optimally) store */ > static u64 zswap_reject_compress_poor; > +/* Load and writeback failed due to decompression failure */ > +static u64 zswap_reject_decompress_fail; "reject" refers to "rejected store", so the name doesn't quite make sense. zswap_decompress_fail? > /* Store failed because underlying allocator could not get memory */ > static u64 zswap_reject_alloc_fail; > /* Store failed because the entry metadata could not be allocated (rare) */ > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > return comp_ret == 0 && alloc_ret == 0; > } > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > { > struct zpool *zpool = entry->pool->zpool; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + bool ret = true; > u8 *src; > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > sg_init_table(&output, 1); > sg_set_folio(&output, folio, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > + acomp_ctx->req->dlen != PAGE_SIZE) { > + ret = false; > + zswap_reject_decompress_fail++; > + pr_alert_ratelimited( > + "decompression failed on zswap entry with offset %08lx\n", > + entry->swpentry.val); Since this is a pretty dramatic failure scenario, IMO it would be useful to dump as much info as possible. The exact return value of crypto_wait_req() could be useful, entry->length and req->dlen too. entry->pool->tfm_name just to make absolute sure there is no confusion, as the compressor can be switched for new stores. Maybe swp_type() and swp_offset() of entry->swpentry? Those could be easy markers to see if the entry was corrupted for example. > + } > mutex_unlock(&acomp_ctx->mutex); > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > + return ret; > } > > /********************************* > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > struct writeback_control wbc = { > .sync_mode = WB_SYNC_NONE, > }; > + int ret = 0; > > /* try to allocate swap cache folio */ > mpol = get_task_policy(current); > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * and freed when invalidated by the concurrent shrinker anyway. > */ > if (!folio_was_allocated) { > - folio_put(folio); > - return -EEXIST; > + ret = -EEXIST; > + goto put_folio; > } > > /* > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * be dereferenced. > */ > tree = swap_zswap_tree(swpentry); > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > - delete_from_swap_cache(folio); > - folio_unlock(folio); > - folio_put(folio); > - return -ENOMEM; > + if (entry != xa_load(tree, offset)) { > + ret = -ENOMEM; > + goto fail; > } > > - zswap_decompress(entry, folio); > + if (!zswap_decompress(entry, folio)) { > + ret = -EIO; > + goto fail; > + } > + > + xa_erase(tree, offset); > > count_vm_event(ZSWPWB); > if (entry->objcg) > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > /* start writeback */ > __swap_writepage(folio, &wbc); > - folio_put(folio); > + goto put_folio; > > - return 0; > +fail: > + delete_from_swap_cache(folio); > + folio_unlock(folio); > +put_folio: > + folio_put(folio); > + return ret; > } Nice, yeah it's time for factoring out the error unwinding. If you write it like this, you can save a jump in the main sequence: __swap_writepage(folio, &wbc); ret = 0; put: folio_put(folio); return ret; delete_unlock: delete_from_swap_cache(folio); folio_unlock(folio); goto put; > > /********************************* > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) > if (WARN_ON_ONCE(folio_test_large(folio))) > return true; > > + /* > + * We cannot invalidate the zswap entry before decompressing it. If > + * decompression fails, we must keep the entry in the tree so that > + * a future read by another process on the same swap entry will also > + * have to go through zswap. Otherwise, we risk silently reading > + * corrupted data for the other process. > + */ > + entry = xa_load(tree, offset); > + if (!entry) > + return false; The explanation in the comment makes sense in the context of this change. But fresh eyes reading this function and having never seen that this *used to* open with xa_erase() will be confused. It answers questions the reader doesn't have at this point - it's just a function called zswap_load() beginning with an xa_load(), so what? At first I was going to suggest moving it down to the swapcache branch. But honestly after reading *that* comment again, in the new sequence, I don't think the question will arise there either. It's pretty self-evident that the whole "we can invalidate when reading into the swapcache" does not apply if the read failed. > + /* > + * If decompression fails, we return true to notify the caller that the > + * folio's data were in zswap, but do not mark the folio as up-to-date. > + * This will effectively SIGBUS the calling process. > + */ It would be good to put a lampshade on this weirdness that the return value has nothing to do with success or not. This wasn't as important a distinction, but with actual decompression failures I think it is. Something like this? if (!zswap_decompress(entry, folio)) { /* * The zswap_load() return value doesn't indicate success or * failure, but whether zswap owns the swapped out contents. * This MUST return true here, otherwise swap_readpage() will * read garbage from the backend. * * Success is signaled by marking the folio uptodate. */ return true; } folio_mark_uptodate(folio); > + if (!zswap_decompress(entry, folio)) > + return true; > + > + count_vm_event(ZSWPIN); > + if (entry->objcg) > + count_objcg_events(entry->objcg, ZSWPIN, 1); > + > /* > * When reading into the swapcache, invalidate our entry. The > * swapcache can be the authoritative owner of the page and > @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio) > * files, which reads into a private page and may free it if > * the fault fails. We remain the primary owner of the entry.) > */ > - if (swapcache) > - entry = xa_erase(tree, offset); > - else > - entry = xa_load(tree, offset); > - > - if (!entry) > - return false; > - > - zswap_decompress(entry, folio); > - > - count_vm_event(ZSWPIN); > - if (entry->objcg) > - count_objcg_events(entry->objcg, ZSWPIN, 1); > - > if (swapcache) { > + xa_erase(tree, offset); > zswap_entry_free(entry); > folio_mark_dirty(folio); > }