From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD8AEC282CD for ; Mon, 3 Mar 2025 21:21:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E52C26B0093; Mon, 3 Mar 2025 16:21:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E033B6B0095; Mon, 3 Mar 2025 16:21:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CCBA06B0096; Mon, 3 Mar 2025 16:21:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id AFA766B0093 for ; Mon, 3 Mar 2025 16:21:36 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 488AFC103A for ; Mon, 3 Mar 2025 21:21:36 +0000 (UTC) X-FDA: 83181511392.20.3ACAEB3 Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180]) by imf01.hostedemail.com (Postfix) with ESMTP id EC3CB40005 for ; Mon, 3 Mar 2025 21:21:33 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=QrCDFm39; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf01.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741036894; a=rsa-sha256; cv=none; b=kxNPGHrOJmhjCQIjv1+DULm/IY0/HO1FaqT2/+tXEaWvf8wpF6s3MwTm+4BisJEXJSphcm RtVe5YldcARKaN248LDngfvXcIiewUb/sxPK4aTSGWcTiItxpGPr3Z1M+2IrWWGiM4Ev7z TfNpeNjt3DwtWubW5ujRERg89baTLMo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=QrCDFm39; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf01.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741036894; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kL4tjzMzQo5uwMm6YTN0FCnZCyssi5BQcb9r960YpQM=; b=xDCSGPhGQt17eN/5bcdX9a+DtOU7NZhwli1Ac0NtxmwgT2WZWQS9VJEgzV06yhpsD/1VLn iuqkhf7cUIe/gj1gT0wCxVdgm27oGPrPlC2/bK35N7VfbFhRM9g3ipwAoDuNU41ObLsNQN /EWfxqAhPOcSE5ywWa/AoZxPV9O2Izk= Date: Mon, 3 Mar 2025 21:21:27 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1741036891; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kL4tjzMzQo5uwMm6YTN0FCnZCyssi5BQcb9r960YpQM=; b=QrCDFm39CXH/h6DfmdEqATr6J1otV52h8hTQ+i3zgc+71cFLdP0gtJLzho/ndfwRJhDQ+r n2MLTJxLk9F/xVzHyO75zhyjlz8FMduaRGENeTdfBZuPB1Dt19yyJ1gvapYh81IE23VKju ++Zyy+mTS73/a0gs5N+no8NgHbKZqB0= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Nhat Pham Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, chengming.zhou@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3] page_io: zswap: do not crash the kernel on decompression failure Message-ID: References: <20250303200627.2102890-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250303200627.2102890-1-nphamcs@gmail.com> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: EC3CB40005 X-Stat-Signature: xjw7fb4pwqri1b346q9nx8dkj1441dde X-Rspam-User: X-HE-Tag: 1741036893-130531 X-HE-Meta: U2FsdGVkX18FSC9rPgtX2kdX3T5An5Iq1ABjfEjH4LCnccTHCquPh23ZTV2r84ceezwHui6boaEKj96wjx7Q3wRl8IvNsKIcXglWlhG2n2fRKhxkf8/pDnbovT8xeYzJAnJKzJeY+92jWW7wuaEocy1swaUXYlBXarMcuEDKC/Yt4ZN68DbzIjZHUMQ01yV/2XgY8WTf1D7NOmYuf1KcdIKkwfpGnK32SRrx7sYaX/NmzHxxRFrCYk90dEeyKI2tdraDeTzs+TGuEyTomqq7LduclbNMX44WiGGa/cg5ppNR8SnYYRmQV1OmNtUIZs86Edl1Pt0B56q/ICUcs8H/fdLUO/e1wysSj7iYnxEIyvKuYb6p7JmK9Fa3qLGM2I+dLoRbGhUGC06wZFVjMhnHknYd1/n6U3GeQo5qVHwJaobEYgdF50nyKD8+JripaH1UB5dOyWXljKWliEDIAhOmf45Im0EWCALaC22r8LK1w8v36caY2t9NLelGDfiVAUb98Hjm4oex4zKGbj3FhKJ7pFz/2DU+Mk4yi4Sn/wsUFghWeD3TuhjdKjht6D2oUIbsWeu7Zh81fTzrCB3gVYkqJVIwIgLrlov8ApX48j/eyKGthpAdRnjJwv+IvDdOpojHkXTj7teWmjycu92nuBGFwDY8K+OMxAHlFy+Eo1hMGUsbw8li1L1O72hYkypzTgjgm818gWbN//RqrnYelXlZde94NNCz6ir7c+0kLeLGNH89C+fgX8kAF0TeGpHsNmFn7L3AWeXFyAFV6G6I2uFfD8yZvG2CVD2Srguj9N4if2RaEm8ZjMYLIV1kjz0ygczAaFY6D3d0ctt0ffI+VnCsYXrhzx7daQD7MoGumpddJQrvfj8cpyWFJqViyzfEFiArAJGeOG5THmV5uuMkNLwnZjUzXJ6ieclgvwo/UtP3hH63eNPBQTN2tAZNShqNfPKqo0Yv7WdJOiw28iw8Bmp rZMHZdXD i0s3rGSASzMuw46nKPNAFLmXQqgAiX3x0z6W4WZWp5FTMDfms3pljEXWoFydynaP95nprCDvfkNBLzlpMgDLAq+xMOC7Cm0b6eET1ldgO3vkCYfg/suWZiPCGLLBFzKeGhU16edF9Qg1PTQCn+L+B1tmomvwcTj2WgViMZk7gp5mjJbMoMJIHInD6X71KTX5ckF7+JxC8eORKo2oDmJuZuTlFrzq7xwug/mFVk5ShuSdvkIGV4S7Hv6LlfA80Lq58KU8A3S1dWnYJakqWPxe34iQsPOkewTtm5GepIUbirL+vhanQd4Dgmh27HhF24gXTaSNH96MEWlJcWJhdWWFRnlRnc9OPxeAahcCaA1C4c6386n+bmiA/EyNd85J8EvBDnnz81mwyQLbks/YbqTB9oGw/z33F33KhtLiDgsPxW2C/eeg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 03, 2025 at 12:06:27PM -0800, Nhat Pham wrote: > Currently, we crash the kernel when a decompression failure occurs in > zswap (either because of memory corruption, or a bug in the compression > algorithm). This is overkill. We should only SIGBUS the unfortunate > process asking for the zswap entry on zswap load, and skip the corrupted > entry in zswap writeback. > > See [1] for a recent upstream discussion about this. > > The zswap writeback case is relatively straightforward to fix. For the > zswap_load() case, we reorganize the swap read paths, having > swap_read_folio_zeromap() and zswap_load() return specific error codes: > > * 0 indicates the backend owns the swapped out content, which was > successfully loaded into the page. > * -ENOENT indicates the backend does not own the swapped out content. > * -EINVAL and -EIO indicate the backend own the swapped out content, but > the content was not successfully loaded into the page for some > reasons. The folio will not be marked up-to-date, which will > eventually cause the process requesting the page to SIGBUS (see the > handling of not-up-to-date folio in do_swap_page() in mm/memory.c). > > zswap decompression failures on the zswap load path are treated as an > -EIO error, as described above, and will no longer crash the kernel. > > As a side effect, we require one extra zswap tree traversal in the load > and writeback paths. Quick benchmarking on a kernel build test shows no > performance difference: > > With the new scheme: > real: mean: 125.1s, stdev: 0.12s > user: mean: 3265.23s, stdev: 9.62s > sys: mean: 2156.41s, stdev: 13.98s > > The old scheme: > real: mean: 125.78s, stdev: 0.45s > user: mean: 3287.18s, stdev: 5.95s > sys: mean: 2177.08s, stdev: 26.52s > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > Suggested-by: Matthew Wilcox > Suggested-by: Yosry Ahmed > Suggested-by: Johannes Weiner > Signed-off-by: Nhat Pham > --- > include/linux/zswap.h | 4 +- > mm/page_io.c | 35 ++++++++---- > mm/zswap.c | 130 ++++++++++++++++++++++++++++++------------ > 3 files changed, 120 insertions(+), 49 deletions(-) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index d961ead91bf1..9468cb3e0878 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -26,7 +26,7 @@ struct zswap_lruvec_state { > > unsigned long zswap_total_pages(void); > bool zswap_store(struct folio *folio); > -bool zswap_load(struct folio *folio); > +int zswap_load(struct folio *folio); > void zswap_invalidate(swp_entry_t swp); > int zswap_swapon(int type, unsigned long nr_pages); > void zswap_swapoff(int type); > @@ -46,7 +46,7 @@ static inline bool zswap_store(struct folio *folio) > > static inline bool zswap_load(struct folio *folio) int? > { > - return false; > + return -ENOENT; > } > > static inline void zswap_invalidate(swp_entry_t swp) {} > diff --git a/mm/page_io.c b/mm/page_io.c > index 9b983de351f9..8a44faec3f92 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -511,7 +511,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret) > mempool_free(sio, sio_pool); > } > > -static bool swap_read_folio_zeromap(struct folio *folio) > +/* > + * Return: one of the following error codes: nit: "Returns 0 on success, or one of the following errors on failure:" Also might as well make this a full kerneldoc? > + * > + * 0: the folio is zero-filled (and was populated as such and marked > + * up-to-date and unlocked). > + * > + * -ENOENT: the folio was not zero-filled. > + * > + * -EINVAL: some of the subpages in the folio are zero-filled, but not all of > + * them. This is an error because we don't currently support a large folio > + * that is partially in the zeromap. The folio is unlocked, but NOT marked > + * up-to-date, so that an IO error is emitted (e.g. do_swap_page() will > + * sigbus). > + */ > +static int swap_read_folio_zeromap(struct folio *folio) > { > int nr_pages = folio_nr_pages(folio); > struct obj_cgroup *objcg; > @@ -523,11 +537,13 @@ static bool swap_read_folio_zeromap(struct folio *folio) > * that an IO error is emitted (e.g. do_swap_page() will sigbus). > */ This comment says "Return true", so it needs to be updated. We probably don't need to explain the return value anymore since it's documented above. > if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages, > - &is_zeromap) != nr_pages)) > - return true; > + &is_zeromap) != nr_pages)) { > + folio_unlock(folio); > + return -EINVAL; > + } > > if (!is_zeromap) > - return false; > + return -ENOENT; > > objcg = get_obj_cgroup_from_folio(folio); > count_vm_events(SWPIN_ZERO, nr_pages); > @@ -538,7 +554,8 @@ static bool swap_read_folio_zeromap(struct folio *folio) > > folio_zero_range(folio, 0, folio_size(folio)); > folio_mark_uptodate(folio); > - return true; > + folio_unlock(folio); > + return 0; > } > > static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug) > @@ -635,13 +652,11 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) > } > delayacct_swapin_start(); > > - if (swap_read_folio_zeromap(folio)) { > - folio_unlock(folio); > + if (swap_read_folio_zeromap(folio) != -ENOENT) > goto finish; I would split the zeromap change into a separate patch, but it's probably fine either way. > - } else if (zswap_load(folio)) { > - folio_unlock(folio); > + > + if (zswap_load(folio) != -ENOENT) > goto finish; > - } > > /* We have to read from slower devices. Increase zswap protection. */ > zswap_folio_swapin(folio); > diff --git a/mm/zswap.c b/mm/zswap.c > index 6dbf31bd2218..b67481defc6c 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > static u64 zswap_reject_compress_fail; > /* Compressed page was too big for the allocator to (optimally) store */ > static u64 zswap_reject_compress_poor; > +/* Load or writeback failed due to decompression failure */ > +static u64 zswap_decompress_fail; > /* Store failed because underlying allocator could not get memory */ > static u64 zswap_reject_alloc_fail; > /* Store failed because the entry metadata could not be allocated (rare) */ > @@ -996,11 +998,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > return comp_ret == 0 && alloc_ret == 0; > } > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > { > struct zpool *zpool = entry->pool->zpool; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + int decomp_ret, dlen; > u8 *src; > > acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool); > @@ -1025,12 +1028,31 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > sg_init_table(&output, 1); > sg_set_folio(&output, folio, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > + decomp_ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); > + dlen = acomp_ctx->req->dlen; > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > acomp_ctx_put_unlock(acomp_ctx); > + > + if (decomp_ret || dlen != PAGE_SIZE) { > + zswap_decompress_fail++; > + pr_alert_ratelimited( > + "decompression failed with returned value %d on zswap entry with " nit: Decompression* I am also wondering how this looks like in dmesg? Is the line too long to be read? Should we add some line breaks (e.g. like warn_sysctl_write()), we could probably also put this in a helper to keep this function visually easy to follow. > + "swap entry value %08lx, swap type %d, and swap offset %lu. " > + "compression algorithm is %s. compressed size is %u bytes, and " > + "decompressed size is %u bytes.\n", > + decomp_ret, > + entry->swpentry.val, > + swp_type(entry->swpentry), > + swp_offset(entry->swpentry), > + entry->pool->tfm_name, > + entry->length, > + acomp_ctx->req->dlen); We should probably be using 'dlen' here since we're outside the lock. > + > + return false; > + } > + return true; > } > > /********************************* [..] > @@ -1620,7 +1651,29 @@ bool zswap_store(struct folio *folio) > return ret; > } > > -bool zswap_load(struct folio *folio) > +/** > + * zswap_load() - load a page from zswap > + * @folio: folio to load > + * > + * Return: one of the following error codes: nit: Returns 0 on success, or .. > + * > + * 0: if the swapped out content was in zswap and was successfully loaded. > + * The folio is unlocked and marked up-to-date. > + * > + * -EIO: if the swapped out content was in zswap, but could not be loaded > + * into the page (for e.g, because there was a memory corruption, or a > + * decompression bug). The folio is unlocked, but NOT marked up-to-date, > + * so that an IO error is emitted (e.g. do_swap_page() will SIGBUS). > + * > + * -EINVAL: if the swapped out content was in zswap, but the page belongs > + * to a large folio, which is not supported by zswap. The folio is unlocked, > + * but NOT marked up-to-date, so that an IO error is emitted (e.g. > + * do_swap_page() will SIGBUS). > + * > + * -ENOENT: if the swapped out content was not in zswap. The folio remains > + * locked on return. > + */ > +int zswap_load(struct folio *folio) > { > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); [..]