From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D8478C27C55 for ; Mon, 10 Jun 2024 21:12:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 54E696B008A; Mon, 10 Jun 2024 17:12:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4FD0C6B0099; Mon, 10 Jun 2024 17:12:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C7076B009C; Mon, 10 Jun 2024 17:12:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1EE8F6B0099 for ; Mon, 10 Jun 2024 17:12:20 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id B9A5BA01F6 for ; Mon, 10 Jun 2024 21:12:19 +0000 (UTC) X-FDA: 82216227198.02.C672C71 Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50]) by imf20.hostedemail.com (Postfix) with ESMTP id AEF1E1C000B for ; Mon, 10 Jun 2024 21:12:17 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Ma9XXO3h; spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718053937; a=rsa-sha256; cv=none; b=pFhkaq6TAmzzzdDa7I34lXQ0wsmmxoQjLkOVZHddnahKASKY1VhKlszekDnx+k4S+tSrXA C+XVjeSHUYcKD/8uk2e/CBX02+SH4mmzcCeF3GLBworhhXxTtx8x+M9gVpjEN1PH/wJ8K+ 5KorRZsZO3L6Q9hSaf76v9ozsMSZUfs= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Ma9XXO3h; spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718053937; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7Uy3FcrR56dAJ33BptXmeq82zQbFLSxbtr1pm6OxI94=; b=UmqkyIwKorjMTSVzvN9GdcrCgvUD+x6wJWOIH2z0reJRCvQsU8yMrHia/f5hDu2FfPov9q aPsb0NxV/Cgao0O8mmyZPTG9/KDjzo+QqqK29K5DZ6asGyquMRu3UuWejPfmWAfkp1P6xC AnsOrglXYglFP9NjOmv/++HrMzUWHmg= Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-a6f0dc80ab9so43132366b.2 for ; Mon, 10 Jun 2024 14:12:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718053936; x=1718658736; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7Uy3FcrR56dAJ33BptXmeq82zQbFLSxbtr1pm6OxI94=; b=Ma9XXO3hEA/dFe+3DYzFbpJObpXmQcfIl+su0lqcDHZCkqzdNaA3xNwi+PrG50crzp txJJKFygOG3pjFev0MOObhhzrntsOgvH4zD78sYDzHKxwfCy+QSFM1DYban2xFqP5ejG ggUbAcoJ+t17CShORbYMhyAHxuxGA7okALJXK3QIs1TBrIwzqR++nzCiYanI0eOA5+rc R0r/ZCSH/NXwxJ/m1LM2kmkw/mixXY4VAj5TqUazaGg4x5/SD3cCiK+/5/ulBM6BC6sW wQUVZLS15kc3SmO6YzQKIGU9Y8vBpXQjfPSNZg2ADHe0WZedCIdK6BtzYldvbKYW44hl ioZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718053936; x=1718658736; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7Uy3FcrR56dAJ33BptXmeq82zQbFLSxbtr1pm6OxI94=; b=sR6KyzE/bcxoK4igCclhF5xQK1N1ApwMKnknWnasNSOb2lWAk2hMBNank3qc3KFCfq B71XufKbJKod59HDckaHf3kWhVV5gG8eTOgnGkNQ8/lOX3i8yPkuL/o7IlMvno0VryNX 80QTPOVCn67p/4yHLWikvLjjNAnpSsDzr01kv++MkDsYZurz89kUoN53LdWQwB7NdcD2 NoEC4tW5XvcCKS55PbX4DWlipIX2yWBHD2US9sUTayvycacjFwCkOagjTeFdbBePAkjP rQkgyk/v6R4VnIm0EcySY0rQNzk2kSXkBSkWEvrxcmm7PJ0DDOloZe2BU2W7dapy2BiP Rjtw== X-Forwarded-Encrypted: i=1; AJvYcCUrNa8DX53eytuphigAG/1UjiDRCNxsho5+PaDr2Hdbv9xawgedz7Kw461bNHUmp4aZXDIGX3mlozIjPu+84ce8E6Q= X-Gm-Message-State: AOJu0YzAJxt52zYURNk+v535DHb+tx7+kulHoe4OWi7HEjhAq2mTnuPh 3Sohml0DoRUMtbknqVoSqiOgniV6FhQuyvkN9Q1g5S2rwOamBAS9znO4kGcJi+BDHfo6ptTZVrI YwMnhosDy4/ZbFOktO+KmEE8hTIzvhxDBwt9L X-Google-Smtp-Source: AGHT+IEASnD5ua9yfrdUtvIzu+Lw46QXFzbx9CRxofl4dQ+vje03VpXXKBLIh1iGcqO9u1jSSMdugvsncTIN1lM0/tY= X-Received: by 2002:a17:906:7fd0:b0:a6e:2275:95b8 with SMTP id a640c23a62f3a-a6e2275962emr460685966b.64.1718053935639; Mon, 10 Jun 2024 14:12:15 -0700 (PDT) MIME-Version: 1.0 References: <20240608023654.3513385-1-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Mon, 10 Jun 2024 14:11:36 -0700 Message-ID: Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios To: Barry Song <21cnbao@gmail.com> Cc: Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , David Hildenbrand , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: AEF1E1C000B X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: zox6qx6tnpp73c94w794tzhkas7z9t7r X-HE-Tag: 1718053937-786736 X-HE-Meta: U2FsdGVkX18uK0I6wU+nPa8Mma4pn5Ol5mwSwalLMgSsArZQ3VuPSkPZeGnlLKOj28QXmJfTqCHP7yMpEz4jeP5u3GokApf6/Lq5EPxtxvg6+p+A5YB0A9LJedHbxEUnr0g58Dv+pjALu3ETQ/7DL1NlpyUPPgjKPxYjaJrgIbsSOS/zkESJ6Z7L1F8+qVWqEhayzrynbANXshg5wgNxhzR2KGAsgaes+DfT+Y0d750imEpoSJFa+MR4b/WN2zbujPslNfnfeYn7YPjHkYaPYJuZGL3+gBbtqd6W8559HtKNO96i3x+KniepALQuTFmnPgE2+Nfm8idWL5c5diOqjcmCMrFwRLuYjLks8LPiqQdLYlhZB4N0fByWEXCQrJNhWBi+kyymrZO28mUWzS+Ctq3CeSZxznqxuBTWGnUC/WXBP9Jwg9J/G/Lqg2dWCmo49RR5Cwm3+6mnFJO4c9njUJK+xaIEVls6XnQOYYK5u28NdGt9oN6PlA4cScBu2ZarMHicG/0elnWOIxl59HS5bwpmVkTBwhiQou+jbveS910sQwBun3NKHyGqnIL8vPVZTJSoTB2JwS4igyCnVrqK3eX8/6uMw6xUOmDGYVzf0BK7VlYf3YrcQXnX1GrVFJH9KvBjMu9omBCfFUHhdSF58KtKD8mR4nvezYFGKcYLfQ6Rgn31RdEoWa7leRsjkjfaE+gFYMPYXOTEBJIL34xXsLe/qIjl5T2DeJy4B8cc3BASOcGmUE/uAnmhK9B/i4x8dzCieVnrNe12xCnMl6X1Wf8O/IBpzAuALCtLXL4ax9Ze5oPfYUkUDRVBUG9Rz34/AIQghgc+RjRTh9yLEePQMoQqwIySSKQaKuKwzpmgVzgBXCdyyJh73gsnAEC/ko+uwDfiolG0TfRO3GQqfajnUzmlzuq7uAlFzQKjMsLa3Og4CVP81yujyhS6Myyz5m75ql5AjsB/6sFAgVE45rz LTlSGkbI vx4F0wxgRkfQY+qvIhax4RElzfkwtNq68rmBax/raixMZXlg6fjV/EuyJ0wFpa8CcUVxk/qiWav+hLwnaLw5nlozdL+XrcbNyKVATCGNnuNsmptor6HidRD7PVuFV1qMMXBsVkoM1o9nInHih7frgsVRNneBLDyvJVaQcH+Kve8Gw4CkU18CoRo2OQHysXEH6LyVF+I/QsDBnpv5bFTukgvJz1zXh9vtEq5VCX4UXdoq/wek9E1iYNuOUPjreH9ybvy/Nw3+grSxDe20T0YacP93XiQjw39OYRcWDovkHcALm7qJisVeROFMrjGXsHZROHPIRP3itH6SAjm8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 10, 2024 at 2:00=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Tue, Jun 11, 2024 at 4:12=E2=80=AFAM Yosry Ahmed wrote: > > > > On Mon, Jun 10, 2024 at 1:06=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Tue, Jun 11, 2024 at 1:42=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > On Fri, Jun 7, 2024 at 9:13=E2=80=AFPM Barry Song <21cnbao@gmail.co= m> wrote: > > > > > > > > > > On Sat, Jun 8, 2024 at 10:37=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > > > > > Zswap does not support storing or loading large folios. Until p= roper > > > > > > support is added, attempts to load large folios from zswap are = a bug. > > > > > > > > > > > > For example, if a swapin fault observes that contiguous PTEs ar= e > > > > > > pointing to contiguous swap entries and tries to swap them in a= s a large > > > > > > folio, swap_read_folio() will pass in a large folio to zswap_lo= ad(), but > > > > > > zswap_load() will only effectively load the first page in the f= olio. If > > > > > > the first page is not in zswap, the folio will be read from dis= k, even > > > > > > though other pages may be in zswap. > > > > > > > > > > > > In both cases, this will lead to silent data corruption. Proper= support > > > > > > needs to be added before large folio swapins and zswap can work > > > > > > together. > > > > > > > > > > > > Looking at callers of swap_read_folio(), it seems like they are= either > > > > > > allocated from __read_swap_cache_async() or do_swap_page() in t= he > > > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios,= so > > > > > > everything is fine for now. > > > > > > > > > > > > However, there is ongoing work to add to support large folio sw= apins > > > > > > [1]. To make sure new development does not break zswap (or get = broken by > > > > > > zswap), add minimal handling of incorrect loads of large folios= to > > > > > > zswap. > > > > > > > > > > > > First, move the call folio_mark_uptodate() inside zswap_load(). > > > > > > > > > > > > If a large folio load is attempted, and any page in that folio = is in > > > > > > zswap, return 'true' without calling folio_mark_uptodate(). Thi= s will > > > > > > prevent the folio from being read from disk, and will emit an I= O error > > > > > > because the folio is not uptodate (e.g. do_swap_fault() will re= turn > > > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases,= but it > > > > > > is better than nothing. > > > > > > > > > > > > This was tested by hacking the allocation in __read_swap_cache_= async() > > > > > > to use order 2 and __GFP_COMP. > > > > > > > > > > > > In the future, to handle this correctly, the swapin code should= : > > > > > > (a) Fallback to order-0 swapins if zswap was ever used on the m= achine, > > > > > > because compressed pages remain in zswap after it is disabled. > > > > > > (b) Add proper support to swapin large folios from zswap (fully= or > > > > > > partially). > > > > > > > > > > > > Probably start with (a) then followup with (b). > > > > > > > > > > > > [1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-21c= nbao@gmail.com/ > > > > > > > > > > > > Signed-off-by: Yosry Ahmed > > > > > > --- > > > > > > > > > > > > v1: https://lore.kernel.org/lkml/20240606184818.1566920-1-yosry= ahmed@google.com/ > > > > > > > > > > > > v1 -> v2: > > > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some = recovery > > > > > > handling (David Hildenbrand). > > > > > > > > > > > > --- > > > > > > mm/page_io.c | 1 - > > > > > > mm/zswap.c | 22 +++++++++++++++++++++- > > > > > > 2 files changed, 21 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > > > > --- a/mm/page_io.c > > > > > > +++ b/mm/page_io.c > > > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, s= truct swap_iocb **plug) > > > > > > delayacct_swapin_start(); > > > > > > > > > > > > if (zswap_load(folio)) { > > > > > > - folio_mark_uptodate(folio); > > > > > > folio_unlock(folio); > > > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > > > > swap_read_folio_fs(folio, plug); > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > > index b9b35ef86d9be..ebb878d3e7865 100644 > > > > > > --- a/mm/zswap.c > > > > > > +++ b/mm/zswap.c > > > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio) > > > > > > > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > > > > > > > + /* > > > > > > + * Large folios should not be swapped in while zswap is= being used, as > > > > > > + * they are not properly handled. Zswap does not proper= ly load large > > > > > > + * folios, and a large folio may only be partially in z= swap. > > > > > > + * > > > > > > + * If any of the subpages are in zswap, reading from di= sk would result > > > > > > + * in data corruption, so return true without marking t= he folio uptodate > > > > > > + * so that an IO error is emitted (e.g. do_swap_page() = will sigfault). > > > > > > + * > > > > > > + * Otherwise, return false and read the folio from disk= . > > > > > > + */ > > > > > > + if (folio_test_large(folio)) { > > > > > > + if (xa_find(tree, &offset, > > > > > > + offset + folio_nr_pages(folio) - 1,= XA_PRESENT)) { > > > > > > + WARN_ON_ONCE(1); > > > > > > + return true; > > > > > > + } > > > > > > + return false; > > > > > > > > > > IMHO, this appears to be over-designed. Personally, I would opt t= o > > > > > use > > > > > > > > > > if (folio_test_large(folio)) > > > > > return true; > > > > > > > > I am sure you mean "return false" here. Always returning true means= we > > > > will never read a large folio from either zswap or disk, whether it= 's > > > > in zswap or not. Basically guaranteeing corrupting data for large > > > > folio swapin, even if zswap is disabled :) > > > > > > > > > > > > > > Before we address large folio support in zswap, it=E2=80=99s esse= ntial > > > > > not to let them coexist. Expecting valid data by lunchtime is > > > > > not advisable. > > > > > > > > The goal here is to enable development for large folio swapin witho= ut > > > > breaking zswap or being blocked on adding support in zswap. If we > > > > always return false for large folios, as you suggest, then even if = the > > > > folio is in zswap (or parts of it), we will go read it from disk. T= his > > > > will result in silent data corruption. > > > > > > > > As you mentioned before, you spent a week debugging problems with y= our > > > > large folio swapin series because of a zswap problem, and even afte= r > > > > then, the zswap_is_enabled() check you had is not enough to prevent > > > > problems as I mentioned before (if zswap was enabled before). So we > > > > need stronger checks to make sure we don't break things when we > > > > support large folio swapin. > > > > > > > > Since we can't just check if zswap is enabled or not, we need to > > > > rather check if the folio (or any part of it) is in zswap or not. W= e > > > > can only WARN in that case, but delivering the error to userspace i= s a > > > > couple of extra lines of code (not set uptodate), and will make the > > > > problem much easier to notice. > > > > > > > > I am not sure I understand what you mean. The alternative is to > > > > introduce a config option (perhaps internal) for large folio swapin= , > > > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get > > > > enabled if large folio swapin is enabled (through config or boot > > > > option). This is until proper handling is added, of course. > > > > > > Hi Yosry, > > > My point is that anybody attempts to do large folios swap-in should > > > either > > > 1. always use small folios if zswap has been once enabled before or n= ow > > > or > > > 2. address the large folios swapin issues in zswap > > > > > > there is no 3rd way which you are providing. > > > > > > it is over-designed to give users true or false based on if data is z= swap > > > as there is always a chance data could be in zswap. so before approac= h > > > 2 is done, we should always WARN_ON large folios and report data > > > corruption. > > > > We can't always WARN_ON for large folios, as this will fire even if > > zswap was never enabled. The alternative is tracking whether zswap was > > ever enabled, and checking that instead of checking if any part of the > > folio is in zswap. > > > > Basically replacing xa_find(..) with zswap_was_enabled(..) or something= . > > My point is that mm core should always fallback > > if (zswap_was_or_is_enabled()) > goto fallback; > > till zswap fixes the issue. This is the only way to enable large folios s= wap-in > development before we fix zswap. I agree with this, I just want an extra fallback in zswap itself in case something was missed during large folio swapin development (which can evidently happen).