From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D8478C27C55
	for <linux-mm@archiver.kernel.org>; Mon, 10 Jun 2024 21:12:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 54E696B008A; Mon, 10 Jun 2024 17:12:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4FD0C6B0099; Mon, 10 Jun 2024 17:12:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3C7076B009C; Mon, 10 Jun 2024 17:12:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 1EE8F6B0099
	for <linux-mm@kvack.org>; Mon, 10 Jun 2024 17:12:20 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id B9A5BA01F6
	for <linux-mm@kvack.org>; Mon, 10 Jun 2024 21:12:19 +0000 (UTC)
X-FDA: 82216227198.02.C672C71
Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50])
	by imf20.hostedemail.com (Postfix) with ESMTP id AEF1E1C000B
	for <linux-mm@kvack.org>; Mon, 10 Jun 2024 21:12:17 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Ma9XXO3h;
	spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718053937; a=rsa-sha256;
	cv=none;
	b=pFhkaq6TAmzzzdDa7I34lXQ0wsmmxoQjLkOVZHddnahKASKY1VhKlszekDnx+k4S+tSrXA
	C+XVjeSHUYcKD/8uk2e/CBX02+SH4mmzcCeF3GLBworhhXxTtx8x+M9gVpjEN1PH/wJ8K+
	5KorRZsZO3L6Q9hSaf76v9ozsMSZUfs=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Ma9XXO3h;
	spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718053937;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7Uy3FcrR56dAJ33BptXmeq82zQbFLSxbtr1pm6OxI94=;
	b=UmqkyIwKorjMTSVzvN9GdcrCgvUD+x6wJWOIH2z0reJRCvQsU8yMrHia/f5hDu2FfPov9q
	aPsb0NxV/Cgao0O8mmyZPTG9/KDjzo+QqqK29K5DZ6asGyquMRu3UuWejPfmWAfkp1P6xC
	AnsOrglXYglFP9NjOmv/++HrMzUWHmg=
Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-a6f0dc80ab9so43132366b.2
        for <linux-mm@kvack.org>; Mon, 10 Jun 2024 14:12:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1718053936; x=1718658736; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=7Uy3FcrR56dAJ33BptXmeq82zQbFLSxbtr1pm6OxI94=;
        b=Ma9XXO3hEA/dFe+3DYzFbpJObpXmQcfIl+su0lqcDHZCkqzdNaA3xNwi+PrG50crzp
         txJJKFygOG3pjFev0MOObhhzrntsOgvH4zD78sYDzHKxwfCy+QSFM1DYban2xFqP5ejG
         ggUbAcoJ+t17CShORbYMhyAHxuxGA7okALJXK3QIs1TBrIwzqR++nzCiYanI0eOA5+rc
         R0r/ZCSH/NXwxJ/m1LM2kmkw/mixXY4VAj5TqUazaGg4x5/SD3cCiK+/5/ulBM6BC6sW
         wQUVZLS15kc3SmO6YzQKIGU9Y8vBpXQjfPSNZg2ADHe0WZedCIdK6BtzYldvbKYW44hl
         ioZA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1718053936; x=1718658736;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=7Uy3FcrR56dAJ33BptXmeq82zQbFLSxbtr1pm6OxI94=;
        b=sR6KyzE/bcxoK4igCclhF5xQK1N1ApwMKnknWnasNSOb2lWAk2hMBNank3qc3KFCfq
         B71XufKbJKod59HDckaHf3kWhVV5gG8eTOgnGkNQ8/lOX3i8yPkuL/o7IlMvno0VryNX
         80QTPOVCn67p/4yHLWikvLjjNAnpSsDzr01kv++MkDsYZurz89kUoN53LdWQwB7NdcD2
         NoEC4tW5XvcCKS55PbX4DWlipIX2yWBHD2US9sUTayvycacjFwCkOagjTeFdbBePAkjP
         rQkgyk/v6R4VnIm0EcySY0rQNzk2kSXkBSkWEvrxcmm7PJ0DDOloZe2BU2W7dapy2BiP
         Rjtw==
X-Forwarded-Encrypted: i=1; AJvYcCUrNa8DX53eytuphigAG/1UjiDRCNxsho5+PaDr2Hdbv9xawgedz7Kw461bNHUmp4aZXDIGX3mlozIjPu+84ce8E6Q=
X-Gm-Message-State: AOJu0YzAJxt52zYURNk+v535DHb+tx7+kulHoe4OWi7HEjhAq2mTnuPh
	3Sohml0DoRUMtbknqVoSqiOgniV6FhQuyvkN9Q1g5S2rwOamBAS9znO4kGcJi+BDHfo6ptTZVrI
	YwMnhosDy4/ZbFOktO+KmEE8hTIzvhxDBwt9L
X-Google-Smtp-Source: AGHT+IEASnD5ua9yfrdUtvIzu+Lw46QXFzbx9CRxofl4dQ+vje03VpXXKBLIh1iGcqO9u1jSSMdugvsncTIN1lM0/tY=
X-Received: by 2002:a17:906:7fd0:b0:a6e:2275:95b8 with SMTP id
 a640c23a62f3a-a6e2275962emr460685966b.64.1718053935639; Mon, 10 Jun 2024
 14:12:15 -0700 (PDT)
MIME-Version: 1.0
References: <20240608023654.3513385-1-yosryahmed@google.com>
 <CAGsJ_4yVmDKtcKKAdjPOkOMWB+=ZT5TrrWz188xJaTV4EpM1Bw@mail.gmail.com>
 <CAJD7tkaHLVcjUgiUFfBK+ztCqxBTRfyVbSVH9vytK=5JYOw+Sw@mail.gmail.com>
 <CAGsJ_4w-magFysq4uLBm46AzHLD+r=v6pJphwmQn+OFvECHjrA@mail.gmail.com>
 <CAJD7tkYdq533Z7nubjFT5jQYuS4oq2u15RAz2oGHGxYSk5Oicg@mail.gmail.com> <CAGsJ_4zNxC5u088RRnKeM18skEJvwTd22mB_FWSA67K3S-CKPw@mail.gmail.com>
In-Reply-To: <CAGsJ_4zNxC5u088RRnKeM18skEJvwTd22mB_FWSA67K3S-CKPw@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Mon, 10 Jun 2024 14:11:36 -0700
Message-ID: <CAJD7tkb0Rv4mSPS3DXqF888iVwd++nd99N3WrZYuJhLPDN+dhA@mail.gmail.com>
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios
To: Barry Song <21cnbao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Nhat Pham <nphamcs@gmail.com>, Chengming Zhou <chengming.zhou@linux.dev>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>, 
	Ryan Roberts <ryan.roberts@arm.com>, David Hildenbrand <david@redhat.com>, 
	Matthew Wilcox <willy@infradead.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: AEF1E1C000B
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Stat-Signature: zox6qx6tnpp73c94w794tzhkas7z9t7r
X-HE-Tag: 1718053937-786736
X-HE-Meta: U2FsdGVkX18uK0I6wU+nPa8Mma4pn5Ol5mwSwalLMgSsArZQ3VuPSkPZeGnlLKOj28QXmJfTqCHP7yMpEz4jeP5u3GokApf6/Lq5EPxtxvg6+p+A5YB0A9LJedHbxEUnr0g58Dv+pjALu3ETQ/7DL1NlpyUPPgjKPxYjaJrgIbsSOS/zkESJ6Z7L1F8+qVWqEhayzrynbANXshg5wgNxhzR2KGAsgaes+DfT+Y0d750imEpoSJFa+MR4b/WN2zbujPslNfnfeYn7YPjHkYaPYJuZGL3+gBbtqd6W8559HtKNO96i3x+KniepALQuTFmnPgE2+Nfm8idWL5c5diOqjcmCMrFwRLuYjLks8LPiqQdLYlhZB4N0fByWEXCQrJNhWBi+kyymrZO28mUWzS+Ctq3CeSZxznqxuBTWGnUC/WXBP9Jwg9J/G/Lqg2dWCmo49RR5Cwm3+6mnFJO4c9njUJK+xaIEVls6XnQOYYK5u28NdGt9oN6PlA4cScBu2ZarMHicG/0elnWOIxl59HS5bwpmVkTBwhiQou+jbveS910sQwBun3NKHyGqnIL8vPVZTJSoTB2JwS4igyCnVrqK3eX8/6uMw6xUOmDGYVzf0BK7VlYf3YrcQXnX1GrVFJH9KvBjMu9omBCfFUHhdSF58KtKD8mR4nvezYFGKcYLfQ6Rgn31RdEoWa7leRsjkjfaE+gFYMPYXOTEBJIL34xXsLe/qIjl5T2DeJy4B8cc3BASOcGmUE/uAnmhK9B/i4x8dzCieVnrNe12xCnMl6X1Wf8O/IBpzAuALCtLXL4ax9Ze5oPfYUkUDRVBUG9Rz34/AIQghgc+RjRTh9yLEePQMoQqwIySSKQaKuKwzpmgVzgBXCdyyJh73gsnAEC/ko+uwDfiolG0TfRO3GQqfajnUzmlzuq7uAlFzQKjMsLa3Og4CVP81yujyhS6Myyz5m75ql5AjsB/6sFAgVE45rz
 LTlSGkbI
 vx4F0wxgRkfQY+qvIhax4RElzfkwtNq68rmBax/raixMZXlg6fjV/EuyJ0wFpa8CcUVxk/qiWav+hLwnaLw5nlozdL+XrcbNyKVATCGNnuNsmptor6HidRD7PVuFV1qMMXBsVkoM1o9nInHih7frgsVRNneBLDyvJVaQcH+Kve8Gw4CkU18CoRo2OQHysXEH6LyVF+I/QsDBnpv5bFTukgvJz1zXh9vtEq5VCX4UXdoq/wek9E1iYNuOUPjreH9ybvy/Nw3+grSxDe20T0YacP93XiQjw39OYRcWDovkHcALm7qJisVeROFMrjGXsHZROHPIRP3itH6SAjm8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jun 10, 2024 at 2:00=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot=
e:
>
> On Tue, Jun 11, 2024 at 4:12=E2=80=AFAM Yosry Ahmed <yosryahmed@google.co=
m> wrote:
> >
> > On Mon, Jun 10, 2024 at 1:06=E2=80=AFPM Barry Song <21cnbao@gmail.com> =
wrote:
> > >
> > > On Tue, Jun 11, 2024 at 1:42=E2=80=AFAM Yosry Ahmed <yosryahmed@googl=
e.com> wrote:
> > > >
> > > > On Fri, Jun 7, 2024 at 9:13=E2=80=AFPM Barry Song <21cnbao@gmail.co=
m> wrote:
> > > > >
> > > > > On Sat, Jun 8, 2024 at 10:37=E2=80=AFAM Yosry Ahmed <yosryahmed@g=
oogle.com> wrote:
> > > > > >
> > > > > > Zswap does not support storing or loading large folios. Until p=
roper
> > > > > > support is added, attempts to load large folios from zswap are =
a bug.
> > > > > >
> > > > > > For example, if a swapin fault observes that contiguous PTEs ar=
e
> > > > > > pointing to contiguous swap entries and tries to swap them in a=
s a large
> > > > > > folio, swap_read_folio() will pass in a large folio to zswap_lo=
ad(), but
> > > > > > zswap_load() will only effectively load the first page in the f=
olio. If
> > > > > > the first page is not in zswap, the folio will be read from dis=
k, even
> > > > > > though other pages may be in zswap.
> > > > > >
> > > > > > In both cases, this will lead to silent data corruption. Proper=
 support
> > > > > > needs to be added before large folio swapins and zswap can work
> > > > > > together.
> > > > > >
> > > > > > Looking at callers of swap_read_folio(), it seems like they are=
 either
> > > > > > allocated from __read_swap_cache_async() or do_swap_page() in t=
he
> > > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios,=
 so
> > > > > > everything is fine for now.
> > > > > >
> > > > > > However, there is ongoing work to add to support large folio sw=
apins
> > > > > > [1]. To make sure new development does not break zswap (or get =
broken by
> > > > > > zswap), add minimal handling of incorrect loads of large folios=
 to
> > > > > > zswap.
> > > > > >
> > > > > > First, move the call folio_mark_uptodate() inside zswap_load().
> > > > > >
> > > > > > If a large folio load is attempted, and any page in that folio =
is in
> > > > > > zswap, return 'true' without calling folio_mark_uptodate(). Thi=
s will
> > > > > > prevent the folio from being read from disk, and will emit an I=
O error
> > > > > > because the folio is not uptodate (e.g. do_swap_fault() will re=
turn
> > > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases,=
 but it
> > > > > > is better than nothing.
> > > > > >
> > > > > > This was tested by hacking the allocation in __read_swap_cache_=
async()
> > > > > > to use order 2 and __GFP_COMP.
> > > > > >
> > > > > > In the future, to handle this correctly, the swapin code should=
:
> > > > > > (a) Fallback to order-0 swapins if zswap was ever used on the m=
achine,
> > > > > > because compressed pages remain in zswap after it is disabled.
> > > > > > (b) Add proper support to swapin large folios from zswap (fully=
 or
> > > > > > partially).
> > > > > >
> > > > > > Probably start with (a) then followup with (b).
> > > > > >
> > > > > > [1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-21c=
nbao@gmail.com/
> > > > > >
> > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > > > > ---
> > > > > >
> > > > > > v1: https://lore.kernel.org/lkml/20240606184818.1566920-1-yosry=
ahmed@google.com/
> > > > > >
> > > > > > v1 -> v2:
> > > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some =
recovery
> > > > > >   handling (David Hildenbrand).
> > > > > >
> > > > > > ---
> > > > > >  mm/page_io.c |  1 -
> > > > > >  mm/zswap.c   | 22 +++++++++++++++++++++-
> > > > > >  2 files changed, 21 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > > > --- a/mm/page_io.c
> > > > > > +++ b/mm/page_io.c
> > > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, s=
truct swap_iocb **plug)
> > > > > >         delayacct_swapin_start();
> > > > > >
> > > > > >         if (zswap_load(folio)) {
> > > > > > -               folio_mark_uptodate(folio);
> > > > > >                 folio_unlock(folio);
> > > > > >         } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > > >                 swap_read_folio_fs(folio, plug);
> > > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > > index b9b35ef86d9be..ebb878d3e7865 100644
> > > > > > --- a/mm/zswap.c
> > > > > > +++ b/mm/zswap.c
> > > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> > > > > >
> > > > > >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > > >
> > > > > > +       /*
> > > > > > +        * Large folios should not be swapped in while zswap is=
 being used, as
> > > > > > +        * they are not properly handled. Zswap does not proper=
ly load large
> > > > > > +        * folios, and a large folio may only be partially in z=
swap.
> > > > > > +        *
> > > > > > +        * If any of the subpages are in zswap, reading from di=
sk would result
> > > > > > +        * in data corruption, so return true without marking t=
he folio uptodate
> > > > > > +        * so that an IO error is emitted (e.g. do_swap_page() =
will sigfault).
> > > > > > +        *
> > > > > > +        * Otherwise, return false and read the folio from disk=
.
> > > > > > +        */
> > > > > > +       if (folio_test_large(folio)) {
> > > > > > +               if (xa_find(tree, &offset,
> > > > > > +                           offset + folio_nr_pages(folio) - 1,=
 XA_PRESENT)) {
> > > > > > +                       WARN_ON_ONCE(1);
> > > > > > +                       return true;
> > > > > > +               }
> > > > > > +               return false;
> > > > >
> > > > > IMHO, this appears to be over-designed. Personally, I would opt t=
o
> > > > > use
> > > > >
> > > > >  if (folio_test_large(folio))
> > > > >                return true;
> > > >
> > > > I am sure you mean "return false" here. Always returning true means=
 we
> > > > will never read a large folio from either zswap or disk, whether it=
's
> > > > in zswap or not. Basically guaranteeing corrupting data for large
> > > > folio swapin, even if zswap is disabled :)
> > > >
> > > > >
> > > > > Before we address large folio support in zswap, it=E2=80=99s esse=
ntial
> > > > > not to let them coexist. Expecting valid data by lunchtime is
> > > > > not advisable.
> > > >
> > > > The goal here is to enable development for large folio swapin witho=
ut
> > > > breaking zswap or being blocked on adding support in zswap. If we
> > > > always return false for large folios, as you suggest, then even if =
the
> > > > folio is in zswap (or parts of it), we will go read it from disk. T=
his
> > > > will result in silent data corruption.
> > > >
> > > > As you mentioned before, you spent a week debugging problems with y=
our
> > > > large folio swapin series because of a zswap problem, and even afte=
r
> > > > then, the zswap_is_enabled() check you had is not enough to prevent
> > > > problems as I mentioned before (if zswap was enabled before). So we
> > > > need stronger checks to make sure we don't break things when we
> > > > support large folio swapin.
> > > >
> > > > Since we can't just check if zswap is enabled or not, we need to
> > > > rather check if the folio (or any part of it) is in zswap or not. W=
e
> > > > can only WARN in that case, but delivering the error to userspace i=
s a
> > > > couple of extra lines of code (not set uptodate), and will make the
> > > > problem much easier to notice.
> > > >
> > > > I am not sure I understand what you mean. The alternative is to
> > > > introduce a config option (perhaps internal) for large folio swapin=
,
> > > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
> > > > enabled if large folio swapin is enabled (through config or boot
> > > > option). This is until proper handling is added, of course.
> > >
> > > Hi Yosry,
> > > My point is that anybody attempts to do large folios swap-in should
> > > either
> > > 1. always use small folios if zswap has been once enabled before or n=
ow
> > > or
> > > 2. address the large folios swapin issues in zswap
> > >
> > > there is no 3rd way which you are providing.
> > >
> > > it is over-designed to give users true or false based on if data is z=
swap
> > > as there is always a chance data could be in zswap. so before approac=
h
> > > 2 is done, we should always WARN_ON large folios and report data
> > > corruption.
> >
> > We can't always WARN_ON for large folios, as this will fire even if
> > zswap was never enabled. The alternative is tracking whether zswap was
> > ever enabled, and checking that instead of checking if any part of the
> > folio is in zswap.
> >
> > Basically replacing xa_find(..) with zswap_was_enabled(..) or something=
.
>
> My point is that mm core should always fallback
>
> if (zswap_was_or_is_enabled())
>      goto fallback;
>
> till zswap fixes the issue. This is the only way to enable large folios s=
wap-in
> development before we fix zswap.

I agree with this, I just want an extra fallback in zswap itself in
case something was missed during large folio swapin development (which
can evidently happen).