From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BC028C27C55
	for <linux-mm@archiver.kernel.org>; Mon, 10 Jun 2024 21:00:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 486EF6B009B; Mon, 10 Jun 2024 17:00:19 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 436FD6B009C; Mon, 10 Jun 2024 17:00:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2D72C6B009D; Mon, 10 Jun 2024 17:00:19 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 1161D6B009B
	for <linux-mm@kvack.org>; Mon, 10 Jun 2024 17:00:19 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 8FE70141171
	for <linux-mm@kvack.org>; Mon, 10 Jun 2024 21:00:18 +0000 (UTC)
X-FDA: 82216196916.27.7D16F22
Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52])
	by imf19.hostedemail.com (Postfix) with ESMTP id A1FAB1A000C
	for <linux-mm@kvack.org>; Mon, 10 Jun 2024 21:00:16 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=SRYQ2VC7;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718053216;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jtglSl5nJSLjnhryyMKeeZ/HJbTMOfyFyynMlPClQCc=;
	b=o7wg6tanOueGviSKT4RKU0kzTgo0fQhNPNoMOKlRPcyDWl6KaaIwMH3Y/FfW10XcQ/xk/K
	JP1VlFReU5Q5FbUmdckYb6zLEoAKwxFtfwspazabuuY15L3lZbFlQwVywmdOZV+ghsdv+B
	Itk7AE7PqzwwjJZnlCgfz+NcjjFQUuw=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=SRYQ2VC7;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718053216; a=rsa-sha256;
	cv=none;
	b=bBUaHAWDxpCJlmXbN4eLd4zUwNZ7+DDIgU1jNJm0ywN03kviT3PnOkHWJSUfXkNCwlQpdr
	g5r/bqm+Mms4awbRMNwZKI9IL/rTDDhPw8dyWH649id2gNcm7lAEwwUP/50pLFBQHGgmHB
	AH61f6Os54CvjWZdV0TPqAiu5WeJEKk=
Received: by mail-ua1-f52.google.com with SMTP id a1e0cc1a2514c-80b8e5cab7bso644068241.0
        for <linux-mm@kvack.org>; Mon, 10 Jun 2024 14:00:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1718053216; x=1718658016; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=jtglSl5nJSLjnhryyMKeeZ/HJbTMOfyFyynMlPClQCc=;
        b=SRYQ2VC73jA3f2IDZtRFTefNe3yXZSJSlvyiowSsBMfXXzTbRv8UZrDbN6DWaxt04T
         iUWQeZus24tdY97s7PhkCrBTXl//WUEh3DJtbBzq28o8u6I6DL44qqHoOLul73NPcCmB
         ajFuVeo/9eLVYB9mHNS6Ixm3fVw4ZTPv5woNNHUGgx6xp4vOOM4OAL8a553u2puhjUWh
         IpvraKczqeJ7WRo16c1gUWoq+laMbjxZ4DKpp17vvxhtcVB5ce36XKM6vj991iYkjyVh
         8VOjA9ag1w9rrwfXuRGZ7umKtddMffBMLYbqSHYSgc5+emMYKyGrNVTeKfHnhhTmz8Sz
         1deg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1718053216; x=1718658016;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=jtglSl5nJSLjnhryyMKeeZ/HJbTMOfyFyynMlPClQCc=;
        b=F+SD6RtEKX9p0x8OeOCnQO9ucI3qrlMSNjQDDhUlRfBLkYN+qy+KV1PUsF47maMBdB
         94WFVk8d+j8koC9l3DjMMXanSTwgMnsMLZKFCR+MsOhfNyRaynYOrkISHUIi5kQEHOhi
         z+RnENd4NAwjB64WcK0nF+0zYdhBwPblm42EU5j1fsMYCar82JxffP/Yq6dZZh7Hd70H
         +cwRRmQCjl++Sz2vd30q/+3d+tfYN/zPZiE+Df43xxWKjx26DCJFALQTydnu/TyqQ4CQ
         YxWNFsi85tOGFpBaUzDdd96KrUTgpLis8osyw5pqwgKxruIcMqV6xF8jrWNhveBzjFeQ
         mV4Q==
X-Forwarded-Encrypted: i=1; AJvYcCUeNzcov/VWcelI5XYswb+BvlgMmk1VHsMzC8oYwfl4BSo/5TSIerJYrD8b0ckI40Z5RZ2XlC9NEiEFmZ/3+gqpvRo=
X-Gm-Message-State: AOJu0YwjSPCpiJTBDp92l9eryVydOuDbYecVk33jqXM17EiA5nn1pNyL
	hqsjrKTgTyByQkg83+NjryfGyuFyecCswpf3yMEzarO19IbKSbVM0Q1xjgsw4ZfwhnbKck4LyN7
	KGkZwQz2snG5EZupo3PbTAxYjTFk=
X-Google-Smtp-Source: AGHT+IFH91cYPqMQB2e5MjF84hEcwy9PEyTQdAbTPdEmgy+aTCyASRVIIvuAoOCusr9Da2e6/clC8wkqB0yx3OleiMk=
X-Received: by 2002:a05:6102:1277:b0:48c:329c:b3dc with SMTP id
 ada2fe7eead31-48c329cb660mr9229115137.11.1718053214959; Mon, 10 Jun 2024
 14:00:14 -0700 (PDT)
MIME-Version: 1.0
References: <20240608023654.3513385-1-yosryahmed@google.com>
 <CAGsJ_4yVmDKtcKKAdjPOkOMWB+=ZT5TrrWz188xJaTV4EpM1Bw@mail.gmail.com>
 <CAJD7tkaHLVcjUgiUFfBK+ztCqxBTRfyVbSVH9vytK=5JYOw+Sw@mail.gmail.com>
 <CAGsJ_4w-magFysq4uLBm46AzHLD+r=v6pJphwmQn+OFvECHjrA@mail.gmail.com> <CAJD7tkYdq533Z7nubjFT5jQYuS4oq2u15RAz2oGHGxYSk5Oicg@mail.gmail.com>
In-Reply-To: <CAJD7tkYdq533Z7nubjFT5jQYuS4oq2u15RAz2oGHGxYSk5Oicg@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 11 Jun 2024 05:00:03 +0800
Message-ID: <CAGsJ_4zNxC5u088RRnKeM18skEJvwTd22mB_FWSA67K3S-CKPw@mail.gmail.com>
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Nhat Pham <nphamcs@gmail.com>, Chengming Zhou <chengming.zhou@linux.dev>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>, 
	Ryan Roberts <ryan.roberts@arm.com>, David Hildenbrand <david@redhat.com>, 
	Matthew Wilcox <willy@infradead.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: A1FAB1A000C
X-Stat-Signature: 6ozhunpiaty8zk15jsfpkaepw4dywzjr
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-HE-Tag: 1718053216-800755
X-HE-Meta: U2FsdGVkX1/Np8EpubBKGuQ7sCUKPm+q+Kcdg1DTbGkm9WfX+X4nQaTcoY0/A3ZMN0xhf3rol931F3n7ZDTr61qjboB2yHwLH64nPplDZ6lHXV7VH+cYwLaNx1E4vl0TsljtOCZSwY6YXH0YcNX4t9fhoTQ/BSQGK5fjY+9OnB85RSkJC627JW8LbSPCMJfhjqwzC4M2wyMpyjvOAAseQiw6ChJd6d40YaF+6+jPbakwiw2JZM21yZuNLYDlJA8zgI36QV3qY5PNjo6fpOA38lcbHMyANAtI2fpHHWhoG25hm1tes3QlAMEWLYpi/+GEFEEPJq5G7dYr/4tp8/rwGr2FVPRcXHMvuu7EjwflZ/5jIpLofxve/EX3eRKAYI9qKLpSJr0YvAMHrH19ofpqS2cTQkBdC76zMDyaP7C77j7he48rbYveSkRpm4fBRtkK9UdrfKLCEDnEg7eJP/eXZfxPTIbpXSnyZQW2bbBxUuV+BLTi98rVx6Eq0nqlDypcE4JqtiNAzTXsPXcAkGr3b1U1k8ziSD1RQkgr/pqOlhRz/vopRDCnyCqtcnq8aruQeIIffY198V7/0HWLKbfTyT8i0OMmTAO7wzzJYy/w3GTjBvtuBhCZ6ZDkgacHC2/nTx/Dc3WxaRocskYGh6HmqH1HuJ7xbp383DvppOMMxTpgtBtrWMrbq2kC0BkUgaZi5H0Oyq5jJoVfbB4DgZDHggpNgXO3ZFqKIca7d+SmtA4PNVgdoBo+Dfk88JD/WYHlbPt0CsxhPdrLThHE1jUq8Z+5sAc1ixOUhW3uXUrL6xvdCz2DwYXrNk7NIxYdkj9fwI36CQiqBd3SDklhMbavyowcRtYonRmC/vD95MxZ8OJCZdilMELhtvzNwwwpqKTvpoKnPnQVH07Hd4cZFi9j1ZO0oq2Am1hUWy02sR2tr0uI10xbbnr7YhTPU+5RxjzGiju8QPB4PG8BAftHdPj
 UcKfX0jn
 HYpvTEhO628Ktj5eCZtsXVYBKyj/PYNZzbNi3n2MAWmuokfreRO0c3gQq+Swr0oFQCH8JCHKnvI7uuuhz90Es6T4uFfcMhAM5ZXHr9XBYiE/MmHe4TdE+VS5DYmSpr/NLjekxd45+7sTFQYdG7oCI2hrSpvalIkucInipRGPXoQ7craWD9tSJUNwzoaAjUcfpWEkq6VOmrgfA4DkNs/VpK49RvWs926YgiDqPu073yABdEr79pJs1npxVS/XtoEOT1T8xzz5sB52ZC2WJNkUAbvSXpJF5teXDCily+GOIX5yqFYlc27jEY/jX/MBj4eT2hCwIrvZcQyRvlyA=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jun 11, 2024 at 4:12=E2=80=AFAM Yosry Ahmed <yosryahmed@google.com>=
 wrote:
>
> On Mon, Jun 10, 2024 at 1:06=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr=
ote:
> >
> > On Tue, Jun 11, 2024 at 1:42=E2=80=AFAM Yosry Ahmed <yosryahmed@google.=
com> wrote:
> > >
> > > On Fri, Jun 7, 2024 at 9:13=E2=80=AFPM Barry Song <21cnbao@gmail.com>=
 wrote:
> > > >
> > > > On Sat, Jun 8, 2024 at 10:37=E2=80=AFAM Yosry Ahmed <yosryahmed@goo=
gle.com> wrote:
> > > > >
> > > > > Zswap does not support storing or loading large folios. Until pro=
per
> > > > > support is added, attempts to load large folios from zswap are a =
bug.
> > > > >
> > > > > For example, if a swapin fault observes that contiguous PTEs are
> > > > > pointing to contiguous swap entries and tries to swap them in as =
a large
> > > > > folio, swap_read_folio() will pass in a large folio to zswap_load=
(), but
> > > > > zswap_load() will only effectively load the first page in the fol=
io. If
> > > > > the first page is not in zswap, the folio will be read from disk,=
 even
> > > > > though other pages may be in zswap.
> > > > >
> > > > > In both cases, this will lead to silent data corruption. Proper s=
upport
> > > > > needs to be added before large folio swapins and zswap can work
> > > > > together.
> > > > >
> > > > > Looking at callers of swap_read_folio(), it seems like they are e=
ither
> > > > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, s=
o
> > > > > everything is fine for now.
> > > > >
> > > > > However, there is ongoing work to add to support large folio swap=
ins
> > > > > [1]. To make sure new development does not break zswap (or get br=
oken by
> > > > > zswap), add minimal handling of incorrect loads of large folios t=
o
> > > > > zswap.
> > > > >
> > > > > First, move the call folio_mark_uptodate() inside zswap_load().
> > > > >
> > > > > If a large folio load is attempted, and any page in that folio is=
 in
> > > > > zswap, return 'true' without calling folio_mark_uptodate(). This =
will
> > > > > prevent the folio from being read from disk, and will emit an IO =
error
> > > > > because the folio is not uptodate (e.g. do_swap_fault() will retu=
rn
> > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, b=
ut it
> > > > > is better than nothing.
> > > > >
> > > > > This was tested by hacking the allocation in __read_swap_cache_as=
ync()
> > > > > to use order 2 and __GFP_COMP.
> > > > >
> > > > > In the future, to handle this correctly, the swapin code should:
> > > > > (a) Fallback to order-0 swapins if zswap was ever used on the mac=
hine,
> > > > > because compressed pages remain in zswap after it is disabled.
> > > > > (b) Add proper support to swapin large folios from zswap (fully o=
r
> > > > > partially).
> > > > >
> > > > > Probably start with (a) then followup with (b).
> > > > >
> > > > > [1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnb=
ao@gmail.com/
> > > > >
> > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > > > ---
> > > > >
> > > > > v1: https://lore.kernel.org/lkml/20240606184818.1566920-1-yosryah=
med@google.com/
> > > > >
> > > > > v1 -> v2:
> > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some re=
covery
> > > > >   handling (David Hildenbrand).
> > > > >
> > > > > ---
> > > > >  mm/page_io.c |  1 -
> > > > >  mm/zswap.c   | 22 +++++++++++++++++++++-
> > > > >  2 files changed, 21 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > > --- a/mm/page_io.c
> > > > > +++ b/mm/page_io.c
> > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, str=
uct swap_iocb **plug)
> > > > >         delayacct_swapin_start();
> > > > >
> > > > >         if (zswap_load(folio)) {
> > > > > -               folio_mark_uptodate(folio);
> > > > >                 folio_unlock(folio);
> > > > >         } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > >                 swap_read_folio_fs(folio, plug);
> > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > index b9b35ef86d9be..ebb878d3e7865 100644
> > > > > --- a/mm/zswap.c
> > > > > +++ b/mm/zswap.c
> > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> > > > >
> > > > >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > >
> > > > > +       /*
> > > > > +        * Large folios should not be swapped in while zswap is b=
eing used, as
> > > > > +        * they are not properly handled. Zswap does not properly=
 load large
> > > > > +        * folios, and a large folio may only be partially in zsw=
ap.
> > > > > +        *
> > > > > +        * If any of the subpages are in zswap, reading from disk=
 would result
> > > > > +        * in data corruption, so return true without marking the=
 folio uptodate
> > > > > +        * so that an IO error is emitted (e.g. do_swap_page() wi=
ll sigfault).
> > > > > +        *
> > > > > +        * Otherwise, return false and read the folio from disk.
> > > > > +        */
> > > > > +       if (folio_test_large(folio)) {
> > > > > +               if (xa_find(tree, &offset,
> > > > > +                           offset + folio_nr_pages(folio) - 1, X=
A_PRESENT)) {
> > > > > +                       WARN_ON_ONCE(1);
> > > > > +                       return true;
> > > > > +               }
> > > > > +               return false;
> > > >
> > > > IMHO, this appears to be over-designed. Personally, I would opt to
> > > > use
> > > >
> > > >  if (folio_test_large(folio))
> > > >                return true;
> > >
> > > I am sure you mean "return false" here. Always returning true means w=
e
> > > will never read a large folio from either zswap or disk, whether it's
> > > in zswap or not. Basically guaranteeing corrupting data for large
> > > folio swapin, even if zswap is disabled :)
> > >
> > > >
> > > > Before we address large folio support in zswap, it=E2=80=99s essent=
ial
> > > > not to let them coexist. Expecting valid data by lunchtime is
> > > > not advisable.
> > >
> > > The goal here is to enable development for large folio swapin without
> > > breaking zswap or being blocked on adding support in zswap. If we
> > > always return false for large folios, as you suggest, then even if th=
e
> > > folio is in zswap (or parts of it), we will go read it from disk. Thi=
s
> > > will result in silent data corruption.
> > >
> > > As you mentioned before, you spent a week debugging problems with you=
r
> > > large folio swapin series because of a zswap problem, and even after
> > > then, the zswap_is_enabled() check you had is not enough to prevent
> > > problems as I mentioned before (if zswap was enabled before). So we
> > > need stronger checks to make sure we don't break things when we
> > > support large folio swapin.
> > >
> > > Since we can't just check if zswap is enabled or not, we need to
> > > rather check if the folio (or any part of it) is in zswap or not. We
> > > can only WARN in that case, but delivering the error to userspace is =
a
> > > couple of extra lines of code (not set uptodate), and will make the
> > > problem much easier to notice.
> > >
> > > I am not sure I understand what you mean. The alternative is to
> > > introduce a config option (perhaps internal) for large folio swapin,
> > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
> > > enabled if large folio swapin is enabled (through config or boot
> > > option). This is until proper handling is added, of course.
> >
> > Hi Yosry,
> > My point is that anybody attempts to do large folios swap-in should
> > either
> > 1. always use small folios if zswap has been once enabled before or now
> > or
> > 2. address the large folios swapin issues in zswap
> >
> > there is no 3rd way which you are providing.
> >
> > it is over-designed to give users true or false based on if data is zsw=
ap
> > as there is always a chance data could be in zswap. so before approach
> > 2 is done, we should always WARN_ON large folios and report data
> > corruption.
>
> We can't always WARN_ON for large folios, as this will fire even if
> zswap was never enabled. The alternative is tracking whether zswap was
> ever enabled, and checking that instead of checking if any part of the
> folio is in zswap.
>
> Basically replacing xa_find(..) with zswap_was_enabled(..) or something.

My point is that mm core should always fallback

if (zswap_was_or_is_enabled())
     goto fallback;

till zswap fixes the issue. This is the only way to enable large folios swa=
p-in
development before we fix zswap.

>
> What I don't like about this is that we will report data corruption
> even in cases where data is not really corrupted and it exists on
> disk. For example, if zswap is globally enabled but disabled in a
> cgroup, there shouldn't be any corruption swapping in large folios.
>
> That being said, I don't feel strongly, as long as we either check
> that part of the folio is in zswap or that zswap was ever enabled (or
> maybe check if a page was ever stored, just in case zswap was enabled
> and immediately disabled).
>
> Johannes, Nhat, any opinions on which way to handle this?

Thanks
Barry