From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4BB67C27C52
	for <linux-mm@archiver.kernel.org>; Thu,  6 Jun 2024 21:36:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D79F56B00B9; Thu,  6 Jun 2024 17:36:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D29F46B00BA; Thu,  6 Jun 2024 17:36:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BF17B6B00BB; Thu,  6 Jun 2024 17:36:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id A37FA6B00B9
	for <linux-mm@kvack.org>; Thu,  6 Jun 2024 17:36:53 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 56D311A1676
	for <linux-mm@kvack.org>; Thu,  6 Jun 2024 21:36:53 +0000 (UTC)
X-FDA: 82201773906.10.568D6B7
Received: from mail-ej1-f51.google.com (mail-ej1-f51.google.com [209.85.218.51])
	by imf15.hostedemail.com (Postfix) with ESMTP id 7981FA0003
	for <linux-mm@kvack.org>; Thu,  6 Jun 2024 21:36:51 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=3+DRnoUO;
	spf=pass (imf15.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1717709811;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=t7okuVesSYZb7WHBSF+efeY8peMB/zeL/21SpdnOh+c=;
	b=uDOyyYg3YV/AVQrJFrmPSLCLsKwBTNQdeNOsKChs8hoMF481RtBsK+o7AFVjcJygpVfO9T
	8nRDoQJkww9eKKHuo+wpZUsNSEuZOvKEmBYK7ccbzqnDYr6H1F1W+i4St0SoD7o5qfT9LW
	oog0x/GxEe2LHOCoNVMJLu33OFCcP+I=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717709811; a=rsa-sha256;
	cv=none;
	b=xQI33C8d5asxTTLN+uOzkMzkj4i8diC+2bfczbDqeYu3vQBWpZheQMV5m8YH22Id3kIazE
	er5ahUqYPaTwcqGV9F8brqmzBqkeMEhV4W3BbEXLXvdunfUq3ampzi/x/k9bbqJlO9Alfw
	YZmosaxD9NTy5Rbse89BQ0gszjTCFco=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=3+DRnoUO;
	spf=pass (imf15.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-ej1-f51.google.com with SMTP id a640c23a62f3a-a696cde86a4so164506766b.1
        for <linux-mm@kvack.org>; Thu, 06 Jun 2024 14:36:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1717709810; x=1718314610; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=t7okuVesSYZb7WHBSF+efeY8peMB/zeL/21SpdnOh+c=;
        b=3+DRnoUO6V3fdIfib+52G8Z2gfC17PQ2yyM0K1Jt4QgpZizgdzJQQaUcsBKNjrWlLy
         P6efWEtFY04MI6MulEAVoS5tP3t4AglPOI+Y8f0bg+GmRqyXHKL0SzBY7WQ0CyTxcUB9
         AiQoJXxkGD8iM6wAVxztsRkRrQzgBhX0BhmdVbUaAErYx2end8eQTTgbNNsSEbTrbbNf
         c49isO8LiR3wiDfjVX17i3CW4Wd+4m3dhfQamzdUobx0ehVx8kLHnMaNdaglLFOnSCwz
         Q0MkuX8Fd7mvWuOevSQgiuUe9mFbNOjvY/HYVcp+mHiX3J729n/EwSCnuG/Ulnzo1OM7
         zpOw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1717709810; x=1718314610;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=t7okuVesSYZb7WHBSF+efeY8peMB/zeL/21SpdnOh+c=;
        b=E1E1fNLGWWRL6k/tr0d7rD34qp/KD4dCx1AX4xFRTJc0shyywKorF0cXt9YDEI4zZD
         eICkcM12Nr3nxST9hxR4PM9HnJb78RC1OTX/xfnRGTP3kNqVlz4Ovdtw5xIw56UVQgIn
         Vk2gW7SdmzbwlF7liWSj8WBGQky5QdzI0NMEaQkUFZeEi0Sc7rgVX1hOv2EvfO99yypF
         +qW3pHKYpOPYbZWyPTtlDDHawOEmhrOEKDV6fxTL/UpavI2rwvvpD69Btwu+RhWK3i4g
         FqLAvTnmKPGNF9NbmEJgefLrQUvU8u5C7IiFhzUVG+8n/YjMgjIwtxgmnjW2oXTQVEmx
         x4HA==
X-Forwarded-Encrypted: i=1; AJvYcCU5MWw3vy3tB9hp+rMwHtIF6uWpz/p83qMiRAkw3ktAys57lXuNVljLcslGwI31LYwLdszW2mzNCJqpaxmaf0STxws=
X-Gm-Message-State: AOJu0YxM9oH78xhW4tE1TxBzd7CC5zRVCGzaRnRo+1AkLNNtLoJtU7J+
	qfryAI4Jc8LmE4CX/gb5U5CH1hHxjHrtqGUcc2l4TEFLqztttUtqkTBgSLY7QONNZ+5pCUQMkK8
	fpTVGPJ/jzUrGLxSeu1zzzSUpc4Pd1OWOeVu3
X-Google-Smtp-Source: AGHT+IGZj/4GXD7GiIJana92FHJ0J/tDsdOuEKhEec46bZrlEctZgIPf7ftpUYb9TGioUvxkmYESJvqCoHpFT2mhiy8=
X-Received: by 2002:a17:906:56ca:b0:a59:9b75:b90 with SMTP id
 a640c23a62f3a-a6cd561214emr59619266b.2.1717709809697; Thu, 06 Jun 2024
 14:36:49 -0700 (PDT)
MIME-Version: 1.0
References: <20240606184818.1566920-1-yosryahmed@google.com>
 <84d78362-e75c-40c8-b6c2-56d5d5292aa7@redhat.com> <CAJD7tkZH9C21nx75W9Erun=oUvmad5ujmDyGYWRRHEwPCCizUw@mail.gmail.com>
 <7507d075-9f4d-4a9b-836c-1fbb2fbd2257@redhat.com>
In-Reply-To: <7507d075-9f4d-4a9b-836c-1fbb2fbd2257@redhat.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Thu, 6 Jun 2024 14:36:13 -0700
Message-ID: <CAJD7tkbBRchzHmyVTHVrXb15gmHvi-sjNViSNNhEKvkeG9JiZw@mail.gmail.com>
Subject: Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted
To: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Nhat Pham <nphamcs@gmail.com>, Chengming Zhou <chengming.zhou@linux.dev>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Barry Song <21cnbao@gmail.com>, 
	Chris Li <chrisl@kernel.org>, Ryan Roberts <ryan.roberts@arm.com>, 
	Matthew Wilcox <willy@infradead.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: 8iew6jt7xm6gy3uqwr4c94haoebs8com
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 7981FA0003
X-HE-Tag: 1717709811-154050
X-HE-Meta: U2FsdGVkX18sHv690eC8icsZy0tufQpcyQjQh7i0NH+c3b1tRppLoJaM81w/hBvjV0doT1yx1pa6JuPaWxg9rbMJRYig1gBExD+a8Ad2jFUVy/T+lIrAGd4+fAUO5aiDPbsUfpKvsjuDo/FeCaNgBhh57raoRyH8yRoGEBqZ9XDSfx7HwDxdSl8wermAFB0uOJ/52X475/Ojn/0LPF9apJXqdSdL385YQevAP3EP3ck02/3MFwYDpxBiggigq9SQqtS9JNA3voLP41OUSdlQkdhzFnx/wQzAymPJPr3dauY0nC2ja9XE+cjqH62QnhsRCa/ECQE+1qku6/Au88sK6x6BMguBPhu1tpf8P6aI13w3bkQWNJw1CtZow/Fuq/jNhYcT3l1Epz2k/7Z5aDfW0yEYBIbqBj/qPTzbRb1PCZFFX+atii6dBRB9CL1l2bdn7FAgz4nvdzlrPQ8S9vmlBVxkyKNWsLxuuNowVn7HEdOxBW+3HTERDTyxVHVmtXRIc8gl6JZF7PgC4A8sU5yi5AOejitniQ6H8mrSSEjkeL8NumvCf9GfVgvFn5b4B7qpfZCJ/6fO8/bRuWTRjFGmasn0UFvTahNL7SSeRzf83gJza++NvIJNKvj+MJ7AF2uPKA7OHC1k3ig6AT7PViLrte7iphzSzEmgV3Vvg5qXBVR6c7Ldp94qxTmVTBybO5daCv2ox7jplagWhHpjLJ7TL1Q7Gl6bljHNYIAlR6y/zuQM2/ye4I48Jrt83h8zbvVMicLi0FTuilP2/hppi35/nZh5X3Flxrt/YynwA2ygpM9sgZ1Fpz1gSrPajfliL13L4pzPVpViZwOWxQTioxbdvE5V5Kv6iuOezHw3rHvMp+hr+ZhjGlhuib0rR2flD/IOn/GuJEa4kyuuaZj7vNLmN2eJRbXTW6kjBhWdz7UQtPk93NJDU5MML77TRNt4nI0du8CBTcFSCQhvvtNA77m
 QV6Pw8fo
 M3piMyp/yaPLkj7p6PJ2hptZQnoDIV6ZDbVvon/KwbxLvxyR6InkAQ+DzBZFW7LwmZ+h47QLK4NAPfgEr8YcLm01GAVR2kzT59CgInfWcLjyxJNSCBN8Ln6R3roZg87Tyu3N8UE1SHW2Coa3tdePhHxozOu8519qljsg5VBIiVRhuOL5krI3CE7xRUA9Y/1kQPORB+4X11Q2zzNkGU6pcNja4+K+pWWJJDPkXVPEdluZ7AEyp8dFvT7ucxGLEaYOPRiTAYXeM58Cmh2ILPnn/zbj/hSNyf51fEbBi4O75MW1+nhaT8vKqV444rG7PqBY5F6Ah
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jun 6, 2024 at 2:17=E2=80=AFPM David Hildenbrand <david@redhat.com>=
 wrote:
>
> On 06.06.24 22:31, Yosry Ahmed wrote:
> > On Thu, Jun 6, 2024 at 1:22=E2=80=AFPM David Hildenbrand <david@redhat.=
com> wrote:
> >>
> >> On 06.06.24 20:48, Yosry Ahmed wrote:
> >>> With ongoing work to support large folio swapin, it is important to m=
ake
> >>> sure we do not pass large folios to zswap_load() without implementing
> >>> proper support.
> >>>
> >>> For example, if a swapin fault observes that contiguous PTEs are
> >>> pointing to contiguous swap entries and tries to swap them in as a la=
rge
> >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), =
but
> >>> zswap_load() will only effectively load the first page in the folio. =
If
> >>> the first page is not in zswap, the folio will be read from disk, eve=
n
> >>> though other pages may be in zswap.
> >>>
> >>> In both cases, this will lead to silent data corruption.
> >>>
> >>> Proper large folio swapin support needs to go into zswap before zswap
> >>> can be enabled in a system that supports large folio swapin.
> >>>
> >>> Looking at callers of swap_read_folio(), it seems like they are eithe=
r
> >>> allocated from __read_swap_cache_async() or do_swap_page() in the
> >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> >>> are fine for now.
> >>>
> >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes=
 in
> >>> the order of those allocations without proper handling of zswap.
> >>>
> >>> Alternatively, swap_read_folio() (or its callers) can be updated to h=
ave
> >>> a fallback mechanism that splits large folios or reads subpages
> >>> separately. Similar logic may be needed anyway in case part of a larg=
e
> >>> folio is already in the swapcache and the rest of it is swapped out.
> >>>
> >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >>> ---
> >>>
> >>> Sorry for the long CC list, I just found myself repeatedly looking at
> >>> new series that add swap support for mTHPs / large folios, making sur=
e
> >>> they do not break with zswap or make incorrect assumptions. This debu=
g
> >>> check should give us some peace of mind. Hopefully this patch will al=
so
> >>> raise awareness among people who are working on this.
> >>>
> >>> ---
> >>>    mm/zswap.c | 3 +++
> >>>    1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index b9b35ef86d9be..6007252429bb2 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> >>>        if (!entry)
> >>>                return false;
> >>>
> >>> +     /* Zswap loads do not handle large folio swapins correctly yet =
*/
> >>> +     VM_BUG_ON(folio_test_large(folio));
> >>> +
> >>
> >> There is no way we could have a WARN_ON_ONCE() and recover, right?
> >
> > Not without making more fundamental changes to the surrounding swap
> > code. Currently zswap_load() returns either true (folio was loaded
> > from zswap) or false (folio is not in zswap).
> >
> > To handle this correctly zswap_load() would need to tell
> > swap_read_folio() which subpages are in zswap and have been loaded,
> > and then swap_read_folio() would need to read the remaining subpages
> > from disk. This of course assumes that the caller of swap_read_folio()
> > made sure that the entire folio is swapped out and protected against
> > races with other swapins.
> >
> > Also, because swap_read_folio() cannot split the folio itself, other
> > swap_read_folio_*() functions that are called from it should be
> > updated to handle swapping in tail subpages, which may be questionable
> > in its own right.
> >
> > An alternative would be that zswap_load() (or a separate interface)
> > could tell swap_read_folio() that the folio is partially in zswap,
> > then we can just bail and tell the caller that it cannot read the
> > large folio and that it should be split.
> >
> > There may be other options as well, but the bottom line is that it is
> > possible, but probably not something that we want to do right now.
> >
> > A stronger protection method would be to introduce a config option or
> > boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> > depend on it being disabled, or have zswap check it at boot and refuse
> > to be enabled if it is on.
>
> Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
>
> I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> this like a read-error from disk.
>
> I think do_swap_page() detects that by checking if the folio is not
> uptodate:
>
> if (unlikely(!folio_test_uptodate(folio))) {
>         ret =3D VM_FAULT_SIGBUS;
>         goto out_nomap;
> }
>
> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> system (but the app would crash either way, there is no way around it).

It seems like most paths will handle this correctly just if the folio
is not uptodate. do_swap_page() seems like it will work correctly
whether

swapin_readahead() and the direct call to swap_read_folio() in
do_swap_page() should work correctly in this case. The shmem swapin
path seems like it will return -EIO, which in the fault path will also
sigbus, and in the file read/write path I assume will be handled
correctly.

However, looking at the swapoff paths, it seems like we don't really
check uptodate. For example, shmem_unuse_swap_entries() will just
throw -EIO away. Maybe it is handled on a higher level by the fact
that the number of swap entries will not drop to zero so swapoff will
not complete? :)

Anyway, I believe it may be possible to just not set uptodate, but I
am not very sure how reliable it will be. It may be better than
nothing anyway, I guess?