From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2948C27C52 for ; Thu, 6 Jun 2024 21:54:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C1836B00B1; Thu, 6 Jun 2024 17:54:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3712F6B00B2; Thu, 6 Jun 2024 17:54:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 211D96B00B8; Thu, 6 Jun 2024 17:54:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id EF0D16B00B1 for ; Thu, 6 Jun 2024 17:54:06 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A2861A3579 for ; Thu, 6 Jun 2024 21:54:06 +0000 (UTC) X-FDA: 82201817292.18.DA3FB20 Received: from mail-vk1-f181.google.com (mail-vk1-f181.google.com [209.85.221.181]) by imf22.hostedemail.com (Postfix) with ESMTP id D4C0CC0008 for ; Thu, 6 Jun 2024 21:54:04 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FWjU+vLJ; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717710844; a=rsa-sha256; cv=none; b=OKNtAz8f1sFjKYS3DalFh2TfAuT2f6/H4Bpto8TG30K/0QBf2N481H4qVzR1Rib3Cl+k3w 3Eo/immmdPAqUziU3rmYZjFYS7eP8tXxvgQJnnGR+XWSRfrvEnIL28tMj4rfgC83TINpon CvPcVcMYKkR+tBA8/BcagJ8HIpJRUUI= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FWjU+vLJ; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717710844; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=II39ZY5jweWWSGgsfFhanVbwdOXHJBCd7qN4MOjB7/8=; b=KuBUC/pxNyM8CMui0R4ndH5UvluI63pr81MBOTheRQtNdBbPcQwuzopBZrCsE5nXoCp00G rssLxOc1/dey+FZIuUUFlxzRrT1QFCf6znZZVJJSTnDZnTQN0BHyBAeBG/N16DKkwCdx67 t0S7VUO3vXCw0NRxAXCJYz4bSgkcQUE= Received: by mail-vk1-f181.google.com with SMTP id 71dfb90a1353d-4eb1a5c26a5so452463e0c.1 for ; Thu, 06 Jun 2024 14:54:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717710844; x=1718315644; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=II39ZY5jweWWSGgsfFhanVbwdOXHJBCd7qN4MOjB7/8=; b=FWjU+vLJz0Gw0zrbzSx86TlI8fjr/Tbq9YonUlUPqI/vQN6oUFABBVDBFp9t5H/CkR yMAD2zxpaxlXSm3uXhINqI+1sJbLhVdUMlZx1w7xgyvfi3sniq1GlbEmDHMRgYTY3yrI Y0aFwmtXV1H8Bv3b+TvRdQJ5S+1cFn+XalUWE0aQFJSZjMaDnco8jiJjMNLDirqVhyCV Zdao/gNl9/o4/blHTR4FCH7JJbOOHMDGYHgwaPmOjHvrkAoYiaRcz4GoY8VIJbIfrWfh QrEqXymoBeXyOfg5Qgr8u8GKVI2WPoy/oyqcjoItGIyT99uOKcnUdKxn1FFWELtDNH3R 83Yg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717710844; x=1718315644; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=II39ZY5jweWWSGgsfFhanVbwdOXHJBCd7qN4MOjB7/8=; b=OaMD5UDKQqwKI8daKki+c1HWAUSabNYYMOmhgHTzg0wrVrtshLyT2oG8F9eveS0A65 GyOCbPa7nqeobNo+Go++wwuEMAfOf5l7AezXXzyw65gaPnW5S40dIZsojvSK/aObe/VR 9Gsld+ld4gXfkV/ugSH94nttbgRSFtakJ7SeCHr24VojXu5JL+C0gdT0g8P4+VzYHMaL GeqLY40yePNXe/GMvo9d/A0nLntmJqMgoGUQ9YmfQTCgcP4B/lTLAQsu6zs6x8gWEWcV mQVSDdlBJmT0IhlJqjgqprMzC979i7ozXqeLu4Kt7R2vft0xFtM845hm6DS7pcOSHLhE 8WLw== X-Forwarded-Encrypted: i=1; AJvYcCUp1PnkiBeif4QN41C2RDZ0Dt+YxchIwLIIgjeN1Dy3Ub4c0sFwiSOTY0ufMHLqhD344QTWS58FK08Pps+EoUlqyxU= X-Gm-Message-State: AOJu0Yz2fLHbMrqManyBTC6VR972e4Gjm3RUrodtG/rQf26DDCnzR4Ue G41Y5Zioo2suERmqieqluuApki9k/1kTXKtZmutK08Fk8lywCQNsemftv9E2aRcW9zyeUJC5nSG VcmfJO24hbs38n1OtvGyZRm7tp38= X-Google-Smtp-Source: AGHT+IGe+gE0qF0VtKx5PyAKfT45DB4DKutiKwM8AAstt5UAWcjYj08CKwbbaiDu9kSKQ6vp2N9e9d2gM3LVyfJgiHo= X-Received: by 2002:ac5:cb05:0:b0:4d8:7a5e:392f with SMTP id 71dfb90a1353d-4eb562b9f79mr914221e0c.12.1717710843623; Thu, 06 Jun 2024 14:54:03 -0700 (PDT) MIME-Version: 1.0 References: <20240606184818.1566920-1-yosryahmed@google.com> <84d78362-e75c-40c8-b6c2-56d5d5292aa7@redhat.com> <7507d075-9f4d-4a9b-836c-1fbb2fbd2257@redhat.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 7 Jun 2024 09:53:52 +1200 Message-ID: Subject: Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted To: Yosry Ahmed Cc: David Hildenbrand , Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D4C0CC0008 X-Stat-Signature: a1e6e19ghot6z87yc3zohhc3s77mtsuy X-HE-Tag: 1717710844-827928 X-HE-Meta: U2FsdGVkX1+lL8nHMCyKRRx04rdg8Hxl8uIsO+ecDYQYBgOHkfbQ3TGSYVgaBKdQ5GV47blYO60Dps1eFLELNg7VCLe2PkTLvbMNBBkKFtUEA3RdLKZfcwoczfjeQlyq5b2U30dZujTGNo1G56e36JkOr413la1L4G0yTFvJShmMzrEoeJQgS50PrtUMDvxLRaAS/u+e0zUdROVIrqg17oi2mnlPuqNJUE4nLyNWRrhLuuDCjkuxrrpRI5m2KfSmuY2tjA5pwzhdDZPJp9AhHePhc/yVQbdzkybBs/+yhdl9yw06xKhxnSUX9z1rDwaWJUKFJITNIEpMfb8RATZXBILQDgWERhFfsOlb4VYllw/nquW+n2LYPrgEGP5u35KTGXsemb731+VWD2HM06v4Ox7PriNDuqf/uUPnR/0QUYKKvYzj1jzFu2LWDrUUU4JDfmDy/fe4JvrymUNpjphTit/rTPEkKR5nU512jEQoqTjfAp8N/IPGzlHq2H5/kdPQjAYSpVyh9rcFInxgetxxyykCOoT69fkz32+w6hThwWglhm5yXLbd+dx6dBfCfgwLT24zC5ISGJz/90mo539yqm7mBvPbePp4WHkz8slgAm1+jpzJo07uwaOP9cdjFJNVN3nXFfllmuM5fipnXxFmcGmwEY2uRpRccU7iDZWcXCATsGU+RD1m/9vzpohVFPWIvrvTQTAIhl0c6eQRqruVwC8u6qtFKaa9JwuumH0nGcrraU5msacQ09DKFfn8AyZjwrYKBX9L2aBflcl/E6stydJM0l53fEIB3JtmSVWStQNjJpNU3/Ad2PGrHOUD2Q6zJU+nBtvC2Pj8jYPwaEBCdTLQR5Sf/Nyqw1S3KLWUHpJ1Z1Fyh2bz3jRMjmaPHyKzEEnsHbHR5JYnEaFRqo3/mPdsbZzikCUtob2KPmh5LewVzLd8jupgEtQxEPwaBXKHYvqoJNXYpf/y/58Wvog 615WqWDj lEIk2io64w9giv0VelotF74/iMxcu9L8ktL/7ZkEORX4O7jb83x3mMQDSMQ9n/Ky4tu9YGN8GLQZRzleKxyyBTZbLxuSpZu2y4syjLc0wf0wXNmcmSQjF1tnJkc0Xmfl+eG5UmzPQyNJHkqdjAZgks5aYTdCAHIe4EuGpkwWTGFq/8/n1vQEljTFQL8WQxFiKHP+FEQP2STnZMT0h27MmqrpfwY7WAAbtCxfLXTEUVpk7vt2AYRatQE+VEIN4fGK0Aml1TPCIieXiDLg4E1QtQXPVme9O/ZYE6xnqg1E5gAOuoqFC5sctYJme6/Whj/qIaKXu1DEecgH1WpoD/xZfeNHiQSLDCDc73lCQXmuFTydeEAMe/wFfg4DVyTNadEXKrdgvGoV3Xx5pZlSPNLQg+yPUk1TEWxdBxGHtcBCqYOp/DZQb2k1fcqkBrPY+j0AyipEepBbqHMl/oPWbNO+AulBim5ZxBGbSLB31 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 7, 2024 at 9:37=E2=80=AFAM Yosry Ahmed = wrote: > > On Thu, Jun 6, 2024 at 2:30=E2=80=AFPM Barry Song <21cnbao@gmail.com> wro= te: > > > > On Fri, Jun 7, 2024 at 9:17=E2=80=AFAM David Hildenbrand wrote: > > > > > > On 06.06.24 22:31, Yosry Ahmed wrote: > > > > On Thu, Jun 6, 2024 at 1:22=E2=80=AFPM David Hildenbrand wrote: > > > >> > > > >> On 06.06.24 20:48, Yosry Ahmed wrote: > > > >>> With ongoing work to support large folio swapin, it is important = to make > > > >>> sure we do not pass large folios to zswap_load() without implemen= ting > > > >>> proper support. > > > >>> > > > >>> For example, if a swapin fault observes that contiguous PTEs are > > > >>> pointing to contiguous swap entries and tries to swap them in as = a large > > > >>> folio, swap_read_folio() will pass in a large folio to zswap_load= (), but > > > >>> zswap_load() will only effectively load the first page in the fol= io. If > > > >>> the first page is not in zswap, the folio will be read from disk,= even > > > >>> though other pages may be in zswap. > > > >>> > > > >>> In both cases, this will lead to silent data corruption. > > > >>> > > > >>> Proper large folio swapin support needs to go into zswap before z= swap > > > >>> can be enabled in a system that supports large folio swapin. > > > >>> > > > >>> Looking at callers of swap_read_folio(), it seems like they are e= ither > > > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > > > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, s= o we > > > >>> are fine for now. > > > >>> > > > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect cha= nges in > > > >>> the order of those allocations without proper handling of zswap. > > > >>> > > > >>> Alternatively, swap_read_folio() (or its callers) can be updated = to have > > > >>> a fallback mechanism that splits large folios or reads subpages > > > >>> separately. Similar logic may be needed anyway in case part of a = large > > > >>> folio is already in the swapcache and the rest of it is swapped o= ut. > > > >>> > > > >>> Signed-off-by: Yosry Ahmed > > > >>> --- > > > >>> > > > >>> Sorry for the long CC list, I just found myself repeatedly lookin= g at > > > >>> new series that add swap support for mTHPs / large folios, making= sure > > > >>> they do not break with zswap or make incorrect assumptions. This = debug > > > >>> check should give us some peace of mind. Hopefully this patch wil= l also > > > >>> raise awareness among people who are working on this. > > > >>> > > > >>> --- > > > >>> mm/zswap.c | 3 +++ > > > >>> 1 file changed, 3 insertions(+) > > > >>> > > > >>> diff --git a/mm/zswap.c b/mm/zswap.c > > > >>> index b9b35ef86d9be..6007252429bb2 100644 > > > >>> --- a/mm/zswap.c > > > >>> +++ b/mm/zswap.c > > > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > > >>> if (!entry) > > > >>> return false; > > > >>> > > > >>> + /* Zswap loads do not handle large folio swapins correctly = yet */ > > > >>> + VM_BUG_ON(folio_test_large(folio)); > > > >>> + > > > >> > > > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > > > > > Not without making more fundamental changes to the surrounding swap > > > > code. Currently zswap_load() returns either true (folio was loaded > > > > from zswap) or false (folio is not in zswap). > > > > > > > > To handle this correctly zswap_load() would need to tell > > > > swap_read_folio() which subpages are in zswap and have been loaded, > > > > and then swap_read_folio() would need to read the remaining subpage= s > > > > from disk. This of course assumes that the caller of swap_read_foli= o() > > > > made sure that the entire folio is swapped out and protected agains= t > > > > races with other swapins. > > > > > > > > Also, because swap_read_folio() cannot split the folio itself, othe= r > > > > swap_read_folio_*() functions that are called from it should be > > > > updated to handle swapping in tail subpages, which may be questiona= ble > > > > in its own right. > > > > > > > > An alternative would be that zswap_load() (or a separate interface) > > > > could tell swap_read_folio() that the folio is partially in zswap, > > > > then we can just bail and tell the caller that it cannot read the > > > > large folio and that it should be split. > > > > > > > > There may be other options as well, but the bottom line is that it = is > > > > possible, but probably not something that we want to do right now. > > > > > > > > A stronger protection method would be to introduce a config option = or > > > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > > > depend on it being disabled, or have zswap check it at boot and ref= use > > > > to be enabled if it is on. > > > > > > Right, sounds like the VM_BUG_ON() really is not that easily avoidabl= e. > > > > > > I was wondering, if we could WARN_ON_ONCE and make the swap code dete= ct > > > this like a read-error from disk. > > > > > > I think do_swap_page() detects that by checking if the folio is not > > > uptodate: > > > > > > if (unlikely(!folio_test_uptodate(folio))) { > > > ret =3D VM_FAULT_SIGBUS; > > > goto out_nomap; > > > } > > > > > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > > > system (but the app would crash either way, there is no way around it= ). > > > > > > > I'd rather fallback to small folios swapin instead crashing apps till w= e fix > > the large folio swapin in zswap :-) > > I think David is referring to catching the buggy cases that do not > properly fallback to small folios with zswap, not as an alternative to > the fallback. This is at least what I had in mind with the patch. Cool. Thanks for the clarification. I'm fine with keeping the fallback, whether it's the current VM_BUG_ON or David's recommended SIGBUS. Currently, mainline doesn't support large folios swap-in, so I see your patch as a helpful warning for those attempting large folio swap-in to avoid making mistakes like loading large folios from zswap. In fact, I spent a week trying to figure out why my app was crashing before I added 'if (zswap_is_enabled()) goto fallback'. If your patch had come earlier, it would have saved me that week of work :-) To me, a direct crash seems like a more straightforward way to prompt people to fallback when attempting large folio swap-ins. So, I am slightly in favor of your current patch. Thanks Barry