From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45B0AC27C4F for ; Mon, 10 Jun 2024 23:34:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 93C466B008C; Mon, 10 Jun 2024 19:34:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8EBD66B0092; Mon, 10 Jun 2024 19:34:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78C746B0093; Mon, 10 Jun 2024 19:34:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 566376B008C for ; Mon, 10 Jun 2024 19:34:57 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A88A1408E7 for ; Mon, 10 Jun 2024 23:34:56 +0000 (UTC) X-FDA: 82216586592.25.B9C101B Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) by imf28.hostedemail.com (Postfix) with ESMTP id E0A1AC0013 for ; Mon, 10 Jun 2024 23:34:53 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Tq/8har0"; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718062493; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oAWl7Ib5EETyb6qKQYzVD1vbK0fWMfxkINjvghmTwZ4=; b=gmFekwJlkIkNivqKUfZm8+OFGRHUYWpj8QTBDiZN8ERtPyMEVUrRMZ9zoV+IkedyzixxLw 3ZB9cF6JXJAOiI5/1LYaUpCprCYUvplKW0/sFJuArO8GtnjmpDoDC4bprhLkjhLfNlg7ju hDYxt1TDMtZRLaSvXeY6ycgg+u4d1f0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718062493; a=rsa-sha256; cv=none; b=WSlXz/fcCOhuJGBqHM8CWo2eFftMUbMvGTVae543eNN2shmgBbVJaF01nkj5oh6+pN0VEk Ehc7W6zBWd4uhGwo6cdtOv9vK9zRvtWT+NYz1Cm95RkOxgzgV2ppPbeF3dt9jRX1Cf6N78 C99u+C43bg7figTlOnkXraS79sWjolI= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Tq/8har0"; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qk1-f172.google.com with SMTP id af79cd13be357-7955dc86cacso110835085a.0 for ; Mon, 10 Jun 2024 16:34:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718062493; x=1718667293; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=oAWl7Ib5EETyb6qKQYzVD1vbK0fWMfxkINjvghmTwZ4=; b=Tq/8har0cZAHqHH7XfRoNFgDDPmitr8rAMlgY8H8AUGmMA+LvEWL+1VIilclm9cnkr Y9Srisu4rE5cJ+9zfiBClnRyN2Bt3uXntOaYzGLG6E/xoH1yZ7ild4P+yAjBexiNDQkH LCoBqQ9gGpKbloReJ7XeTS908ul45cA9iuBHuGHLMViFTqy84QW1sUr+ZMKL/2XrNQGA WiSfRhnwPE3ZVqhesl/hmwZmC1SKHhM3r0MEVQSJFBLiNKV9gNTta9BnuYrbSXnYehXP xNQne4Mie4iVuI7okDt5KmQQ/qpUGv+UcFVjpgStuA8tYp51ZdU7MC+GoFLzK5z/4aLB RMYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718062493; x=1718667293; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oAWl7Ib5EETyb6qKQYzVD1vbK0fWMfxkINjvghmTwZ4=; b=cttaQvAQKCpp9u73D6imc80ybd5E61SeI+9FpT5j6/m4Gkqo380NjbCfuc1unGOaoP OEpbr1LCVtOtA1KwL+8KOQSItk1JLibRY26WgmldDlayfApLnRAgM594aJN/WdxHngwE l5kImRF7PWAcHMb6ok474cC8PcHIXY19HFTLckFWzRxOzzlji1jqEG9lWPWvh9nZnoZx XytlKzNJiKK5YsZmsQkKeFtz3Rb+FZ43UqUAnEh8nl6DqJIejRCEPKXEVRM8rm+zQREi x/Dbd09NGkUdqhBn3kyXD+5HQepOyDCCc67WtKzxDwcNscEIxed5Yh0obMs4FKVvuoAY BxqA== X-Forwarded-Encrypted: i=1; AJvYcCXRiDOnc3jAq2F17xk0kzhD5lsGcYVmWMVIkDaGoz5v7o3HHbPp/NiN+0QNB+bXLzwmjtfPjepqxEssb/XJsSneqUk= X-Gm-Message-State: AOJu0YzeDTmCgsZxDopfZ3LMpNDEPSv1stMmcEoLUKnrP2/dHWNxmMQE c7roSQkNtgBY8MIFCy81GiYmPOnQiowRuLGUDqYm3KoX1vtG0fkD5JzYvruTxofqhEzNy5Qy7DM pzUcK83t8h6/9eTGsPMebaIpsIdl3eH2I X-Google-Smtp-Source: AGHT+IE3oosJo4kcNHuGcr9G0RD1oLhJ87BJKZd85tjQHgfR4fr0b5ZzGYFIIPFIBItYUIkwkHriHcoN+l3mh86XlVk= X-Received: by 2002:a05:6214:46a1:b0:6b0:7864:90ac with SMTP id 6a1803df08f44-6b078649246mr85341986d6.11.1718062492825; Mon, 10 Jun 2024 16:34:52 -0700 (PDT) MIME-Version: 1.0 References: <20240608023654.3513385-1-yosryahmed@google.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 11 Jun 2024 11:34:41 +1200 Message-ID: Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios To: Yosry Ahmed Cc: Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , David Hildenbrand , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: rsfkyrrgb4byoqotwsofnng488pwnooc X-Rspamd-Queue-Id: E0A1AC0013 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1718062493-783712 X-HE-Meta: U2FsdGVkX1+tMFnDSaVhuBE3Pv5bkbfnJ4PPLAzSkMsd0hG0fLbSVNFsikmYSRz2PuRf2w72UqhaBYOYVu08+xl9lW/VSP/ij+ahKltZWXGT17mh2LR48RiwOUneEy71ccecTFYuwfn+vF80DjQgi5avumQqnZFKz5Hw9SECqOhvxnIdOU188QM4kJWicgE/PXtkxv6Gwg6LcESsay04LhvvoBrds5NJP9rB3scw0w4dlA11WbDfEVDpTnQcDPwnyuq12mU0+ZQhYrOBy5yTmDxWr5CoRANYrRBnTSrtprrX5hbopjOKoLsbJJZPfjWNxWs52zd+e/JCROIF3UoDdrwJht+UpaIOZ11FxL044lO+mwmUpebb45zA6GA0HnRKJ+jPZ7Vja0tav/cFr0uSQi3Xm2tdW6WMcTw+MG8hDwhN2mSNmPeJwtzRvjdJGgSfHTmMymWB59h2/gpfz/b/yS49qDqr9hGEel1k1MSHPrZs8RHw5NYR8+vVLMARpoCSMSTDgzqZJnABkmPQBpE3GVWq8Npr5e9Fllf2QTduWMkHp0OEyTS3IKnPnmIuvLxn8UtBR2yG00OY1+LOYS/DywVOVH2SjbtQtSOD85JHBd10o0SLB5K90lTfeKzhJNxuVyp2oXBVHDOcg+c3e7a9LUTtjMhPmIQv0PBViKsx0Ktve1jVmBXpZQPSrZV+3C9p+5F/VZ3bxLcXI3ldj+ehRlp8kuumZnqx7JChLLtBmU8s9PSiebpmsLITVtHbYJzBPgaQYLrKCrq4juXV4CNzlciVLUAvDja59u/4CTS+6XsxIFv/FTIAXbF9mp/6cKNXTU0Wl7v4ITzX02nmpL31GX6hCkAhfmijxUcHH+SFFkWENF0nt1kyQtIaNFM4VZUoMmTb57lcGsyuxK77qiuosQLEmYIi4IQrG71jZvcfnhC1EMgoH+rwMBOc2XE7sG9+6vUhtsL71g2hNgEjA6q QUNstYup keU94PheQIrRWxdbR6O1vlOTUZtR5k9SnerrXkhkk357l4zYUuRBUl/WUm9GpkEbOgr5ynMZ5GL9Kt1RQY/ypeeEzFMaYUXR43Cdc1cvqNkw7u2IW15B2Ow7zG3MbiCCON7u5AW7YCZHk3nfA5zzFT7oE51VRqUVmXBbkjukS3QcsIqg+taFXM/C9VJO+1tEWRexNNJhtH7j4D0ZZM/YE1ygWUPthrZrmlY7Ja9Q3hZu7KgT52NlnqNFPZ7rfnx2Mh+kcPeby6NP45FUVDZF3y9TXZ+4PTy+zgUBUkwBgiR/vvBhTVEn7GqjGic3xYyI0xvcCSBwM9mrAyRk/ZpHX7P3r1YM9Dax1yTUtxij1U9V/9aovMuPYXYPElQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 11, 2024 at 9:12=E2=80=AFAM Yosry Ahmed = wrote: > > On Mon, Jun 10, 2024 at 2:00=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Tue, Jun 11, 2024 at 4:12=E2=80=AFAM Yosry Ahmed wrote: > > > > > > On Mon, Jun 10, 2024 at 1:06=E2=80=AFPM Barry Song <21cnbao@gmail.com= > wrote: > > > > > > > > On Tue, Jun 11, 2024 at 1:42=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > > > On Fri, Jun 7, 2024 at 9:13=E2=80=AFPM Barry Song <21cnbao@gmail.= com> wrote: > > > > > > > > > > > > On Sat, Jun 8, 2024 at 10:37=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > > > > > > > Zswap does not support storing or loading large folios. Until= proper > > > > > > > support is added, attempts to load large folios from zswap ar= e a bug. > > > > > > > > > > > > > > For example, if a swapin fault observes that contiguous PTEs = are > > > > > > > pointing to contiguous swap entries and tries to swap them in= as a large > > > > > > > folio, swap_read_folio() will pass in a large folio to zswap_= load(), but > > > > > > > zswap_load() will only effectively load the first page in the= folio. If > > > > > > > the first page is not in zswap, the folio will be read from d= isk, even > > > > > > > though other pages may be in zswap. > > > > > > > > > > > > > > In both cases, this will lead to silent data corruption. Prop= er support > > > > > > > needs to be added before large folio swapins and zswap can wo= rk > > > > > > > together. > > > > > > > > > > > > > > Looking at callers of swap_read_folio(), it seems like they a= re either > > > > > > > allocated from __read_swap_cache_async() or do_swap_page() in= the > > > > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folio= s, so > > > > > > > everything is fine for now. > > > > > > > > > > > > > > However, there is ongoing work to add to support large folio = swapins > > > > > > > [1]. To make sure new development does not break zswap (or ge= t broken by > > > > > > > zswap), add minimal handling of incorrect loads of large foli= os to > > > > > > > zswap. > > > > > > > > > > > > > > First, move the call folio_mark_uptodate() inside zswap_load(= ). > > > > > > > > > > > > > > If a large folio load is attempted, and any page in that foli= o is in > > > > > > > zswap, return 'true' without calling folio_mark_uptodate(). T= his will > > > > > > > prevent the folio from being read from disk, and will emit an= IO error > > > > > > > because the folio is not uptodate (e.g. do_swap_fault() will = return > > > > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all case= s, but it > > > > > > > is better than nothing. > > > > > > > > > > > > > > This was tested by hacking the allocation in __read_swap_cach= e_async() > > > > > > > to use order 2 and __GFP_COMP. > > > > > > > > > > > > > > In the future, to handle this correctly, the swapin code shou= ld: > > > > > > > (a) Fallback to order-0 swapins if zswap was ever used on the= machine, > > > > > > > because compressed pages remain in zswap after it is disabled= . > > > > > > > (b) Add proper support to swapin large folios from zswap (ful= ly or > > > > > > > partially). > > > > > > > > > > > > > > Probably start with (a) then followup with (b). > > > > > > > > > > > > > > [1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-2= 1cnbao@gmail.com/ > > > > > > > > > > > > > > Signed-off-by: Yosry Ahmed > > > > > > > --- > > > > > > > > > > > > > > v1: https://lore.kernel.org/lkml/20240606184818.1566920-1-yos= ryahmed@google.com/ > > > > > > > > > > > > > > v1 -> v2: > > > > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add som= e recovery > > > > > > > handling (David Hildenbrand). > > > > > > > > > > > > > > --- > > > > > > > mm/page_io.c | 1 - > > > > > > > mm/zswap.c | 22 +++++++++++++++++++++- > > > > > > > 2 files changed, 21 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > > > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > > > > > --- a/mm/page_io.c > > > > > > > +++ b/mm/page_io.c > > > > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio,= struct swap_iocb **plug) > > > > > > > delayacct_swapin_start(); > > > > > > > > > > > > > > if (zswap_load(folio)) { > > > > > > > - folio_mark_uptodate(folio); > > > > > > > folio_unlock(folio); > > > > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > > > > > swap_read_folio_fs(folio, plug); > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > > > index b9b35ef86d9be..ebb878d3e7865 100644 > > > > > > > --- a/mm/zswap.c > > > > > > > +++ b/mm/zswap.c > > > > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio) > > > > > > > > > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > > > > > > > > > + /* > > > > > > > + * Large folios should not be swapped in while zswap = is being used, as > > > > > > > + * they are not properly handled. Zswap does not prop= erly load large > > > > > > > + * folios, and a large folio may only be partially in= zswap. > > > > > > > + * > > > > > > > + * If any of the subpages are in zswap, reading from = disk would result > > > > > > > + * in data corruption, so return true without marking= the folio uptodate > > > > > > > + * so that an IO error is emitted (e.g. do_swap_page(= ) will sigfault). > > > > > > > + * > > > > > > > + * Otherwise, return false and read the folio from di= sk. > > > > > > > + */ > > > > > > > + if (folio_test_large(folio)) { > > > > > > > + if (xa_find(tree, &offset, > > > > > > > + offset + folio_nr_pages(folio) - = 1, XA_PRESENT)) { > > > > > > > + WARN_ON_ONCE(1); > > > > > > > + return true; > > > > > > > + } > > > > > > > + return false; > > > > > > > > > > > > IMHO, this appears to be over-designed. Personally, I would opt= to > > > > > > use > > > > > > > > > > > > if (folio_test_large(folio)) > > > > > > return true; > > > > > > > > > > I am sure you mean "return false" here. Always returning true mea= ns we > > > > > will never read a large folio from either zswap or disk, whether = it's > > > > > in zswap or not. Basically guaranteeing corrupting data for large > > > > > folio swapin, even if zswap is disabled :) > > > > > > > > > > > > > > > > > Before we address large folio support in zswap, it=E2=80=99s es= sential > > > > > > not to let them coexist. Expecting valid data by lunchtime is > > > > > > not advisable. > > > > > > > > > > The goal here is to enable development for large folio swapin wit= hout > > > > > breaking zswap or being blocked on adding support in zswap. If we > > > > > always return false for large folios, as you suggest, then even i= f the > > > > > folio is in zswap (or parts of it), we will go read it from disk.= This > > > > > will result in silent data corruption. > > > > > > > > > > As you mentioned before, you spent a week debugging problems with= your > > > > > large folio swapin series because of a zswap problem, and even af= ter > > > > > then, the zswap_is_enabled() check you had is not enough to preve= nt > > > > > problems as I mentioned before (if zswap was enabled before). So = we > > > > > need stronger checks to make sure we don't break things when we > > > > > support large folio swapin. > > > > > > > > > > Since we can't just check if zswap is enabled or not, we need to > > > > > rather check if the folio (or any part of it) is in zswap or not.= We > > > > > can only WARN in that case, but delivering the error to userspace= is a > > > > > couple of extra lines of code (not set uptodate), and will make t= he > > > > > problem much easier to notice. > > > > > > > > > > I am not sure I understand what you mean. The alternative is to > > > > > introduce a config option (perhaps internal) for large folio swap= in, > > > > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to ge= t > > > > > enabled if large folio swapin is enabled (through config or boot > > > > > option). This is until proper handling is added, of course. > > > > > > > > Hi Yosry, > > > > My point is that anybody attempts to do large folios swap-in should > > > > either > > > > 1. always use small folios if zswap has been once enabled before or= now > > > > or > > > > 2. address the large folios swapin issues in zswap > > > > > > > > there is no 3rd way which you are providing. > > > > > > > > it is over-designed to give users true or false based on if data is= zswap > > > > as there is always a chance data could be in zswap. so before appro= ach > > > > 2 is done, we should always WARN_ON large folios and report data > > > > corruption. > > > > > > We can't always WARN_ON for large folios, as this will fire even if > > > zswap was never enabled. The alternative is tracking whether zswap wa= s > > > ever enabled, and checking that instead of checking if any part of th= e > > > folio is in zswap. > > > > > > Basically replacing xa_find(..) with zswap_was_enabled(..) or somethi= ng. > > > > My point is that mm core should always fallback > > > > if (zswap_was_or_is_enabled()) > > goto fallback; > > > > till zswap fixes the issue. This is the only way to enable large folios= swap-in > > development before we fix zswap. > > I agree with this, I just want an extra fallback in zswap itself in > case something was missed during large folio swapin development (which > can evidently happen). yes. then i feel we only need to warn_on the case mm-core fails to fallback= . I mean, only WARN_ON is_zswap_ever_enabled&&large folio. there is no need to do more. Before zswap brings up the large folio support, mm-core will need is_zswap_ever_enabled() to do fallback. diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 2a85b941db97..035e51ed89c4 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -36,6 +36,7 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg= ); void zswap_lruvec_state_init(struct lruvec *lruvec); void zswap_folio_swapin(struct folio *folio); bool is_zswap_enabled(void); +bool is_zswap_ever_enabled(void); #else struct zswap_lruvec_state {}; @@ -65,6 +66,10 @@ static inline bool is_zswap_enabled(void) return false; } +static inline bool is_zswap_ever_enabled(void) +{ + return false; +} #endif #endif /* _LINUX_ZSWAP_H */ diff --git a/mm/zswap.c b/mm/zswap.c index b9b35ef86d9b..bf2da5d37e47 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -86,6 +86,9 @@ static int zswap_setup(void); static bool zswap_enabled =3D IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON); static int zswap_enabled_param_set(const char *, const struct kernel_param *); + +static bool zswap_ever_enable; + static const struct kernel_param_ops zswap_enabled_param_ops =3D { .set =3D zswap_enabled_param_set, .get =3D param_get_bool, @@ -136,6 +139,11 @@ bool is_zswap_enabled(void) return zswap_enabled; } +bool is_zswap_ever_enabled(void) +{ + return zswap_enabled || zswap_ever_enabled; +} + /********************************* * data structures **********************************/ @@ -1734,6 +1742,7 @@ static int zswap_setup(void) pr_info("loaded using pool %s/%s\n", pool->tfm_name, zpool_get_type(pool->zpools[0])); list_add(&pool->list, &zswap_pools); + zswap_ever_enabled =3D true; zswap_has_pool =3D true; } else { pr_err("pool creation failed\n"); Thanks Barry