From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC028C27C55 for ; Mon, 10 Jun 2024 21:00:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 486EF6B009B; Mon, 10 Jun 2024 17:00:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 436FD6B009C; Mon, 10 Jun 2024 17:00:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2D72C6B009D; Mon, 10 Jun 2024 17:00:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 1161D6B009B for ; Mon, 10 Jun 2024 17:00:19 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 8FE70141171 for ; Mon, 10 Jun 2024 21:00:18 +0000 (UTC) X-FDA: 82216196916.27.7D16F22 Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52]) by imf19.hostedemail.com (Postfix) with ESMTP id A1FAB1A000C for ; Mon, 10 Jun 2024 21:00:16 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SRYQ2VC7; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718053216; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jtglSl5nJSLjnhryyMKeeZ/HJbTMOfyFyynMlPClQCc=; b=o7wg6tanOueGviSKT4RKU0kzTgo0fQhNPNoMOKlRPcyDWl6KaaIwMH3Y/FfW10XcQ/xk/K JP1VlFReU5Q5FbUmdckYb6zLEoAKwxFtfwspazabuuY15L3lZbFlQwVywmdOZV+ghsdv+B Itk7AE7PqzwwjJZnlCgfz+NcjjFQUuw= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SRYQ2VC7; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718053216; a=rsa-sha256; cv=none; b=bBUaHAWDxpCJlmXbN4eLd4zUwNZ7+DDIgU1jNJm0ywN03kviT3PnOkHWJSUfXkNCwlQpdr g5r/bqm+Mms4awbRMNwZKI9IL/rTDDhPw8dyWH649id2gNcm7lAEwwUP/50pLFBQHGgmHB AH61f6Os54CvjWZdV0TPqAiu5WeJEKk= Received: by mail-ua1-f52.google.com with SMTP id a1e0cc1a2514c-80b8e5cab7bso644068241.0 for ; Mon, 10 Jun 2024 14:00:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718053216; x=1718658016; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jtglSl5nJSLjnhryyMKeeZ/HJbTMOfyFyynMlPClQCc=; b=SRYQ2VC73jA3f2IDZtRFTefNe3yXZSJSlvyiowSsBMfXXzTbRv8UZrDbN6DWaxt04T iUWQeZus24tdY97s7PhkCrBTXl//WUEh3DJtbBzq28o8u6I6DL44qqHoOLul73NPcCmB ajFuVeo/9eLVYB9mHNS6Ixm3fVw4ZTPv5woNNHUGgx6xp4vOOM4OAL8a553u2puhjUWh IpvraKczqeJ7WRo16c1gUWoq+laMbjxZ4DKpp17vvxhtcVB5ce36XKM6vj991iYkjyVh 8VOjA9ag1w9rrwfXuRGZ7umKtddMffBMLYbqSHYSgc5+emMYKyGrNVTeKfHnhhTmz8Sz 1deg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718053216; x=1718658016; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jtglSl5nJSLjnhryyMKeeZ/HJbTMOfyFyynMlPClQCc=; b=F+SD6RtEKX9p0x8OeOCnQO9ucI3qrlMSNjQDDhUlRfBLkYN+qy+KV1PUsF47maMBdB 94WFVk8d+j8koC9l3DjMMXanSTwgMnsMLZKFCR+MsOhfNyRaynYOrkISHUIi5kQEHOhi z+RnENd4NAwjB64WcK0nF+0zYdhBwPblm42EU5j1fsMYCar82JxffP/Yq6dZZh7Hd70H +cwRRmQCjl++Sz2vd30q/+3d+tfYN/zPZiE+Df43xxWKjx26DCJFALQTydnu/TyqQ4CQ YxWNFsi85tOGFpBaUzDdd96KrUTgpLis8osyw5pqwgKxruIcMqV6xF8jrWNhveBzjFeQ mV4Q== X-Forwarded-Encrypted: i=1; AJvYcCUeNzcov/VWcelI5XYswb+BvlgMmk1VHsMzC8oYwfl4BSo/5TSIerJYrD8b0ckI40Z5RZ2XlC9NEiEFmZ/3+gqpvRo= X-Gm-Message-State: AOJu0YwjSPCpiJTBDp92l9eryVydOuDbYecVk33jqXM17EiA5nn1pNyL hqsjrKTgTyByQkg83+NjryfGyuFyecCswpf3yMEzarO19IbKSbVM0Q1xjgsw4ZfwhnbKck4LyN7 KGkZwQz2snG5EZupo3PbTAxYjTFk= X-Google-Smtp-Source: AGHT+IFH91cYPqMQB2e5MjF84hEcwy9PEyTQdAbTPdEmgy+aTCyASRVIIvuAoOCusr9Da2e6/clC8wkqB0yx3OleiMk= X-Received: by 2002:a05:6102:1277:b0:48c:329c:b3dc with SMTP id ada2fe7eead31-48c329cb660mr9229115137.11.1718053214959; Mon, 10 Jun 2024 14:00:14 -0700 (PDT) MIME-Version: 1.0 References: <20240608023654.3513385-1-yosryahmed@google.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 11 Jun 2024 05:00:03 +0800 Message-ID: Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios To: Yosry Ahmed Cc: Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , David Hildenbrand , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: A1FAB1A000C X-Stat-Signature: 6ozhunpiaty8zk15jsfpkaepw4dywzjr X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1718053216-800755 X-HE-Meta: U2FsdGVkX1/Np8EpubBKGuQ7sCUKPm+q+Kcdg1DTbGkm9WfX+X4nQaTcoY0/A3ZMN0xhf3rol931F3n7ZDTr61qjboB2yHwLH64nPplDZ6lHXV7VH+cYwLaNx1E4vl0TsljtOCZSwY6YXH0YcNX4t9fhoTQ/BSQGK5fjY+9OnB85RSkJC627JW8LbSPCMJfhjqwzC4M2wyMpyjvOAAseQiw6ChJd6d40YaF+6+jPbakwiw2JZM21yZuNLYDlJA8zgI36QV3qY5PNjo6fpOA38lcbHMyANAtI2fpHHWhoG25hm1tes3QlAMEWLYpi/+GEFEEPJq5G7dYr/4tp8/rwGr2FVPRcXHMvuu7EjwflZ/5jIpLofxve/EX3eRKAYI9qKLpSJr0YvAMHrH19ofpqS2cTQkBdC76zMDyaP7C77j7he48rbYveSkRpm4fBRtkK9UdrfKLCEDnEg7eJP/eXZfxPTIbpXSnyZQW2bbBxUuV+BLTi98rVx6Eq0nqlDypcE4JqtiNAzTXsPXcAkGr3b1U1k8ziSD1RQkgr/pqOlhRz/vopRDCnyCqtcnq8aruQeIIffY198V7/0HWLKbfTyT8i0OMmTAO7wzzJYy/w3GTjBvtuBhCZ6ZDkgacHC2/nTx/Dc3WxaRocskYGh6HmqH1HuJ7xbp383DvppOMMxTpgtBtrWMrbq2kC0BkUgaZi5H0Oyq5jJoVfbB4DgZDHggpNgXO3ZFqKIca7d+SmtA4PNVgdoBo+Dfk88JD/WYHlbPt0CsxhPdrLThHE1jUq8Z+5sAc1ixOUhW3uXUrL6xvdCz2DwYXrNk7NIxYdkj9fwI36CQiqBd3SDklhMbavyowcRtYonRmC/vD95MxZ8OJCZdilMELhtvzNwwwpqKTvpoKnPnQVH07Hd4cZFi9j1ZO0oq2Am1hUWy02sR2tr0uI10xbbnr7YhTPU+5RxjzGiju8QPB4PG8BAftHdPj UcKfX0jn HYpvTEhO628Ktj5eCZtsXVYBKyj/PYNZzbNi3n2MAWmuokfreRO0c3gQq+Swr0oFQCH8JCHKnvI7uuuhz90Es6T4uFfcMhAM5ZXHr9XBYiE/MmHe4TdE+VS5DYmSpr/NLjekxd45+7sTFQYdG7oCI2hrSpvalIkucInipRGPXoQ7craWD9tSJUNwzoaAjUcfpWEkq6VOmrgfA4DkNs/VpK49RvWs926YgiDqPu073yABdEr79pJs1npxVS/XtoEOT1T8xzz5sB52ZC2WJNkUAbvSXpJF5teXDCily+GOIX5yqFYlc27jEY/jX/MBj4eT2hCwIrvZcQyRvlyA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 11, 2024 at 4:12=E2=80=AFAM Yosry Ahmed = wrote: > > On Mon, Jun 10, 2024 at 1:06=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Tue, Jun 11, 2024 at 1:42=E2=80=AFAM Yosry Ahmed wrote: > > > > > > On Fri, Jun 7, 2024 at 9:13=E2=80=AFPM Barry Song <21cnbao@gmail.com>= wrote: > > > > > > > > On Sat, Jun 8, 2024 at 10:37=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > > > Zswap does not support storing or loading large folios. Until pro= per > > > > > support is added, attempts to load large folios from zswap are a = bug. > > > > > > > > > > For example, if a swapin fault observes that contiguous PTEs are > > > > > pointing to contiguous swap entries and tries to swap them in as = a large > > > > > folio, swap_read_folio() will pass in a large folio to zswap_load= (), but > > > > > zswap_load() will only effectively load the first page in the fol= io. If > > > > > the first page is not in zswap, the folio will be read from disk,= even > > > > > though other pages may be in zswap. > > > > > > > > > > In both cases, this will lead to silent data corruption. Proper s= upport > > > > > needs to be added before large folio swapins and zswap can work > > > > > together. > > > > > > > > > > Looking at callers of swap_read_folio(), it seems like they are e= ither > > > > > allocated from __read_swap_cache_async() or do_swap_page() in the > > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, s= o > > > > > everything is fine for now. > > > > > > > > > > However, there is ongoing work to add to support large folio swap= ins > > > > > [1]. To make sure new development does not break zswap (or get br= oken by > > > > > zswap), add minimal handling of incorrect loads of large folios t= o > > > > > zswap. > > > > > > > > > > First, move the call folio_mark_uptodate() inside zswap_load(). > > > > > > > > > > If a large folio load is attempted, and any page in that folio is= in > > > > > zswap, return 'true' without calling folio_mark_uptodate(). This = will > > > > > prevent the folio from being read from disk, and will emit an IO = error > > > > > because the folio is not uptodate (e.g. do_swap_fault() will retu= rn > > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, b= ut it > > > > > is better than nothing. > > > > > > > > > > This was tested by hacking the allocation in __read_swap_cache_as= ync() > > > > > to use order 2 and __GFP_COMP. > > > > > > > > > > In the future, to handle this correctly, the swapin code should: > > > > > (a) Fallback to order-0 swapins if zswap was ever used on the mac= hine, > > > > > because compressed pages remain in zswap after it is disabled. > > > > > (b) Add proper support to swapin large folios from zswap (fully o= r > > > > > partially). > > > > > > > > > > Probably start with (a) then followup with (b). > > > > > > > > > > [1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnb= ao@gmail.com/ > > > > > > > > > > Signed-off-by: Yosry Ahmed > > > > > --- > > > > > > > > > > v1: https://lore.kernel.org/lkml/20240606184818.1566920-1-yosryah= med@google.com/ > > > > > > > > > > v1 -> v2: > > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some re= covery > > > > > handling (David Hildenbrand). > > > > > > > > > > --- > > > > > mm/page_io.c | 1 - > > > > > mm/zswap.c | 22 +++++++++++++++++++++- > > > > > 2 files changed, 21 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > > > --- a/mm/page_io.c > > > > > +++ b/mm/page_io.c > > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, str= uct swap_iocb **plug) > > > > > delayacct_swapin_start(); > > > > > > > > > > if (zswap_load(folio)) { > > > > > - folio_mark_uptodate(folio); > > > > > folio_unlock(folio); > > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > > > swap_read_folio_fs(folio, plug); > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > index b9b35ef86d9be..ebb878d3e7865 100644 > > > > > --- a/mm/zswap.c > > > > > +++ b/mm/zswap.c > > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio) > > > > > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > > > > > + /* > > > > > + * Large folios should not be swapped in while zswap is b= eing used, as > > > > > + * they are not properly handled. Zswap does not properly= load large > > > > > + * folios, and a large folio may only be partially in zsw= ap. > > > > > + * > > > > > + * If any of the subpages are in zswap, reading from disk= would result > > > > > + * in data corruption, so return true without marking the= folio uptodate > > > > > + * so that an IO error is emitted (e.g. do_swap_page() wi= ll sigfault). > > > > > + * > > > > > + * Otherwise, return false and read the folio from disk. > > > > > + */ > > > > > + if (folio_test_large(folio)) { > > > > > + if (xa_find(tree, &offset, > > > > > + offset + folio_nr_pages(folio) - 1, X= A_PRESENT)) { > > > > > + WARN_ON_ONCE(1); > > > > > + return true; > > > > > + } > > > > > + return false; > > > > > > > > IMHO, this appears to be over-designed. Personally, I would opt to > > > > use > > > > > > > > if (folio_test_large(folio)) > > > > return true; > > > > > > I am sure you mean "return false" here. Always returning true means w= e > > > will never read a large folio from either zswap or disk, whether it's > > > in zswap or not. Basically guaranteeing corrupting data for large > > > folio swapin, even if zswap is disabled :) > > > > > > > > > > > Before we address large folio support in zswap, it=E2=80=99s essent= ial > > > > not to let them coexist. Expecting valid data by lunchtime is > > > > not advisable. > > > > > > The goal here is to enable development for large folio swapin without > > > breaking zswap or being blocked on adding support in zswap. If we > > > always return false for large folios, as you suggest, then even if th= e > > > folio is in zswap (or parts of it), we will go read it from disk. Thi= s > > > will result in silent data corruption. > > > > > > As you mentioned before, you spent a week debugging problems with you= r > > > large folio swapin series because of a zswap problem, and even after > > > then, the zswap_is_enabled() check you had is not enough to prevent > > > problems as I mentioned before (if zswap was enabled before). So we > > > need stronger checks to make sure we don't break things when we > > > support large folio swapin. > > > > > > Since we can't just check if zswap is enabled or not, we need to > > > rather check if the folio (or any part of it) is in zswap or not. We > > > can only WARN in that case, but delivering the error to userspace is = a > > > couple of extra lines of code (not set uptodate), and will make the > > > problem much easier to notice. > > > > > > I am not sure I understand what you mean. The alternative is to > > > introduce a config option (perhaps internal) for large folio swapin, > > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get > > > enabled if large folio swapin is enabled (through config or boot > > > option). This is until proper handling is added, of course. > > > > Hi Yosry, > > My point is that anybody attempts to do large folios swap-in should > > either > > 1. always use small folios if zswap has been once enabled before or now > > or > > 2. address the large folios swapin issues in zswap > > > > there is no 3rd way which you are providing. > > > > it is over-designed to give users true or false based on if data is zsw= ap > > as there is always a chance data could be in zswap. so before approach > > 2 is done, we should always WARN_ON large folios and report data > > corruption. > > We can't always WARN_ON for large folios, as this will fire even if > zswap was never enabled. The alternative is tracking whether zswap was > ever enabled, and checking that instead of checking if any part of the > folio is in zswap. > > Basically replacing xa_find(..) with zswap_was_enabled(..) or something. My point is that mm core should always fallback if (zswap_was_or_is_enabled()) goto fallback; till zswap fixes the issue. This is the only way to enable large folios swa= p-in development before we fix zswap. > > What I don't like about this is that we will report data corruption > even in cases where data is not really corrupted and it exists on > disk. For example, if zswap is globally enabled but disabled in a > cgroup, there shouldn't be any corruption swapping in large folios. > > That being said, I don't feel strongly, as long as we either check > that part of the folio is in zswap or that zswap was ever enabled (or > maybe check if a page was ever stored, just in case zswap was enabled > and immediately disabled). > > Johannes, Nhat, any opinions on which way to handle this? Thanks Barry