From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F3DDC27C55 for ; Fri, 7 Jun 2024 21:20:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3ED96B00A8; Fri, 7 Jun 2024 17:20:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AEDB36B00A9; Fri, 7 Jun 2024 17:20:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B59A6B00AA; Fri, 7 Jun 2024 17:20:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 75F7E6B00A8 for ; Fri, 7 Jun 2024 17:20:22 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E15DD1A196E for ; Fri, 7 Jun 2024 21:20:21 +0000 (UTC) X-FDA: 82205361042.18.51B8E64 Received: from mail-vk1-f175.google.com (mail-vk1-f175.google.com [209.85.221.175]) by imf11.hostedemail.com (Postfix) with ESMTP id 1474440009 for ; Fri, 7 Jun 2024 21:20:19 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AvZiGDre; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717795220; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OxAZp3z7L49A+lTnMeWo5nLFYKkCmQWqY4FG8Fg0hV8=; b=GOAl8UFQSeyZS8BuFhSFVSi7p+U5CKZVxQzVKDLvRy+s5O7F3R10AvYq8E2Bfm2j74DBY7 THx2Ol52eZLdGbL8X0kxQvp/7QXjL77904y27VawGDQkfR5WQrDfVTnpHQm7h8svY0L6NP LileAYs0QuqdzjtTihWJ3N4Q+X55nqU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717795220; a=rsa-sha256; cv=none; b=DzJtCf3/rOu5SsyYQFO45Hr05yXo23N/K5klJ/QdZNtH8ItZRmzACIp7HEJcOV7hbq12BR Z4Eg6/Jasstr4zMyiTXiBJqB+IhhNdWDqmz5NK4z5PWj7d3ls7W4HmSqPi7KbKgrxuWfyb JEeu/S5qQV7PHk8BGKzBc+YI4hzLscE= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AvZiGDre; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f175.google.com with SMTP id 71dfb90a1353d-4eb236e27adso765442e0c.2 for ; Fri, 07 Jun 2024 14:20:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717795219; x=1718400019; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OxAZp3z7L49A+lTnMeWo5nLFYKkCmQWqY4FG8Fg0hV8=; b=AvZiGDrebSlrA0po3wP3uMRgHJrSHUfoADYof45mPsxI7D/vVGA0pGyhjwBT14zUcE 3SpYWgWIVpCYP+iod1g/sGUsTGjyMakdt6qEwcCJr80Sm26+p6+aimY6UYRWxBHOYNFK hwa/O7K4/ox5PGus6wmsYszl6bLnmF1K6ZjiKVjWS50XEIERVRIWJyjx1I4xJRdm2Q1l AC6MCKxXn7e5uYQk/qZIwHbN5AHADOIpI+f3Dm/0li0UNbT8etwXD+o8SQA4oZmi6AeV TY/SEd6U11fCD3IPmq77wJKPruuXCnGRRP3ApZWbhCzwqAVs1O7HLMYK62FP6psDx0Yz Znqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717795219; x=1718400019; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OxAZp3z7L49A+lTnMeWo5nLFYKkCmQWqY4FG8Fg0hV8=; b=pdUf+s/QMVV7Dp1XCdAXJ92kO3Uvzo1rk7YXoOW1GU+kb7JVvrLy2ksl0y1sq4naCm BEBSwWBZr6j69WbUfRRI+0JTl5nG3P9swOF5TsUiedxyboEgCaXWvQKG1mBI4THimUKX Zx/LgxGVt3IBu5YscUq9ZTaXf0+juEUU7ifovLoFyZ8LcIi5+cfqTkJDTqDPXX1Oq/dO 4dO8+LzwBe2GdteMq1DdtenfY9iInm6SLTAlkTHS32zDlzFc7vlVRgk01hKD1XV1t1L6 HVnthDnJ8shPKh2yBmv4Wk76YZ20M0iKaCLyhZBt69zUaQxtGznByB0ntdqow+1qC4mk P6qA== X-Forwarded-Encrypted: i=1; AJvYcCWLTxYsm3iY2nl0frs9SvjkPzx8+1hELWoOHMxSLOPKzfRBzfqI4sR3C6jgg6mrvojz2XOOaFiyTmFqk1qiFdtIXv4= X-Gm-Message-State: AOJu0YwBT9uW8iuAQ6XWfwzYAFtI2JAU765pV7BWrJPKXXu8R2gPGcbo DpXAV1VDO1P/E9evCqAlqO/MQ/wVw3INmCAQvAp2sBB7WeiZ+w8WYaF57etGeQPZax4OPlOPC42 oRp2QYy4w4R0meX1JkahgoouzNMg= X-Google-Smtp-Source: AGHT+IHwApfbDXy/9uEmymJpWJutBc9OUMURYbKnanvYLEGkriqI0tU51VRp6KnRGmRS+myCt6MsDRH+vVcH5VYRYHk= X-Received: by 2002:a05:6122:dd:b0:4ea:edac:3ae4 with SMTP id 71dfb90a1353d-4eb56260c92mr4190990e0c.9.1717795219055; Fri, 07 Jun 2024 14:20:19 -0700 (PDT) MIME-Version: 1.0 References: <20240606184818.1566920-1-yosryahmed@google.com> <84d78362-e75c-40c8-b6c2-56d5d5292aa7@redhat.com> <7507d075-9f4d-4a9b-836c-1fbb2fbd2257@redhat.com> <9374758d-9f81-4e4f-8405-1f972234173e@redhat.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sat, 8 Jun 2024 09:20:07 +1200 Message-ID: Subject: Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted To: Yosry Ahmed Cc: David Hildenbrand , Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 6h1etqwcnu6p4uu8316eqiiu44d4baku X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1474440009 X-HE-Tag: 1717795219-705001 X-HE-Meta: U2FsdGVkX1+Eahw80g/yjOHoXkOuPPenoeY68vu7tgVAuTNRBqC4mAw1gnLIrw1gay60QlzzxMhSQLIxMgA2+e49+EwcfwHqn//3O/tFG6W2F07J+NAmd3ZwnZ9DG6XIjgGS5NQnUNowqyQUP3Jw0FkTh9cwrS5pLSJOoqrVX1AkmKP43OxcLUv3QcVTHWlPF7OFM3Epigiy4u8tsFRXpNuw6BAAPJU0QvJMLxMU0LeRa18C51imjHukP2oMmheqpOOFPmxGJYvNvAPmrOMmN5TfXdd3YaBQItvtkB462Cg8m76sFAbZlsrjFtV2g+riEHMbYtWdeacshYoP3rJ1XHwcJyFMaDP92eWzEz6TJ2FYksibaQDKb5wcuEOmQ5bQmBpR7F2WmHImlEipku0S8eeY+FprrrPBzpYEQkFJFRPGm2QXdhmavoIWGB2uILkr7Int88LqZvvgq8a6WGmyJbBX5nBPZT6XQgHBx+hdhioQfAGdKaHVW9hdIO60C8y8DGW28wg/rGrP664uxtcHYMpEmRCzJgxqG9ec+mNy8hQSi7OXUjsMYDIglqbqSG7u29hEODwevoyDdoNJZRAqpa92LgGybasqwE25JyBlI51jXwKrQtizANEew5tsfztg5/Okj1XN40x69CG5cd0Rw0yctFPB+nk+k2ZdKgSCm2qzmFxFDNXEzycLvEo6STjg90wqd5f9fkPRYHlbKrzvzOCSio8agr1shdEIa34EudP5dOq50S7TbwncrW6Sz868q5/5iI8I5R+7vVqAKkIU+q3NM2YUQpDAdaFPU6smG8ijcIt8ka2cBX4A7rLEB6iigUwFbsBYY7utU3cM4XivlTxTWslpccVqkvCxtEPxjLjEh3rtyc+TrD/Vg6LjZy3bFBwiPD57owS5SXW8st+owZ0OBH3MUMgBwE2TPX8KXA8lAeLfrObljFPuW2E1SApcbOl+DS2BoInRJzJC4/k jsK6A3re bAof2Uk0lMaKqJKjVx+p65uSZJSI1pmfF7sK7DO1OSYLe7XvjgE2cILJLhnBlfVCYGWLh1zrpW2wR1Kaf3V6gjkjX1xPAzKsgtPt+s34XWT6e09WpzJY+krAilanYiuG/sUZfV8kW+i5cWZWaafUWAQWqnexiYLyvq/VVlyLeS9/EZMnYu4PmtsqjMLxokdECWWP/E+LGLKLVtTqD47ZLIcc55EgTBN0tz9aY/xIpgAxeXcehO5vNaI33tiPzKX8V7VkMPh9+txQjc9Qmd3fcKutbFrKWCH96tUIXcwUdDxKGDfeehiNy1/4YvX/MALYcGBov6dtulzKSKH1SkTLCVyo7hVJM8kPPKn8AvC1v3ZcO20ZkRAlc/qc9dXMfu9NNnfrMfkfxkGaXSPc6G0ampo4j1zY3dWgJ7OJqQZ2iwP6y6H+ZkfiCrdd/GjmXUeEYK80UnBhMnPSN/SFRX1+MR804XntwSsFbu4SF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jun 8, 2024 at 5:43=E2=80=AFAM Yosry Ahmed = wrote: > > On Fri, Jun 7, 2024 at 12:11=E2=80=AFAM David Hildenbrand wrote: > > > > On 06.06.24 23:53, Barry Song wrote: > > > On Fri, Jun 7, 2024 at 9:37=E2=80=AFAM Yosry Ahmed wrote: > > >> > > >> On Thu, Jun 6, 2024 at 2:30=E2=80=AFPM Barry Song <21cnbao@gmail.com= > wrote: > > >>> > > >>> On Fri, Jun 7, 2024 at 9:17=E2=80=AFAM David Hildenbrand wrote: > > >>>> > > >>>> On 06.06.24 22:31, Yosry Ahmed wrote: > > >>>>> On Thu, Jun 6, 2024 at 1:22=E2=80=AFPM David Hildenbrand wrote: > > >>>>>> > > >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote: > > >>>>>>> With ongoing work to support large folio swapin, it is importan= t to make > > >>>>>>> sure we do not pass large folios to zswap_load() without implem= enting > > >>>>>>> proper support. > > >>>>>>> > > >>>>>>> For example, if a swapin fault observes that contiguous PTEs ar= e > > >>>>>>> pointing to contiguous swap entries and tries to swap them in a= s a large > > >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_lo= ad(), but > > >>>>>>> zswap_load() will only effectively load the first page in the f= olio. If > > >>>>>>> the first page is not in zswap, the folio will be read from dis= k, even > > >>>>>>> though other pages may be in zswap. > > >>>>>>> > > >>>>>>> In both cases, this will lead to silent data corruption. > > >>>>>>> > > >>>>>>> Proper large folio swapin support needs to go into zswap before= zswap > > >>>>>>> can be enabled in a system that supports large folio swapin. > > >>>>>>> > > >>>>>>> Looking at callers of swap_read_folio(), it seems like they are= either > > >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in t= he > > >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios,= so we > > >>>>>>> are fine for now. > > >>>>>>> > > >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect c= hanges in > > >>>>>>> the order of those allocations without proper handling of zswap= . > > >>>>>>> > > >>>>>>> Alternatively, swap_read_folio() (or its callers) can be update= d to have > > >>>>>>> a fallback mechanism that splits large folios or reads subpages > > >>>>>>> separately. Similar logic may be needed anyway in case part of = a large > > >>>>>>> folio is already in the swapcache and the rest of it is swapped= out. > > >>>>>>> > > >>>>>>> Signed-off-by: Yosry Ahmed > > >>>>>>> --- > > >>>>>>> > > >>>>>>> Sorry for the long CC list, I just found myself repeatedly look= ing at > > >>>>>>> new series that add swap support for mTHPs / large folios, maki= ng sure > > >>>>>>> they do not break with zswap or make incorrect assumptions. Thi= s debug > > >>>>>>> check should give us some peace of mind. Hopefully this patch w= ill also > > >>>>>>> raise awareness among people who are working on this. > > >>>>>>> > > >>>>>>> --- > > >>>>>>> mm/zswap.c | 3 +++ > > >>>>>>> 1 file changed, 3 insertions(+) > > >>>>>>> > > >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c > > >>>>>>> index b9b35ef86d9be..6007252429bb2 100644 > > >>>>>>> --- a/mm/zswap.c > > >>>>>>> +++ b/mm/zswap.c > > >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > >>>>>>> if (!entry) > > >>>>>>> return false; > > >>>>>>> > > >>>>>>> + /* Zswap loads do not handle large folio swapins correctl= y yet */ > > >>>>>>> + VM_BUG_ON(folio_test_large(folio)); > > >>>>>>> + > > >>>>>> > > >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, righ= t? > > >>>>> > > >>>>> Not without making more fundamental changes to the surrounding sw= ap > > >>>>> code. Currently zswap_load() returns either true (folio was loade= d > > >>>>> from zswap) or false (folio is not in zswap). > > >>>>> > > >>>>> To handle this correctly zswap_load() would need to tell > > >>>>> swap_read_folio() which subpages are in zswap and have been loade= d, > > >>>>> and then swap_read_folio() would need to read the remaining subpa= ges > > >>>>> from disk. This of course assumes that the caller of swap_read_fo= lio() > > >>>>> made sure that the entire folio is swapped out and protected agai= nst > > >>>>> races with other swapins. > > >>>>> > > >>>>> Also, because swap_read_folio() cannot split the folio itself, ot= her > > >>>>> swap_read_folio_*() functions that are called from it should be > > >>>>> updated to handle swapping in tail subpages, which may be questio= nable > > >>>>> in its own right. > > >>>>> > > >>>>> An alternative would be that zswap_load() (or a separate interfac= e) > > >>>>> could tell swap_read_folio() that the folio is partially in zswap= , > > >>>>> then we can just bail and tell the caller that it cannot read the > > >>>>> large folio and that it should be split. > > >>>>> > > >>>>> There may be other options as well, but the bottom line is that i= t is > > >>>>> possible, but probably not something that we want to do right now= . > > >>>>> > > >>>>> A stronger protection method would be to introduce a config optio= n or > > >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > >>>>> depend on it being disabled, or have zswap check it at boot and r= efuse > > >>>>> to be enabled if it is on. > > >>>> > > >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoid= able. > > >>>> > > >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code d= etect > > >>>> this like a read-error from disk. > > >>>> > > >>>> I think do_swap_page() detects that by checking if the folio is no= t > > >>>> uptodate: > > >>>> > > >>>> if (unlikely(!folio_test_uptodate(folio))) { > > >>>> ret =3D VM_FAULT_SIGBUS; > > >>>> goto out_nomap; > > >>>> } > > >>>> > > >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to = the > > >>>> system (but the app would crash either way, there is no way around= it). > > >>>> > > >>> > > >>> I'd rather fallback to small folios swapin instead crashing apps ti= ll we fix > > >>> the large folio swapin in zswap :-) > > >> > > >> I think David is referring to catching the buggy cases that do not > > >> properly fallback to small folios with zswap, not as an alternative = to > > >> the fallback. This is at least what I had in mind with the patch. > > > > > > Cool. Thanks for the clarification. I'm fine with keeping the fallbac= k, > > > whether it's the current VM_BUG_ON or David's recommended > > > SIGBUS. > > > > > > Currently, mainline doesn't support large folios swap-in, so I see > > > your patch as a helpful warning for those attempting large folio > > > swap-in to avoid making mistakes like loading large folios from > > > zswap. > > > > > > In fact, I spent a week trying to figure out why my app was crashing > > > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch > > > had come earlier, it would have saved me that week of work :-) > > > > > > To me, a direct crash seems like a more straightforward way to > > > prompt people to fallback when attempting large folio swap-ins. > > > So, I am slightly in favor of your current patch. > > > > BUG_ON and friends are frowned-upon, in particular in scenarios where w= e > > can recover or that are so unexpected that they can be found during > > early testing. > > > > I have no strong opinion on this one, but likely a VM_WARN_ON would als= o > > be sufficient to find such issues early during testing. No need to cras= h > > the machine. > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > some digging I found your patches to checkpatch and Linus clearly > stating that it isn't. > > How about something like the following (untested), it is the minimal > recovery we can do but should work for a lot of cases, and does > nothing beyond a warning if we can swapin the large folio from disk: > > diff --git a/mm/page_io.c b/mm/page_io.c > index f1a9cfab6e748..8f441dd8e109f 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > swap_iocb **plug) > delayacct_swapin_start(); > > if (zswap_load(folio)) { > - folio_mark_uptodate(folio); > folio_unlock(folio); > } else if (data_race(sis->flags & SWP_FS_OPS)) { > swap_read_folio_fs(folio, plug); > diff --git a/mm/zswap.c b/mm/zswap.c > index 6007252429bb2..cc04db6bb217e 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > + /* > + * Large folios should not be swapped in while zswap is being use= d, as > + * they are not properly handled. > + * > + * If any of the subpages are in zswap, reading from disk would r= esult > + * in data corruption, so return true without marking the folio u= ptodate > + * so that an IO error is emitted (e.g. do_swap_page() will sigfa= ult). > + * > + * Otherwise, return false and read the folio from disk. > + */ > + if (WARN_ON_ONCE(folio_test_large(folio))) { > + if (xa_find(tree, &offset, offset + > folio_nr_pages(folio) - 1, 0)) > + return true; > + return false; > + } > + > /* > * When reading into the swapcache, invalidate our entry. The > * swapcache can be the authoritative owner of the page and > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > zswap_entry_free(entry); > folio_mark_dirty(folio); > } > - > + folio_mark_uptodate(folio); > return true; > } > > One problem is that even if zswap was never enabled, the warning will > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > static key if zswap was "ever" enabled. > > Barry, I suspect your is_zswap_enabled() check is deficient for > similar reasons, zswap could have been enabled before then became > disabled. I don't understand this. if zswap was enabled before but is disabled when I am loading data, will I get corrupted data before zswap was once enabled? If not, it seems nothing important. Thanks Barry