From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 886FEC27C52 for ; Thu, 6 Jun 2024 21:38:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 258D36B009B; Thu, 6 Jun 2024 17:38:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2074C6B00BB; Thu, 6 Jun 2024 17:38:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F67C6B009C; Thu, 6 Jun 2024 17:38:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E6CC26B00BB for ; Thu, 6 Jun 2024 17:37:59 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6C3E91407D3 for ; Thu, 6 Jun 2024 21:37:59 +0000 (UTC) X-FDA: 82201776678.14.77249B1 Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf17.hostedemail.com (Postfix) with ESMTP id 8582A40007 for ; Thu, 6 Jun 2024 21:37:57 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Qp06Kwh1; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717709877; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sxVCgkdsWxZe9cdiBSCygSUxoUzaQUcqMVcfSn0jQS0=; b=bpmDTWptNflheBevOoGJXuX061ExfDTwn4oxhmYgBpva2E9Rek5jt6gRLJ+T+sBbadUooe VrZhzx6AnkT+g6xFZ6IvJUhH+ZSZMRd6I9p+/zEGzzAxUtWRiK5S2s3EYlRERamLtYIi7M EMG3xY3YRBSVayL3s9QbQPSIRCVS92A= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Qp06Kwh1; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717709877; a=rsa-sha256; cv=none; b=5sCPGskJsNR2EWGm4WUiX4YvlL7v+lfSdRoo06n3W1NJqnF9WzRlFd6LI7caKNiXkdJghG j3OyETx3wz2zl5rMWZ7usvOiaqW8A6QiczPtoqINvcwwFNjpifpEmqC0uYGGQSn99d+2cr zHSaSLzypdvBUs2h/jVb/iIy2y1ke0w= Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-57aa64c6bbfso1509951a12.3 for ; Thu, 06 Jun 2024 14:37:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717709876; x=1718314676; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sxVCgkdsWxZe9cdiBSCygSUxoUzaQUcqMVcfSn0jQS0=; b=Qp06Kwh1f3iUfMe09CihSQTaPXLnt8qmjujhjpsKCMx1EK8AsnbGUd91b/cOnZuo8c Dq88yhgYaPF9PTJBphhjtpVTdYfjGetxoj0r0mp+aYAS7l4iKPk3WD9cIY70buTW0orp n3tj2iFuHNX776t9S8DOWJ5dpDECVik1dBKvxMT/+9nR84242q+QqmNUbTlekKe6TJoG +ToyElK3M58DmrYS5biQJrDvZF6GKA66/yxLl3B+TMFoYtc34dzUtAXNDU03EUXyJdsH zOfjY9GhyFXTz23EmUfjFyiHhpjtcDHhn/xrlCg6FEZrZWxow7jBuqRn0AzNYYxk8Sma Y4Xg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717709876; x=1718314676; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sxVCgkdsWxZe9cdiBSCygSUxoUzaQUcqMVcfSn0jQS0=; b=QOOO2QDKY29igVi+P8IHwKFJ7PKOfNO0H5xLKQExoycM0ZXSNzmahcjPCbu/93Lnjg V8xe+DOdq+KIm1Arse5q0AetxWx52dQjYh+9aOo4FQIH33d2vMwyIau7zSHxuejT4kqP FjtqDCKUDvYEqZ8IUnJotNj4gSUFD+27G+dYQMdfNy5zznuy2MAiRShMv9BMk5Oad0W0 KYIWwmb/BHE989ofwz9fB83hYN7o70esYwZRmhj0w4plS3pDPBj+XjVkkVgd0+weExyN zAqRWyor9KxrhAu1Opa1dmyKMQwf+wn08ceE2+M+YBm0ZiH1NC6x1dV4G/znQ4ItvZOn 1ivg== X-Forwarded-Encrypted: i=1; AJvYcCUMaPldVSHt3lMl8fa5cn0qIsw1T0hXA53ZyEbsWa6S7n1fQcQyRPBe9t3a+dWvQa4uB70MpUZ4Mgh3kk7gNM52rnM= X-Gm-Message-State: AOJu0YxsPdM++crEcvjzm5plN8TDrjOyKwqKHxZUadbKHURSmLIwGVQW IHN+bBvedo8HhAlNc0/Xs80E5rOBHeB1pqlpI2RByOQj7M+P8JHO94BZtSOAfRQMi5fidrbd2yJ v0c+s0Xm+bgXoPT1BqH8KtYnw0Zj96DpEJrpn X-Google-Smtp-Source: AGHT+IFz21LxWp7vj7WlCkGN3SL4ptgwrC5kesIuuV/yxsfzNrANxMI5zU81PuX1StwHQU6dhfApOuiL5BX6T9jLKRY= X-Received: by 2002:a17:906:f590:b0:a62:2b65:1dd5 with SMTP id a640c23a62f3a-a6cdaa0f67dmr73830066b.58.1717709875765; Thu, 06 Jun 2024 14:37:55 -0700 (PDT) MIME-Version: 1.0 References: <20240606184818.1566920-1-yosryahmed@google.com> <84d78362-e75c-40c8-b6c2-56d5d5292aa7@redhat.com> <7507d075-9f4d-4a9b-836c-1fbb2fbd2257@redhat.com> In-Reply-To: From: Yosry Ahmed Date: Thu, 6 Jun 2024 14:37:19 -0700 Message-ID: Subject: Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted To: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand , Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 6trsxq9ih9ihc9c9ohoauu9ze76f77ix X-Rspamd-Queue-Id: 8582A40007 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1717709877-661075 X-HE-Meta: U2FsdGVkX19BG45/jtCtDEmIMOzPwfKKOOlOvkEgGMV5J2/TcWIPZblEEP3+15gfPw9r7l0GlNRJd0TZ0GfeE7jfnFj/A/3/ju1N15+/t2oQoPuEQ1HWkzh9xQSudW2+2Pl7re4+VADKeFbnilR4g0dSRCB0o/4zG0a9nzcJV8qf08NGTdUOZnhl6V4/1Sz1mhkagTuEiHlMxC7VZd2pDB/r1MpebO5pILbTx3l9Q5Pwmjqdc+Oi85tPAjkqiaZUlJkUZkiQwLH9bNIqY43jFii/f6QvB9T4Ckf5LLK4YQPRF+lHgIKfbMFUJReQm9+ff5Td4TRUBxGxCiOyBEWyZ8ixMBlTeunuF1Y87lcKBJpL3gKJHnc9zViJI2i6GLKpOu5Ol7+MU38Lws+Ldy6FdAeXvabatXMEl7qtuZn36Yl+dHkJfy62A4bg6xPPcL9gOron3j+NZo4+4BO88wjNFSbA8mgNdQVuPKgw/abaauzHUx91fHxtQuWRNsejYVTtAlWCmAix0nqpTCRAdZKEkfeNt+O3gY+4HSKEfIljjnoB3ym8Rr2jcZVhHbrr4NxpRVw1Qdn0yIqP9Ef6IkHuf8BciwQpP2TJILGXR59ZLOqR1k8JqqEahUT3Z65LPuqZTDImz49tz8JsZKwcI5D9GSA2Nrd4ken+V98jRlBba0YpJm8oGznt/JIfWG81J/d+LTSPJfegnKJBIDEtLWwq+ycgFjRJ6g3JBPCVtvYY6ldXa3sRxoy4wfE73sQ7l6oLCM/wT72cKaHabADYGWZHhkvBJFJY7n7nxiCa00gW8WucjB+svERHwVWfu20BVbgtXkqr3v777okiAQe7KntlbWjXJXnTorsmLw4xyfHsxKuAPTKZeSae7Aiif0WFOZKMIAdkCMsTOI9JvUzY3c22GyFJZKPlSL6tEGrQLg0mdcCHylBW47BNhEW3PUlrRT2qf/HjI9YVtJ5rVSTofsh TKooO46/ YC3XXubs+rmbQGTY2p6RxRbbYTW++tJZ6Gnip5UkAR4J03er0wP0PwgN4N0ljDAfQzbW0OKxlFA6NcElMk6mUswG+xB1zEMo0Host61n3GwW9GceMB/SqdSGfala6/FLTtgcrEtK4WtoCcqxWYJQrENMtJNd4jPTnr711u72UfGj6tbuMQB1ZGXLNc7nkyoKHPUYGnK3f9IqobY+r/PT5PhP2aWKsWe42zR5dedTtekduDJycnrTpqpRYMy7a7vAdCtiJXz+bZGTfPtLtds05+7mf49fhjwAHWmRgEnaOGPmm8PCcyw1UC3Zb6Vn1T36etmPtTImW1iThCaIMscMyMeYv5Unphr59wAMi X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 6, 2024 at 2:30=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrote= : > > On Fri, Jun 7, 2024 at 9:17=E2=80=AFAM David Hildenbrand wrote: > > > > On 06.06.24 22:31, Yosry Ahmed wrote: > > > On Thu, Jun 6, 2024 at 1:22=E2=80=AFPM David Hildenbrand wrote: > > >> > > >> On 06.06.24 20:48, Yosry Ahmed wrote: > > >>> With ongoing work to support large folio swapin, it is important to= make > > >>> sure we do not pass large folios to zswap_load() without implementi= ng > > >>> proper support. > > >>> > > >>> For example, if a swapin fault observes that contiguous PTEs are > > >>> pointing to contiguous swap entries and tries to swap them in as a = large > > >>> folio, swap_read_folio() will pass in a large folio to zswap_load()= , but > > >>> zswap_load() will only effectively load the first page in the folio= . If > > >>> the first page is not in zswap, the folio will be read from disk, e= ven > > >>> though other pages may be in zswap. > > >>> > > >>> In both cases, this will lead to silent data corruption. > > >>> > > >>> Proper large folio swapin support needs to go into zswap before zsw= ap > > >>> can be enabled in a system that supports large folio swapin. > > >>> > > >>> Looking at callers of swap_read_folio(), it seems like they are eit= her > > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so = we > > >>> are fine for now. > > >>> > > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect chang= es in > > >>> the order of those allocations without proper handling of zswap. > > >>> > > >>> Alternatively, swap_read_folio() (or its callers) can be updated to= have > > >>> a fallback mechanism that splits large folios or reads subpages > > >>> separately. Similar logic may be needed anyway in case part of a la= rge > > >>> folio is already in the swapcache and the rest of it is swapped out= . > > >>> > > >>> Signed-off-by: Yosry Ahmed > > >>> --- > > >>> > > >>> Sorry for the long CC list, I just found myself repeatedly looking = at > > >>> new series that add swap support for mTHPs / large folios, making s= ure > > >>> they do not break with zswap or make incorrect assumptions. This de= bug > > >>> check should give us some peace of mind. Hopefully this patch will = also > > >>> raise awareness among people who are working on this. > > >>> > > >>> --- > > >>> mm/zswap.c | 3 +++ > > >>> 1 file changed, 3 insertions(+) > > >>> > > >>> diff --git a/mm/zswap.c b/mm/zswap.c > > >>> index b9b35ef86d9be..6007252429bb2 100644 > > >>> --- a/mm/zswap.c > > >>> +++ b/mm/zswap.c > > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > >>> if (!entry) > > >>> return false; > > >>> > > >>> + /* Zswap loads do not handle large folio swapins correctly ye= t */ > > >>> + VM_BUG_ON(folio_test_large(folio)); > > >>> + > > >> > > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > > > Not without making more fundamental changes to the surrounding swap > > > code. Currently zswap_load() returns either true (folio was loaded > > > from zswap) or false (folio is not in zswap). > > > > > > To handle this correctly zswap_load() would need to tell > > > swap_read_folio() which subpages are in zswap and have been loaded, > > > and then swap_read_folio() would need to read the remaining subpages > > > from disk. This of course assumes that the caller of swap_read_folio(= ) > > > made sure that the entire folio is swapped out and protected against > > > races with other swapins. > > > > > > Also, because swap_read_folio() cannot split the folio itself, other > > > swap_read_folio_*() functions that are called from it should be > > > updated to handle swapping in tail subpages, which may be questionabl= e > > > in its own right. > > > > > > An alternative would be that zswap_load() (or a separate interface) > > > could tell swap_read_folio() that the folio is partially in zswap, > > > then we can just bail and tell the caller that it cannot read the > > > large folio and that it should be split. > > > > > > There may be other options as well, but the bottom line is that it is > > > possible, but probably not something that we want to do right now. > > > > > > A stronger protection method would be to introduce a config option or > > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > > depend on it being disabled, or have zswap check it at boot and refus= e > > > to be enabled if it is on. > > > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect > > this like a read-error from disk. > > > > I think do_swap_page() detects that by checking if the folio is not > > uptodate: > > > > if (unlikely(!folio_test_uptodate(folio))) { > > ret =3D VM_FAULT_SIGBUS; > > goto out_nomap; > > } > > > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > > system (but the app would crash either way, there is no way around it). > > > > I'd rather fallback to small folios swapin instead crashing apps till we = fix > the large folio swapin in zswap :-) I think David is referring to catching the buggy cases that do not properly fallback to small folios with zswap, not as an alternative to the fallback. This is at least what I had in mind with the patch.