From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C1B7C27C52 for ; Thu, 6 Jun 2024 20:55:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 368C16B00AA; Thu, 6 Jun 2024 16:55:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 319C06B00AB; Thu, 6 Jun 2024 16:55:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E1256B00AC; Thu, 6 Jun 2024 16:55:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 00EE96B00AA for ; Thu, 6 Jun 2024 16:55:10 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id BE133A04A9 for ; Thu, 6 Jun 2024 20:55:10 +0000 (UTC) X-FDA: 82201668780.02.8F45876 Received: from mail-ua1-f42.google.com (mail-ua1-f42.google.com [209.85.222.42]) by imf02.hostedemail.com (Postfix) with ESMTP id 0CA248000C for ; Thu, 6 Jun 2024 20:55:08 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gR84oUW0; spf=pass (imf02.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717707309; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yL7tTfGLqeY7hfzIjcIfo8g+rxb0Tacathwl6LUj7oY=; b=rOZMhOlRgNLlzWEkH6whWlng/o5t35bpWtnyoI6+oTJ4k85ozgbkR1J4EnFqMg+DIeUzBq AVw+4pvcMf9eBz1pkslPgCAPHqymL90TjC5g0x8z88544y9PSJF4whRe5UvC/f2e19kGFr /4NE40s8c3BDohBjCGqcu5Cqo5RsEio= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717707309; a=rsa-sha256; cv=none; b=ozlq/unVaIobUry1ePqcQV/WPiJz39KQdjDrEEMggXE9nISxB1jtjFyNReZWqyCC6xpFbN YcSJNOurrKHVfLpRHOruDYMEttMSoto6ZhsrEAStUETUEDfrFy0b1IhL+vSsRgQKC7ZPJ9 3SsyM4mZA4FBZi85S171L4xPj/wn8IM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gR84oUW0; spf=pass (imf02.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f42.google.com with SMTP id a1e0cc1a2514c-80acf1bca86so436213241.3 for ; Thu, 06 Jun 2024 13:55:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717707308; x=1718312108; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yL7tTfGLqeY7hfzIjcIfo8g+rxb0Tacathwl6LUj7oY=; b=gR84oUW08FD193qIu4iFENXxHdBjfL+AbTBy+vxi1Qm6tmauHE8MRPI2fip6dDtJ0H rvkRs5l4inCxkRXFj5xcOwRgD5lRTHEaSGgILTz+OS0mVhQVB+VVhIB2BYGZiG1R5E7G LHxQE0d3utr/2TZdQ7TviU8TkMoqLrJaeOIL+g4KOkDrT6Fwm+qNTiQC/6HtR8llZsxQ d5FwZ2iyMARsWjlk890kO0WTcFPPKXi6IkpGJ0fvictmHi+xIisoahQESIVdH7qsJ0SF F50T9ii8kCgsU2+960+qHnYBNx1MrPvNU+z529OrXyCa/oIKZzDSNmy83jt+XwOdHK5v FIDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717707308; x=1718312108; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yL7tTfGLqeY7hfzIjcIfo8g+rxb0Tacathwl6LUj7oY=; b=Bt68vDpvEdHpoFJwLLw5jcpeKP5xlG/UOyC+Yj97IkfXzK+Gbf/V7u/19lCvMMgQIu sAzyTyfjCVWL6sdLtIZ9kYvx3GBSBfu1GXLqIEoS8RALRxowWRd8sAKc1lRH46UnULGt IdbBfNRKKOV9ka4HT7ESIUN3D4OPUtT4sARgpaq015hzBd1jUsBdeFYQjivAEMEA6hKx 03cT3U5byQzp2r8Q2PaHj6gzjl2ANkOg5N+a1kBFER2ChGwrYYES7Aflv6okpAVBd4my ns/WEW22e1jDWui7o/wvMcZIi0dXJTi39dwoN7teMXnjXmOWzlvDoWRaBi9T1+Gm3v0l c0+A== X-Forwarded-Encrypted: i=1; AJvYcCXG0VEvHR4cQogD0LjW+xAilWGd1/WC2Y+QGy6B7pg6OFC+xu2KijoJgaOLk9cUVItUooOHw7dM1H3zm+aHt3sHBOw= X-Gm-Message-State: AOJu0YwDkU7eHHMcjFtdFdZhnV+tIc2RKayXcIMzGjeKI1XwhssMgqEG 5ZxvvvNpgP/iomXkl3XfZm1SiNN2mxuZEuU/ZIbmYR2lEdPwMS/0GyWMUaQb8qds5Hulyf7bPVi N9PjISPktvGukyCrxB++D/xWL9aM= X-Google-Smtp-Source: AGHT+IGEMm31cV6zs8jOV7+Yxw+4wnR4rELpKUYSMnjAOhP+TMsXQ2LRqbi3mM2zzDUNVWV26FKc/C7VjbgOi0dPltU= X-Received: by 2002:a05:6102:216a:b0:47b:d70c:ca9d with SMTP id ada2fe7eead31-48c275342e2mr468266137.29.1717707307769; Thu, 06 Jun 2024 13:55:07 -0700 (PDT) MIME-Version: 1.0 References: <20240606184818.1566920-1-yosryahmed@google.com> <84d78362-e75c-40c8-b6c2-56d5d5292aa7@redhat.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 7 Jun 2024 08:54:56 +1200 Message-ID: Subject: Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted To: Yosry Ahmed Cc: David Hildenbrand , Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: t9pz9dndfxowabekbphox49ya6j18fh9 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 0CA248000C X-HE-Tag: 1717707308-510671 X-HE-Meta: U2FsdGVkX18rvFPP8LDXbcIzjA5Peh6BEXYcld97QY+uYP3Ld6o9TlBON8+xh/l5ZUPAsulLqBGVlcDxh5MUVlaZr2zi/jbOaY1zcPE6iyC43cYrIofkExHG7jpead783DTPsNqmT+DCgWDt3weywrFt7kEdUR5keZOMt+AGkIAInl6wcMCoclapk/aAa7TozULcZdpb9tCqBrXk+8+Fssg5hWH6LwW2ju9m+/2IDhajYR0eB2uEA9v8EXo8nir1rr8qQUw+652nnp6aeO+Cnjkm8WPbfwHaNLNmnUDDWi/3qJeOTsyczu+rxiTvloumZ7nHKS/O64zSykGYZUPwLfS0ylC7+7LZ4utZbCScY6/qhQU1EHOxQ/zDaoRtAhpXLe7SF/4SrkdUw/rMHrfcLKU4iT56h6hPtnmr3gkk3TSf/VjrErb/50IJxpEJM1mbDJjkjVrDDEqdax8tkacmv8fpNJ1ip1ooWnudnDH+owjmFrwJqJYeWohuvFHjITJAX486GXYs5rFr8Y1t1e9i7lfyPdB50mqfXEMp+/s8E9IlEt0gwh0xB69tCl9fczd7XA/LxaYwdMsmW2EohZflS1iT+s9C0DwlVuTZFUXw8K7rAPEX+zNHuAMxmnow85ohkxeYiLRb4skQFBNXeb4denNZ0Qqj9w6H5J40h3GiJgv8NGUEZactU+RqwfIlLMyvUuYxssk7AirTDrVqMvoo++Ih9fLg1FLQ8AWgtDhSx37y4gIXPOR7RMTmsCqmAZxaAmNbhap5g+RdfaohAnHd7tzfjRu2XgPCz4VcVlVLzVGfqpYkSDXtHW8qXDSQ0Ci0d9pku7q0bRqvlB8JwkYflrdcp++WxsCiyMuoQrlq9Gc84Qvgc+BFB4UWMRK/VoEbJL+dTasqKH9rgnR/V+eSaBmzX4ekbmmSc/QjGI63rPFW5++GDYOBuRnULtxqMMzzUxln3ENh06RS446kGAQ rdxvhfPo xm5By7jh2eZTPNDVOrPIT//OPlWDGPrN2Ryv5uHRI2HgwwPt9BaP4j5Im3xZV3PcbxBKTB2yVM5TaphZ0CouK9OoMbQ7QDOtNIZjjWePgqBQQAZTI7R9Rw8mMI6lnkH1EDHajZJITN1DAnZEpsqohVZQSx2seGBKN6+tvyLCe5/MYgujGyXqgqRkpcqyqt/v2YrV5kPhWIe8ZwOzfjFYzKwDMu4gz3VJiFJdz9qup2SqS0WkGR0EkT3FHm9pLv4fs5M3l4Z7OVUxbzQvdvvqfph3UfPlaUKsP3sTkib/u7O/YPxIg4J/F4L8vBbpy6zZv+6jqZ8IpNUcnz2yxPEwfCrlIAJ/TsJ6X0TF2180oxrgBN2MpiXFEPgwjUMTd9tZoitlVft09bSAAHYeDSC8JCLDyZepKM+ThURX2+uU7S3kSl6QyEO44tOhI7qvRG8owMOTNHJ2r2aQcZYcO1rSb0Fe9PP4HCyaSunclK5nO2BJhleGN4vcY0suV6yf+6dS1oP4rVcIj7McEwwdOPtgKgEFTtg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 7, 2024 at 8:32=E2=80=AFAM Yosry Ahmed = wrote: > > On Thu, Jun 6, 2024 at 1:22=E2=80=AFPM David Hildenbrand wrote: > > > > On 06.06.24 20:48, Yosry Ahmed wrote: > > > With ongoing work to support large folio swapin, it is important to m= ake > > > sure we do not pass large folios to zswap_load() without implementing > > > proper support. > > > > > > For example, if a swapin fault observes that contiguous PTEs are > > > pointing to contiguous swap entries and tries to swap them in as a la= rge > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), = but > > > zswap_load() will only effectively load the first page in the folio. = If > > > the first page is not in zswap, the folio will be read from disk, eve= n > > > though other pages may be in zswap. > > > > > > In both cases, this will lead to silent data corruption. > > > > > > Proper large folio swapin support needs to go into zswap before zswap > > > can be enabled in a system that supports large folio swapin. > > > > > > Looking at callers of swap_read_folio(), it seems like they are eithe= r > > > allocated from __read_swap_cache_async() or do_swap_page() in the > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > > are fine for now. > > > > > > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes= in > > > the order of those allocations without proper handling of zswap. > > > > > > Alternatively, swap_read_folio() (or its callers) can be updated to h= ave > > > a fallback mechanism that splits large folios or reads subpages > > > separately. Similar logic may be needed anyway in case part of a larg= e > > > folio is already in the swapcache and the rest of it is swapped out. > > > > > > Signed-off-by: Yosry Ahmed Acked-by: Barry Song this has been observed by me[1], that's why you can find the below code in my swapin patch +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + ... + /* + * a large folio being swapped-in could be partially in + * zswap and partially in swap devices, zswap doesn't + * support large folios yet, we might get corrupted + * zero-filled data by reading all subpages from swap + * devices while some of them are actually in zswap + */ + if (is_zswap_enabled()) + goto fallback; [1] https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.= com/ > > > --- > > > > > > Sorry for the long CC list, I just found myself repeatedly looking at > > > new series that add swap support for mTHPs / large folios, making sur= e > > > they do not break with zswap or make incorrect assumptions. This debu= g > > > check should give us some peace of mind. Hopefully this patch will al= so > > > raise awareness among people who are working on this. > > > > > > --- > > > mm/zswap.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index b9b35ef86d9be..6007252429bb2 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > > if (!entry) > > > return false; > > > > > > + /* Zswap loads do not handle large folio swapins correctly yet = */ > > > + VM_BUG_ON(folio_test_large(folio)); > > > + > > > > There is no way we could have a WARN_ON_ONCE() and recover, right? > > Not without making more fundamental changes to the surrounding swap > code. Currently zswap_load() returns either true (folio was loaded > from zswap) or false (folio is not in zswap). > > To handle this correctly zswap_load() would need to tell > swap_read_folio() which subpages are in zswap and have been loaded, > and then swap_read_folio() would need to read the remaining subpages > from disk. This of course assumes that the caller of swap_read_folio() > made sure that the entire folio is swapped out and protected against > races with other swapins. > > Also, because swap_read_folio() cannot split the folio itself, other > swap_read_folio_*() functions that are called from it should be > updated to handle swapping in tail subpages, which may be questionable > in its own right. > > An alternative would be that zswap_load() (or a separate interface) > could tell swap_read_folio() that the folio is partially in zswap, > then we can just bail and tell the caller that it cannot read the > large folio and that it should be split. > > There may be other options as well, but the bottom line is that it is > possible, but probably not something that we want to do right now. > > A stronger protection method would be to introduce a config option or > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > depend on it being disabled, or have zswap check it at boot and refuse > to be enabled if it is on. Thanks Barry