From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 859C2C27C55 for ; Mon, 10 Jun 2024 20:06:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A716C6B0093; Mon, 10 Jun 2024 16:06:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A20AA6B0095; Mon, 10 Jun 2024 16:06:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C14E6B0096; Mon, 10 Jun 2024 16:06:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6EDA76B0093 for ; Mon, 10 Jun 2024 16:06:06 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A8C5EC0ACC for ; Mon, 10 Jun 2024 20:06:05 +0000 (UTC) X-FDA: 82216060290.28.0CE8B0C Received: from mail-ua1-f50.google.com (mail-ua1-f50.google.com [209.85.222.50]) by imf26.hostedemail.com (Postfix) with ESMTP id C7F61140010 for ; Mon, 10 Jun 2024 20:06:03 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=C30dzWpj; spf=pass (imf26.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718049963; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7TuWnQeYsZIwbcmv6fZEt1TjdHGPfIr0qBppAqWptGs=; b=txkXYSLMbCLaLH37E8x/h7pv46JxodQa/6iaBmxnxhpYzsYqPEIKinrjmQ9ZGo+juQf48J fFoU6EISOL0qVIAsmv8ljjJwh60JmaXQn4l1iKO4TfY/qTCla2TT9OGeGM4tHfm/zE3R0w JHzZByi4BAwdMIFI7VgrHb1yF5CgrcE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718049963; a=rsa-sha256; cv=none; b=Ogkn7Scxgflrb/dSUyMtXollZVZmPxHDavuZKmF0wfj4iVRsSF3c5VSM4R6ASIYzxfzyoj 4I7XE54pziCbIiLWtxbZ5SI2UdmBA5yJxxMSgPFmdiAym1iOpWs+VZFxJAJ/5sklx0cENy A25d52pdmDBqxrNRfj4E+jTnMgnB+Cc= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=C30dzWpj; spf=pass (imf26.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f50.google.com with SMTP id a1e0cc1a2514c-80b87c529e3so83747241.3 for ; Mon, 10 Jun 2024 13:06:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718049963; x=1718654763; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7TuWnQeYsZIwbcmv6fZEt1TjdHGPfIr0qBppAqWptGs=; b=C30dzWpjtX7j00mgP2UBL6AKVXmtrjO5KS0J35vuPcC2HgNfTkKR7Og4wEoD+q829m wPgoctGJGPuHf5PaR+ew2yBy33ht/V1Hbxg34v+LJifVcFU9zIqVfHt+ZF+COxhqPI3R vQ+PRus0kXYsPPaRnWOmaD8A8nfYAnAhncVhVAnzOEabSVxgdLkzORmOaMdQSaSiS5eg zLUyJ2Ri+1r2LWcicUmQugZsHKMPNQHkncpO0MQzvHIKYtPfR/wD0/1qSzzUU05YsMQJ kld+lQRMhcOgLf02aNEfIHOZx0QeYH83qMMuxGfkGzilC6SFZYcB5yXp4FVSg6++VgNO O5OQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718049963; x=1718654763; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7TuWnQeYsZIwbcmv6fZEt1TjdHGPfIr0qBppAqWptGs=; b=XzkwE5UkorF2QC2pWZU8jSy9kry+AfckoJa83GASqlEBp/NBMH7bWESSsqq3ySKhXc lrHEFndwtkEoWDaN4LRRhe2qeoIp2HSZBcRDKtka+x9wpLp1Bm/b0b9m+RuZTOZTOJMP 91EKkinpm71JGESuPTmPfYf6gT9Kbr1x5zOq6q2MDRg8FjNH7LOElnXTexw2tmShHR/I xaeLzRPEoEdndE777f0qiETCe/Pae49luP3ucwZXmbNrISpDEa+Gi1LB9pzt2WQFHWZk 59TlphwaS1u8/OzCkBrLFj4sujsrF5n/fWbaOEzmoQDMeC9BaU60hY5WmMzeYvgvvgbs BJ4Q== X-Forwarded-Encrypted: i=1; AJvYcCX65Epm5KRU7oxEpgoJtVYI5by2w1Me6EsacD1nN6p+h6c04j1a9vwDRiyLqP1ufqeOp60A87H87zgiNeieA9kLjF4= X-Gm-Message-State: AOJu0YzrHTSliaP2d338NvzA2KP20IbrLXiNp8lUMenpkB/qiy5buktN Vb43i6J9fOICexMJWPjSOFlUYE01suiJSC4CFBUEaKwmKOthqs+qP2+mvO8AHbOpeIsroc+SlO8 HgaZ/RGpbkATch7vdJr+QypSRUkg= X-Google-Smtp-Source: AGHT+IEr0OWElKiXriv2YopSfwlbufb6HiGoa79JOsFWRHbWC7m2ZbxwcjyExEZsldQUqWXiQQDM52cuysfjUcd1Rls= X-Received: by 2002:a1f:f884:0:b0:4d8:7339:4c35 with SMTP id 71dfb90a1353d-4eb562ceb57mr7967220e0c.13.1718049962449; Mon, 10 Jun 2024 13:06:02 -0700 (PDT) MIME-Version: 1.0 References: <20240608023654.3513385-1-yosryahmed@google.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 11 Jun 2024 04:05:50 +0800 Message-ID: Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios To: Yosry Ahmed Cc: Andrew Morton , Johannes Weiner , Nhat Pham , Chengming Zhou , Baolin Wang , Chris Li , Ryan Roberts , David Hildenbrand , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Stat-Signature: 5o79d1xcznakihoefo1zi8zs54qsbsp1 X-Rspamd-Queue-Id: C7F61140010 X-Rspam-User: X-HE-Tag: 1718049963-88799 X-HE-Meta: U2FsdGVkX18LAdDxgRb3cDHASGYncwghzRqJ43U0CcackQ2VZAY0BhJsbaQb6eoa8RgIytq3kIn2yVagDo9x1iNs9IuTRb5oFoGg9EKxK2rmjrZuo2gPTudANC1E0q5vyDSvpFRb8CN17nbN5uer7Ag8TXwbbDcPOaKQK4/K4/JhrLlR/ExJy+yNl400cvEXMYJx0uSRlnKF51MmTs4LGuqGOZYv2REPqrSE+EVp8fddDIva8t3WtOVYyLyQmihu26RR3k9x1OO/s24JTYz3+hcaxOMJORzatlOXziTaVLSibYDLR9Eoxe0qiieyL6MT4F1aZUJVz34/BuX63eXy1SJxPA+91aZuIF3XhxV1c3UHmoDfFQAiixVswUGzwdDjTUO09RDJu5xfIbLlofpEpjdtFMCqMZdgzvkJan8m2K/poZmFr8lBpeU7ICrjBx3VuI7hI4Zr/cFlffgEmSk9A2nIfX7aGekmfsxDGHDORUAl3DsegMhZiiDRNa8kbTCAh00zz+uOUmybWOzxVMRboTvbwYFxaX59hes3+hXpqFh2Zj8x5bTRtiqBCQAjgjcsi0xDShfyyU8V2buVot2SQEyXA69+FXDRF3lzImToyixJnNye1eo1ZhKdqaRbxR9+3w4t6mqusIwh6dMbiULOGYzUhraa9eXj+pGjzgSDIakxu/RfXOkuHB+uDtRM8XvwktFsqLyhdjrDVCm3xp9hFJPXcsoHmconJwBvHudQO2MaART37JQHtQZKh1+IPCN1YJffEfrG+o9P5hqOH90V6H+e6enMde1P6Yn72/5MaqwEw9VSxJW6kiPu0kS5Qz/mjDjpJ2TndnlIJYCjHa71WT8558W0+86z27HNBWXM0oy6DFfjuWamLA/oW5Fy7kQ/E+evUMFHAH3LMLAFhZrgbtBctpEo1PVQMbLfnMg2u/XwMmoGxkFTh0FKgpcca8e5OFRGLt6+L+oj0suxP9l RFGfr4L5 PGsA79puWnYRM2/eDj1+LHShuoGKjZzVFFRlvxiH3GpgMJPMV9ShTVpoggL/1NqxazXnkxP9MqBsExN8hEmYj1bT4NseWaWhlwCzpGyWPdnNbhNxlCsygCFybIBbVs6eR0Ic5C31h5GM6Psl0RspZ/ow/QNSirhDPcqxIib4UhlGeeTx35qRfU2Oy7y8g9Yn4TVZEVpS1Lus09akljVn8dlPqeOdMX/Fbq/7f13d8mHuKpojtRDM1ha8YXoGr6nYpPzYdv3fLlDJY9WWm2hfxacoumSSXE7lR/ty+DSTtnJ7JUu9k4S0db0zCX2JZ6R49bRSmmrNap+QnuSSJMjqWtyKRh2AKLlGLt7wyGhRvQVO2i/bihQEIwFCk/g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 11, 2024 at 1:42=E2=80=AFAM Yosry Ahmed = wrote: > > On Fri, Jun 7, 2024 at 9:13=E2=80=AFPM Barry Song <21cnbao@gmail.com> wro= te: > > > > On Sat, Jun 8, 2024 at 10:37=E2=80=AFAM Yosry Ahmed wrote: > > > > > > Zswap does not support storing or loading large folios. Until proper > > > support is added, attempts to load large folios from zswap are a bug. > > > > > > For example, if a swapin fault observes that contiguous PTEs are > > > pointing to contiguous swap entries and tries to swap them in as a la= rge > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), = but > > > zswap_load() will only effectively load the first page in the folio. = If > > > the first page is not in zswap, the folio will be read from disk, eve= n > > > though other pages may be in zswap. > > > > > > In both cases, this will lead to silent data corruption. Proper suppo= rt > > > needs to be added before large folio swapins and zswap can work > > > together. > > > > > > Looking at callers of swap_read_folio(), it seems like they are eithe= r > > > allocated from __read_swap_cache_async() or do_swap_page() in the > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so > > > everything is fine for now. > > > > > > However, there is ongoing work to add to support large folio swapins > > > [1]. To make sure new development does not break zswap (or get broken= by > > > zswap), add minimal handling of incorrect loads of large folios to > > > zswap. > > > > > > First, move the call folio_mark_uptodate() inside zswap_load(). > > > > > > If a large folio load is attempted, and any page in that folio is in > > > zswap, return 'true' without calling folio_mark_uptodate(). This will > > > prevent the folio from being read from disk, and will emit an IO erro= r > > > because the folio is not uptodate (e.g. do_swap_fault() will return > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but i= t > > > is better than nothing. > > > > > > This was tested by hacking the allocation in __read_swap_cache_async(= ) > > > to use order 2 and __GFP_COMP. > > > > > > In the future, to handle this correctly, the swapin code should: > > > (a) Fallback to order-0 swapins if zswap was ever used on the machine= , > > > because compressed pages remain in zswap after it is disabled. > > > (b) Add proper support to swapin large folios from zswap (fully or > > > partially). > > > > > > Probably start with (a) then followup with (b). > > > > > > [1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@g= mail.com/ > > > > > > Signed-off-by: Yosry Ahmed > > > --- > > > > > > v1: https://lore.kernel.org/lkml/20240606184818.1566920-1-yosryahmed@= google.com/ > > > > > > v1 -> v2: > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recove= ry > > > handling (David Hildenbrand). > > > > > > --- > > > mm/page_io.c | 1 - > > > mm/zswap.c | 22 +++++++++++++++++++++- > > > 2 files changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > --- a/mm/page_io.c > > > +++ b/mm/page_io.c > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct = swap_iocb **plug) > > > delayacct_swapin_start(); > > > > > > if (zswap_load(folio)) { > > > - folio_mark_uptodate(folio); > > > folio_unlock(folio); > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > swap_read_folio_fs(folio, plug); > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index b9b35ef86d9be..ebb878d3e7865 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio) > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > + /* > > > + * Large folios should not be swapped in while zswap is being= used, as > > > + * they are not properly handled. Zswap does not properly loa= d large > > > + * folios, and a large folio may only be partially in zswap. > > > + * > > > + * If any of the subpages are in zswap, reading from disk wou= ld result > > > + * in data corruption, so return true without marking the fol= io uptodate > > > + * so that an IO error is emitted (e.g. do_swap_page() will s= igfault). > > > + * > > > + * Otherwise, return false and read the folio from disk. > > > + */ > > > + if (folio_test_large(folio)) { > > > + if (xa_find(tree, &offset, > > > + offset + folio_nr_pages(folio) - 1, XA_PR= ESENT)) { > > > + WARN_ON_ONCE(1); > > > + return true; > > > + } > > > + return false; > > > > IMHO, this appears to be over-designed. Personally, I would opt to > > use > > > > if (folio_test_large(folio)) > > return true; > > I am sure you mean "return false" here. Always returning true means we > will never read a large folio from either zswap or disk, whether it's > in zswap or not. Basically guaranteeing corrupting data for large > folio swapin, even if zswap is disabled :) > > > > > Before we address large folio support in zswap, it=E2=80=99s essential > > not to let them coexist. Expecting valid data by lunchtime is > > not advisable. > > The goal here is to enable development for large folio swapin without > breaking zswap or being blocked on adding support in zswap. If we > always return false for large folios, as you suggest, then even if the > folio is in zswap (or parts of it), we will go read it from disk. This > will result in silent data corruption. > > As you mentioned before, you spent a week debugging problems with your > large folio swapin series because of a zswap problem, and even after > then, the zswap_is_enabled() check you had is not enough to prevent > problems as I mentioned before (if zswap was enabled before). So we > need stronger checks to make sure we don't break things when we > support large folio swapin. > > Since we can't just check if zswap is enabled or not, we need to > rather check if the folio (or any part of it) is in zswap or not. We > can only WARN in that case, but delivering the error to userspace is a > couple of extra lines of code (not set uptodate), and will make the > problem much easier to notice. > > I am not sure I understand what you mean. The alternative is to > introduce a config option (perhaps internal) for large folio swapin, > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get > enabled if large folio swapin is enabled (through config or boot > option). This is until proper handling is added, of course. Hi Yosry, My point is that anybody attempts to do large folios swap-in should either 1. always use small folios if zswap has been once enabled before or now or 2. address the large folios swapin issues in zswap there is no 3rd way which you are providing. it is over-designed to give users true or false based on if data is zswap as there is always a chance data could be in zswap. so before approach 2 is done, we should always WARN_ON large folios and report data corruption. > > > > > > + } > > > + > > > /* > > > * When reading into the swapcache, invalidate our entry. The > > > * swapcache can be the authoritative owner of the page and > > > @@ -1590,7 +1610,7 @@ bool zswap_load(struct folio *folio) > > > zswap_entry_free(entry); > > > folio_mark_dirty(folio); > > > } > > > - > > > + folio_mark_uptodate(folio); > > > return true; > > > } > > > > > > -- > > > 2.45.2.505.gda0bf45e8d-goog > > > > > > > Thanks > > Barry Thanks Barry