From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96EFCD1713F for ; Mon, 21 Oct 2024 21:34:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2EDCB6B009F; Mon, 21 Oct 2024 17:34:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 277196B00A0; Mon, 21 Oct 2024 17:34:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F0FA6B00A1; Mon, 21 Oct 2024 17:34:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id DBC286B009F for ; Mon, 21 Oct 2024 17:34:44 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 98097A8078 for ; Mon, 21 Oct 2024 21:34:13 +0000 (UTC) X-FDA: 82698913668.24.F7F8BC3 Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf08.hostedemail.com (Postfix) with ESMTP id 71936160025 for ; Mon, 21 Oct 2024 21:34:32 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=bNAWADxM; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729546432; a=rsa-sha256; cv=none; b=8WQiZPvS7bV7Qw3HNVysiRRHG8X6eBwmpToXRIpnmMrSpeGr1mSrIGyka/aY1GUk4brmgu k14UNuAIiPOyG4YhawAgi0e/pyJ0kDwLnuteRyn/NI1T2JQj/+61FWx4cOgFJb1eICs+E2 YQdI7ySVJovOuB37LHEQ0GAYBIzjS/o= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=bNAWADxM; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729546432; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Tmm2xALcW2rsx29+DdrW+v337pzZtbQgSnw6+Xs/zn4=; b=r+F7SM6ykZEW/YtmQF6U3DbI8tkkMTv2p8pblnOt5eP0JY8eTxcZxLb7UwytNYs3u+GVRQ tC0D/dHlYAZ56jQvsDNWY3iXhITMQ7JL+pYdkf8rXD6DRDkafD12N2UN5pGJXrtVNFOmh+ 6SA5M1aKsk6RT1n1qle/TvK8mr3maJQ= Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-a9a3dc089d8so688028266b.3 for ; Mon, 21 Oct 2024 14:34:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729546481; x=1730151281; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Tmm2xALcW2rsx29+DdrW+v337pzZtbQgSnw6+Xs/zn4=; b=bNAWADxMz+iFG+dGAUe1XQDBAvcaYUIsHWD6a89FHVFMDcL2fw1rHvusS/8aivW93O PHp2i4OfoLsKW1VEfWO380zn8AZprMH3TT9D+uK6N/iG493yAK982Tfw6G8u5GJIB+Sx bfGLmWji/CESFsLdFXJRyZHMnLoCaovphzG0yk3CsMuwZSoiVrixZ9b0Vhk+av43q2PK ASCCA2gI5OChHBBQPK5Sbo7dfajiEK3D9LRjf4AvFkgRXB3UortuXN0yiVxNDZpWiV4J /N+WEFiX6uhYSLh68XhR9CgACLZFPeWlfVNZ4ZPhBkd9LwYNtcjcKMxsoZQFE49BORg5 cfng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729546481; x=1730151281; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Tmm2xALcW2rsx29+DdrW+v337pzZtbQgSnw6+Xs/zn4=; b=odcRtivAjJZGq7P/HqFy1PlTmKqtXYkbxi/nNSocIKIipZCilCfGZsIdI7MQQNUn3x 4blwit4/pK7MKAViK6pELQHLC7yiUigtiD/gv4nDMFYCTprYptMxOPr3tIUty1DDfXto 197D/sisSJfqZTSQli3k+pKLM0y7RUcFl84J+x0LmDdCwAlzvRvu7hmG11jwRO1UDMmt uJXRHXjn7HRczXqBtMWtFCbPvju+t7R+TFvQ8018ogxZKDV2bOH0//lu0kIY2hDxCJcr hDh/5x355Y7QKOMv/9zoxJ56qo3QJqWev2aY9YmwUIqvsONuPMBqE7iQKFmD4UJyumoi vSpw== X-Forwarded-Encrypted: i=1; AJvYcCXKi7tXAYscfi+2zLfWvqMbsXUXMbZu5fjbumxZXJqxmzi4DMK5ZBQAFVBK4Kb93uk7QeA19JsjNg==@kvack.org X-Gm-Message-State: AOJu0YxpNsWukvv0SqL7N71I00fmfjnVecHqJT1PteES9huLyUHTDsoM Z4VwOG8XLT4A27D+wHF3A561Obe+Hm6urUiVN1Pjr0+uiTLIy9qy5BXxtqnE2PTx249XR1MCgy1 4geWI7/xlfMJJ5jB83eKQO88rMhMaltMilyLH X-Google-Smtp-Source: AGHT+IF3y/LVIbieQtEJFBqOYK2yrALBMR4CT6p9zp8Pa+ZdpkLxqwNllyVWQxdDyJVeQO7f2UVhXNv2mF7LxG2ceA4= X-Received: by 2002:a17:906:4788:b0:a99:d797:c132 with SMTP id a640c23a62f3a-a9a69a6605dmr1179941666b.16.1729546480976; Mon, 21 Oct 2024 14:34:40 -0700 (PDT) MIME-Version: 1.0 References: <20241018105026.2521366-1-usamaarif642@gmail.com> <20241018105026.2521366-4-usamaarif642@gmail.com> <6d036c4d-ec2e-4562-98a1-6668948086b5@gmail.com> In-Reply-To: <6d036c4d-ec2e-4562-98a1-6668948086b5@gmail.com> From: Yosry Ahmed Date: Mon, 21 Oct 2024 14:34:03 -0700 Message-ID: Subject: Re: [RFC 3/4] mm/zswap: add support for large folio zswapin To: Usama Arif Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, nphamcs@gmail.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 71936160025 X-Rspamd-Server: rspam01 X-Stat-Signature: ymm8zqs1gwijojz9679wwxne9m13fhsj X-HE-Tag: 1729546472-555666 X-HE-Meta: U2FsdGVkX1/RGnHRGLiNLnp1JQeBQVuHckf82dMQKVFOvJimjz6YsI3hUDTgzCOJmPBZNsQAdf4Ae/er3GJFL+vl+fCg8XFqL61UQs0SvrlsTs8Giq1RLzY0nIAqIIQzIEd7E8aNILxsugoePdyyB5vop4vjnHJGvzVJZP2PS1jIXfNAwKshl3S63fE86VdUGfe7Mumok4m/aFRcsGbCRe8A+YuMndbLMV8aqG5NMReGm26sojPRHdTJ8ya8r4hyfQm2OVj5Bz73iQE8Xcj3UGJlxOJ+4KQtyFpY9X397XHFr8Pt91CO9XdTe1Nwzkbc0pQu8LEx/ZiYNjDn+dwv0xb7Ex6sSLwBdyC5s3jpeWJYap/QlOL+wPYcUABu49dhynZy6bdeiG7utMkvGAGzYtyLITNefe+vHYq4Fd6ugZdsbRr4OELZ/Np4AJIwGbsyUmkT51XUGd1w3veKkxKnkJyZxDwIfvSpMbdVj2fLfzNOtz9ra9JkDMAKcwO/ljspZ1J6ga7qAuzlu5/Cblhp5hi7XXyEOvWi+lsoFGAS7SzPanuELzsuWP/wQOzQn1UyFUXt2xdsRR4I4k1EyH19y/KOlk9lekJ9PXquxeNU/tQ54eYCY5hnQmbP9zAQywf6O5d5WVb+xFOMygdRRfr5vTnuNvTDW9rmpAHAbMdxOjng5z86dcaurX4CsdYnNX+p3VzRO4Exaq+3CV4Fo+Yc2zIQNnwBBMclO+fPRcEBycHUxY0LAQQ55ZjY/0JjZY9ADMrSPV80S44mkxrP7dkZOh8v8NwQ2L9QBZVl1UYoWqa3G1QYES2shcsWRDIzoj7zdf+L3ME7bQnhfeHSNKiRb6nwzQTHWgtbRVNElSt6IOhE8oJIsn1EvNiDvaMlJgJaYTp/Zp1EzoUSuqxxrscGfhlHP+XrL3l1Hu0h4UEf6Jqp+K9OYB1hFVAOpUmc554ObVpfbSRS+TZU3rFfl2A AMjyLQ0U QoYt4ZwtDiaArGvwJLq2doDGZ5WBCnRhNjrXFK4f1ZjFGwjYOIgmsRXpFasXIYBRGJcSPlKMrhrgObrRLeBFuLQ1NgBh0zJ4SvFMacs3vxDlxPM2P3ZFOt1XWcFWeq8ErVbB8gmuWJTZtcQt7CTf57PRiFeaSZRrefZYgG4m7E79XDaNU5WXDuvXHdM/96l1SDhEpxz1KxLP0QCkZc6kgBkCfMtoay/waCbW1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 21, 2024 at 1:57=E2=80=AFPM Usama Arif = wrote: > > > > On 21/10/2024 21:28, Barry Song wrote: > > On Tue, Oct 22, 2024 at 1:21=E2=80=AFAM Usama Arif wrote: > >> > >> > >> > >> On 21/10/2024 11:55, Barry Song wrote: > >>> On Mon, Oct 21, 2024 at 11:44=E2=80=AFPM Usama Arif wrote: > >>>> > >>>> > >>>> > >>>> On 21/10/2024 06:49, Barry Song wrote: > >>>>> On Fri, Oct 18, 2024 at 11:50=E2=80=AFPM Usama Arif wrote: > >>>>>> > >>>>>> At time of folio allocation, alloc_swap_folio checks if the entire > >>>>>> folio is in zswap to determine folio order. > >>>>>> During swap_read_folio, zswap_load will check if the entire folio > >>>>>> is in zswap, and if it is, it will iterate through the pages in > >>>>>> folio and decompress them. > >>>>>> This will mean the benefits of large folios (fewer page faults, ba= tched > >>>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for a= rm64 > >>>>>> and amd) are not lost at swap out when using zswap. > >>>>>> This patch does not add support for hybrid backends (i.e. folios > >>>>>> partly present swap and zswap). > >>>>>> > >>>>>> Signed-off-by: Usama Arif > >>>>>> --- > >>>>>> mm/memory.c | 13 +++------- > >>>>>> mm/zswap.c | 68 ++++++++++++++++++++++++------------------------= ----- > >>>>>> 2 files changed, 34 insertions(+), 47 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/memory.c b/mm/memory.c > >>>>>> index 49d243131169..75f7b9f5fb32 100644 > >>>>>> --- a/mm/memory.c > >>>>>> +++ b/mm/memory.c > >>>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault= *vmf, pte_t *ptep, int nr_pages) > >>>>>> > >>>>>> /* > >>>>>> * swap_read_folio() can't handle the case a large folio i= s hybridly > >>>>>> - * from different backends. And they are likely corner cas= es. Similar > >>>>>> - * things might be added once zswap support large folios. > >>>>>> + * from different backends. And they are likely corner cas= es. > >>>>>> */ > >>>>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != =3D nr_pages)) > >>>>>> return false; > >>>>>> if (unlikely(non_swapcache_batch(entry, nr_pages) !=3D nr_= pages)) > >>>>>> return false; > >>>>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) > >>>>>> + return false; Hmm if the entire folio is not in zswap, this will prevent the large folio swapin, right? Also, I think this is racy, see the comments below and in patch 1. > >>>>>> > >>>>>> return true; > >>>>>> } > >>>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struc= t vm_fault *vmf) > >>>>>> if (unlikely(userfaultfd_armed(vma))) > >>>>>> goto fallback; > >>>>>> > >>>>>> - /* > >>>>>> - * A large swapped out folio could be partially or fully i= n zswap. We > >>>>>> - * lack handling for such cases, so fallback to swapping i= n order-0 > >>>>>> - * folio. > >>>>>> - */ > >>>>>> - if (!zswap_never_enabled()) > >>>>>> - goto fallback; > >>>>>> - > >>>>>> entry =3D pte_to_swp_entry(vmf->orig_pte); > >>>>>> /* > >>>>>> * Get a list of all the (large) orders below PMD_ORDER th= at are enabled > >>>>>> diff --git a/mm/zswap.c b/mm/zswap.c > >>>>>> index 9cc91ae31116..a5aa86c24060 100644 > >>>>>> --- a/mm/zswap.c > >>>>>> +++ b/mm/zswap.c > >>>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, i= nt nr_pages) > >>>>>> > >>>>>> bool zswap_load(struct folio *folio) > >>>>>> { > >>>>>> + int nr_pages =3D folio_nr_pages(folio); > >>>>>> swp_entry_t swp =3D folio->swap; > >>>>>> + unsigned int type =3D swp_type(swp); > >>>>>> pgoff_t offset =3D swp_offset(swp); > >>>>>> bool swapcache =3D folio_test_swapcache(folio); > >>>>>> - struct xarray *tree =3D swap_zswap_tree(swp); > >>>>>> + struct xarray *tree; > >>>>>> struct zswap_entry *entry; > >>>>>> + int i; > >>>>>> > >>>>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); > >>>>>> > >>>>>> if (zswap_never_enabled()) > >>>>>> return false; > >>>>>> > >>>>>> - /* > >>>>>> - * Large folios should not be swapped in while zswap is be= ing used, as > >>>>>> - * they are not properly handled. Zswap does not properly = load large > >>>>>> - * folios, and a large folio may only be partially in zswa= p. > >>>>>> - * > >>>>>> - * Return true without marking the folio uptodate so that = an IO error is > >>>>>> - * emitted (e.g. do_swap_page() will sigbus). > >>>>>> - */ > >>>>>> - if (WARN_ON_ONCE(folio_test_large(folio))) > >>>>>> - return true; > >>>>>> - > >>>>>> - /* > >>>>>> - * When reading into the swapcache, invalidate our entry. = The > >>>>>> - * swapcache can be the authoritative owner of the page an= d > >>>>>> - * its mappings, and the pressure that results from having= two > >>>>>> - * in-memory copies outweighs any benefits of caching the > >>>>>> - * compression work. > >>>>>> - * > >>>>>> - * (Most swapins go through the swapcache. The notable > >>>>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO > >>>>>> - * files, which reads into a private page and may free it = if > >>>>>> - * the fault fails. We remain the primary owner of the ent= ry.) > >>>>>> - */ > >>>>>> - if (swapcache) > >>>>>> - entry =3D xa_erase(tree, offset); > >>>>>> - else > >>>>>> - entry =3D xa_load(tree, offset); > >>>>>> - > >>>>>> - if (!entry) > >>>>>> + if (!zswap_present_test(folio->swap, nr_pages)) > >>>>>> return false; > >>>>> > >>>>> Hi Usama, > >>>>> > >>>>> Is there any chance that zswap_present_test() returns true > >>>>> in do_swap_page() but false in zswap_load()? If that=E2=80=99s > >>>>> possible, could we be missing something? For example, > >>>>> could it be that zswap has been partially released (with > >>>>> part of it still present) during an mTHP swap-in? As I mentioned in patch 1, we need to document when the result of zswap_present_test() is stable, and we can't race with other stores, exclusive loads, writeback, or invalidation. > >>>>> > >>>>> If this happens with an mTHP, my understanding is that > >>>>> we shouldn't proceed with reading corrupted data from the > >>>>> disk backend. > >>>>> > >>>> > >>>> If its not swapcache, the zswap entry is not deleted so I think > >>>> it should be ok? Can we race with things like writeback and other exclusive loads because swapcache_prepare() is not called yet? > >>>> > >>>> We can check over here if the entire folio is in zswap, > >>>> and if not, return true without marking the folio uptodate > >>>> to give an error. > >>> > >>> We have swapcache_prepare() called in do_swap_page(), which should > >>> have protected these entries from being partially freed by other proc= esses > >>> (for example, if someone falls back to small folios for the same addr= ess). > >>> Therefore, I believe that zswap_present_test() cannot be false for mT= HP in > >>> the current case where only synchronous I/O is supported. > >>> > >>> the below might help detect the bug? > >>> > >>> if (!zswap_present_test(folio->swap, nr_pages)) { > >>> if (WARN_ON_ONCE(nr_pages > 1)) > >>> return true; > >>> return false; > >>> } > >>> > >> > >> I think this isn't correct. If nr_pages > 1 and the entire folio is no= t in zswap, > >> it should still return false. So would need to check the whole folio i= f we want to > >> warn. But I think if we are sure the code is ok, it is an unnecessary = check. > > > > my point is that zswap_present_test() can't differentiate > > 1. the *whole* folio is not in zswap > > 2. the folio is *partially* not in zswap > > > > in case 2, returning false is wrong. > > > > Agreed! > > > And when nr_pages > 1, we have already confirmed earlier in > > do_swap_page() that zswap_present_test() is true. At this point, > > it must always be true; if it's false, it indicates a bug. > > > > Yes agreed! I was thinking from just zswap_load perspective irrespective > of who calls it. > If someone adds large folio support to swapin_readahead, then I think the > above warn might be an issue. > > But just with this patch series, doing what you suggested is correct. I > will add it in next revision. We can deal with it once swap count > 1, > starts supporting large folios. I think I don't follow this part of the conversation properly, but it seems like we want to catch the case where we end up in zswap_load() and only part of the folio is in zswap. Can we use something like the approach we used for swap_zeromap_batch()?