From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B77F4CD5BB8
	for <linux-mm@archiver.kernel.org>; Thu,  5 Sep 2024 19:19:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 328196B0083; Thu,  5 Sep 2024 15:19:24 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2D8A46B0088; Thu,  5 Sep 2024 15:19:24 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1790A6B0089; Thu,  5 Sep 2024 15:19:24 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id E8E6E6B0083
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 15:19:23 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 5F9D1A8B8C
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 19:19:23 +0000 (UTC)
X-FDA: 82531648206.01.04CC09E
Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43])
	by imf03.hostedemail.com (Postfix) with ESMTP id 4729A20021
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 19:19:21 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=RIzSWuDL;
	spf=pass (imf03.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725563865;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=9s8jua2Pv3C1d7+kCUBkWFuGBigjg35ffmRouVkik+Q=;
	b=sohsBCN3xFW+DwvnsH4o6pnBMCghYLD8nHOy9P1vWvmIXOmUFB+4HPG6a2DAPYmPirapLW
	LLCWCmRv4qhwssx0XHzPn/K8braH3fB6d+s3Hx3YQAaxTHcTniGl+ypt8d/IynUsLnfytk
	RRYMUpz+e4PPFZAyCekFU6b/aIsjWLk=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725563865; a=rsa-sha256;
	cv=none;
	b=Px5RzgZbSSNhb1ktqK6Q60avudcl8Xf192QJk/XYyo1ujUUNmJKSyGGmQ5dKxga5oJCMk5
	FI0XF3WKuttW/ZgfN98aWE2wLJXaydiclgGVqN/ag0Yhn7MQo4A/b+gy7UmdWNgMAK9F94
	js06w4jJQIzUHj9LVc0S2hx3P8MxlAc=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=RIzSWuDL;
	spf=pass (imf03.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-wm1-f43.google.com with SMTP id 5b1f17b1804b1-42bbc70caa4so9385875e9.0
        for <linux-mm@kvack.org>; Thu, 05 Sep 2024 12:19:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1725563960; x=1726168760; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:from:content-language
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=9s8jua2Pv3C1d7+kCUBkWFuGBigjg35ffmRouVkik+Q=;
        b=RIzSWuDLHZ7AJn5WQUq4Eo2+BxOiFSKJrUZUoWKi2ZXtaNZ/tkKx+eKc9FipOE7szL
         VA4uEV9VzpToXN1SV1HHSk1MNOSumOkKONVAQ8JGwc6wEjjkpz4Ec79UwCOmilibrCVy
         NueSpW0hh9ZA9t6sYfU0h/CEO3A/OVHTeXMuO00W+nUvpYVNnexlNeTHIRBslOq1k5Yz
         SvAXBNnCDAlAeW4SaoDV621uEkTygNbTQhA1/2tf3T3Pgj+GLSW/EKuB7tG1/q7/NvT6
         m8iFpaNtn8XARVZz9x6IdR+puDOTAVKajynksqElQtadIACZIFoxiKlIT3N0VsFSHXt4
         R7Bw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725563960; x=1726168760;
        h=content-transfer-encoding:in-reply-to:from:content-language
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=9s8jua2Pv3C1d7+kCUBkWFuGBigjg35ffmRouVkik+Q=;
        b=ZVoYR5eF8WGd3sN4PJsZdt4FeLay047j5BlXnS9BRRuWshN+zxqHi+28syJCQZq/hW
         CE1HiH2EHgu0mAFxX9S5oocC+Lp5C+70kp/knfdukV57jUY3vWmpFpSV/Y4fAC3rydwT
         PwyMzr/K68ox9C8lkqR3CqqdUv8iCHuIEgaJzNbRRuDjZbkWo07rxNlHVavgqNiaWOWp
         yjxvkHQLBr0eekF51zFnHLAMZWHV+j4W4wvikWS7uvvLo5GjDGSDbQh/nwPpBMu0UljS
         ig893qmbRJnPJ7g5eWdIB/KBxIY6cwPE5s9FnwOJ7MFSpeIsmFELrRhtt5Gy4qEN/o4d
         rM7A==
X-Forwarded-Encrypted: i=1; AJvYcCU7Vor48hVY5c997M26l/m/22ZsEi+N89sQNA3RHebjvV2b5CcOlWMD5YpTv0TVeS8jA4lp5AMjOg==@kvack.org
X-Gm-Message-State: AOJu0Yw5cf8iAwosStE1y+Oul8rspwRuapM36a/qdIpjkGkOvMyainKJ
	0D+VvYhnGbTPHuU0dY0uSINCuVFW+EHQ3+gyZvTvTY9iAseomcDP
X-Google-Smtp-Source: AGHT+IFH74vhlVbUBJdQR5HL9BamBhcKplCOyqEu17VUEydZU4oJ3ZekCDkOnwazaPhaVVNxDlUymg==
X-Received: by 2002:a05:600c:3483:b0:42c:9999:4fb3 with SMTP id 5b1f17b1804b1-42c9f9d7d6bmr1216715e9.34.1725563958783;
        Thu, 05 Sep 2024 12:19:18 -0700 (PDT)
Received: from ?IPV6:2a02:6b6f:e750:7600:c5:51ce:2b5:970b? ([2a02:6b6f:e750:7600:c5:51ce:2b5:970b])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-42c7a41bdc8sm186022145e9.3.2024.09.05.12.19.18
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 05 Sep 2024 12:19:18 -0700 (PDT)
Message-ID: <fb238383-a67b-47c1-94a7-0782fb0a4fe1@gmail.com>
Date: Thu, 5 Sep 2024 20:19:17 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap
To: Barry Song <21cnbao@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>, akpm@linux-foundation.org,
 chengming.zhou@linux.dev, david@redhat.com, hannes@cmpxchg.org,
 hughd@google.com, kernel-team@meta.com, linux-kernel@vger.kernel.org,
 linux-mm@kvack.org, nphamcs@gmail.com, shakeel.butt@linux.dev,
 willy@infradead.org, ying.huang@intel.com, hanchuanhua@oppo.com
References: <20240612124750.2220726-2-usamaarif642@gmail.com>
 <20240904055522.2376-1-21cnbao@gmail.com>
 <CAJD7tkYNn51b3wQbNnJoy8TMVA+r+ookuZzNEEpYWwKiZPVRdg@mail.gmail.com>
 <CAGsJ_4w2k=704mgtQu97y5Qpidc-x+ZBmBXCytkzdcasfAaG0w@mail.gmail.com>
 <CAJD7tkYqk_raVy07cw9cz=RWo=6BpJc0Ax84MkXLRqCjYvYpeA@mail.gmail.com>
 <CAGsJ_4w4woc6st+nPqH7PnhczhQZ7j90wupgX28UrajobqHLnw@mail.gmail.com>
 <CAJD7tkY+wXUwmgZUfVqSXpXL_CxRO-4eKGCPunfJaTDGhNO=Kw@mail.gmail.com>
 <CAGsJ_4zP_tA4z-n=3MTPorNnmANdSJTg4jSx0-atHS1vdd2jmg@mail.gmail.com>
 <CAJD7tkZ7ZhGz5J5O=PEkoyN9WeSjXOLMqnASFc8T3Vpv5uiSRQ@mail.gmail.com>
 <CAGsJ_4x0y+RtghmFifm_pR-=P_t5hNW5qjvw-oF+-T_amuVuzQ@mail.gmail.com>
 <CAGsJ_4zB7za72xL94-1Oc+M2M1RtxftVYUAUk=1yngUoK65stw@mail.gmail.com>
 <CAGsJ_4yBFpyA4Znfgr7V=eoHAnhuLPDTqaVOre9waTKZ+R3R9A@mail.gmail.com>
 <ede9c691-9b94-486a-895e-a822615b2805@gmail.com>
 <CAGsJ_4xRHtFvxc6kbpGdrvxpaQYCHNWpZsMO+GbX7LJwW841nw@mail.gmail.com>
Content-Language: en-US
From: Usama Arif <usamaarif642@gmail.com>
In-Reply-To: <CAGsJ_4xRHtFvxc6kbpGdrvxpaQYCHNWpZsMO+GbX7LJwW841nw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 4729A20021
X-Stat-Signature: qcopmy6o558rzp9kj1gii5pe8fx1reko
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1725563961-221519
X-HE-Meta: U2FsdGVkX1/oRTu+kdn1/TD9WMAwWNuokgP6IsVnVPqrKUU3Z4ZtB/Hq4dPPuJn9Sz+XfkHlb6iaOx/4gYVMWDwhPb8qeec/Von5CgD5xP4AhzyfHHiR8zyUWRJ3ewZaWb7KfXQcWKHJ1AZLPwPmulytNkz148kF2fzU1hiuzaOZVPqe/FOENYJ6se22C/lfvvQcDgPVF5hmEkZgC/+nEgzaGQP8od0LN3BYgmF/0sL9k3OAbsqMc5sS0u1pXRbnM52SqoS0wpqllN+ypaie1H1j/878ll+WcZqpmhuyfCvpkEWM3oMicpOl1Il5UI3AkXcdZYglpFbwcuStYhpHOWOjc73BP89cJOxzi05xblF6oBNidO/EUPAb7nzOyBjuUcwZBXQoMCU+5Ps+pskmA2Djha3HoEP1c+SlfTK3jfkuOabm+ZGa4haK5oB2NBYylZzv35Wu9hsEnQs9SCNur5BTcFVnDNx5NqjiJRDxB9ZiRalR9EgItrnb5d++nYSXnKCxL690Vf4usX58o0kfQNvddG5ha/D1/Gqrq7tXU7kh4IWSGrQSxEOd0C7MDE8XC15q/dT0DHp0oC6I7UNRkWhjANfz/ODF+J+dPBRsPuqMOXCcwHJ0w8EQ8WTEq3w3r0q/FWb+kKKf3Bw/Ac4lunJxZj+P3lh/w+pxWDuW53FvbpxVO19MRcTzXRf8fs80FUqTJekIe7E7Ec2Twdy7AcfpFkiJsWyNewP9ej21j0rWQ3bZXV6X6NImIWJMAcJedqCtl52DJ37Ye8lOShKgpuVr/1INis5uaSadO37qleacNVeGnYC7hOoCAvdOIjbjEH/Xtqt6T+gdioOrt3MxHu6Z9lNmJJUzPpzIw8Gp7mAQcjbEsG5WNECiJsGCPQJ/YxPoeBBkncSqqBHMA8QFjTD191Hpl8SIINE2B9MrOEwvBWoIDwvPa+UJv0Tb5gdNfddVuDP5Hbb6EVj/Fnu
 9lrC2yTF
 vREWx7ARb5n2rYX5Grml3w3y6kJkOjPTgSQNo/Asi/3f1/7e+8MqguHXDQYdMptUpX9lwQCTW0ResAwoTaPZmuMZSoGuSoeOppSfRqdCeb0Y1FCbWU0Zu7zs5Vj57RKig4TDHoYEQr0G0AxnqPscQJqMomZRU32LIdkHVjHRS8x2wE/fqyxIEyfVUVfg22FcWY+0iVxU8nmv4ym/vgbSJvVjbT49eK92KZWv5B6DKvzQ7J/Vaur8a4zLUfrXT0SQX7qy7K2Ywn6Vhwnmixe2NZOD6Mo+5bowJ+oARNtdqEoXdUxe642BZEPJ5vscjxpbhQedzXh/X+ey4kdo6hAItAP5Ncp3BG+67GoEV6Gu5SXtV7RSura7L4l0cIyrj8MC6nveEjsYU2i/CBb66pPgfNfBjn238FBnOBe/12m48E2RL6IYsvDMMjkDQ8l//ScFZo9d9O3dGBLY2XOYXttHrkTMk4euuGMHCMAtR7YovZJcoP/6WRNnl9GTSMs2h7H2tfcg++FNm8LNDv88=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 05/09/2024 12:00, Barry Song wrote:
> On Thu, Sep 5, 2024 at 10:53 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 05/09/2024 11:33, Barry Song wrote:
>>> On Thu, Sep 5, 2024 at 10:10 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Thu, Sep 5, 2024 at 8:49 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>>
>>>>> On Thu, Sep 5, 2024 at 7:55 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>>
>>>>>> On Thu, Sep 5, 2024 at 12:03 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>>
>>>>>>> On Thu, Sep 5, 2024 at 5:41 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>>>>
>>>>>>>> [..]
>>>>>>>>>> I understand the point of doing this to unblock the synchronous large
>>>>>>>>>> folio swapin support work, but at some point we're gonna have to
>>>>>>>>>> actually handle the cases where a large folio being swapped in is
>>>>>>>>>> partially in the swap cache, zswap, the zeromap, etc.
>>>>>>>>>>
>>>>>>>>>> All these cases will need similar-ish handling, and I suspect we won't
>>>>>>>>>> just skip swapping in large folios in all these cases.
>>>>>>>>>
>>>>>>>>> I agree that this is definitely the goal. `swap_read_folio()` should be a
>>>>>>>>> dependable API that always returns reliable data, regardless of whether
>>>>>>>>> `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in shouldn't
>>>>>>>>> be held back. Significant efforts are underway to support large folios in
>>>>>>>>> `zswap`, and progress is being made. Not to mention we've already allowed
>>>>>>>>> `zeromap` to proceed, even though it doesn't support large folios.
>>>>>>>>>
>>>>>>>>> It's genuinely unfair to let the lack of mTHP support in `zeromap` and
>>>>>>>>> `zswap` hold swap-in hostage.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Yosry,
>>>>>>>
>>>>>>>> Well, two points here:
>>>>>>>>
>>>>>>>> 1. I did not say that we should block the synchronous mTHP swapin work
>>>>>>>> for this :) I said the next item on the TODO list for mTHP swapin
>>>>>>>> support should be handling these cases.
>>>>>>>
>>>>>>> Thanks for your clarification!
>>>>>>>
>>>>>>>>
>>>>>>>> 2. I think two things are getting conflated here. Zswap needs to
>>>>>>>> support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is
>>>>>>>> truly, and is outside the scope of zswap/zeromap, is being able to
>>>>>>>> support hybrid mTHP swapin.
>>>>>>>>
>>>>>>>> When swapping in an mTHP, the swapped entries can be on disk, in the
>>>>>>>> swapcache, in zswap, or in the zeromap. Even if all these things
>>>>>>>> support mTHPs individually, we essentially need support to form an
>>>>>>>> mTHP from swap entries in different backends. That's what I meant.
>>>>>>>> Actually if we have that, we may not really need mTHP swapin support
>>>>>>>> in zswap, because we can just form the large folio in the swap layer
>>>>>>>> from multiple zswap entries.
>>>>>>>>
>>>>>>>
>>>>>>> After further consideration, I've actually started to disagree with the idea
>>>>>>> of supporting hybrid swapin (forming an mTHP from swap entries in different
>>>>>>> backends). My reasoning is as follows:
>>>>>>
>>>>>> I do not have any data about this, so you could very well be right
>>>>>> here. Handling hybrid swapin could be simply falling back to the
>>>>>> smallest order we can swapin from a single backend. We can at least
>>>>>> start with this, and collect data about how many mTHP swapins fallback
>>>>>> due to hybrid backends. This way we only take the complexity if
>>>>>> needed.
>>>>>>
>>>>>> I did imagine though that it's possible for two virtually contiguous
>>>>>> folios to be swapped out to contiguous swap entries and end up in
>>>>>> different media (e.g. if only one of them is zero-filled). I am not
>>>>>> sure how rare it would be in practice.
>>>>>>
>>>>>>>
>>>>>>> 1. The scenario where an mTHP is partially zeromap, partially zswap, etc.,
>>>>>>> would be an extremely rare case, as long as we're swapping out the mTHP as
>>>>>>> a whole and all the modules are handling it accordingly. It's highly
>>>>>>> unlikely to form this mix of zeromap, zswap, and swapcache unless the
>>>>>>> contiguous VMA virtual address happens to get some small folios with
>>>>>>> aligned and contiguous swap slots. Even then, they would need to be
>>>>>>> partially zeromap and partially non-zeromap, zswap, etc.
>>>>>>
>>>>>> As I mentioned, we can start simple and collect data for this. If it's
>>>>>> rare and we don't need to handle it, that's good.
>>>>>>
>>>>>>>
>>>>>>> As you mentioned, zeromap handles mTHP as a whole during swapping
>>>>>>> out, marking all subpages of the entire mTHP as zeromap rather than just
>>>>>>> a subset of them.
>>>>>>>
>>>>>>> And swap-in can also entirely map a swapcache which is a large folio based
>>>>>>> on our previous patchset which has been in mainline:
>>>>>>> "mm: swap: entirely map large folios found in swapcache"
>>>>>>> https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.com/
>>>>>>>
>>>>>>> It seems the only thing we're missing is zswap support for mTHP.
>>>>>>
>>>>>> It is still possible for two virtually contiguous folios to be swapped
>>>>>> out to contiguous swap entries. It is also possible that a large folio
>>>>>> is swapped out as a whole, then only a part of it is swapped in later
>>>>>> due to memory pressure. If that part is later reclaimed again and gets
>>>>>> added to the swapcache, we can run into the hybrid swapin situation.
>>>>>> There may be other scenarios as well, I did not think this through.
>>>>>>
>>>>>>>
>>>>>>> 2. Implementing hybrid swap-in would be extremely tricky and could disrupt
>>>>>>> several software layers. I can share some pseudo code below:
>>>>>>
>>>>>> Yeah it definitely would be complex, so we need proper justification for it.
>>>>>>
>>>>>>>
>>>>>>> swap_read_folio()
>>>>>>> {
>>>>>>>        if (zeromap_full)
>>>>>>>                folio_read_from_zeromap()
>>>>>>>        else if (zswap_map_full)
>>>>>>>               folio_read_from_zswap()
>>>>>>>        else {
>>>>>>>               folio_read_from_swapfile()
>>>>>>>               if (zeromap_partial)
>>>>>>>                        folio_read_from_zeromap_fixup()  /* fill zero
>>>>>>> for partially zeromap subpages */
>>>>>>>               if (zwap_partial)
>>>>>>>                        folio_read_from_zswap_fixup()  /* zswap_load
>>>>>>> for partially zswap-mapped subpages */
>>>>>>>
>>>>>>>                folio_mark_uptodate()
>>>>>>>                folio_unlock()
>>>>>>> }
>>>>>>>
>>>>>>> We'd also need to modify folio_read_from_swapfile() to skip
>>>>>>> folio_mark_uptodate()
>>>>>>> and folio_unlock() after completing the BIO. This approach seems to
>>>>>>> entirely disrupt
>>>>>>> the software layers.
>>>>>>>
>>>>>>> This could also lead to unnecessary IO operations for subpages that
>>>>>>> require fixup.
>>>>>>> Since such cases are quite rare, I believe the added complexity isn't worth it.
>>>>>>>
>>>>>>> My point is that we should simply check that all PTEs have consistent zeromap,
>>>>>>> zswap, and swapcache statuses before proceeding, otherwise fall back to the next
>>>>>>> lower order if needed. This approach improves performance and avoids complex
>>>>>>> corner cases.
>>>>>>
>>>>>> Agree that we should start with that, although we should probably
>>>>>> fallback to the largest order we can swapin from a single backend,
>>>>>> rather than the next lower order.
>>>>>>
>>>>>>>
>>>>>>> So once zswap mTHP is there, I would also expect an API similar to
>>>>>>> swap_zeromap_entries_check()
>>>>>>> for example:
>>>>>>> zswap_entries_check(entry, nr) which can return if we are having
>>>>>>> full, non, and partial zswap to replace the existing
>>>>>>> zswap_never_enabled().
>>>>>>
>>>>>> I think a better API would be similar to what Usama had. Basically
>>>>>> take in (entry, nr) and return how much of it is in zswap starting at
>>>>>> entry, so that we can decide the swapin order.
>>>>>>
>>>>>> Maybe we can adjust your proposed swap_zeromap_entries_check() as well
>>>>>> to do that? Basically return the number of swap entries in the zeromap
>>>>>> starting at 'entry'. If 'entry' itself is not in the zeromap we return
>>>>>> 0 naturally. That would be a small adjustment/fix over what Usama had,
>>>>>> but implementing it with bitmap operations like you did would be
>>>>>> better.
>>>>>
>>>>> I assume you means the below
>>>>>
>>>>> /*
>>>>>  * Return the number of contiguous zeromap entries started from entry
>>>>>  */
>>>>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr)
>>>>> {
>>>>>         struct swap_info_struct *sis = swp_swap_info(entry);
>>>>>         unsigned long start = swp_offset(entry);
>>>>>         unsigned long end = start + nr;
>>>>>         unsigned long idx;
>>>>>
>>>>>         idx = find_next_bit(sis->zeromap, end, start);
>>>>>         if (idx != start)
>>>>>                 return 0;
>>>>>
>>>>>         return find_next_zero_bit(sis->zeromap, end, start) - idx;
>>>>> }
>>>>>
>>>>> If yes, I really like this idea.
>>>>>
>>>>> It seems much better than using an enum, which would require adding a new
>>>>> data structure :-) Additionally, returning the number allows callers
>>>>> to fall back
>>>>> to the largest possible order, rather than trying next lower orders
>>>>> sequentially.
>>>>
>>>> No, returning 0 after only checking first entry would still reintroduce
>>>> the current bug, where the start entry is zeromap but other entries
>>>> might not be. We need another value to indicate whether the entries
>>>> are consistent if we want to avoid the enum:
>>>>
>>>> /*
>>>>  * Return the number of contiguous zeromap entries started from entry;
>>>>  * If all entries have consistent zeromap, *consistent will be true;
>>>>  * otherwise, false;
>>>>  */
>>>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry,
>>>>                 int nr, bool *consistent)
>>>> {
>>>>         struct swap_info_struct *sis = swp_swap_info(entry);
>>>>         unsigned long start = swp_offset(entry);
>>>>         unsigned long end = start + nr;
>>>>         unsigned long s_idx, c_idx;
>>>>
>>>>         s_idx = find_next_bit(sis->zeromap, end, start);
>>>>         if (s_idx == end) {
>>>>                 *consistent = true;
>>>>                 return 0;
>>>>         }
>>>>
>>>>         c_idx = find_next_zero_bit(sis->zeromap, end, start);
>>>>         if (c_idx == end) {
>>>>                 *consistent = true;
>>>>                 return nr;
>>>>         }
>>>>
>>>>         *consistent = false;
>>>>         if (s_idx == start)
>>>>                 return 0;
>>>>         return c_idx - s_idx;
>>>> }
>>>>
>>>> I can actually switch the places of the "consistent" and returned
>>>> number if that looks
>>>> better.
>>>
>>> I'd rather make it simpler by:
>>>
>>> /*
>>>  * Check if all entries have consistent zeromap status, return true if
>>>  * all entries are zeromap or non-zeromap, else return false;
>>>  */
>>> static inline bool swap_zeromap_entries_check(swp_entry_t entry, int nr)
>>> {
>>>         struct swap_info_struct *sis = swp_swap_info(entry);
>>>         unsigned long start = swp_offset(entry);
>>>         unsigned long end = start + *nr;
>>>
>> I guess you meant end= start + nr here?
> 
> right.
> 
>>
>>>         if (find_next_bit(sis->zeromap, end, start) == end)
>>>                 return true;
>>>         if (find_next_zero_bit(sis->zeromap, end, start) == end)
>>>                 return true;
>>>
>> So if zeromap is all false, this still returns true. We cant use this function in swap_read_folio_zeromap,
>> to check at time of swapin if all were zeros, right?
> 
> We can, my point is that swap_read_folio_zeromap() is the only
> function that actually
> needs the real value of zeromap; the others only care about
> consistency. So if we can
> avoid introducing a new enum across modules, we avoid it :-)
> 
> static bool swap_read_folio_zeromap(struct folio *folio)
> {
>         struct swap_info_struct *sis = swp_swap_info(folio->swap)
>         unsigned int nr_pages = folio_nr_pages(folio);
>         swp_entry_t entry = folio->swap;
> 
>         /*
>          * Swapping in a large folio that is partially in the zeromap is not
>          * currently handled. Return true without marking the folio uptodate so
>          * that an IO error is emitted (e.g. do_swap_page() will sigbus).
>          */
>         if (WARN_ON_ONCE(!swap_zeromap_entries_check(entry, nr_pages)))
>                 return true;
> 
>         if (!test_bit(swp_offset(entry), sis->zeromap))
>                 return false;
> 
LGTM with this swap_read_folio_zeromap. Thanks!

>         folio_zero_range(folio, 0, folio_size(folio));
>         folio_mark_uptodate(folio);
>         return true;
> }
> 
> mm/memory.c only needs true or false, it doesn't care about the real value.
> 
>>
>>
>>>         return false;
>>> }
>>>
>>> mm/page_io.c can combine this with reading the zeromap of first entry to
>>> decide if it will read folio from zeromap; mm/memory.c only needs the bool
>>> to fallback to the largest possible order.
>>>
>>> static inline unsigned long thp_swap_suitable_orders(...)
>>> {
>>>         int order, nr;
>>>
>>>         order = highest_order(orders);
>>>
>>>         while (orders) {
>>>                 nr = 1 << order;
>>>                 if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr &&
>>>                     swap_zeromap_entries_check(entry, nr))
>>>                         break;
>>>                 order = next_order(&orders, order);
>>>         }
>>>
>>>         return orders;
>>> }
>>>
>>>>
>>>>>
>>>>> Hi Usama,
>>>>> what is your take on this?
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Though I am not sure how cheap zswap can implement it,
>>>>>>> swap_zeromap_entries_check()
>>>>>>> could be two simple bit operations:
>>>>>>>
>>>>>>> +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t
>>>>>>> entry, int nr)
>>>>>>> +{
>>>>>>> +       struct swap_info_struct *sis = swp_swap_info(entry);
>>>>>>> +       unsigned long start = swp_offset(entry);
>>>>>>> +       unsigned long end = start + nr;
>>>>>>> +
>>>>>>> +       if (find_next_bit(sis->zeromap, end, start) == end)
>>>>>>> +               return SWAP_ZEROMAP_NON;
>>>>>>> +       if (find_next_zero_bit(sis->zeromap, end, start) == end)
>>>>>>> +               return SWAP_ZEROMAP_FULL;
>>>>>>> +
>>>>>>> +       return SWAP_ZEROMAP_PARTIAL;
>>>>>>> +}
>>>>>>>
>>>>>>> 3. swapcache is different from zeromap and zswap. Swapcache indicates
>>>>>>> that the memory
>>>>>>> is still available and should be re-mapped rather than allocating a
>>>>>>> new folio. Our previous
>>>>>>> patchset has implemented a full re-map of an mTHP in do_swap_page() as mentioned
>>>>>>> in 1.
>>>>>>>
>>>>>>> For the same reason as point 1, partial swapcache is a rare edge case.
>>>>>>> Not re-mapping it
>>>>>>> and instead allocating a new folio would add significant complexity.
>>>>>>>
>>>>>>>>>
>>>>>>>>> Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, we
>>>>>>>>> permit almost all mTHP swap-ins, except for those rare situations where
>>>>>>>>> small folios that were swapped out happen to have contiguous and aligned
>>>>>>>>> swap slots.
>>>>>>>>>
>>>>>>>>> swapcache is another quite different story, since our user scenarios begin from
>>>>>>>>> the simplest sync io on mobile phones, we don't quite care about swapcache.
>>>>>>>>
>>>>>>>> Right. The reason I bring this up is as I mentioned above, there is a
>>>>>>>> common problem of forming large folios from different sources, which
>>>>>>>> includes the swap cache. The fact that synchronous swapin does not use
>>>>>>>> the swapcache was a happy coincidence for you, as you can add support
>>>>>>>> mTHP swapins without handling this case yet ;)
>>>>>>>
>>>>>>> As I mentioned above, I'd really rather filter out those corner cases
>>>>>>> than support
>>>>>>> them, not just for the current situation to unlock swap-in series :-)
>>>>>>
>>>>>> If they are indeed corner cases, then I definitely agree.
>>>>>
>>>
> 
> Thanks
> Barry