From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5257ACD5BA9
	for <linux-mm@archiver.kernel.org>; Thu,  5 Sep 2024 11:00:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D433C6B0158; Thu,  5 Sep 2024 07:00:55 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CF2D86B0486; Thu,  5 Sep 2024 07:00:55 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B6BBD6B0487; Thu,  5 Sep 2024 07:00:55 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 912F16B0158
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 07:00:55 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 1B7DFA1B7F
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 11:00:55 +0000 (UTC)
X-FDA: 82530392070.07.6B57C9E
Received: from mail-vk1-f178.google.com (mail-vk1-f178.google.com [209.85.221.178])
	by imf17.hostedemail.com (Postfix) with ESMTP id 0247140017
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 11:00:52 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=hklIya2Y;
	spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725534004; a=rsa-sha256;
	cv=none;
	b=VrWJCAavuyRrwJ8OHrDGU39KEL24XWlTY39M2yKNuEgPYx4i0Q0g8/UUlLreIs7qQLOwVl
	jyd4fOMoFAGgVLiGlZRhn2MDQLwQbNNzoGGf4cW0JeR4egJd4SOrH4SwCiRbEJhSaGl4fM
	8q541E5WcY4KFYHibWTr+XwsmrXFJ8k=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=hklIya2Y;
	spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725534004;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=TX/t4BudZdupGh0RbfkCFE19KwqtGTdcusyaQxytRGA=;
	b=CqZc47L7HXLgyf7C2k/J7ThCS4uMki/qFJVLptoVLbT9fwOI2n8Y6fszeLA6tKC2S/mmPD
	Wgcm9T1qQiAUMHS+RAvVcnSLwXxPmfhwpI5TrQxNdWMvZKG/V3A8DO7IahPc+5EFP8pJx6
	/yKLzQhb2LkiYBUph/xHf7zn6pqxSb4=
Received: by mail-vk1-f178.google.com with SMTP id 71dfb90a1353d-4fcf9102c8bso231681e0c.2
        for <linux-mm@kvack.org>; Thu, 05 Sep 2024 04:00:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1725534052; x=1726138852; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=TX/t4BudZdupGh0RbfkCFE19KwqtGTdcusyaQxytRGA=;
        b=hklIya2YJvKETnqvyK19zIkrBzEautxQcF26Wx7HxqEXg2tRkrMrbofK2IpaY5Rsas
         3tEdpAu0esYNd2aDm1PNVLg00Cx+WwNtcZGpa+RSJ8nBa8w3UbmMhvFv0yHYzmsP2C4t
         nOc34ZC8Q+eIGEWLOItEDsWWEfvKXtIblV6EdmZmlrcWyEvkC5XXmc6/xfyG5/HmWnxR
         6aaosti7/XD/MXY1umMvK2LxlRjVXj4k2aVlouYt7PiILCDUTFwGHS8Z3/FEAi+Gn0KD
         dd8x9qEXUH8bQc8AVkJbq/D1M4dHbS0YqxyI0QJkFxp+yeJwPDWSbMptEQYovheKQ24q
         w14g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725534052; x=1726138852;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=TX/t4BudZdupGh0RbfkCFE19KwqtGTdcusyaQxytRGA=;
        b=gUqMCfTabSLYzc4XVAxkV+ztjPYvJRxItt8EByrHNuBAC4TBT/peapxTOm2j4WRH0r
         I0mXm28pCxf/SB/NzjxBirJUGfbkWwbPKkdAzMDMGcYf1wuRGg1X9nLhzkkQZ02Xyi3W
         aGo0hT3/MtwVHzOxYNcCvycgvnXCJGbJx+GwQ5sI3YFhpK1ilFxI9rFbHHXoHHtvP9cH
         Jj8QKk3U81TfCCpW7oe9zv1rT2uQLYDXuS7DIl/6Yq+laqNYsoFMIEBBLxynvJ20kjdb
         xIj6f6EKgoLqFbMTTpmsBQ1cAqx5iZnLiD1+c83Yd/cu90szcP9kbxyTg0jLa8KtFEwf
         alEQ==
X-Forwarded-Encrypted: i=1; AJvYcCUBmXc5+uJXCcOdATD7o3eQaVBflZbInisUjcDcQqwE6pL9M9XNrb0oboh9LMmCNnjI9jZAN3cg1Q==@kvack.org
X-Gm-Message-State: AOJu0YxOzNFLN/TzezuTS596TDzDl6oHVzd8fdw7+CMPzv4zFi0sxJ5c
	gJ5C/VMf3bFfHKOhk65LwAZQyV1DG6rtXu8qEq9Yn/1rx6O0Yla7kpgj/k3i/SPmqGMXSCZxHgK
	vepo13ZxJ522qJ7Rl8DdYUMW0ZAo=
X-Google-Smtp-Source: AGHT+IEO5FQxHqX8ZSTRjCQFoGwHS9HkBclNdDzvhsTr3hAFbCCzHZiC4bn8BAFO5cNoLSv98QpUuKJQIwtCQrOY438=
X-Received: by 2002:a05:6122:328f:b0:4f5:14f9:e12c with SMTP id
 71dfb90a1353d-4fff16ee678mr27528077e0c.13.1725534051552; Thu, 05 Sep 2024
 04:00:51 -0700 (PDT)
MIME-Version: 1.0
References: <20240612124750.2220726-2-usamaarif642@gmail.com>
 <20240904055522.2376-1-21cnbao@gmail.com> <CAJD7tkYNn51b3wQbNnJoy8TMVA+r+ookuZzNEEpYWwKiZPVRdg@mail.gmail.com>
 <CAGsJ_4w2k=704mgtQu97y5Qpidc-x+ZBmBXCytkzdcasfAaG0w@mail.gmail.com>
 <CAJD7tkYqk_raVy07cw9cz=RWo=6BpJc0Ax84MkXLRqCjYvYpeA@mail.gmail.com>
 <CAGsJ_4w4woc6st+nPqH7PnhczhQZ7j90wupgX28UrajobqHLnw@mail.gmail.com>
 <CAJD7tkY+wXUwmgZUfVqSXpXL_CxRO-4eKGCPunfJaTDGhNO=Kw@mail.gmail.com>
 <CAGsJ_4zP_tA4z-n=3MTPorNnmANdSJTg4jSx0-atHS1vdd2jmg@mail.gmail.com>
 <CAJD7tkZ7ZhGz5J5O=PEkoyN9WeSjXOLMqnASFc8T3Vpv5uiSRQ@mail.gmail.com>
 <CAGsJ_4x0y+RtghmFifm_pR-=P_t5hNW5qjvw-oF+-T_amuVuzQ@mail.gmail.com>
 <CAGsJ_4zB7za72xL94-1Oc+M2M1RtxftVYUAUk=1yngUoK65stw@mail.gmail.com>
 <CAGsJ_4yBFpyA4Znfgr7V=eoHAnhuLPDTqaVOre9waTKZ+R3R9A@mail.gmail.com> <ede9c691-9b94-486a-895e-a822615b2805@gmail.com>
In-Reply-To: <ede9c691-9b94-486a-895e-a822615b2805@gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 5 Sep 2024 23:00:39 +1200
Message-ID: <CAGsJ_4xRHtFvxc6kbpGdrvxpaQYCHNWpZsMO+GbX7LJwW841nw@mail.gmail.com>
Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap
To: Usama Arif <usamaarif642@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>, akpm@linux-foundation.org, 
	chengming.zhou@linux.dev, david@redhat.com, hannes@cmpxchg.org, 
	hughd@google.com, kernel-team@meta.com, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org, nphamcs@gmail.com, shakeel.butt@linux.dev, 
	willy@infradead.org, ying.huang@intel.com, hanchuanhua@oppo.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: mdqo4q59k7yq6nrhgk6kbdfjiqgffh65
X-Rspamd-Queue-Id: 0247140017
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1725534052-782258
X-HE-Meta: U2FsdGVkX18oT3lAXPqVC2YA3LWIBzBru5TA+WU32LYYPSVpuT+i/DpE6pz52WjW5z4AlK+9uqXm+rawTMSw3cD9ayZ57Pcv/G92sn0j8eNO5EkLu0P7ukmqrNiMtpD4/OzfQun/++Jl+s/yZ8zFGW+LvTCgCHVp/thKQ3uhsDk1YpJlfr8y7JWAK+RXOniHOYkiepS4nZoW9U6mhAZer/h9mURKeScwCSGqN1Cf+ZjO3knSf70MPbItFO6+E75OpW0NRgEswwDngt2GjSYsFTkHuW31b/9kVzqZquy/BeDcxW//GnLSI/RswFvOgcFT+igwHBQWjawRph1OcKc8c6V6bPCX6et8UQ79j2c8ud7Rx8zFaFAuxIFVvIuiyt8X6edbmYBkgPYBxFteUvrblEiZd163Y2FUeamQHqVQG4WWE8SDQ/tvNGvlYqs+xG/0Mx6yaLCVgssY8b+vCpF3QdWh4SlzL1MxbnKF17fqLHPi9kX/BethDy+Mas7wqpJeutGguHfNLwNXi9d3Xgm33cp7yl81748TH9c7cBf8wdAISjjJcQJn4PSvI+dVIls38ixdJg27O8hhn62agyNkq/XEoOz5GXm9jwIuVqrFqd9e0yRZVysIRKg/jbSjB4Xk1gI4UryuJDm4bMd0xweGxCrhUtDAizD2VwZeZHwwPEs1FZmQ8RT9AnWih9FP/N5Sp30nxONhOwPNaSh+2T3WbMrpSVacscQpryJi+9SHSGjnDRc2upN3A4yvZdfuCqSJ7ue0dXkyH4jmxMLFfFCfK7f4z0AuDJ+sq+T/bt2PVqKwcoW+s/y04rfHkUReVRQVSr/Y/IiN2F768mdxOw8qmTJq1ZJsF1730fFIeXm5cUT75mwMyUOydStZAWHhXwfRMxJOzMV62vdbAauiV+ETG2mlu2iQrSY2BYTrSJ/sGMc8jcXr+mBC155nS4RXJxalvvu2hAFpuskdBSbIusF
 o4zpzr8m
 Ql1TABZzfbX5HbNfNTJI+aPcvAJlcEjWo1KJY8rKdlwE2wOCFwlz9DOiCz6rBiOTppOeHl1BWXzAMP1RVandeLHIcvlPMiAb8aOHJ7qNyyb2EyYzL6odgVmy7/IvIxMqbV37EN781FIqR1GK1SvTQ1jQU8sp5f6clAT1yzt5cr1EIopPitPbLp198z6bxgt+xY57BkPvvuQjSf56VzG9bgDolvJd3ybdPgEHSTzrRgpnqho5tpC1gw/zro+lIqlHAGKqcx5kdvwSpKdlsrX90zAfI9u61iAp/DZlKLGh5oY7SdeGVVaMzw/MTt8zriBXvKbrxVGsEUjZwIHemmt7hy0SfjvzwN9HF4ruNxwNxEFinsg3SXpC6uRezB2EAxkwBF7goRdfIgwH2SEM=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Sep 5, 2024 at 10:53=E2=80=AFPM Usama Arif <usamaarif642@gmail.com>=
 wrote:
>
>
>
> On 05/09/2024 11:33, Barry Song wrote:
> > On Thu, Sep 5, 2024 at 10:10=E2=80=AFPM Barry Song <21cnbao@gmail.com> =
wrote:
> >>
> >> On Thu, Sep 5, 2024 at 8:49=E2=80=AFPM Barry Song <21cnbao@gmail.com> =
wrote:
> >>>
> >>> On Thu, Sep 5, 2024 at 7:55=E2=80=AFPM Yosry Ahmed <yosryahmed@google=
.com> wrote:
> >>>>
> >>>> On Thu, Sep 5, 2024 at 12:03=E2=80=AFAM Barry Song <21cnbao@gmail.co=
m> wrote:
> >>>>>
> >>>>> On Thu, Sep 5, 2024 at 5:41=E2=80=AFAM Yosry Ahmed <yosryahmed@goog=
le.com> wrote:
> >>>>>>
> >>>>>> [..]
> >>>>>>>> I understand the point of doing this to unblock the synchronous =
large
> >>>>>>>> folio swapin support work, but at some point we're gonna have to
> >>>>>>>> actually handle the cases where a large folio being swapped in i=
s
> >>>>>>>> partially in the swap cache, zswap, the zeromap, etc.
> >>>>>>>>
> >>>>>>>> All these cases will need similar-ish handling, and I suspect we=
 won't
> >>>>>>>> just skip swapping in large folios in all these cases.
> >>>>>>>
> >>>>>>> I agree that this is definitely the goal. `swap_read_folio()` sho=
uld be a
> >>>>>>> dependable API that always returns reliable data, regardless of w=
hether
> >>>>>>> `zeromap` or `zswap` is involved. Despite these issues, mTHP swap=
-in shouldn't
> >>>>>>> be held back. Significant efforts are underway to support large f=
olios in
> >>>>>>> `zswap`, and progress is being made. Not to mention we've already=
 allowed
> >>>>>>> `zeromap` to proceed, even though it doesn't support large folios=
.
> >>>>>>>
> >>>>>>> It's genuinely unfair to let the lack of mTHP support in `zeromap=
` and
> >>>>>>> `zswap` hold swap-in hostage.
> >>>>>>
> >>>>>
> >>>>> Hi Yosry,
> >>>>>
> >>>>>> Well, two points here:
> >>>>>>
> >>>>>> 1. I did not say that we should block the synchronous mTHP swapin =
work
> >>>>>> for this :) I said the next item on the TODO list for mTHP swapin
> >>>>>> support should be handling these cases.
> >>>>>
> >>>>> Thanks for your clarification!
> >>>>>
> >>>>>>
> >>>>>> 2. I think two things are getting conflated here. Zswap needs to
> >>>>>> support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What =
is
> >>>>>> truly, and is outside the scope of zswap/zeromap, is being able to
> >>>>>> support hybrid mTHP swapin.
> >>>>>>
> >>>>>> When swapping in an mTHP, the swapped entries can be on disk, in t=
he
> >>>>>> swapcache, in zswap, or in the zeromap. Even if all these things
> >>>>>> support mTHPs individually, we essentially need support to form an
> >>>>>> mTHP from swap entries in different backends. That's what I meant.
> >>>>>> Actually if we have that, we may not really need mTHP swapin suppo=
rt
> >>>>>> in zswap, because we can just form the large folio in the swap lay=
er
> >>>>>> from multiple zswap entries.
> >>>>>>
> >>>>>
> >>>>> After further consideration, I've actually started to disagree with=
 the idea
> >>>>> of supporting hybrid swapin (forming an mTHP from swap entries in d=
ifferent
> >>>>> backends). My reasoning is as follows:
> >>>>
> >>>> I do not have any data about this, so you could very well be right
> >>>> here. Handling hybrid swapin could be simply falling back to the
> >>>> smallest order we can swapin from a single backend. We can at least
> >>>> start with this, and collect data about how many mTHP swapins fallba=
ck
> >>>> due to hybrid backends. This way we only take the complexity if
> >>>> needed.
> >>>>
> >>>> I did imagine though that it's possible for two virtually contiguous
> >>>> folios to be swapped out to contiguous swap entries and end up in
> >>>> different media (e.g. if only one of them is zero-filled). I am not
> >>>> sure how rare it would be in practice.
> >>>>
> >>>>>
> >>>>> 1. The scenario where an mTHP is partially zeromap, partially zswap=
, etc.,
> >>>>> would be an extremely rare case, as long as we're swapping out the =
mTHP as
> >>>>> a whole and all the modules are handling it accordingly. It's highl=
y
> >>>>> unlikely to form this mix of zeromap, zswap, and swapcache unless t=
he
> >>>>> contiguous VMA virtual address happens to get some small folios wit=
h
> >>>>> aligned and contiguous swap slots. Even then, they would need to be
> >>>>> partially zeromap and partially non-zeromap, zswap, etc.
> >>>>
> >>>> As I mentioned, we can start simple and collect data for this. If it=
's
> >>>> rare and we don't need to handle it, that's good.
> >>>>
> >>>>>
> >>>>> As you mentioned, zeromap handles mTHP as a whole during swapping
> >>>>> out, marking all subpages of the entire mTHP as zeromap rather than=
 just
> >>>>> a subset of them.
> >>>>>
> >>>>> And swap-in can also entirely map a swapcache which is a large foli=
o based
> >>>>> on our previous patchset which has been in mainline:
> >>>>> "mm: swap: entirely map large folios found in swapcache"
> >>>>> https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.c=
om/
> >>>>>
> >>>>> It seems the only thing we're missing is zswap support for mTHP.
> >>>>
> >>>> It is still possible for two virtually contiguous folios to be swapp=
ed
> >>>> out to contiguous swap entries. It is also possible that a large fol=
io
> >>>> is swapped out as a whole, then only a part of it is swapped in late=
r
> >>>> due to memory pressure. If that part is later reclaimed again and ge=
ts
> >>>> added to the swapcache, we can run into the hybrid swapin situation.
> >>>> There may be other scenarios as well, I did not think this through.
> >>>>
> >>>>>
> >>>>> 2. Implementing hybrid swap-in would be extremely tricky and could =
disrupt
> >>>>> several software layers. I can share some pseudo code below:
> >>>>
> >>>> Yeah it definitely would be complex, so we need proper justification=
 for it.
> >>>>
> >>>>>
> >>>>> swap_read_folio()
> >>>>> {
> >>>>>        if (zeromap_full)
> >>>>>                folio_read_from_zeromap()
> >>>>>        else if (zswap_map_full)
> >>>>>               folio_read_from_zswap()
> >>>>>        else {
> >>>>>               folio_read_from_swapfile()
> >>>>>               if (zeromap_partial)
> >>>>>                        folio_read_from_zeromap_fixup()  /* fill zer=
o
> >>>>> for partially zeromap subpages */
> >>>>>               if (zwap_partial)
> >>>>>                        folio_read_from_zswap_fixup()  /* zswap_load
> >>>>> for partially zswap-mapped subpages */
> >>>>>
> >>>>>                folio_mark_uptodate()
> >>>>>                folio_unlock()
> >>>>> }
> >>>>>
> >>>>> We'd also need to modify folio_read_from_swapfile() to skip
> >>>>> folio_mark_uptodate()
> >>>>> and folio_unlock() after completing the BIO. This approach seems to
> >>>>> entirely disrupt
> >>>>> the software layers.
> >>>>>
> >>>>> This could also lead to unnecessary IO operations for subpages that
> >>>>> require fixup.
> >>>>> Since such cases are quite rare, I believe the added complexity isn=
't worth it.
> >>>>>
> >>>>> My point is that we should simply check that all PTEs have consiste=
nt zeromap,
> >>>>> zswap, and swapcache statuses before proceeding, otherwise fall bac=
k to the next
> >>>>> lower order if needed. This approach improves performance and avoid=
s complex
> >>>>> corner cases.
> >>>>
> >>>> Agree that we should start with that, although we should probably
> >>>> fallback to the largest order we can swapin from a single backend,
> >>>> rather than the next lower order.
> >>>>
> >>>>>
> >>>>> So once zswap mTHP is there, I would also expect an API similar to
> >>>>> swap_zeromap_entries_check()
> >>>>> for example:
> >>>>> zswap_entries_check(entry, nr) which can return if we are having
> >>>>> full, non, and partial zswap to replace the existing
> >>>>> zswap_never_enabled().
> >>>>
> >>>> I think a better API would be similar to what Usama had. Basically
> >>>> take in (entry, nr) and return how much of it is in zswap starting a=
t
> >>>> entry, so that we can decide the swapin order.
> >>>>
> >>>> Maybe we can adjust your proposed swap_zeromap_entries_check() as we=
ll
> >>>> to do that? Basically return the number of swap entries in the zerom=
ap
> >>>> starting at 'entry'. If 'entry' itself is not in the zeromap we retu=
rn
> >>>> 0 naturally. That would be a small adjustment/fix over what Usama ha=
d,
> >>>> but implementing it with bitmap operations like you did would be
> >>>> better.
> >>>
> >>> I assume you means the below
> >>>
> >>> /*
> >>>  * Return the number of contiguous zeromap entries started from entry
> >>>  */
> >>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t ent=
ry, int nr)
> >>> {
> >>>         struct swap_info_struct *sis =3D swp_swap_info(entry);
> >>>         unsigned long start =3D swp_offset(entry);
> >>>         unsigned long end =3D start + nr;
> >>>         unsigned long idx;
> >>>
> >>>         idx =3D find_next_bit(sis->zeromap, end, start);
> >>>         if (idx !=3D start)
> >>>                 return 0;
> >>>
> >>>         return find_next_zero_bit(sis->zeromap, end, start) - idx;
> >>> }
> >>>
> >>> If yes, I really like this idea.
> >>>
> >>> It seems much better than using an enum, which would require adding a=
 new
> >>> data structure :-) Additionally, returning the number allows callers
> >>> to fall back
> >>> to the largest possible order, rather than trying next lower orders
> >>> sequentially.
> >>
> >> No, returning 0 after only checking first entry would still reintroduc=
e
> >> the current bug, where the start entry is zeromap but other entries
> >> might not be. We need another value to indicate whether the entries
> >> are consistent if we want to avoid the enum:
> >>
> >> /*
> >>  * Return the number of contiguous zeromap entries started from entry;
> >>  * If all entries have consistent zeromap, *consistent will be true;
> >>  * otherwise, false;
> >>  */
> >> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entr=
y,
> >>                 int nr, bool *consistent)
> >> {
> >>         struct swap_info_struct *sis =3D swp_swap_info(entry);
> >>         unsigned long start =3D swp_offset(entry);
> >>         unsigned long end =3D start + nr;
> >>         unsigned long s_idx, c_idx;
> >>
> >>         s_idx =3D find_next_bit(sis->zeromap, end, start);
> >>         if (s_idx =3D=3D end) {
> >>                 *consistent =3D true;
> >>                 return 0;
> >>         }
> >>
> >>         c_idx =3D find_next_zero_bit(sis->zeromap, end, start);
> >>         if (c_idx =3D=3D end) {
> >>                 *consistent =3D true;
> >>                 return nr;
> >>         }
> >>
> >>         *consistent =3D false;
> >>         if (s_idx =3D=3D start)
> >>                 return 0;
> >>         return c_idx - s_idx;
> >> }
> >>
> >> I can actually switch the places of the "consistent" and returned
> >> number if that looks
> >> better.
> >
> > I'd rather make it simpler by:
> >
> > /*
> >  * Check if all entries have consistent zeromap status, return true if
> >  * all entries are zeromap or non-zeromap, else return false;
> >  */
> > static inline bool swap_zeromap_entries_check(swp_entry_t entry, int nr=
)
> > {
> >         struct swap_info_struct *sis =3D swp_swap_info(entry);
> >         unsigned long start =3D swp_offset(entry);
> >         unsigned long end =3D start + *nr;
> >
> I guess you meant end=3D start + nr here?

right.

>
> >         if (find_next_bit(sis->zeromap, end, start) =3D=3D end)
> >                 return true;
> >         if (find_next_zero_bit(sis->zeromap, end, start) =3D=3D end)
> >                 return true;
> >
> So if zeromap is all false, this still returns true. We cant use this fun=
ction in swap_read_folio_zeromap,
> to check at time of swapin if all were zeros, right?

We can, my point is that swap_read_folio_zeromap() is the only
function that actually
needs the real value of zeromap; the others only care about
consistency. So if we can
avoid introducing a new enum across modules, we avoid it :-)

static bool swap_read_folio_zeromap(struct folio *folio)
{
        struct swap_info_struct *sis =3D swp_swap_info(folio->swap)
        unsigned int nr_pages =3D folio_nr_pages(folio);
        swp_entry_t entry =3D folio->swap;

        /*
         * Swapping in a large folio that is partially in the zeromap is no=
t
         * currently handled. Return true without marking the folio uptodat=
e so
         * that an IO error is emitted (e.g. do_swap_page() will sigbus).
         */
        if (WARN_ON_ONCE(!swap_zeromap_entries_check(entry, nr_pages)))
                return true;

        if (!test_bit(swp_offset(entry), sis->zeromap))
                return false;

        folio_zero_range(folio, 0, folio_size(folio));
        folio_mark_uptodate(folio);
        return true;
}

mm/memory.c only needs true or false, it doesn't care about the real value.

>
>
> >         return false;
> > }
> >
> > mm/page_io.c can combine this with reading the zeromap of first entry t=
o
> > decide if it will read folio from zeromap; mm/memory.c only needs the b=
ool
> > to fallback to the largest possible order.
> >
> > static inline unsigned long thp_swap_suitable_orders(...)
> > {
> >         int order, nr;
> >
> >         order =3D highest_order(orders);
> >
> >         while (orders) {
> >                 nr =3D 1 << order;
> >                 if ((addr >> PAGE_SHIFT) % nr =3D=3D swp_offset % nr &&
> >                     swap_zeromap_entries_check(entry, nr))
> >                         break;
> >                 order =3D next_order(&orders, order);
> >         }
> >
> >         return orders;
> > }
> >
> >>
> >>>
> >>> Hi Usama,
> >>> what is your take on this?
> >>>
> >>>>
> >>>>>
> >>>>> Though I am not sure how cheap zswap can implement it,
> >>>>> swap_zeromap_entries_check()
> >>>>> could be two simple bit operations:
> >>>>>
> >>>>> +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_=
t
> >>>>> entry, int nr)
> >>>>> +{
> >>>>> +       struct swap_info_struct *sis =3D swp_swap_info(entry);
> >>>>> +       unsigned long start =3D swp_offset(entry);
> >>>>> +       unsigned long end =3D start + nr;
> >>>>> +
> >>>>> +       if (find_next_bit(sis->zeromap, end, start) =3D=3D end)
> >>>>> +               return SWAP_ZEROMAP_NON;
> >>>>> +       if (find_next_zero_bit(sis->zeromap, end, start) =3D=3D end=
)
> >>>>> +               return SWAP_ZEROMAP_FULL;
> >>>>> +
> >>>>> +       return SWAP_ZEROMAP_PARTIAL;
> >>>>> +}
> >>>>>
> >>>>> 3. swapcache is different from zeromap and zswap. Swapcache indicat=
es
> >>>>> that the memory
> >>>>> is still available and should be re-mapped rather than allocating a
> >>>>> new folio. Our previous
> >>>>> patchset has implemented a full re-map of an mTHP in do_swap_page()=
 as mentioned
> >>>>> in 1.
> >>>>>
> >>>>> For the same reason as point 1, partial swapcache is a rare edge ca=
se.
> >>>>> Not re-mapping it
> >>>>> and instead allocating a new folio would add significant complexity=
.
> >>>>>
> >>>>>>>
> >>>>>>> Nonetheless, `zeromap` and `zswap` are distinct cases. With `zero=
map`, we
> >>>>>>> permit almost all mTHP swap-ins, except for those rare situations=
 where
> >>>>>>> small folios that were swapped out happen to have contiguous and =
aligned
> >>>>>>> swap slots.
> >>>>>>>
> >>>>>>> swapcache is another quite different story, since our user scenar=
ios begin from
> >>>>>>> the simplest sync io on mobile phones, we don't quite care about =
swapcache.
> >>>>>>
> >>>>>> Right. The reason I bring this up is as I mentioned above, there i=
s a
> >>>>>> common problem of forming large folios from different sources, whi=
ch
> >>>>>> includes the swap cache. The fact that synchronous swapin does not=
 use
> >>>>>> the swapcache was a happy coincidence for you, as you can add supp=
ort
> >>>>>> mTHP swapins without handling this case yet ;)
> >>>>>
> >>>>> As I mentioned above, I'd really rather filter out those corner cas=
es
> >>>>> than support
> >>>>> them, not just for the current situation to unlock swap-in series :=
-)
> >>>>
> >>>> If they are indeed corner cases, then I definitely agree.
> >>>
> >

Thanks
Barry