From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2F30DCD4F5B
	for <linux-mm@archiver.kernel.org>; Thu,  5 Sep 2024 08:49:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9F0956B039F; Thu,  5 Sep 2024 04:49:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 979B96B03A0; Thu,  5 Sep 2024 04:49:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7F3556B03A1; Thu,  5 Sep 2024 04:49:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 5EA7B6B039F
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 04:49:47 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id D599C16167C
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 08:49:46 +0000 (UTC)
X-FDA: 82530061572.08.3852F83
Received: from mail-vk1-f181.google.com (mail-vk1-f181.google.com [209.85.221.181])
	by imf09.hostedemail.com (Postfix) with ESMTP id 1469F140005
	for <linux-mm@kvack.org>; Thu,  5 Sep 2024 08:49:44 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QCLbSHNZ;
	spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725526136; a=rsa-sha256;
	cv=none;
	b=xXa8HxGPzUKPiFUvoVdx/jo2fnPHURlAEnIj6hXKqCSrVsn/rEJ09cViGU4D/aQaNO+kBW
	99+uglycUkVsYOzfNsQOQKOsygAA4voB7nu3DbJa4EPRxkh4Z/f2vkPDlY+t6pOtdR4TYG
	cODlAQxwHLxULYA8hVUvj/zxP4DqwR4=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QCLbSHNZ;
	spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725526136;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ydZrq1jmYcJ3iT5KVQR2uEItSEJAIa0sqI78Zxb7F4o=;
	b=rm988VOi7iLjjUlexztBebB1E/IxYWOchfeu5EACP8eefJJtMlJHrnzg9L2ajoWDaKp5tS
	m+kzGldRA0vgb+aB31zgbHnUnEam5xUKZK9CVT06bpFf9o4R4Pp+vSdbkOzaXApR6ypsQa
	he78cuKKqsUMbobw3n90U1LJbOrzOH8=
Received: by mail-vk1-f181.google.com with SMTP id 71dfb90a1353d-50108a42fa9so207616e0c.3
        for <linux-mm@kvack.org>; Thu, 05 Sep 2024 01:49:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1725526184; x=1726130984; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ydZrq1jmYcJ3iT5KVQR2uEItSEJAIa0sqI78Zxb7F4o=;
        b=QCLbSHNZsfoh/KSWtFU75WL7bBspP/ML1tjOHz2XhspyvKr3TikT475Pj/XrbOc1Pk
         +hbqcMDB5jD08V8AUWp50gxwlRIn7Ypvsznj6eUMjfEoJ4PhZTX/kdcrXketDbKqzfIB
         PyhTbSFDdm5YWVP0d4YQS2wxV7arYzXls2Zl7IevXHM5eJFRNPGp45XqTyVc63mxtBrD
         GLPIHtpSNY6Z/e/5QPwZ1jr2P79WuaYqRMf9jRLeB718vhNch2ma/s4waEbeE5Y52VkA
         I7ru8TEToSsnPvxKUsUillJhv1MDyVwBzehmejkNj9zgtVLne2a5No/i0c+vdtvAfe7e
         Jk5Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725526184; x=1726130984;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ydZrq1jmYcJ3iT5KVQR2uEItSEJAIa0sqI78Zxb7F4o=;
        b=YzV93BTjQAxSUTLrq31FWzdDhap3B/+RlZjQuZ0DDs+9BCJb4FibFy5O9/8tvYcYUz
         S0yr9QODrChLsiG0Wm3AM8VDNyAsKVDaX99zszjKUqJTnob4oORjh83L0uFDYqiPWA0S
         W90dSO6P3SWubt14rMsEiRF8ctpI/IR+Z1R+HUhtlAWLL/2Kev/JlPUky+pNdDrfamSx
         5HtR//tTOpi4iUoQ6YhP5U92GYQAbdmeGrSi8J07zSfbAZG1KiC+cunOW1mUcsoIDjuk
         NaxP68Kjl+NIOArJH+aNsl+ufxLJYuykeAhUfzua99tjCxIbI6uRf2MU/McLmYMTdMbS
         V5VA==
X-Forwarded-Encrypted: i=1; AJvYcCW5tZcZXJrSwVNR1+dPx5Es9Aq7rmnJFmU0ftpP0iLtHMcFftR0Tuz30IvQhXJEA8UMLNz0K/LB+A==@kvack.org
X-Gm-Message-State: AOJu0Yxb6jhW7tk5MASFQkc29jvdIUtcLGc00+D5X9HX5yBu/Fbs3vS7
	JNdJyk8IpxpE3bSLbAC0uyjXqbD4fitzir/4HG3ChyRwNrKzIAI0gCoByp4I7OB28whPPOaoIFJ
	uE13BdGr/2XWhWPrzQHzwMkTSc70=
X-Google-Smtp-Source: AGHT+IESNubA1dBtNKwBaV0JG3mhnHocLUPe4ld26wzHID0WHJi/mIUJKopiztf/4SpPMVyPen7jmWn785b8VHIqMWY=
X-Received: by 2002:a05:6122:1797:b0:4f5:261a:bdc4 with SMTP id
 71dfb90a1353d-5009b00151amr19662728e0c.2.1725526183853; Thu, 05 Sep 2024
 01:49:43 -0700 (PDT)
MIME-Version: 1.0
References: <20240612124750.2220726-2-usamaarif642@gmail.com>
 <20240904055522.2376-1-21cnbao@gmail.com> <CAJD7tkYNn51b3wQbNnJoy8TMVA+r+ookuZzNEEpYWwKiZPVRdg@mail.gmail.com>
 <CAGsJ_4w2k=704mgtQu97y5Qpidc-x+ZBmBXCytkzdcasfAaG0w@mail.gmail.com>
 <CAJD7tkYqk_raVy07cw9cz=RWo=6BpJc0Ax84MkXLRqCjYvYpeA@mail.gmail.com>
 <CAGsJ_4w4woc6st+nPqH7PnhczhQZ7j90wupgX28UrajobqHLnw@mail.gmail.com>
 <CAJD7tkY+wXUwmgZUfVqSXpXL_CxRO-4eKGCPunfJaTDGhNO=Kw@mail.gmail.com>
 <CAGsJ_4zP_tA4z-n=3MTPorNnmANdSJTg4jSx0-atHS1vdd2jmg@mail.gmail.com> <CAJD7tkZ7ZhGz5J5O=PEkoyN9WeSjXOLMqnASFc8T3Vpv5uiSRQ@mail.gmail.com>
In-Reply-To: <CAJD7tkZ7ZhGz5J5O=PEkoyN9WeSjXOLMqnASFc8T3Vpv5uiSRQ@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 5 Sep 2024 20:49:31 +1200
Message-ID: <CAGsJ_4x0y+RtghmFifm_pR-=P_t5hNW5qjvw-oF+-T_amuVuzQ@mail.gmail.com>
Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap
To: Yosry Ahmed <yosryahmed@google.com>
Cc: usamaarif642@gmail.com, akpm@linux-foundation.org, 
	chengming.zhou@linux.dev, david@redhat.com, hannes@cmpxchg.org, 
	hughd@google.com, kernel-team@meta.com, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org, nphamcs@gmail.com, shakeel.butt@linux.dev, 
	willy@infradead.org, ying.huang@intel.com, hanchuanhua@oppo.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 5ykh3ego89bbrdn43uh1f47mbhc8116g
X-Rspamd-Queue-Id: 1469F140005
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1725526184-756548
X-HE-Meta: U2FsdGVkX18GIQK83uHgWzkN9RJuQrl4quEuHLpqrguFbIhdp24rabj9QDBRBX9R4vUSFvB07JY1qVzNpFV0FE0d7FKZqUkNO4d0YoLnxU4nAatsiWOJIIC9jvnmWtYIlbT5c+FUe9pgRmxSX52wwrk0y6gjGLv0Q9s46BvUn/cP4czOhedlVUHVCtBJr6LGRIv2wJGYSUei5FdFewVTpqb8tzHRrevh1/MLW5xWQaonPPwCusw7JdkCTOdfMnGt9M9Xoq+7h5drNgAYahI9jWF6Guapph8n4bfjwuUUYHeHeNMjCrIRVd6qSVHi+eCS8rqkGRCxTxVTAkd8vUQQWSRWbZg5v9czrpNCCg6y2nfURjrXiL7Y918imAZAqAQEUs3CBPFjfsm/AR5PwRm6Grlq5i/cqQaNFIHndpRmlUEs3hoBqOMSaPw7ICA55ohe902s56QFFPp/hp1FFZdXQ6l9znn8+iSe4f17YOWB9fXT7KckX4R0kThK5coww050uH3iRdOYyfbPlPfn5ge7RVxyCkB/EB6gbtCC/MzHfXL8VCBZPtQ/u8TMijYSJi0ykB7AsfDeOXHsFrX1o1h9w3u5I5M6dcsCWOI4u5HojQ2FQatjXZEX02Uzo+pqACX0R6AJAlBmaVx5RODWgkiKxXpSN9wK62KiZ0IlK7QW8cFvEoI7ON4yuz6EzkokYHz/GUsr4hz+cjfgvIWSLHOS+InGxDncmX1ay6lMj8G8NsncqJTLVy2JXod0O0enDNR/j+WnzdhhwKSq5TvXwFUBUQVxZDtGNYVRlTfsJvf5xHb4clIsbZx34KhZT/H4kMD9ptlgznPdSZ/rsOCCQn/ekoS8XMZdQhaoBvonHZS2KzdcSkW0rSqgq0bDo6me2ehYiCfx65et7OifAgf6+Q9TJz2r99pTrQ2hB26aF3qucwKRwrQTeXzjHrDM8bwnP9kNfgq7ucPNnOSe7aVdEtp
 qkT3lorN
 16AoncQutm0alkRui+sPkxZCFEbm6V6hp5nyR11GROja8ypKViUNI4Wwv+4EKg55fRWeTnpYGLiOLZZcVI5/cPjxVpx26G2HtGKJ1ASvoMRgM9NE0KJrkB/dyTp3uV4lJOj+UXH7quc64ycQyo50+CAXMlwH6gh2VMGuclDA4iuEbcgBJzB/QtrWmVZJa/Z8ro7uzsXq6BH6NwyQSQS6Mjuil1tqQvLEtGBx7NXq74G0LpOV1wddP0ocUFPBKSL5BASRu2LWcq4C0BxIPMIShp+ZvNcp4DJktVTeNp70hBrdyOnUb/cImNvp+Rsd23PKiEnbfpyK0sRg3R44Txlh5lxJSm8KjNN6xRxR3MfK+QRJKRw5C1kUp437NIRDx6t8mxTHCQP09PK4jm9+5RTNQPIbEtQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Sep 5, 2024 at 7:55=E2=80=AFPM Yosry Ahmed <yosryahmed@google.com> =
wrote:
>
> On Thu, Sep 5, 2024 at 12:03=E2=80=AFAM Barry Song <21cnbao@gmail.com> wr=
ote:
> >
> > On Thu, Sep 5, 2024 at 5:41=E2=80=AFAM Yosry Ahmed <yosryahmed@google.c=
om> wrote:
> > >
> > > [..]
> > > > > I understand the point of doing this to unblock the synchronous l=
arge
> > > > > folio swapin support work, but at some point we're gonna have to
> > > > > actually handle the cases where a large folio being swapped in is
> > > > > partially in the swap cache, zswap, the zeromap, etc.
> > > > >
> > > > > All these cases will need similar-ish handling, and I suspect we =
won't
> > > > > just skip swapping in large folios in all these cases.
> > > >
> > > > I agree that this is definitely the goal. `swap_read_folio()` shoul=
d be a
> > > > dependable API that always returns reliable data, regardless of whe=
ther
> > > > `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-i=
n shouldn't
> > > > be held back. Significant efforts are underway to support large fol=
ios in
> > > > `zswap`, and progress is being made. Not to mention we've already a=
llowed
> > > > `zeromap` to proceed, even though it doesn't support large folios.
> > > >
> > > > It's genuinely unfair to let the lack of mTHP support in `zeromap` =
and
> > > > `zswap` hold swap-in hostage.
> > >
> >
> > Hi Yosry,
> >
> > > Well, two points here:
> > >
> > > 1. I did not say that we should block the synchronous mTHP swapin wor=
k
> > > for this :) I said the next item on the TODO list for mTHP swapin
> > > support should be handling these cases.
> >
> > Thanks for your clarification!
> >
> > >
> > > 2. I think two things are getting conflated here. Zswap needs to
> > > support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is
> > > truly, and is outside the scope of zswap/zeromap, is being able to
> > > support hybrid mTHP swapin.
> > >
> > > When swapping in an mTHP, the swapped entries can be on disk, in the
> > > swapcache, in zswap, or in the zeromap. Even if all these things
> > > support mTHPs individually, we essentially need support to form an
> > > mTHP from swap entries in different backends. That's what I meant.
> > > Actually if we have that, we may not really need mTHP swapin support
> > > in zswap, because we can just form the large folio in the swap layer
> > > from multiple zswap entries.
> > >
> >
> > After further consideration, I've actually started to disagree with the=
 idea
> > of supporting hybrid swapin (forming an mTHP from swap entries in diffe=
rent
> > backends). My reasoning is as follows:
>
> I do not have any data about this, so you could very well be right
> here. Handling hybrid swapin could be simply falling back to the
> smallest order we can swapin from a single backend. We can at least
> start with this, and collect data about how many mTHP swapins fallback
> due to hybrid backends. This way we only take the complexity if
> needed.
>
> I did imagine though that it's possible for two virtually contiguous
> folios to be swapped out to contiguous swap entries and end up in
> different media (e.g. if only one of them is zero-filled). I am not
> sure how rare it would be in practice.
>
> >
> > 1. The scenario where an mTHP is partially zeromap, partially zswap, et=
c.,
> > would be an extremely rare case, as long as we're swapping out the mTHP=
 as
> > a whole and all the modules are handling it accordingly. It's highly
> > unlikely to form this mix of zeromap, zswap, and swapcache unless the
> > contiguous VMA virtual address happens to get some small folios with
> > aligned and contiguous swap slots. Even then, they would need to be
> > partially zeromap and partially non-zeromap, zswap, etc.
>
> As I mentioned, we can start simple and collect data for this. If it's
> rare and we don't need to handle it, that's good.
>
> >
> > As you mentioned, zeromap handles mTHP as a whole during swapping
> > out, marking all subpages of the entire mTHP as zeromap rather than jus=
t
> > a subset of them.
> >
> > And swap-in can also entirely map a swapcache which is a large folio ba=
sed
> > on our previous patchset which has been in mainline:
> > "mm: swap: entirely map large folios found in swapcache"
> > https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.com/
> >
> > It seems the only thing we're missing is zswap support for mTHP.
>
> It is still possible for two virtually contiguous folios to be swapped
> out to contiguous swap entries. It is also possible that a large folio
> is swapped out as a whole, then only a part of it is swapped in later
> due to memory pressure. If that part is later reclaimed again and gets
> added to the swapcache, we can run into the hybrid swapin situation.
> There may be other scenarios as well, I did not think this through.
>
> >
> > 2. Implementing hybrid swap-in would be extremely tricky and could disr=
upt
> > several software layers. I can share some pseudo code below:
>
> Yeah it definitely would be complex, so we need proper justification for =
it.
>
> >
> > swap_read_folio()
> > {
> >        if (zeromap_full)
> >                folio_read_from_zeromap()
> >        else if (zswap_map_full)
> >               folio_read_from_zswap()
> >        else {
> >               folio_read_from_swapfile()
> >               if (zeromap_partial)
> >                        folio_read_from_zeromap_fixup()  /* fill zero
> > for partially zeromap subpages */
> >               if (zwap_partial)
> >                        folio_read_from_zswap_fixup()  /* zswap_load
> > for partially zswap-mapped subpages */
> >
> >                folio_mark_uptodate()
> >                folio_unlock()
> > }
> >
> > We'd also need to modify folio_read_from_swapfile() to skip
> > folio_mark_uptodate()
> > and folio_unlock() after completing the BIO. This approach seems to
> > entirely disrupt
> > the software layers.
> >
> > This could also lead to unnecessary IO operations for subpages that
> > require fixup.
> > Since such cases are quite rare, I believe the added complexity isn't w=
orth it.
> >
> > My point is that we should simply check that all PTEs have consistent z=
eromap,
> > zswap, and swapcache statuses before proceeding, otherwise fall back to=
 the next
> > lower order if needed. This approach improves performance and avoids co=
mplex
> > corner cases.
>
> Agree that we should start with that, although we should probably
> fallback to the largest order we can swapin from a single backend,
> rather than the next lower order.
>
> >
> > So once zswap mTHP is there, I would also expect an API similar to
> > swap_zeromap_entries_check()
> > for example:
> > zswap_entries_check(entry, nr) which can return if we are having
> > full, non, and partial zswap to replace the existing
> > zswap_never_enabled().
>
> I think a better API would be similar to what Usama had. Basically
> take in (entry, nr) and return how much of it is in zswap starting at
> entry, so that we can decide the swapin order.
>
> Maybe we can adjust your proposed swap_zeromap_entries_check() as well
> to do that? Basically return the number of swap entries in the zeromap
> starting at 'entry'. If 'entry' itself is not in the zeromap we return
> 0 naturally. That would be a small adjustment/fix over what Usama had,
> but implementing it with bitmap operations like you did would be
> better.

I assume you means the below

/*
 * Return the number of contiguous zeromap entries started from entry
 */
static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, in=
t nr)
{
        struct swap_info_struct *sis =3D swp_swap_info(entry);
        unsigned long start =3D swp_offset(entry);
        unsigned long end =3D start + nr;
        unsigned long idx;

        idx =3D find_next_bit(sis->zeromap, end, start);
        if (idx !=3D start)
                return 0;

        return find_next_zero_bit(sis->zeromap, end, start) - idx;
}

If yes, I really like this idea.

It seems much better than using an enum, which would require adding a new
data structure :-) Additionally, returning the number allows callers
to fall back
to the largest possible order, rather than trying next lower orders
sequentially.

Hi Usama,
what is your take on this?

>
> >
> > Though I am not sure how cheap zswap can implement it,
> > swap_zeromap_entries_check()
> > could be two simple bit operations:
> >
> > +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t
> > entry, int nr)
> > +{
> > +       struct swap_info_struct *sis =3D swp_swap_info(entry);
> > +       unsigned long start =3D swp_offset(entry);
> > +       unsigned long end =3D start + nr;
> > +
> > +       if (find_next_bit(sis->zeromap, end, start) =3D=3D end)
> > +               return SWAP_ZEROMAP_NON;
> > +       if (find_next_zero_bit(sis->zeromap, end, start) =3D=3D end)
> > +               return SWAP_ZEROMAP_FULL;
> > +
> > +       return SWAP_ZEROMAP_PARTIAL;
> > +}
> >
> > 3. swapcache is different from zeromap and zswap. Swapcache indicates
> > that the memory
> > is still available and should be re-mapped rather than allocating a
> > new folio. Our previous
> > patchset has implemented a full re-map of an mTHP in do_swap_page() as =
mentioned
> > in 1.
> >
> > For the same reason as point 1, partial swapcache is a rare edge case.
> > Not re-mapping it
> > and instead allocating a new folio would add significant complexity.
> >
> > > >
> > > > Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeroma=
p`, we
> > > > permit almost all mTHP swap-ins, except for those rare situations w=
here
> > > > small folios that were swapped out happen to have contiguous and al=
igned
> > > > swap slots.
> > > >
> > > > swapcache is another quite different story, since our user scenario=
s begin from
> > > > the simplest sync io on mobile phones, we don't quite care about sw=
apcache.
> > >
> > > Right. The reason I bring this up is as I mentioned above, there is a
> > > common problem of forming large folios from different sources, which
> > > includes the swap cache. The fact that synchronous swapin does not us=
e
> > > the swapcache was a happy coincidence for you, as you can add support
> > > mTHP swapins without handling this case yet ;)
> >
> > As I mentioned above, I'd really rather filter out those corner cases
> > than support
> > them, not just for the current situation to unlock swap-in series :-)
>
> If they are indeed corner cases, then I definitely agree.

Thanks
Barry