From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 60CC8C5475B
	for <linux-mm@archiver.kernel.org>; Wed,  6 Mar 2024 21:29:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E6A966B0075; Wed,  6 Mar 2024 16:29:53 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E1B706B0080; Wed,  6 Mar 2024 16:29:53 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CBE2B6B0081; Wed,  6 Mar 2024 16:29:53 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id BADF76B0075
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 16:29:53 -0500 (EST)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 91F038057C
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 21:29:53 +0000 (UTC)
X-FDA: 81867906666.26.4A747D6
Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42])
	by imf10.hostedemail.com (Postfix) with ESMTP id C868EC001F
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 21:29:50 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=X0iUvsa8;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709760590;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=/TW7BmseUg05DZMJNAyQ/wIvH4Jz08fRGBBxe8gUkTc=;
	b=EfayGnwArMkdG2FEz8yIlGKAZZQOxRTDduc8b3Acj7qwHsKY4rRJJtqWr9+Q+phtWfWzz/
	gtjPnag9pUzPs827MTohAjBaEexvNl+K1RL2W1I5ExBW0v74C+PtCHL398ntr7RteS8jHd
	JGXB3o1LzaqcuH5X2Y2KqG0A/Jf7TEw=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=X0iUvsa8;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709760590; a=rsa-sha256;
	cv=none;
	b=q5tnbQtmC3NuY0YfcxxM6rDWzvZpUF7d1ppaa6yk83tbGw2i0ryF6zsnYZ/3cdSqJSDR/i
	nytiyEFjdU3uPj2E1kdkwS+OdET3Wk94yg8Hl3zMu3i4Q0btUvatzJlLCGjUURAMZb9LY0
	5zoectdWoTV7URUF8N4i1t6YGTHA09k=
Received: by mail-vs1-f42.google.com with SMTP id ada2fe7eead31-472807a1d59so69399137.2
        for <linux-mm@kvack.org>; Wed, 06 Mar 2024 13:29:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709760590; x=1710365390; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=/TW7BmseUg05DZMJNAyQ/wIvH4Jz08fRGBBxe8gUkTc=;
        b=X0iUvsa8H1BH05WhaZZhU4vzspBZtNC1c2Q3PHHxr3hbmz9EH73Vg1Jbj38ewH6rSs
         x0Lbt55nAJY/WhaZWpdr6FFlqWa2+BP4KdOW5d0YaLj/W4SqtljMqf1gpVpfIsSSw+GC
         H6Ry7ajnwRseVTT9M6oYEJ8kGU6UxfGjus3gkEb8c1Oy+CM5fAKOT97RQ1xB71uhAWrb
         0571pSTLi1Wb0+o9HGGDeefpdNLMb2GaKJmkuuhF4sSaSVM7UoLnUPzTIX8MmVD3wfSJ
         HEFRtk9rkZT4ZARwDOo9t+Lk/iPHm5LuMNcFkiuUJ2BuQHwFJ5jOB//qNaTDEbu97zAs
         0ALA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709760590; x=1710365390;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=/TW7BmseUg05DZMJNAyQ/wIvH4Jz08fRGBBxe8gUkTc=;
        b=qqgUc4TOVdoBPsgmP2B3DzCCsqvAR4Pyxjgt08wpA3s8uClgt5S64XIP6DZyPSHxNH
         U1TZctYvCbPk4YjyLl/vQxVbmyExO88prlhJ37JXUFjaIo5wQEShcsYOz6LsuYdPoQGQ
         4Oxf4vN8vOGTylkAiwlIOAwFbb6zCLZSI3zw9ugqFRdEmd2ElSF2Va3yUECOxflHDWbI
         NvtCIRQpXpJX8zIqeb6CzX4gtMtijtOyF9Hs5YQ73AIQRjqgoLq9LNOAWzXqNByWXwde
         uTorb+6kBSF4i0a2NoTlUhH2Hru4v46L+P5nKVGmwnlbvEER/WcH7xq1R78+j2Tl+5Wb
         x6BA==
X-Forwarded-Encrypted: i=1; AJvYcCV/VdIVZnWtuNTzWh803vrdQRvYoXxER4KwyRsU1cqT4kVpwj/n3kKlVhHShfR0OoRrMzfygPufwrZQ7xte0a2wrOg=
X-Gm-Message-State: AOJu0YyJz5scm45ZHTAq8ok9KNaTY9/KZe/cGPYlNwDkOpUcOcJ5VUoc
	CPBHrg43vCHldOUAvJEjbDXewoYBopPnBeAkhjqJwUkCqQ8X+qLjx7PaRqHkoJb9MXuKlES2Pgh
	+qkSYBG8UTyye0HTqCJgUFY5k/FU=
X-Google-Smtp-Source: AGHT+IHsvg5/+NtIDgMYmWutXcYXD7YNJev128V6ekaY2UZJLcU6FjhR2n89X/2BrMU5ngcCtp+T1yoJD0IKQ9PhzwM=
X-Received: by 2002:a67:ffda:0:b0:471:2478:71fd with SMTP id
 w26-20020a67ffda000000b00471247871fdmr5235798vsq.33.1709760589801; Wed, 06
 Mar 2024 13:29:49 -0800 (PST)
MIME-Version: 1.0
References: <CAF8kJuMQ7qBZqdHHS52jRyA-ETTfHnPv+V9ChaBsJ_q_G801Lw@mail.gmail.com>
 <CAGsJ_4zJE_Jz_b4rdmRhs4=CLbc_fUNJZZ6H+Du9d6are4DLWA@mail.gmail.com>
 <CAF8kJuOsaF3V8bRrQiQvGANSCvemZ-j3qTN1vioJKhr4N+D8CA@mail.gmail.com>
 <CAGsJ_4zV7S0SX_gaeO-V=bo232TNOZYpHoBzN7js_dvpimq1KA@mail.gmail.com> <CAF8kJuO1+pvCDkhnaju24ib=iy28RPEPO5KGgn-J_1Eo2qTDyg@mail.gmail.com>
In-Reply-To: <CAF8kJuO1+pvCDkhnaju24ib=iy28RPEPO5KGgn-J_1Eo2qTDyg@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 7 Mar 2024 10:29:36 +1300
Message-ID: <CAGsJ_4zF=4oM72gWBxGuH2LnuwXXwAFRRg2qOo6tzaaq7eubNQ@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
To: Chris Li <chrisl@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm <linux-mm@kvack.org>, 
	ryan.roberts@arm.com, David Hildenbrand <david@redhat.com>, 
	Chuanhua Han <hanchuanhua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: C868EC001F
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: cxdnkmjjp6xzx1i8xjq5ou3psxnstjmt
X-HE-Tag: 1709760590-487224
X-HE-Meta: U2FsdGVkX18DzgP62WtdNq8muhffLV3gtb9v7XpcYiomk3agHQBXNe5+Scc9VX5J4z5cPo3J2CKmAnmfJbhugRhnUvjnJm4QGkXET2njR9zyfbpbSbhENJziFQ3L490g9qKIfcWiB7z8tSmpt7b2PRPVFUYXHtfa3eTvSlmBscOrFDtSb2xK7JwkMzSQOocTToMYkfPSM3NMEUIDRZ7LGXGKHTmwTcXKd7Mf0ZgoVbmGTjdwxUk1B4fKFDpX34Lccx7iKzPtHv5ExFPoCIlvRFitNo89Hj0LI06+2mW3maEHDon0JiG7wfpK11IOhZ9MaOf05QxJZSychBGtcZjxGmmBslSIx08cVaA/jZAoLIhia5czZl0qE/0kkDLnQtgInOjHkuGm65CkqLdPWghgAym6avC7mK8Wpxl/wfC2r3uhqKScQxHVdzPZSOsr5eFym6IDsheE9WumlKHqTR1uIsQpd37xr4WvfN6QnY8/TP9ntblgm7MOwXBQMrfwv3mGEbQ+n3yKGAbfsy5NQhbgCbVzRojFkph205Af8TJDC9Rmgebi5kNJJgJyP+7YCEaKt2SeQv0bWkLluE29/c7iMqkAHuGuGyb584FdHSnFT5/mjjb4vi1uMtVvG9RsIN+Kez+DlC7GCw8s0FB4lt7BpgzWSNiHcRZ03AJSwb36UIwMiNOqcFNQazWoRqQfjCLZ54WNSme8Lzfnm0HJS1sDDiPk2wBa+g+oKFAhJKSjSQP/nBxktySlibGUp60Uq6VqNet+h+RxHlDgseYOBAHrl41ohEibwcu3gUApvEhMh7uo+kRzIyYesVipUjr+TBy4jBAEIUNEqZ0qBaZ4UvVTpsRfXOAjE+zyZUjumJDBFZG5/rSjU83XRMcTLJsxUJxDYlde/SdS3E+vnsqXt49OxHzuhyvP2Q8D+k+dI//B+RqfFdGjmdkM2yRevHW/pfY2eqTVqMsbx599BEXAkQ3
 Mj1ch1bi
 W2zaKE88WtlcCJl5k/nDGceGGL+GMbv0W9YzC
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000115, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Mar 7, 2024 at 6:56=E2=80=AFAM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 5, 2024 at 10:05=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr=
ote:
> >
> > On Wed, Mar 6, 2024 at 4:00=E2=80=AFPM Chris Li <chrisl@kernel.org> wro=
te:
> > >
> > > On Tue, Mar 5, 2024 at 5:15=E2=80=AFPM Barry Song <21cnbao@gmail.com>=
 wrote:
> > > > > Another limitation I would like to address is that swap_writepage=
 can
> > > > > only write out IO in one contiguous chunk, not able to perform
> > > > > non-continuous IO. When the swapfile is close to full, it is like=
ly
> > > > > the unused entry will spread across different locations. It would=
 be
> > > > > nice to be able to read and write large folio using discontiguous=
 disk
> > > > > IO locations.
> > > >
> > > > I don't find it will be too difficult for swap_writepage to only wr=
ite
> > > > out a large folio which has discontiguous swap offsets. taking
> > > > zRAM as an example, as long as bio can be organized correctly,
> > > > zram should be able to write a large folio one by one for its all
> > > > subpages.
> > >
> > > Yes.
> > >
> > > >
> > > > static void zram_bio_write(struct zram *zram, struct bio *bio)
> > > > {
> > > >         unsigned long start_time =3D bio_start_io_acct(bio);
> > > >         struct bvec_iter iter =3D bio->bi_iter;
> > > >
> > > >         do {
> > > >                 u32 index =3D iter.bi_sector >> SECTORS_PER_PAGE_SH=
IFT;
> > > >                 u32 offset =3D (iter.bi_sector & (SECTORS_PER_PAGE =
- 1)) <<
> > > >                                 SECTOR_SHIFT;
> > > >                 struct bio_vec bv =3D bio_iter_iovec(bio, iter);
> > > >
> > > >                 bv.bv_len =3D min_t(u32, bv.bv_len, PAGE_SIZE - off=
set);
> > > >
> > > >                 if (zram_bvec_write(zram, &bv, index, offset, bio) =
< 0) {
> > > >                         atomic64_inc(&zram->stats.failed_writes);
> > > >                         bio->bi_status =3D BLK_STS_IOERR;
> > > >                         break;
> > > >                 }
> > > >
> > > >                 zram_slot_lock(zram, index);
> > > >                 zram_accessed(zram, index);
> > > >                 zram_slot_unlock(zram, index);
> > > >
> > > >                 bio_advance_iter_single(bio, &iter, bv.bv_len);
> > > >         } while (iter.bi_size);
> > > >
> > > >         bio_end_io_acct(bio, start_time);
> > > >         bio_endio(bio);
> > > > }
> > > >
> > > > right now , add_to_swap() is lacking a way to record discontiguous
> > > > offset for each subpage, alternatively, we have a folio->swap.
> > > >
> > > > I wonder if we can somehow make it page granularity, for each
> > > > subpage, it can have its own offset somehow like page->swap,
> > > > then in swap_writepage(), we can make a bio with multiple
> > > > discontiguous I/O index. then we allow add_to_swap() to get
> > > > nr_pages different swap offsets, and fill into each subpage.
> > >
> > > The key is where to store the subpage offset. It can't be stored on
> > > the tail page's page->swap because some tail page's page struct are
> > > just mapping of the head page's page struct. I am afraid this mapping
> > > relationship has to be stored on the swap back end. That is the idea,
> > > have swap backend keep track of an array of subpage's swap location.
> > > This array is looked up by the head swap offset.
> >
> > I assume "some tail page's page struct are just mapping of the head
> > page's page struct" is only true of hugeTLB larger than PMD-mapped
> > hugeTLB (for example 2MB) for this moment? more widely mTHP
> > less than PMD-mapped size will still have all tail page struct?
>
> That is the HVO for huge pages. Yes, I consider using the tail page
> struct to store the swap entry a step back from the folio. The folio
> is about all these 4k pages having the same property and they can look
> like one big page. If we move to the memdesc world, those tail pages
> will not exist in any way. It is doable in some situations, I am just
> not sure it aligns with our future goal.
>
> >
> > "Having swap backend keep track of an array of subpage's swap
> > location" means we will save this metadata on swapfile?  will we
> > have more I/O especially if a large folio's mapping area might be
> > partially unmap, for example, by MADV_DONTNEED even after
> > the large folio is swapped-out, then we have to update the
> > metadata? right now, we only need to change PTE entries
> > and swap_map[] for the same case. do we have some way to keep
> > those data in memory instead?
>
> I actually consider keeping those arrays in memory, index by xarray
> and looking up by the head swap entry offset.
>
> >
> > >
> > > > But will this be a step back for folio?
> > >
> > > I think this should be separate from the folio. It is on the swap
> > > backend. From folio's point of view, it is just writing out a folio.
> > > The swap back end knows how to write out into subpage locations. From
> > > folio's point of view. It is just one swap page write.
> > >
> > > > > Some possible ideas for the fragmentation issue.
> > > > >
> > > > > a) buddy allocator for swap entities. Similar to the buddy alloca=
tor
> > > > > in memory. We can use a buddy allocator system for the swap entry=
 to
> > > > > avoid the low order swap entry fragment too much of the high orde=
r
> > > > > swap entry. It should greatly reduce the fragmentation caused by
> > > > > allocate and free of the swap entry of different sizes. However t=
he
> > > > > buddy allocator has its own limit as well. Unlike system memory, =
we
> > > > > can move and compact the memory. There is no rmap for swap entry,=
 it
> > > > > is much harder to move a swap entry to another disk location. So =
the
> > > > > buddy allocator for swap will help, but not solve all the
> > > > > fragmentation issues.
> > > >
> > > > I agree buddy will help. Meanwhile, we might need some way similar
> > > > with MOVABLE, UNMOVABLE migratetype. For example, try to gather
> > > > swap applications for small folios together and don't let them spre=
ad
> > > > throughout the whole swapfile.
> > > > we might be able to dynamically classify swap clusters to be for sm=
all
> > > > folios, for large folios, and avoid small folios to spread all clus=
ters.
> > >
> > > This really depends on the swap entries allocation and free cycle. In
> > > this extreme case, all swap entries have been allocated full.
> > > Then it free some of the 4K entry at discotinuges locations. Buddy
> > > allocator or cluster allocator are not going to save you from ending
> > > up with fragmented swap entries.  That is why I think we still need
> > > b).
> >
> > I agree. I believe that classifying clusters has the potential to allev=
iate
> > fragmentation to some degree while it can not resolve it. We can
> > to some extent prevent the spread of small swaps' applications.
>
> Yes, as I state earlier, it will help but not solve it completely.
>
> >
> > >
> > > > > b) Large swap entries. Take file as an example, a file on the fil=
e
> > > > > system can write to a discontinuous disk location. The file syste=
m
> > > > > responsible for tracking how to map the file offset into disk
> > > > > location. A large swap entry can have a similar indirection array=
 map
> > > > > out the disk location for different subpages within a folio.  Thi=
s
> > > > > allows a large folio to write out dis-continguos swap entries on =
the
> > > > > swap file. The array will need to store somewhere as part of the
> > > > > overhead.When allocating swap entries for the folio, we can alloc=
ate a
> > > > > batch of smaller 4k swap entries into an array. Use this array to
> > > > > read/write the large folio. There will be a lot of plumbing work =
to
> > > > > get it to work.
> > > >
> > > > we already have page struct, i wonder if we can record the offset
> > > > there if this is not a step back to folio. on the other hand, while
> > >
> > > No for the tail pages. Because some of the tail page "struct page" ar=
e
> > > just remapping of the head page "struct page".
> > >
> > > > swap-in, we can also allow large folios be swapped in from non-
> > > > discontiguous places and those offsets are actually also in PTE
> > > > entries.
> > >
> > > This discontinues sub page location needs to store outside of folio.
> > > Keep in mind that you can have more than one PTE in different
> > > processes. Those PTE on different processes might not agree with each
> > > other. BTW, shmem store the swap entry in page cache not PTE.
> >
> > I don't quite understand what you mean by "Those PTE on different
> > processes might not agree with each other". Can we have a concrete
> > example?
>
> Process A allocates memory back by large folio, A fork as process B.
> Both A and B swap out the large folio. Then B MADVICE zap some PTE
> from the large folio (Maybe zap before the swap out). While A did not
> change the large folio at all.

That behavior seems quite normal. Since B's Page Table Entries (PTEs) are s=
et
to pte_none, there's no need to swap in those parts.
This behavior is occurring today, but we already know the full
situation based on
the values of the PTEs.
As long as we correctly fill in the offsets in the PTEs, whether they
are continuous
or not, we have a method to swap them in.
Once pageout() has been done and folio is destroyed, it seems the metadata =
you
mentioned becomes totally useless.

>
> > I assume this is also true for small folios but it won't be a problem
> > as the process which is doing swap-in only cares about its own
> > PTE entries?
>
> It will be a challenge if we maintain a large swap entry with its
> internal array mapping to different swap device offset. You get
> different partial mapping of the same large folio. That is a problem
> we need to solve, I don't have all the answers yet.
>

I don't quite understand why it is a large folio for process B, we may
instead swap-in small folios for those parts which are still mapped.

> Chris
>
> >
> > > >
> > > > I feel we have "page" to record offset before pageout() is done
> > > > and we have PTE entries to record offset after pageout() is
> > > > done.
> > > >
> > > > But still (a) is needed as we really hope large folios can be put
> > > > in contiguous offsets, with this, we might have other benefit
> > > > like saving the whole compressed large folio as one object rather t=
han
> > > > nr_pages objects in zsmalloc and decompressing them together
> > > > while swapping  in (a patchset is coming in a couple of days for th=
is).
> > > > when a large folio is put in nr_pages different places, hardly can =
we do
> > > > this in zsmalloc. But at least, we can still swap-out large folios
> > > > without splitting and swap-in large folios though we read it
> > > > back from nr_pages different objects.
> > >
> > > Exactly.
> > >
> > > Chris
> >

Thanks
Barry