linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: usamaarif642@gmail.com, akpm@linux-foundation.org,
	 chengming.zhou@linux.dev, david@redhat.com, hannes@cmpxchg.org,
	 hughd@google.com, kernel-team@meta.com,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	nphamcs@gmail.com, shakeel.butt@linux.dev,  willy@infradead.org,
	ying.huang@intel.com, hanchuanhua@oppo.com
Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap
Date: Thu, 5 Sep 2024 22:33:41 +1200	[thread overview]
Message-ID: <CAGsJ_4yBFpyA4Znfgr7V=eoHAnhuLPDTqaVOre9waTKZ+R3R9A@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4zB7za72xL94-1Oc+M2M1RtxftVYUAUk=1yngUoK65stw@mail.gmail.com>

On Thu, Sep 5, 2024 at 10:10 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Sep 5, 2024 at 8:49 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, Sep 5, 2024 at 7:55 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Sep 5, 2024 at 12:03 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Thu, Sep 5, 2024 at 5:41 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > [..]
> > > > > > > I understand the point of doing this to unblock the synchronous large
> > > > > > > folio swapin support work, but at some point we're gonna have to
> > > > > > > actually handle the cases where a large folio being swapped in is
> > > > > > > partially in the swap cache, zswap, the zeromap, etc.
> > > > > > >
> > > > > > > All these cases will need similar-ish handling, and I suspect we won't
> > > > > > > just skip swapping in large folios in all these cases.
> > > > > >
> > > > > > I agree that this is definitely the goal. `swap_read_folio()` should be a
> > > > > > dependable API that always returns reliable data, regardless of whether
> > > > > > `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in shouldn't
> > > > > > be held back. Significant efforts are underway to support large folios in
> > > > > > `zswap`, and progress is being made. Not to mention we've already allowed
> > > > > > `zeromap` to proceed, even though it doesn't support large folios.
> > > > > >
> > > > > > It's genuinely unfair to let the lack of mTHP support in `zeromap` and
> > > > > > `zswap` hold swap-in hostage.
> > > > >
> > > >
> > > > Hi Yosry,
> > > >
> > > > > Well, two points here:
> > > > >
> > > > > 1. I did not say that we should block the synchronous mTHP swapin work
> > > > > for this :) I said the next item on the TODO list for mTHP swapin
> > > > > support should be handling these cases.
> > > >
> > > > Thanks for your clarification!
> > > >
> > > > >
> > > > > 2. I think two things are getting conflated here. Zswap needs to
> > > > > support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is
> > > > > truly, and is outside the scope of zswap/zeromap, is being able to
> > > > > support hybrid mTHP swapin.
> > > > >
> > > > > When swapping in an mTHP, the swapped entries can be on disk, in the
> > > > > swapcache, in zswap, or in the zeromap. Even if all these things
> > > > > support mTHPs individually, we essentially need support to form an
> > > > > mTHP from swap entries in different backends. That's what I meant.
> > > > > Actually if we have that, we may not really need mTHP swapin support
> > > > > in zswap, because we can just form the large folio in the swap layer
> > > > > from multiple zswap entries.
> > > > >
> > > >
> > > > After further consideration, I've actually started to disagree with the idea
> > > > of supporting hybrid swapin (forming an mTHP from swap entries in different
> > > > backends). My reasoning is as follows:
> > >
> > > I do not have any data about this, so you could very well be right
> > > here. Handling hybrid swapin could be simply falling back to the
> > > smallest order we can swapin from a single backend. We can at least
> > > start with this, and collect data about how many mTHP swapins fallback
> > > due to hybrid backends. This way we only take the complexity if
> > > needed.
> > >
> > > I did imagine though that it's possible for two virtually contiguous
> > > folios to be swapped out to contiguous swap entries and end up in
> > > different media (e.g. if only one of them is zero-filled). I am not
> > > sure how rare it would be in practice.
> > >
> > > >
> > > > 1. The scenario where an mTHP is partially zeromap, partially zswap, etc.,
> > > > would be an extremely rare case, as long as we're swapping out the mTHP as
> > > > a whole and all the modules are handling it accordingly. It's highly
> > > > unlikely to form this mix of zeromap, zswap, and swapcache unless the
> > > > contiguous VMA virtual address happens to get some small folios with
> > > > aligned and contiguous swap slots. Even then, they would need to be
> > > > partially zeromap and partially non-zeromap, zswap, etc.
> > >
> > > As I mentioned, we can start simple and collect data for this. If it's
> > > rare and we don't need to handle it, that's good.
> > >
> > > >
> > > > As you mentioned, zeromap handles mTHP as a whole during swapping
> > > > out, marking all subpages of the entire mTHP as zeromap rather than just
> > > > a subset of them.
> > > >
> > > > And swap-in can also entirely map a swapcache which is a large folio based
> > > > on our previous patchset which has been in mainline:
> > > > "mm: swap: entirely map large folios found in swapcache"
> > > > https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.com/
> > > >
> > > > It seems the only thing we're missing is zswap support for mTHP.
> > >
> > > It is still possible for two virtually contiguous folios to be swapped
> > > out to contiguous swap entries. It is also possible that a large folio
> > > is swapped out as a whole, then only a part of it is swapped in later
> > > due to memory pressure. If that part is later reclaimed again and gets
> > > added to the swapcache, we can run into the hybrid swapin situation.
> > > There may be other scenarios as well, I did not think this through.
> > >
> > > >
> > > > 2. Implementing hybrid swap-in would be extremely tricky and could disrupt
> > > > several software layers. I can share some pseudo code below:
> > >
> > > Yeah it definitely would be complex, so we need proper justification for it.
> > >
> > > >
> > > > swap_read_folio()
> > > > {
> > > >        if (zeromap_full)
> > > >                folio_read_from_zeromap()
> > > >        else if (zswap_map_full)
> > > >               folio_read_from_zswap()
> > > >        else {
> > > >               folio_read_from_swapfile()
> > > >               if (zeromap_partial)
> > > >                        folio_read_from_zeromap_fixup()  /* fill zero
> > > > for partially zeromap subpages */
> > > >               if (zwap_partial)
> > > >                        folio_read_from_zswap_fixup()  /* zswap_load
> > > > for partially zswap-mapped subpages */
> > > >
> > > >                folio_mark_uptodate()
> > > >                folio_unlock()
> > > > }
> > > >
> > > > We'd also need to modify folio_read_from_swapfile() to skip
> > > > folio_mark_uptodate()
> > > > and folio_unlock() after completing the BIO. This approach seems to
> > > > entirely disrupt
> > > > the software layers.
> > > >
> > > > This could also lead to unnecessary IO operations for subpages that
> > > > require fixup.
> > > > Since such cases are quite rare, I believe the added complexity isn't worth it.
> > > >
> > > > My point is that we should simply check that all PTEs have consistent zeromap,
> > > > zswap, and swapcache statuses before proceeding, otherwise fall back to the next
> > > > lower order if needed. This approach improves performance and avoids complex
> > > > corner cases.
> > >
> > > Agree that we should start with that, although we should probably
> > > fallback to the largest order we can swapin from a single backend,
> > > rather than the next lower order.
> > >
> > > >
> > > > So once zswap mTHP is there, I would also expect an API similar to
> > > > swap_zeromap_entries_check()
> > > > for example:
> > > > zswap_entries_check(entry, nr) which can return if we are having
> > > > full, non, and partial zswap to replace the existing
> > > > zswap_never_enabled().
> > >
> > > I think a better API would be similar to what Usama had. Basically
> > > take in (entry, nr) and return how much of it is in zswap starting at
> > > entry, so that we can decide the swapin order.
> > >
> > > Maybe we can adjust your proposed swap_zeromap_entries_check() as well
> > > to do that? Basically return the number of swap entries in the zeromap
> > > starting at 'entry'. If 'entry' itself is not in the zeromap we return
> > > 0 naturally. That would be a small adjustment/fix over what Usama had,
> > > but implementing it with bitmap operations like you did would be
> > > better.
> >
> > I assume you means the below
> >
> > /*
> >  * Return the number of contiguous zeromap entries started from entry
> >  */
> > static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr)
> > {
> >         struct swap_info_struct *sis = swp_swap_info(entry);
> >         unsigned long start = swp_offset(entry);
> >         unsigned long end = start + nr;
> >         unsigned long idx;
> >
> >         idx = find_next_bit(sis->zeromap, end, start);
> >         if (idx != start)
> >                 return 0;
> >
> >         return find_next_zero_bit(sis->zeromap, end, start) - idx;
> > }
> >
> > If yes, I really like this idea.
> >
> > It seems much better than using an enum, which would require adding a new
> > data structure :-) Additionally, returning the number allows callers
> > to fall back
> > to the largest possible order, rather than trying next lower orders
> > sequentially.
>
> No, returning 0 after only checking first entry would still reintroduce
> the current bug, where the start entry is zeromap but other entries
> might not be. We need another value to indicate whether the entries
> are consistent if we want to avoid the enum:
>
> /*
>  * Return the number of contiguous zeromap entries started from entry;
>  * If all entries have consistent zeromap, *consistent will be true;
>  * otherwise, false;
>  */
> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry,
>                 int nr, bool *consistent)
> {
>         struct swap_info_struct *sis = swp_swap_info(entry);
>         unsigned long start = swp_offset(entry);
>         unsigned long end = start + nr;
>         unsigned long s_idx, c_idx;
>
>         s_idx = find_next_bit(sis->zeromap, end, start);
>         if (s_idx == end) {
>                 *consistent = true;
>                 return 0;
>         }
>
>         c_idx = find_next_zero_bit(sis->zeromap, end, start);
>         if (c_idx == end) {
>                 *consistent = true;
>                 return nr;
>         }
>
>         *consistent = false;
>         if (s_idx == start)
>                 return 0;
>         return c_idx - s_idx;
> }
>
> I can actually switch the places of the "consistent" and returned
> number if that looks
> better.

I'd rather make it simpler by:

/*
 * Check if all entries have consistent zeromap status, return true if
 * all entries are zeromap or non-zeromap, else return false;
 */
static inline bool swap_zeromap_entries_check(swp_entry_t entry, int nr)
{
        struct swap_info_struct *sis = swp_swap_info(entry);
        unsigned long start = swp_offset(entry);
        unsigned long end = start + *nr;

        if (find_next_bit(sis->zeromap, end, start) == end)
                return true;
        if (find_next_zero_bit(sis->zeromap, end, start) == end)
                return true;

        return false;
}

mm/page_io.c can combine this with reading the zeromap of first entry to
decide if it will read folio from zeromap; mm/memory.c only needs the bool
to fallback to the largest possible order.

static inline unsigned long thp_swap_suitable_orders(...)
{
        int order, nr;

        order = highest_order(orders);

        while (orders) {
                nr = 1 << order;
                if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr &&
                    swap_zeromap_entries_check(entry, nr))
                        break;
                order = next_order(&orders, order);
        }

        return orders;
}

>
> >
> > Hi Usama,
> > what is your take on this?
> >
> > >
> > > >
> > > > Though I am not sure how cheap zswap can implement it,
> > > > swap_zeromap_entries_check()
> > > > could be two simple bit operations:
> > > >
> > > > +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t
> > > > entry, int nr)
> > > > +{
> > > > +       struct swap_info_struct *sis = swp_swap_info(entry);
> > > > +       unsigned long start = swp_offset(entry);
> > > > +       unsigned long end = start + nr;
> > > > +
> > > > +       if (find_next_bit(sis->zeromap, end, start) == end)
> > > > +               return SWAP_ZEROMAP_NON;
> > > > +       if (find_next_zero_bit(sis->zeromap, end, start) == end)
> > > > +               return SWAP_ZEROMAP_FULL;
> > > > +
> > > > +       return SWAP_ZEROMAP_PARTIAL;
> > > > +}
> > > >
> > > > 3. swapcache is different from zeromap and zswap. Swapcache indicates
> > > > that the memory
> > > > is still available and should be re-mapped rather than allocating a
> > > > new folio. Our previous
> > > > patchset has implemented a full re-map of an mTHP in do_swap_page() as mentioned
> > > > in 1.
> > > >
> > > > For the same reason as point 1, partial swapcache is a rare edge case.
> > > > Not re-mapping it
> > > > and instead allocating a new folio would add significant complexity.
> > > >
> > > > > >
> > > > > > Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, we
> > > > > > permit almost all mTHP swap-ins, except for those rare situations where
> > > > > > small folios that were swapped out happen to have contiguous and aligned
> > > > > > swap slots.
> > > > > >
> > > > > > swapcache is another quite different story, since our user scenarios begin from
> > > > > > the simplest sync io on mobile phones, we don't quite care about swapcache.
> > > > >
> > > > > Right. The reason I bring this up is as I mentioned above, there is a
> > > > > common problem of forming large folios from different sources, which
> > > > > includes the swap cache. The fact that synchronous swapin does not use
> > > > > the swapcache was a happy coincidence for you, as you can add support
> > > > > mTHP swapins without handling this case yet ;)
> > > >
> > > > As I mentioned above, I'd really rather filter out those corner cases
> > > > than support
> > > > them, not just for the current situation to unlock swap-in series :-)
> > >
> > > If they are indeed corner cases, then I definitely agree.
> >

Thanks
 Barry


  reply	other threads:[~2024-09-05 10:33 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-12 12:43 [PATCH v4 0/2] " Usama Arif
2024-06-12 12:43 ` [PATCH v4 1/2] " Usama Arif
2024-06-12 20:13   ` Yosry Ahmed
2024-06-13 11:37     ` Usama Arif
2024-06-13 16:38       ` Yosry Ahmed
2024-06-13 19:21         ` Usama Arif
2024-06-13 19:26           ` Yosry Ahmed
2024-06-13 19:38             ` Usama Arif
2024-09-04  5:55   ` Barry Song
2024-09-04  7:12     ` Yosry Ahmed
2024-09-04  7:17       ` Barry Song
2024-09-04  7:22         ` Yosry Ahmed
2024-09-04  7:54           ` Barry Song
2024-09-04 17:40             ` Yosry Ahmed
2024-09-05  7:03               ` Barry Song
2024-09-05  7:55                 ` Yosry Ahmed
2024-09-05  8:49                   ` Barry Song
2024-09-05 10:10                     ` Barry Song
2024-09-05 10:33                       ` Barry Song [this message]
2024-09-05 10:53                         ` Usama Arif
2024-09-05 11:00                           ` Barry Song
2024-09-05 19:19                             ` Usama Arif
2024-09-05 17:36                         ` Yosry Ahmed
2024-09-05 19:28                         ` Yosry Ahmed
2024-09-06 10:22                           ` Barry Song
2024-09-05 10:37                       ` Usama Arif
2024-09-05 10:42                         ` Barry Song
2024-09-05 10:50                           ` Usama Arif
2024-09-04 11:14     ` Usama Arif
2024-09-04 23:44       ` Barry Song
2024-09-04 23:47         ` Barry Song
2024-09-04 23:57         ` Yosry Ahmed
2024-09-05  0:29           ` Barry Song
2024-09-05  7:38             ` Yosry Ahmed
2024-06-12 12:43 ` [PATCH v4 2/2] mm: remove code to handle same filled pages Usama Arif
2024-06-12 15:09   ` Nhat Pham
2024-06-12 16:34     ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4yBFpyA4Znfgr7V=eoHAnhuLPDTqaVOre9waTKZ+R3R9A@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=david@redhat.com \
    --cc=hanchuanhua@oppo.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox