From: Usama Arif <usamaarif642@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>,
akpm@linux-foundation.org, chengming.zhou@linux.dev,
david@redhat.com, hannes@cmpxchg.org, hughd@google.com,
kernel-team@meta.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, nphamcs@gmail.com, shakeel.butt@linux.dev,
willy@infradead.org, ying.huang@intel.com, hanchuanhua@oppo.com
Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap
Date: Thu, 5 Sep 2024 11:50:16 +0100 [thread overview]
Message-ID: <076d2577-61d0-42a6-a95f-11326684a2f2@gmail.com> (raw)
In-Reply-To: <CAGsJ_4yVB4wcqp5cDpnj6XSuqtb4G=DRiMBssbhpTcVSwaQd6A@mail.gmail.com>
On 05/09/2024 11:42, Barry Song wrote:
> On Thu, Sep 5, 2024 at 10:37 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 05/09/2024 11:10, Barry Song wrote:
>>> On Thu, Sep 5, 2024 at 8:49 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Thu, Sep 5, 2024 at 7:55 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> On Thu, Sep 5, 2024 at 12:03 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>
>>>>>> On Thu, Sep 5, 2024 at 5:41 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>>>
>>>>>>> [..]
>>>>>>>>> I understand the point of doing this to unblock the synchronous large
>>>>>>>>> folio swapin support work, but at some point we're gonna have to
>>>>>>>>> actually handle the cases where a large folio being swapped in is
>>>>>>>>> partially in the swap cache, zswap, the zeromap, etc.
>>>>>>>>>
>>>>>>>>> All these cases will need similar-ish handling, and I suspect we won't
>>>>>>>>> just skip swapping in large folios in all these cases.
>>>>>>>>
>>>>>>>> I agree that this is definitely the goal. `swap_read_folio()` should be a
>>>>>>>> dependable API that always returns reliable data, regardless of whether
>>>>>>>> `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in shouldn't
>>>>>>>> be held back. Significant efforts are underway to support large folios in
>>>>>>>> `zswap`, and progress is being made. Not to mention we've already allowed
>>>>>>>> `zeromap` to proceed, even though it doesn't support large folios.
>>>>>>>>
>>>>>>>> It's genuinely unfair to let the lack of mTHP support in `zeromap` and
>>>>>>>> `zswap` hold swap-in hostage.
>>>>>>>
>>>>>>
>>>>>> Hi Yosry,
>>>>>>
>>>>>>> Well, two points here:
>>>>>>>
>>>>>>> 1. I did not say that we should block the synchronous mTHP swapin work
>>>>>>> for this :) I said the next item on the TODO list for mTHP swapin
>>>>>>> support should be handling these cases.
>>>>>>
>>>>>> Thanks for your clarification!
>>>>>>
>>>>>>>
>>>>>>> 2. I think two things are getting conflated here. Zswap needs to
>>>>>>> support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is
>>>>>>> truly, and is outside the scope of zswap/zeromap, is being able to
>>>>>>> support hybrid mTHP swapin.
>>>>>>>
>>>>>>> When swapping in an mTHP, the swapped entries can be on disk, in the
>>>>>>> swapcache, in zswap, or in the zeromap. Even if all these things
>>>>>>> support mTHPs individually, we essentially need support to form an
>>>>>>> mTHP from swap entries in different backends. That's what I meant.
>>>>>>> Actually if we have that, we may not really need mTHP swapin support
>>>>>>> in zswap, because we can just form the large folio in the swap layer
>>>>>>> from multiple zswap entries.
>>>>>>>
>>>>>>
>>>>>> After further consideration, I've actually started to disagree with the idea
>>>>>> of supporting hybrid swapin (forming an mTHP from swap entries in different
>>>>>> backends). My reasoning is as follows:
>>>>>
>>>>> I do not have any data about this, so you could very well be right
>>>>> here. Handling hybrid swapin could be simply falling back to the
>>>>> smallest order we can swapin from a single backend. We can at least
>>>>> start with this, and collect data about how many mTHP swapins fallback
>>>>> due to hybrid backends. This way we only take the complexity if
>>>>> needed.
>>>>>
>>>>> I did imagine though that it's possible for two virtually contiguous
>>>>> folios to be swapped out to contiguous swap entries and end up in
>>>>> different media (e.g. if only one of them is zero-filled). I am not
>>>>> sure how rare it would be in practice.
>>>>>
>>>>>>
>>>>>> 1. The scenario where an mTHP is partially zeromap, partially zswap, etc.,
>>>>>> would be an extremely rare case, as long as we're swapping out the mTHP as
>>>>>> a whole and all the modules are handling it accordingly. It's highly
>>>>>> unlikely to form this mix of zeromap, zswap, and swapcache unless the
>>>>>> contiguous VMA virtual address happens to get some small folios with
>>>>>> aligned and contiguous swap slots. Even then, they would need to be
>>>>>> partially zeromap and partially non-zeromap, zswap, etc.
>>>>>
>>>>> As I mentioned, we can start simple and collect data for this. If it's
>>>>> rare and we don't need to handle it, that's good.
>>>>>
>>>>>>
>>>>>> As you mentioned, zeromap handles mTHP as a whole during swapping
>>>>>> out, marking all subpages of the entire mTHP as zeromap rather than just
>>>>>> a subset of them.
>>>>>>
>>>>>> And swap-in can also entirely map a swapcache which is a large folio based
>>>>>> on our previous patchset which has been in mainline:
>>>>>> "mm: swap: entirely map large folios found in swapcache"
>>>>>> https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.com/
>>>>>>
>>>>>> It seems the only thing we're missing is zswap support for mTHP.
>>>>>
>>>>> It is still possible for two virtually contiguous folios to be swapped
>>>>> out to contiguous swap entries. It is also possible that a large folio
>>>>> is swapped out as a whole, then only a part of it is swapped in later
>>>>> due to memory pressure. If that part is later reclaimed again and gets
>>>>> added to the swapcache, we can run into the hybrid swapin situation.
>>>>> There may be other scenarios as well, I did not think this through.
>>>>>
>>>>>>
>>>>>> 2. Implementing hybrid swap-in would be extremely tricky and could disrupt
>>>>>> several software layers. I can share some pseudo code below:
>>>>>
>>>>> Yeah it definitely would be complex, so we need proper justification for it.
>>>>>
>>>>>>
>>>>>> swap_read_folio()
>>>>>> {
>>>>>> if (zeromap_full)
>>>>>> folio_read_from_zeromap()
>>>>>> else if (zswap_map_full)
>>>>>> folio_read_from_zswap()
>>>>>> else {
>>>>>> folio_read_from_swapfile()
>>>>>> if (zeromap_partial)
>>>>>> folio_read_from_zeromap_fixup() /* fill zero
>>>>>> for partially zeromap subpages */
>>>>>> if (zwap_partial)
>>>>>> folio_read_from_zswap_fixup() /* zswap_load
>>>>>> for partially zswap-mapped subpages */
>>>>>>
>>>>>> folio_mark_uptodate()
>>>>>> folio_unlock()
>>>>>> }
>>>>>>
>>>>>> We'd also need to modify folio_read_from_swapfile() to skip
>>>>>> folio_mark_uptodate()
>>>>>> and folio_unlock() after completing the BIO. This approach seems to
>>>>>> entirely disrupt
>>>>>> the software layers.
>>>>>>
>>>>>> This could also lead to unnecessary IO operations for subpages that
>>>>>> require fixup.
>>>>>> Since such cases are quite rare, I believe the added complexity isn't worth it.
>>>>>>
>>>>>> My point is that we should simply check that all PTEs have consistent zeromap,
>>>>>> zswap, and swapcache statuses before proceeding, otherwise fall back to the next
>>>>>> lower order if needed. This approach improves performance and avoids complex
>>>>>> corner cases.
>>>>>
>>>>> Agree that we should start with that, although we should probably
>>>>> fallback to the largest order we can swapin from a single backend,
>>>>> rather than the next lower order.
>>>>>
>>>>>>
>>>>>> So once zswap mTHP is there, I would also expect an API similar to
>>>>>> swap_zeromap_entries_check()
>>>>>> for example:
>>>>>> zswap_entries_check(entry, nr) which can return if we are having
>>>>>> full, non, and partial zswap to replace the existing
>>>>>> zswap_never_enabled().
>>>>>
>>>>> I think a better API would be similar to what Usama had. Basically
>>>>> take in (entry, nr) and return how much of it is in zswap starting at
>>>>> entry, so that we can decide the swapin order.
>>>>>
>>>>> Maybe we can adjust your proposed swap_zeromap_entries_check() as well
>>>>> to do that? Basically return the number of swap entries in the zeromap
>>>>> starting at 'entry'. If 'entry' itself is not in the zeromap we return
>>>>> 0 naturally. That would be a small adjustment/fix over what Usama had,
>>>>> but implementing it with bitmap operations like you did would be
>>>>> better.
>>>>
>>>> I assume you means the below
>>>>
>>>> /*
>>>> * Return the number of contiguous zeromap entries started from entry
>>>> */
>>>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr)
>>>> {
>>>> struct swap_info_struct *sis = swp_swap_info(entry);
>>>> unsigned long start = swp_offset(entry);
>>>> unsigned long end = start + nr;
>>>> unsigned long idx;
>>>>
>>>> idx = find_next_bit(sis->zeromap, end, start);
>>>> if (idx != start)
>>>> return 0;
>>>>
>>>> return find_next_zero_bit(sis->zeromap, end, start) - idx;
>>>> }
>>>>
>>>> If yes, I really like this idea.
>>>>
>>>> It seems much better than using an enum, which would require adding a new
>>>> data structure :-) Additionally, returning the number allows callers
>>>> to fall back
>>>> to the largest possible order, rather than trying next lower orders
>>>> sequentially.
>>>
>>> No, returning 0 after only checking first entry would still reintroduce
>>> the current bug, where the start entry is zeromap but other entries
>>> might not be. We need another value to indicate whether the entries
>>> are consistent if we want to avoid the enum:
>>>
>>> /*
>>> * Return the number of contiguous zeromap entries started from entry;
>>> * If all entries have consistent zeromap, *consistent will be true;
>>> * otherwise, false;
>>> */
>>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry,
>>> int nr, bool *consistent)
>>> {
>>> struct swap_info_struct *sis = swp_swap_info(entry);
>>> unsigned long start = swp_offset(entry);
>>> unsigned long end = start + nr;
>>> unsigned long s_idx, c_idx;
>>>
>>> s_idx = find_next_bit(sis->zeromap, end, start);
>>
>> In all of the implementations you sent, you are using find_next_bit(..,end, start), but
>> I believe it should be find_next_bit(..,nr, start)?
>
> I guess no, the tricky thing is that size means the size from the first bit of
> bitmap but not from the "start" bit?
>
Ah ok, we should probably change the function prototype to end. Its ok then if thats the case.
>> TBH, I liked the enum implementation you had in https://lore.kernel.org/all/20240905002926.1055-1-21cnbao@gmail.com/
>> Its the easiest to review and understand, and least likely to introduce any bugs.
>> But it could be a personal preference.
>> The likelihood of having contiguous zeromap entries *that* is less than nr is very low right?
>> If so we could go with the enum implementation?
>
> what about the bool impementation i sent in the last email, it seems the
> simplest code.
>
Looking now.
>>
>>
>>> if (s_idx == end) {
>>> *consistent = true;
>>> return 0;
>>> }
>>>
>>> c_idx = find_next_zero_bit(sis->zeromap, end, start);
>>> if (c_idx == end) {
>>> *consistent = true;
>>> return nr;
>>> }
>>>
>>> *consistent = false;
>>> if (s_idx == start)
>>> return 0;
>>> return c_idx - s_idx;
>>> }
>>>
>>> I can actually switch the places of the "consistent" and returned
>>> number if that looks
>>> better.
>>>
>>>>
>>>> Hi Usama,
>>>> what is your take on this?
>>>>
>>>>>
>>>>>>
>>>>>> Though I am not sure how cheap zswap can implement it,
>>>>>> swap_zeromap_entries_check()
>>>>>> could be two simple bit operations:
>>>>>>
>>>>>> +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t
>>>>>> entry, int nr)
>>>>>> +{
>>>>>> + struct swap_info_struct *sis = swp_swap_info(entry);
>>>>>> + unsigned long start = swp_offset(entry);
>>>>>> + unsigned long end = start + nr;
>>>>>> +
>>>>>> + if (find_next_bit(sis->zeromap, end, start) == end)
>>>>>> + return SWAP_ZEROMAP_NON;
>>>>>> + if (find_next_zero_bit(sis->zeromap, end, start) == end)
>>>>>> + return SWAP_ZEROMAP_FULL;
>>>>>> +
>>>>>> + return SWAP_ZEROMAP_PARTIAL;
>>>>>> +}
>>>>>>
>>>>>> 3. swapcache is different from zeromap and zswap. Swapcache indicates
>>>>>> that the memory
>>>>>> is still available and should be re-mapped rather than allocating a
>>>>>> new folio. Our previous
>>>>>> patchset has implemented a full re-map of an mTHP in do_swap_page() as mentioned
>>>>>> in 1.
>>>>>>
>>>>>> For the same reason as point 1, partial swapcache is a rare edge case.
>>>>>> Not re-mapping it
>>>>>> and instead allocating a new folio would add significant complexity.
>>>>>>
>>>>>>>>
>>>>>>>> Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, we
>>>>>>>> permit almost all mTHP swap-ins, except for those rare situations where
>>>>>>>> small folios that were swapped out happen to have contiguous and aligned
>>>>>>>> swap slots.
>>>>>>>>
>>>>>>>> swapcache is another quite different story, since our user scenarios begin from
>>>>>>>> the simplest sync io on mobile phones, we don't quite care about swapcache.
>>>>>>>
>>>>>>> Right. The reason I bring this up is as I mentioned above, there is a
>>>>>>> common problem of forming large folios from different sources, which
>>>>>>> includes the swap cache. The fact that synchronous swapin does not use
>>>>>>> the swapcache was a happy coincidence for you, as you can add support
>>>>>>> mTHP swapins without handling this case yet ;)
>>>>>>
>>>>>> As I mentioned above, I'd really rather filter out those corner cases
>>>>>> than support
>>>>>> them, not just for the current situation to unlock swap-in series :-)
>>>>>
>>>>> If they are indeed corner cases, then I definitely agree.
>>>>
>
> Thanks
> Barry
next prev parent reply other threads:[~2024-09-05 10:50 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-12 12:43 [PATCH v4 0/2] " Usama Arif
2024-06-12 12:43 ` [PATCH v4 1/2] " Usama Arif
2024-06-12 20:13 ` Yosry Ahmed
2024-06-13 11:37 ` Usama Arif
2024-06-13 16:38 ` Yosry Ahmed
2024-06-13 19:21 ` Usama Arif
2024-06-13 19:26 ` Yosry Ahmed
2024-06-13 19:38 ` Usama Arif
2024-09-04 5:55 ` Barry Song
2024-09-04 7:12 ` Yosry Ahmed
2024-09-04 7:17 ` Barry Song
2024-09-04 7:22 ` Yosry Ahmed
2024-09-04 7:54 ` Barry Song
2024-09-04 17:40 ` Yosry Ahmed
2024-09-05 7:03 ` Barry Song
2024-09-05 7:55 ` Yosry Ahmed
2024-09-05 8:49 ` Barry Song
2024-09-05 10:10 ` Barry Song
2024-09-05 10:33 ` Barry Song
2024-09-05 10:53 ` Usama Arif
2024-09-05 11:00 ` Barry Song
2024-09-05 19:19 ` Usama Arif
2024-09-05 17:36 ` Yosry Ahmed
2024-09-05 19:28 ` Yosry Ahmed
2024-09-06 10:22 ` Barry Song
2024-09-05 10:37 ` Usama Arif
2024-09-05 10:42 ` Barry Song
2024-09-05 10:50 ` Usama Arif [this message]
2024-09-04 11:14 ` Usama Arif
2024-09-04 23:44 ` Barry Song
2024-09-04 23:47 ` Barry Song
2024-09-04 23:57 ` Yosry Ahmed
2024-09-05 0:29 ` Barry Song
2024-09-05 7:38 ` Yosry Ahmed
2024-06-12 12:43 ` [PATCH v4 2/2] mm: remove code to handle same filled pages Usama Arif
2024-06-12 15:09 ` Nhat Pham
2024-06-12 16:34 ` Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=076d2577-61d0-42a6-a95f-11326684a2f2@gmail.com \
--to=usamaarif642@gmail.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chengming.zhou@linux.dev \
--cc=david@redhat.com \
--cc=hanchuanhua@oppo.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox