Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Kairui Song <kasong@tencent.com>,
	 Hugh Dickins <hughd@google.com>,
	 Kalesh Singh <kaleshsingh@google.com>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	 Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
Date: Fri, 26 Jul 2024 14:02:19 +0800	[thread overview]
Message-ID: <87jzh83d3o.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <CACePvbWW6YLZe=47+kfuz76J+WWGmfKHvqatGGm=RyRX=D-WeQ@mail.gmail.com> (Chris Li's message of "Thu, 25 Jul 2024 22:09:25 -0700")

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > The current proposed order also improves things step by step. The only
>> > disagreement here is which patch order we introduce yet another list
>> > in addition to the nonfull one. I just feel that it does not make
>> > sense to invest into new code if that new code is going to be
>> > completely rewrite anyway in the next two patches.
>> >
>> > Unless you mean is we should not do the patch 3 big rewrite and should
>> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
>> > the allocation job and let scan_swap_map_slots() do the complex retry
>> > on top of try_ssd(). I feel the overall code is more complex and less
>> > maintainable.
>>
>> I haven't look at [3/3], will wait for your next version for that.  So,
>> I cannot say which order is better.  Please consider reviewers' effort
>> too.  Small step patch is easier to be understood and reviewed.
>
> That is exactly the reason I don't want to introduce too much new code
> depending on the scan_swap_map_slots() behavior, which will be
> abandoned in the big rewrite. Their constraints are very different. I
> want to make the big rewrite patch 3 as small as possible. Using
> incremental follow up patches to improve it.
>
>>
>> >> > That is why I want to make this change patch after patch 3. There is
>> >> > also the long test cycle after the modification to make sure the swap
>> >> > code path is stable. I am not resisting a change of patch orders, it
>> >> > is that patch can't directly be removed before patch 3 before the big
>> >> > rewrite.
>> >> >
>> >> >
>> >> >>
>> >> >> > Your original
>> >> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> >> > on the nonfull list for order-N always unless the number of free swap
>> >> >> > entry is less than 1<<N.
>> >> >>
>> >> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> >> been freed since the scan started then it should also be removed from the list.
>> >> >
>> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> >> > full cluster that for the cluster has been scan and not able to
>> >> > allocate order N entry.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >>> And, I understand that in some situations it may
>> >> >> >>> be better to share clusters among CPUs.  So my suggestion is,
>> >> >> >>>
>> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >> >>>   have free swap entries with that order even after we are sure that we
>> >> >> >>>   haven't.
>> >> >> >>
>> >> >> >> Is this patch pretending that today? I don't think so?
>> >> >> >
>> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> >> > sure that there are no order-N free swap entry in the cluster.
>> >> >>
>> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> >> conclusion wrong!).
>> >> >
>> >> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >> But I agree that a
>> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >> >
>> >> >> > We can check that when free swap entry via checking adjacent swap
>> >> >> > entries.  IMHO, the performance should be acceptable.
>> >> >>
>> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> >> a separate change on top of what Chris is doing here. For high orders there
>> >> >> could be quite a bit of scanning required in the worst case for every page that
>> >> >> gets freed.
>> >> >
>> >> > Right, I feel that is a different set of patches. Even this series is
>> >> > hard enough for review. Those order promotion and demotion is heading
>> >> > towards a buddy system design. I want to point out that even the buddy
>> >> > system is not able to handle the case that swapfile is almost full and
>> >> > the recently freed swap entries are not contiguous.
>> >> >
>> >> > We can invest in the buddy system, which doesn't handle all the
>> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> >> > swap entry. We pay a price for the indirect mapping of swap entries.
>> >> > But it will solve the fragmentation issue 100%.
>> >>
>> >> It's good if we can solve the fragmentation issue 100%.  Just need to
>> >> pay attention to the cost.
>> >
>> > The cost you mean the development cost or the run time cost (memory and cpu)?
>>
>> I mean runtime cost.
>
> Thanks for the clarification. Agree that we need to pay attention to
> the run time cost. That is given.
>
>> >> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >> >>> among CPUs?
>> >> >> >>
>> >> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> >> given order if there are actually slots available just because they have been
>> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> >> approach.
>> >> >> >>
>> >> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >> >>
>> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> >> current half-and-half policy in this patch.
>> >> >> >
>> >> >> > Sounds good to me.  We can start with exclusive solution and evaluate
>> >> >> > whether shared solution is good.
>> >> >>
>> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >> >
>> >> > It is not able to avoid fragementation 100% of the time. I prefer the
>> >> > discontinued swap entry as the next step, which guarantees forward
>> >> > progress, we will not be stuck in a situation where we are not able to
>> >> > allocate swap entries due to fragmentation.
>> >>
>> >> If my understanding were correct, the implementation complexity of the
>> >> order promotion/demotion isn't at the same level of that of discontinued
>> >> swap entry.
>> >
>> > Discontinued swap entry has higher complexity but higher payout as
>> > well. It can get us to the place where cluster promotion/demotion
>> > can't.
>> >
>> > I also feel that if we implement something towards a buddy system
>> > allocator for swap, we should do a proper buddy allocator
>> > implementation of data structures.
>>
>> I don't think that it's easy to implement a real buddy allocator for
>> swap entries.  So, I avoid to use buddy in my words.
>
> Then such a mix of cluster order promote/demote lose some benefit of
> the buddy system. Because it lacks the proper data structure to
> support buddy allocation. The buddy allocator provides more general
> migration between orders. For the limited usage case of cluster
> promotion/demotion is supported (by luck). We need to evaluate whether
> it is worth the additional complexity.

TBH, I believe that the complexity of order promote/demote is quite low,
both for development and runtime.  A real buddy allocator may need to
increase per-swap-entry memory footprint much.

--
Best Regards,
Huang, Ying

next prev parent reply	other threads:[~2024-07-26  6:05 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-11  7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-11  7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
2024-07-15 14:57   ` Ryan Roberts
2024-07-16 22:11     ` Chris Li
2024-07-18  6:26   ` Huang, Ying
2024-07-26  5:46     ` Chris Li
2024-07-11  7:29 ` [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
2024-07-15 15:40   ` Ryan Roberts
2024-07-16 22:46     ` Chris Li
2024-07-17 10:14       ` Ryan Roberts
2024-07-17 15:41         ` Chris Li
2024-07-18  7:53           ` Huang, Ying
2024-07-19 10:30             ` Ryan Roberts
2024-07-22  2:14               ` Huang, Ying
2024-07-22  7:51                 ` Ryan Roberts
2024-07-22  8:49                   ` Huang, Ying
2024-07-22  9:54                     ` Ryan Roberts
2024-07-23  6:27                       ` Huang, Ying
2024-07-24  8:33                         ` Ryan Roberts
2024-07-24 22:41                           ` Chris Li
2024-07-25  6:43                             ` Huang, Ying
2024-07-25  8:09                               ` Chris Li
2024-07-26  2:09                                 ` Huang, Ying
2024-07-26  5:09                                   ` Chris Li
2024-07-26  6:02                                     ` Huang, Ying [this message]
2024-07-26  7:15                                       ` Chris Li
2024-07-26  7:42                                         ` Huang, Ying
2024-07-25  6:53                           ` Huang, Ying
2024-07-25  8:26                             ` Chris Li
2024-07-26  2:04                               ` Huang, Ying
2024-07-26  4:50                                 ` Chris Li
2024-07-26  5:52                                   ` Huang, Ying
2024-07-26  7:10                                     ` Chris Li
2024-07-26  7:18                                       ` Huang, Ying
2024-07-26  7:26                                         ` Chris Li
2024-07-26  7:37                                           ` Huang, Ying
2024-07-11  7:29 ` [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots() Chris Li
2024-07-11 10:02 ` [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Ryan Roberts
2024-07-11 14:08   ` Chris Li
2024-07-15 14:10     ` Ryan Roberts
2024-07-15 18:14       ` Chris Li
2024-07-18  5:50 ` Huang, Ying
2024-07-26  5:51   ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jzh83d3o.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=chrisl@kernel.org \
    --cc=hughd@google.com \
    --cc=kaleshsingh@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox