linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Kairui Song <kasong@tencent.com>,
	 Hugh Dickins <hughd@google.com>,
	 Kalesh Singh <kaleshsingh@google.com>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	 Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
Date: Fri, 26 Jul 2024 10:09:27 +0800	[thread overview]
Message-ID: <87v80s3nvs.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <CACePvbXC6SwD1mx_s_9yCZpqTXZhRKMetbCcBNPOgT-ZtLmGCA@mail.gmail.com> (Chris Li's message of "Thu, 25 Jul 2024 01:09:03 -0700")

Chris Li <chrisl@kernel.org> writes:

> On Wed, Jul 24, 2024 at 11:46 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > Hi Ryan and Ying,
>> >
>> > Sorry I was busy. I am catching up on the email now.
>> >
>> > On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>
>> >> On 23/07/2024 07:27, Huang, Ying wrote:
>> >> > Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >
>> >> >> On 22/07/2024 09:49, Huang, Ying wrote:
>> >> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >>>
>> >> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
>> >> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >>>>>
>> >> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>> >> >>>>>>> Chris Li <chrisl@kernel.org> writes:
>> >> >>>>>>>
>> >> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>> >> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> >> >>>>>
>> >> >>>>> [snip]
>> >> >>>>>
>> >> >>>>>>>>>>>> +
>> >> >>>>>>>>>>>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >> >>>>>>>>>>>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>> >> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>> >> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>> >> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> So you could have this situation:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>>   - cpuA allocs cluster from free list (exclusive to that cpu)
>> >> >>>>>>>>>>>   - cpuA allocs 1 swap entry from current cluster
>> >> >>>>>>>>>>>   - swap entry is freed; cluster added to nonfull_clusters
>> >> >>>>>>>>>>>   - cpuB "allocs" cluster from nonfull_clusters
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>> >> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>> >> >>>>>>>>>> moving it into nonfull.
>> >> >>>>>>>>>
>> >> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> >> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> >> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> >> >>>>>>>>> refactoring separated from behavioural changes.
>> >> >>>>>>>>
>> >> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>> >> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>> >> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>> >> >>>>>>>> allocation.
>> >> >>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> >> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>> >> >>>>>>>>> results quoted on the cover letter?
>> >> >>>>>>>>
>> >> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>> >> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>> >> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> >> >>>>>>>> because it is not feature complete. Currently it does not do swap
>> >> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>> >> >>>>>>>> remove the RFC.
>> >> >>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>> >> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>> >> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>> >> >>>>>>>>>> entries allocated from the previous CPU.
>> >> >>>>>>>>>
>> >> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> >> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>> >> >>>>>>>>
>> >> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>> >> >>>>>>>> advances to the next cluster pointer, it can cross with the other
>> >> >>>>>>>> CPU's next cluster pointer.
>> >> >>>>>>>
>> >> >>>>>>> No.  si->percpu_cluster[cpu].next will keep in the current per cpu
>> >> >>>>>>> cluster only.  If it doesn't do that, we should fix it.
>> >> >>>>>>>
>> >> >>>>>>> I agree with Ryan that we should make per cpu cluster correct.  A
>> >> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list.  When we
>> >> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> >> >>>>>>> list if necessary.  And, we should make it correct in this patch instead
>> >> >>>>>>> of later in series.  I understand that you want to make the patch itself
>> >> >>>>>>> simple, but it's important to make code simple to be understood too.
>> >> >>>>>>> Consistent design choice will do that.
>> >> >>>>>>
>> >> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>> >> >>>>>
>> >> >>>>> Sorry, I misunderstood your words.
>> >> >>>>>
>> >> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>> >> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> >> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> >> >>>>>> case, in which case it should be added to the nonfull list.
>> >> >>>>>>
>> >> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> >> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> >> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>> >> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
>> >> >>>>>>
>> >> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> >> >>>>>> per-cpu cluster.
>> >> >>>>>
>> >> >>>>> Yes.
>> >> >>>>>
>> >> >>>>>> I was arguing to make it always shared. Perhaps the best
>> >> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> >> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>> >> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> >> >>>>>> the shared approach as part of the "big rewrite"?
>> >> >>>>>>>
>> >> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> >> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>> >> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>> >> >>>>>>>>> way to help that?
>> >> >>>>>>>>
>> >> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
>> >> >>>>>>>> when all clusters have been scanned and not available swap range
>> >> >>>>>>>> found.
>> >> >>>>>
>> >> >>>>> I also think that the shared approach has dead loop issue.
>> >> >>>>
>> >> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> >> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> >> >>>> go forever? That's surely just an implementation issue to solve? It's not a
>> >> >>>> reason to avoid the design principle; if we agree that maintaining sharability
>> >> >>>> of the cluster is preferred then the code must be written to guard against the
>> >> >>>> dead loop problem. It could be done by remembering the first cluster you
>> >> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> >> >>>> to it. (I think holding the si lock will protect against concurrently freeing
>> >> >>>> the cluster so it should definitely remain in the list?).
>> >> >>>
>> >> >>> I believe that you can find some way to avoid the dead loop issue,
>> >> >>> although your suggestion may kill the performance via looping a long list
>> >> >>> of nonfull clusters.
>> >> >>
>> >> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> >> >> list when made current for a cpu), that only reduces the size of the list by a
>> >> >> maximum of the number of CPUs in the system, which I suspect is pretty small
>> >> >> compared to the number of nonfull clusters.
>> >> >
>> >> > Anyway, this depends on details.  If we cannot allocate a order-N swap
>> >> > entry from the cluster, we should remove it from the nonfull list for
>> >> > order-N (This is the behavior of this patch too).
>> >
>> > Yes, Kairui implements something like that in the reclaim part of the
>> > patch series. It is after patch 3. We are heavily testing the
>> > performance and the stability of the reclaim patches. May I post the
>> > reclaim together with patch 3 for discussion. If you want we can
>> > discuss the re-order the patch in a later iteration.
>> >
>> >>
>> >> Yes that's a good point, and I conceed it is more difficult to detect that
>> >> condition if the cluster is shared. I suspect that with a bit of thinking, we
>> >> could find a way though.
>> >
>> > Kaiui has  the patch series show a good performance number that beats
>> > the current swap cache reclaim.
>> >
>> > I want to make a point regarding the patch ordering before vs after
>> > patch 3 (aka the big rewrite).
>> > Previously, the "san_swap_map_try_ssd_cluster" only did partial
>> > allocation. It does not sucessfully allocate a swap entry 100% the
>> > time.  The patch 3 makes the cluster allocation function return the
>> > swap entry 100% of the time. There are no more fallback retry loops
>> > outside of the cluster allocation function. Also the try_ssd function
>> > does not do swap cache reclaims while the cluster allocation function
>> > will need to. These two have very different constraints.
>> >
>> > There for, adding different cluster header into
>> > san_swap_map_try_ssd_cluste will be a lot of waste investment of
>> > development time in the sense that, that function will need to be
>> > rewrite any way, the end result is very different.
>>
>> I am not a big fan of implementing the final solution directly.
>> Personally, I prefer to improve step by step.
>
> The current proposed order also improves things step by step. The only
> disagreement here is which patch order we introduce yet another list
> in addition to the nonfull one. I just feel that it does not make
> sense to invest into new code if that new code is going to be
> completely rewrite anyway in the next two patches.
>
> Unless you mean is we should not do the patch 3 big rewrite and should
> continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> the allocation job and let scan_swap_map_slots() do the complex retry
> on top of try_ssd(). I feel the overall code is more complex and less
> maintainable.

I haven't look at [3/3], will wait for your next version for that.  So,
I cannot say which order is better.  Please consider reviewers' effort
too.  Small step patch is easier to be understood and reviewed.

>> > That is why I want to make this change patch after patch 3. There is
>> > also the long test cycle after the modification to make sure the swap
>> > code path is stable. I am not resisting a change of patch orders, it
>> > is that patch can't directly be removed before patch 3 before the big
>> > rewrite.
>> >
>> >
>> >>
>> >> > Your original
>> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> > on the nonfull list for order-N always unless the number of free swap
>> >> > entry is less than 1<<N.
>> >>
>> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> been freed since the scan started then it should also be removed from the list.
>> >
>> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> > full cluster that for the cluster has been scan and not able to
>> > allocate order N entry.
>> >
>> >>
>> >> >
>> >> >>> And, I understand that in some situations it may
>> >> >>> be better to share clusters among CPUs.  So my suggestion is,
>> >> >>>
>> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >>>   have free swap entries with that order even after we are sure that we
>> >> >>>   haven't.
>> >> >>
>> >> >> Is this patch pretending that today? I don't think so?
>> >> >
>> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> > sure that there are no order-N free swap entry in the cluster.
>> >>
>> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> conclusion wrong!).
>> >
>> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >
>> >>
>> >> >
>> >> >> But I agree that a
>> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >
>> >> > We can check that when free swap entry via checking adjacent swap
>> >> > entries.  IMHO, the performance should be acceptable.
>> >>
>> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> a separate change on top of what Chris is doing here. For high orders there
>> >> could be quite a bit of scanning required in the worst case for every page that
>> >> gets freed.
>> >
>> > Right, I feel that is a different set of patches. Even this series is
>> > hard enough for review. Those order promotion and demotion is heading
>> > towards a buddy system design. I want to point out that even the buddy
>> > system is not able to handle the case that swapfile is almost full and
>> > the recently freed swap entries are not contiguous.
>> >
>> > We can invest in the buddy system, which doesn't handle all the
>> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> > swap entry. We pay a price for the indirect mapping of swap entries.
>> > But it will solve the fragmentation issue 100%.
>>
>> It's good if we can solve the fragmentation issue 100%.  Just need to
>> pay attention to the cost.
>
> The cost you mean the development cost or the run time cost (memory and cpu)?

I mean runtime cost.

>>
>> >>
>> >> >
>> >> >>>
>> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >>> among CPUs?
>> >> >>
>> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> given order if there are actually slots available just because they have been
>> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> approach.
>> >> >>
>> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >>
>> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> current half-and-half policy in this patch.
>> >> >
>> >> > Sounds good to me.  We can start with exclusive solution and evaluate
>> >> > whether shared solution is good.
>> >>
>> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >
>> > It is not able to avoid fragementation 100% of the time. I prefer the
>> > discontinued swap entry as the next step, which guarantees forward
>> > progress, we will not be stuck in a situation where we are not able to
>> > allocate swap entries due to fragmentation.
>>
>> If my understanding were correct, the implementation complexity of the
>> order promotion/demotion isn't at the same level of that of discontinued
>> swap entry.
>
> Discontinued swap entry has higher complexity but higher payout as
> well. It can get us to the place where cluster promotion/demotion
> can't.
>
> I also feel that if we implement something towards a buddy system
> allocator for swap, we should do a proper buddy allocator
> implementation of data structures.

I don't think that it's easy to implement a real buddy allocator for
swap entries.  So, I avoid to use buddy in my words.

--
Best Regards,
Huang, Ying


  reply	other threads:[~2024-07-26  2:13 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-11  7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-11  7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
2024-07-15 14:57   ` Ryan Roberts
2024-07-16 22:11     ` Chris Li
2024-07-18  6:26   ` Huang, Ying
2024-07-26  5:46     ` Chris Li
2024-07-11  7:29 ` [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
2024-07-15 15:40   ` Ryan Roberts
2024-07-16 22:46     ` Chris Li
2024-07-17 10:14       ` Ryan Roberts
2024-07-17 15:41         ` Chris Li
2024-07-18  7:53           ` Huang, Ying
2024-07-19 10:30             ` Ryan Roberts
2024-07-22  2:14               ` Huang, Ying
2024-07-22  7:51                 ` Ryan Roberts
2024-07-22  8:49                   ` Huang, Ying
2024-07-22  9:54                     ` Ryan Roberts
2024-07-23  6:27                       ` Huang, Ying
2024-07-24  8:33                         ` Ryan Roberts
2024-07-24 22:41                           ` Chris Li
2024-07-25  6:43                             ` Huang, Ying
2024-07-25  8:09                               ` Chris Li
2024-07-26  2:09                                 ` Huang, Ying [this message]
2024-07-26  5:09                                   ` Chris Li
2024-07-26  6:02                                     ` Huang, Ying
2024-07-26  7:15                                       ` Chris Li
2024-07-26  7:42                                         ` Huang, Ying
2024-07-25  6:53                           ` Huang, Ying
2024-07-25  8:26                             ` Chris Li
2024-07-26  2:04                               ` Huang, Ying
2024-07-26  4:50                                 ` Chris Li
2024-07-26  5:52                                   ` Huang, Ying
2024-07-26  7:10                                     ` Chris Li
2024-07-26  7:18                                       ` Huang, Ying
2024-07-26  7:26                                         ` Chris Li
2024-07-26  7:37                                           ` Huang, Ying
2024-07-11  7:29 ` [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots() Chris Li
2024-07-11 10:02 ` [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Ryan Roberts
2024-07-11 14:08   ` Chris Li
2024-07-15 14:10     ` Ryan Roberts
2024-07-15 18:14       ` Chris Li
2024-07-18  5:50 ` Huang, Ying
2024-07-26  5:51   ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v80s3nvs.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=chrisl@kernel.org \
    --cc=hughd@google.com \
    --cc=kaleshsingh@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox