From: Ryan Roberts <ryan.roberts@arm.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Chris Li <chrisl@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Kairui Song <kasong@tencent.com>, Hugh Dickins <hughd@google.com>,
Kalesh Singh <kaleshsingh@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
Date: Wed, 24 Jul 2024 09:33:28 +0100 [thread overview]
Message-ID: <4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com> (raw)
In-Reply-To: <87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com>
On 23/07/2024 07:27, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
>> On 22/07/2024 09:49, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 22/07/2024 03:14, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>>>
>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>>>
>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>>>
>>>>>>>>>>> So you could have this situation:
>>>>>>>>>>>
>>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>>>
>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>>>
>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>>>> moving it into nonfull.
>>>>>>>>>
>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>>>> refactoring separated from behavioural changes.
>>>>>>>>
>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>>>> allocation.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>>>> results quoted on the cover letter?
>>>>>>>>
>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>>>> remove the RFC.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>>>
>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>>>
>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>>>> entries allocated from the previous CPU.
>>>>>>>>>
>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>>>
>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>>>> CPU's next cluster pointer.
>>>>>>>
>>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>>>> cluster only. If it doesn't do that, we should fix it.
>>>>>>>
>>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>>>> list if necessary. And, we should make it correct in this patch instead
>>>>>>> of later in series. I understand that you want to make the patch itself
>>>>>>> simple, but it's important to make code simple to be understood too.
>>>>>>> Consistent design choice will do that.
>>>>>>
>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>>>
>>>>> Sorry, I misunderstood your words.
>>>>>
>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>>>> case, in which case it should be added to the nonfull list.
>>>>>>
>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>>>
>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>>>> per-cpu cluster.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> I was arguing to make it always shared. Perhaps the best
>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>>>> the shared approach as part of the "big rewrite"?
>>>>>>>
>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>>>> way to help that?
>>>>>>>>
>>>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>>>> when all clusters have been scanned and not available swap range
>>>>>>>> found.
>>>>>
>>>>> I also think that the shared approach has dead loop issue.
>>>>
>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>>>> go forever? That's surely just an implementation issue to solve? It's not a
>>>> reason to avoid the design principle; if we agree that maintaining sharability
>>>> of the cluster is preferred then the code must be written to guard against the
>>>> dead loop problem. It could be done by remembering the first cluster you
>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>>>> to it. (I think holding the si lock will protect against concurrently freeing
>>>> the cluster so it should definitely remain in the list?).
>>>
>>> I believe that you can find some way to avoid the dead loop issue,
>>> although your suggestion may kill the performance via looping a long list
>>> of nonfull clusters.
>>
>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> list when made current for a cpu), that only reduces the size of the list by a
>> maximum of the number of CPUs in the system, which I suspect is pretty small
>> compared to the number of nonfull clusters.
>
> Anyway, this depends on details. If we cannot allocate a order-N swap
> entry from the cluster, we should remove it from the nonfull list for
> order-N (This is the behavior of this patch too).
Yes that's a good point, and I conceed it is more difficult to detect that
condition if the cluster is shared. I suspect that with a bit of thinking, we
could find a way though.
> Your original
> suggestion appears like that you want to keep all cluster with order-N
> on the nonfull list for order-N always unless the number of free swap
> entry is less than 1<<N.
Well I think that's certainly one of the conditions for removing it. But agree
that if a full scan of the cluster has been performed and no swap entries have
been freed since the scan started then it should also be removed from the list.
>
>>> And, I understand that in some situations it may
>>> be better to share clusters among CPUs. So my suggestion is,
>>>
>>> - Make swap_cluster_info->order more accurate, don't pretend that we
>>> have free swap entries with that order even after we are sure that we
>>> haven't.
>>
>> Is this patch pretending that today? I don't think so?
>
> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> sure that there are no order-N free swap entry in the cluster.
Oh I see what you mean. I think you and Chris already discussed this? IIRC
Chris's point was that if you move that cluster to N-1, eventually all clusters
are for order-0 and you have no means of allocating high orders until a whole
cluster becomes free. That logic certainly makes sense to me, so think its
better for swap_cluster_info->order to remain static while the cluster is
allocated. (I only skimmed that conversation so appologies if I got the
conclusion wrong!).
>
>> But I agree that a
>> cluster should only be on the per-order nonfull list if we know there are at
>> least enough free swap entries in that cluster to cover the order. Of course
>> that doesn't tell us for sure because they may not be contiguous.
>
> We can check that when free swap entry via checking adjacent swap
> entries. IMHO, the performance should be acceptable.
Would you then use the result of that scanning to "promote" a cluster's order?
e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
a separate change on top of what Chris is doing here. For high orders there
could be quite a bit of scanning required in the worst case for every page that
gets freed.
>
>>>
>>> My question is whether it's so important to share the per-cpu cluster
>>> among CPUs?
>>
>> My rationale for sharing is that the preference previously has been to favour
>> efficient use of swap space; we don't want to fail a request for allocation of a
>> given order if there are actually slots available just because they have been
>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> actually help improve allocation success, then I'm happy to take the exclusive
>> approach.
>>
>>> I suggest to start with simple design, that is, per-CPU
>>> cluster will not be shared among CPUs in most cases.
>>
>> I'm all for starting simple; I think that's what I already proposed (exclusive
>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> current half-and-half policy in this patch.
>
> Sounds good to me. We can start with exclusive solution and evaluate
> whether shared solution is good.
Yep. And also evaluate the dynamic order inc/dec idea too...
>
>>>
>>> Another choice for sharing is when we run short of free swap space, we
>>> disable per-CPU cluster and allocate from the shared non-full cluster
>>> list directly.
>>>
>>>> Which actually makes me wonder; what is the mechanism that prevents the current
>>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>>>> raise count when its current. (If Chris has implemented that in the "big
>>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>>>
>>> Yes. We may need a flag for that.
>>>
>>>>>
>>>>>>> This is another reason that we should put the cluster in
>>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>>>> in the cluster. It makes design complex to keep it in
>>>>>>> nonfull_clusters[order].
>>>>>>>
>>>>>>>> We have tried many different approaches including moving to the end of
>>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>>>
>>>>>>>>>> Those behaviors will be fine
>>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>>>
>>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>>>
>>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>>>
>>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>>>> to introduce the nonfull list.
>>>>>>>>>>
>>>>>
>>>>> [snip]
>
> --
> Best Regards,
> Huang, Ying
next prev parent reply other threads:[~2024-07-24 8:33 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-11 7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-11 7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
2024-07-15 14:57 ` Ryan Roberts
2024-07-16 22:11 ` Chris Li
2024-07-18 6:26 ` Huang, Ying
2024-07-26 5:46 ` Chris Li
2024-07-11 7:29 ` [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
2024-07-15 15:40 ` Ryan Roberts
2024-07-16 22:46 ` Chris Li
2024-07-17 10:14 ` Ryan Roberts
2024-07-17 15:41 ` Chris Li
2024-07-18 7:53 ` Huang, Ying
2024-07-19 10:30 ` Ryan Roberts
2024-07-22 2:14 ` Huang, Ying
2024-07-22 7:51 ` Ryan Roberts
2024-07-22 8:49 ` Huang, Ying
2024-07-22 9:54 ` Ryan Roberts
2024-07-23 6:27 ` Huang, Ying
2024-07-24 8:33 ` Ryan Roberts [this message]
2024-07-24 22:41 ` Chris Li
2024-07-25 6:43 ` Huang, Ying
2024-07-25 8:09 ` Chris Li
2024-07-26 2:09 ` Huang, Ying
2024-07-26 5:09 ` Chris Li
2024-07-26 6:02 ` Huang, Ying
2024-07-26 7:15 ` Chris Li
2024-07-26 7:42 ` Huang, Ying
2024-07-25 6:53 ` Huang, Ying
2024-07-25 8:26 ` Chris Li
2024-07-26 2:04 ` Huang, Ying
2024-07-26 4:50 ` Chris Li
2024-07-26 5:52 ` Huang, Ying
2024-07-26 7:10 ` Chris Li
2024-07-26 7:18 ` Huang, Ying
2024-07-26 7:26 ` Chris Li
2024-07-26 7:37 ` Huang, Ying
2024-07-11 7:29 ` [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots() Chris Li
2024-07-11 10:02 ` [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Ryan Roberts
2024-07-11 14:08 ` Chris Li
2024-07-15 14:10 ` Ryan Roberts
2024-07-15 18:14 ` Chris Li
2024-07-18 5:50 ` Huang, Ying
2024-07-26 5:51 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com \
--to=ryan.roberts@arm.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=chrisl@kernel.org \
--cc=hughd@google.com \
--cc=kaleshsingh@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox