From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3880DC3DA49 for ; Thu, 25 Jul 2024 08:09:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9630C6B0083; Thu, 25 Jul 2024 04:09:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9123D6B0085; Thu, 25 Jul 2024 04:09:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D9FB6B0088; Thu, 25 Jul 2024 04:09:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5A3F26B0083 for ; Thu, 25 Jul 2024 04:09:20 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A8357C0E1F for ; Thu, 25 Jul 2024 08:09:19 +0000 (UTC) X-FDA: 82377550038.16.AFAA246 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf06.hostedemail.com (Postfix) with ESMTP id ABDC7180015 for ; Thu, 25 Jul 2024 08:09:16 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=C6OIiPp9; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721894902; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vrilVt5H6iz4CvrmR+G0PRPMTHA30H5rZgh571qymVg=; b=c8oQSOAztVR3CjhA6NVsJh8+YERx/xpVM4XeT5RWYF/8GRHYTWe3qqr5CoHad+mzIqCRFZ Pwv6SpUPcP0k0hWXirKj5rZC3m8bMZXdwk8YQeRG+1jdxRYUXiKHrH+mf37giHDw8lFlEJ obsK3eUakrEvfwFNdpWpazI7apKhP1Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721894902; a=rsa-sha256; cv=none; b=RzZ4udreOi64Obpz0u4b3rWN3CJPP+7xHXHps7jgv1kJan2B2zlIcEzP1QXCnejkJ+K776 dggRWBpRamUiJVafSPmRamUA+e+CkRxDeCfjtXdYB1UlJvZc7G1umdL8lNb5vJSATyP5K6 EOuNp1XNTYb2FUBpEIu75wOnEAw+RqA= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=C6OIiPp9; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 9B35F61323 for ; Thu, 25 Jul 2024 08:09:15 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4D855C4AF0B for ; Thu, 25 Jul 2024 08:09:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1721894955; bh=ggE1vX8bbY/IRjTqeSyXNULVPlrEvEz4Q8bZ7nMTfO4=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=C6OIiPp9zY4FEd1MQnPMbNiq9Vm2ebFp6QGsWZuw6Y71DODXakgSO1df6xJBlAmGD OrMvGUSpXOri2RjdU/UR9E7c43VmjIyaq+o5b7/D9j8YVfS9lZnOefPKU2lUlWkwbL ehAOH3WJ4kvO4rasuzWAflIXxV7BalDfpIIxm0Ch5vILTGL/OdZGmHQ+yU1askxVko AYKomKxCd9xiAQ9SeywHLXlM7cGCOEs3r3LkB6L30DJxNM3JWP6fTxpXi/hNGzq4bz roEUs9izn9EF0VYKRdG3cL7juEjiByK7c31XHRt5MXsIbGwcjWFnt4Cr/s6pVHgZFZ HWBKD+SO7IgcA== Received: by mail-yw1-f181.google.com with SMTP id 00721157ae682-65f9e25fffaso6859877b3.3 for ; Thu, 25 Jul 2024 01:09:15 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCW4c+D5/Gq2H48LGeEzcQf5tSkFXmMaI7476auNEuBc3tOecNCEmJHvyDx4t1nT/IuUND/70tO3lr3qRmDHXTVNn58= X-Gm-Message-State: AOJu0YzVAQMITgiJk8jVtMTE3LHolw9OhcouQDCokYXgamCwxLCIEdQK locgpNUleENXGwhM4AjF8JeaaukkQb1xf/DpqkehQGE2iaS46FA28GP8oiKTWo01/gnzkanW5h5 c4Na0XSaRC9w1okmywb5SEoruXWPX4MeMS1WcfQ== X-Google-Smtp-Source: AGHT+IFeSIVByaS1HXJFSFz44Uu1mYHPGpheC3+bYj7hd2Tjh4sYbmlEs9Bauc5PWzmoCE4tjxtmPfkgnFRrVBCfRH0= X-Received: by 2002:a05:690c:368e:b0:64a:90fe:9108 with SMTP id 00721157ae682-67515405d45mr24339677b3.30.1721894954440; Thu, 25 Jul 2024 01:09:14 -0700 (PDT) MIME-Version: 1.0 References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> <874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com> <43f73463-af42-4a00-8996-5f63bdf264a3@arm.com> <87jzhdkdzv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com> <4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com> <87plr26kg2.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87plr26kg2.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Chris Li Date: Thu, 25 Jul 2024 01:09:03 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list To: "Huang, Ying" Cc: Ryan Roberts , Andrew Morton , Kairui Song , Hugh Dickins , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: ABDC7180015 X-Stat-Signature: rzede8outgmk3rxhi8616cy9fjp3q53u X-Rspam-User: X-HE-Tag: 1721894956-481167 X-HE-Meta: U2FsdGVkX1+g6pXPh/VeMEJp1Q+HR7kUBE/5taRcN6WvXSyG0dzkKvL2KSJJgcGK0A7zsHT2ItBObntmjeU0vJwMELIm/w3uX4dmy/aUtHxgq7XxUuBw53JXo2yq+BfDq+n4J+PMUy8otKo36TNId+dtDMeDbE24DkZMCFQrkrDBjzkhUd5ek5bqg7zGSr4tYp0C6aJ44y9XyC8hDNRVDKtMjXJfQObIx8DMITb3tbW1HtiT4XcTPrzyRYhJBSltjDyyRzLZkYWjU8iABnsPZb5yUCP3oMKD/piJ5xJ9+fc1Mhx3Bdh6S9C1QJ/KZQIJGM+Bp128oi/JQp8v2Nu3dt06WZFgFXbUqiGB88R6nQGJt5sVRiWEdd6SROmdcj37uSiWQYexMNlcvqDGiKSotAcsWny0ah+BcP8oB6o9QJW13LvZb4g9g0iF2wGEy7mFxHQDj3/zyTpsWVx64DJQHe2gvM7LRnQ1azDHKR2M1X8q7SC+1Q1RFpagxegxKAIg3YZX75A6gVAompX/Vr51Nufzr726zlRQ4RQV1oLzDoDd6w6w0MYN2iER5bTK3nUyagJ/otvkBESxIg6BqvZ7Hhe/krVMhr1bsw2I/i/gxT7r9B+4oehVYZFlnK4A2RJ/Mn237W0tlR9K0Vf3UURlDuESooglBIX12C2Sf7LUsQVFOyERd9YKdy3S6gmMWIQBqiAhWAtgesNVPDoij97zESCALNwOzu7udGjyXnvmqlM2n7tykYm+qE3L88dkOXq12MnuV1beXlg1PpfSyKV/80+D2I1aEesZbhrWrFDUR7BzITy4K5ghZKwCBcNWCxdZsm+0gwYOmmPpXqlKzNjH8QEllen887S/06cyCDzIGni+1RBIqoRDuu4zSVrJkyui1Vwr8Qdgg/qIKfAtFe+ot9Vt2Z5ePbpcFHwM9vpPZ5U0wdGh0eW6y8ERonVnGZqLh2vO6zLUJUTCDS4MQV1 /L7R9cfU EAZR3Vntz4ob3jRvZwWD0OoJ/VphqD6xBPsheJ/9zX2qPWwm7tt/5VC/WzcGyhjZSNLNAxNvetNeMiC++JYWygoPm19wvLKi2LUKnHep2VkottaU/fIwnUZgvdYF3HQM+l3Z3rhOLczO62NIm6gv57veOqAU67u3qzN9A5wI6RavYREW5+vktVK8h88rX3s1Zq1Zaq6hf4nwRO1WBBN1GOzfhCoHMHa6KdXWthKB9m4aBK+firWE8nJvHD9y7QXG8pbO2mOVLJejPkcc7PpMc7kq2vm0Z4Ci8RA5+r4mfIScRIvPcrVDEAvatUmLXyX42L8aT2nmZ3rSeQ0RCRb41ZBsILKMoGlhqEqWVvP59sfEIkOLq6MuUT0oBURExtcNsXJUQpsgiIoFxWF0+zy9UQmp9Lnr03XzAuY8VE5coQ1AjBFM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 24, 2024 at 11:46=E2=80=AFPM Huang, Ying = wrote: > > Chris Li writes: > > > Hi Ryan and Ying, > > > > Sorry I was busy. I am catching up on the email now. > > > > On Wed, Jul 24, 2024 at 1:33=E2=80=AFAM Ryan Roberts wrote: > >> > >> On 23/07/2024 07:27, Huang, Ying wrote: > >> > Ryan Roberts writes: > >> > > >> >> On 22/07/2024 09:49, Huang, Ying wrote: > >> >>> Ryan Roberts writes: > >> >>> > >> >>>> On 22/07/2024 03:14, Huang, Ying wrote: > >> >>>>> Ryan Roberts writes: > >> >>>>> > >> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote: > >> >>>>>>> Chris Li writes: > >> >>>>>>> > >> >>>>>>>> On Wed, Jul 17, 2024 at 3:14=E2=80=AFAM Ryan Roberts wrote: > >> >>>>>>>>> > >> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote: > >> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40=E2=80=AFAM Ryan Roberts wrote: > >> >>>>>>>>>>> > >> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote: > >> >>>>> > >> >>>>> [snip] > >> >>>>> > >> >>>>>>>>>>>> + > >> >>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { > >> >>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_cluste= rs[ci->order]); > >> >>>>>>>>>>> > >> >>>>>>>>>>> I find the transitions when you add and remove a cluster f= rom the > >> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood co= rrectly): It is added > >> >>>>>>>>>>> to the list whenever there is at least one free swap entry= if not already on the > >> >>>>>>>>>>> list. But you take it off the list when assigning it as th= e current cluster for > >> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster(). > >> >>>>>>>>>>> > >> >>>>>>>>>>> So you could have this situation: > >> >>>>>>>>>>> > >> >>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that = cpu) > >> >>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster > >> >>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters > >> >>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters > >> >>>>>>>>>>> > >> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as= their current > >> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_cl= usters list at > >> >>>>>>>>>>> allocation time (when removed from free_list) and only rem= ove it from the > >> >>>>>>>>>> > >> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the fre= e list and > >> >>>>>>>>>> moving it into nonfull. > >> >>>>>>>>> > >> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation = from > >> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring= of the code to > >> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer t= o see the > >> >>>>>>>>> refactoring separated from behavioural changes. > >> >>>>>>>> > >> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allo= cator > >> >>>>>>>> using the cluster. Behavior change is expected. The goal is c= ompletely > >> >>>>>>>> removing the brute force scanning of swap_map[] array for clu= ster swap > >> >>>>>>>> allocation. > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Since the patch was titled RFC and I thought it was just ref= actoring, I was > >> >>>>>>>>> deferring review. But sounds like it is actually required to= realize the test > >> >>>>>>>>> results quoted on the cover letter? > >> >>>>>>>> > >> >>>>>>>> Yes, required because it handles the previous fall out case t= ry_ssd() > >> >>>>>>>> failed. This big rewrite has gone through a lot of testing an= d bug > >> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RF= C is > >> >>>>>>>> because it is not feature complete. Currently it does not do = swap > >> >>>>>>>> cache reclaim. The next version will have swap cache reclaim = and > >> >>>>>>>> remove the RFC. > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> I am only making the minimal change in this step so the big= rewrite can land. > >> >>>>>>>>>> > >> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at le= ast definitely doesn't > >> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stea= ling" always instead > >> >>>>>>>>>>> of just sometimes. You would likely want to move the clust= er to the end of the > >> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cl= uster() to reduce the > >> >>>>>>>>>>> chances of multiple CPUs using the same cluster. > >> >>>>>>>>>> > >> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple= CPU > >> >>>>>>>>>> sharing the cluster. Because the cluster already has previo= us swap > >> >>>>>>>>>> entries allocated from the previous CPU. > >> >>>>>>>>> > >> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogi= cal case where cpuA > >> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all t= he free pages and > >> >>>>>>>> > >> >>>>>>>> That happens to exist per cpu next pointer already. When the = other CPU > >> >>>>>>>> advances to the next cluster pointer, it can cross with the o= ther > >> >>>>>>>> CPU's next cluster pointer. > >> >>>>>>> > >> >>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per= cpu > >> >>>>>>> cluster only. If it doesn't do that, we should fix it. > >> >>>>>>> > >> >>>>>>> I agree with Ryan that we should make per cpu cluster correct.= A > >> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. = When we > >> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster i= n nonfull > >> >>>>>>> list if necessary. And, we should make it correct in this pat= ch instead > >> >>>>>>> of later in series. I understand that you want to make the pa= tch itself > >> >>>>>>> simple, but it's important to make code simple to be understoo= d too. > >> >>>>>>> Consistent design choice will do that. > >> >>>>>> > >> >>>>>> I think I'm actually arguing for the opposite of what you sugge= st here. > >> >>>>> > >> >>>>> Sorry, I misunderstood your words. > >> >>>>> > >> >>>>>> As I see it, there are 2 possible approaches; either a cluster = is always > >> >>>>>> considered exclusive to a single cpu when its set as a per-cpu = cluster, so it > >> >>>>>> does not appear on the nonfull list. Or a cluster is considered= sharable in this > >> >>>>>> case, in which case it should be added to the nonfull list. > >> >>>>>> > >> >>>>>> The code at the moment sort of does both; when a cpu decides to= use a cluster in > >> >>>>>> the nonfull list, it removes it from that list to make it exclu= sive. But as soon > >> >>>>>> as a single swap entry is freed from that cluster it is put bac= k on the list. > >> >>>>>> This neither-one-policy-nor-the-other seems odd to me. > >> >>>>>> > >> >>>>>> I think Huang, Ying is arguing to keep it always exclusive whil= e installed as a > >> >>>>>> per-cpu cluster. > >> >>>>> > >> >>>>> Yes. > >> >>>>> > >> >>>>>> I was arguing to make it always shared. Perhaps the best > >> >>>>>> approach is to implement the exclusive policy in this patch (yo= u'd need a flag > >> >>>>>> to note if any pages were freed while in exclusive use, then wh= en exclusive use > >> >>>>>> completes, put it back on the nonfull list if the flag was set)= . Then migrate to > >> >>>>>> the shared approach as part of the "big rewrite"? > >> >>>>>>> > >> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. = I think do want to > >> >>>>>>>>> share the cluster when you really need to, but try to avoid = it if there are > >> >>>>>>>>> other options, and I think moving the cluster to the end of = the list might be a > >> >>>>>>>>> way to help that? > >> >>>>>>>> > >> >>>>>>>> Simply moving to the end of the list can create a possible de= adloop > >> >>>>>>>> when all clusters have been scanned and not available swap ra= nge > >> >>>>>>>> found. > >> >>>>> > >> >>>>> I also think that the shared approach has dead loop issue. > >> >>>> > >> >>>> What exactly do you mean by dead loop issue? Perhaps you are sugg= esting the code > >> >>>> won't know when to stop dequeing/requeuing clusters on the nonful= l list and will > >> >>>> go forever? That's surely just an implementation issue to solve? = It's not a > >> >>>> reason to avoid the design principle; if we agree that maintainin= g sharability > >> >>>> of the cluster is preferred then the code must be written to guar= d against the > >> >>>> dead loop problem. It could be done by remembering the first clus= ter you > >> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop whe= n you get back > >> >>>> to it. (I think holding the si lock will protect against concurre= ntly freeing > >> >>>> the cluster so it should definitely remain in the list?). > >> >>> > >> >>> I believe that you can find some way to avoid the dead loop issue, > >> >>> although your suggestion may kill the performance via looping a lo= ng list > >> >>> of nonfull clusters. > >> >> > >> >> I don't agree; If the clusters are considered exclusive (i.e. remov= ed from the > >> >> list when made current for a cpu), that only reduces the size of th= e list by a > >> >> maximum of the number of CPUs in the system, which I suspect is pre= tty small > >> >> compared to the number of nonfull clusters. > >> > > >> > Anyway, this depends on details. If we cannot allocate a order-N sw= ap > >> > entry from the cluster, we should remove it from the nonfull list fo= r > >> > order-N (This is the behavior of this patch too). > > > > Yes, Kairui implements something like that in the reclaim part of the > > patch series. It is after patch 3. We are heavily testing the > > performance and the stability of the reclaim patches. May I post the > > reclaim together with patch 3 for discussion. If you want we can > > discuss the re-order the patch in a later iteration. > > > >> > >> Yes that's a good point, and I conceed it is more difficult to detect = that > >> condition if the cluster is shared. I suspect that with a bit of think= ing, we > >> could find a way though. > > > > Kaiui has the patch series show a good performance number that beats > > the current swap cache reclaim. > > > > I want to make a point regarding the patch ordering before vs after > > patch 3 (aka the big rewrite). > > Previously, the "san_swap_map_try_ssd_cluster" only did partial > > allocation. It does not sucessfully allocate a swap entry 100% the > > time. The patch 3 makes the cluster allocation function return the > > swap entry 100% of the time. There are no more fallback retry loops > > outside of the cluster allocation function. Also the try_ssd function > > does not do swap cache reclaims while the cluster allocation function > > will need to. These two have very different constraints. > > > > There for, adding different cluster header into > > san_swap_map_try_ssd_cluste will be a lot of waste investment of > > development time in the sense that, that function will need to be > > rewrite any way, the end result is very different. > > I am not a big fan of implementing the final solution directly. > Personally, I prefer to improve step by step. The current proposed order also improves things step by step. The only disagreement here is which patch order we introduce yet another list in addition to the nonfull one. I just feel that it does not make sense to invest into new code if that new code is going to be completely rewrite anyway in the next two patches. Unless you mean is we should not do the patch 3 big rewrite and should continue the scan_swap_map_try_ssd_cluster() way of only doing half of the allocation job and let scan_swap_map_slots() do the complex retry on top of try_ssd(). I feel the overall code is more complex and less maintainable. > > That is why I want to make this change patch after patch 3. There is > > also the long test cycle after the modification to make sure the swap > > code path is stable. I am not resisting a change of patch orders, it > > is that patch can't directly be removed before patch 3 before the big > > rewrite. > > > > > >> > >> > Your original > >> > suggestion appears like that you want to keep all cluster with order= -N > >> > on the nonfull list for order-N always unless the number of free swa= p > >> > entry is less than 1< >> > >> Well I think that's certainly one of the conditions for removing it. B= ut agree > >> that if a full scan of the cluster has been performed and no swap entr= ies have > >> been freed since the scan started then it should also be removed from = the list. > > > > Yes, in the later patch of patch, beyond patch 3, we have the almost > > full cluster that for the cluster has been scan and not able to > > allocate order N entry. > > > >> > >> > > >> >>> And, I understand that in some situations it may > >> >>> be better to share clusters among CPUs. So my suggestion is, > >> >>> > >> >>> - Make swap_cluster_info->order more accurate, don't pretend that = we > >> >>> have free swap entries with that order even after we are sure th= at we > >> >>> haven't. > >> >> > >> >> Is this patch pretending that today? I don't think so? > >> > > >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we= are > >> > sure that there are no order-N free swap entry in the cluster. > >> > >> Oh I see what you mean. I think you and Chris already discussed this? = IIRC > >> Chris's point was that if you move that cluster to N-1, eventually all= clusters > >> are for order-0 and you have no means of allocating high orders until = a whole > >> cluster becomes free. That logic certainly makes sense to me, so think= its > >> better for swap_cluster_info->order to remain static while the cluster= is > >> allocated. (I only skimmed that conversation so appologies if I got th= e > >> conclusion wrong!). > > > > Yes, that is the original intent, keep the cluster order as much as pos= sible. > > > >> > >> > > >> >> But I agree that a > >> >> cluster should only be on the per-order nonfull list if we know the= re are at > >> >> least enough free swap entries in that cluster to cover the order. = Of course > >> >> that doesn't tell us for sure because they may not be contiguous. > >> > > >> > We can check that when free swap entry via checking adjacent swap > >> > entries. IMHO, the performance should be acceptable. > >> > >> Would you then use the result of that scanning to "promote" a cluster'= s order? > >> e.g. swap_cluster_info->order =3D N+1? That would be neat. But this al= l feels like > >> a separate change on top of what Chris is doing here. For high orders = there > >> could be quite a bit of scanning required in the worst case for every = page that > >> gets freed. > > > > Right, I feel that is a different set of patches. Even this series is > > hard enough for review. Those order promotion and demotion is heading > > towards a buddy system design. I want to point out that even the buddy > > system is not able to handle the case that swapfile is almost full and > > the recently freed swap entries are not contiguous. > > > > We can invest in the buddy system, which doesn't handle all the > > fragmentation issues. Or I prefer to go directly to the discontiguous > > swap entry. We pay a price for the indirect mapping of swap entries. > > But it will solve the fragmentation issue 100%. > > It's good if we can solve the fragmentation issue 100%. Just need to > pay attention to the cost. The cost you mean the development cost or the run time cost (memory and cpu= )? > > >> > >> > > >> >>> > >> >>> My question is whether it's so important to share the per-cpu clus= ter > >> >>> among CPUs? > >> >> > >> >> My rationale for sharing is that the preference previously has been= to favour > >> >> efficient use of swap space; we don't want to fail a request for al= location of a > >> >> given order if there are actually slots available just because they= have been > >> >> reserved by another CPU. And I'm still asserting that it should be = ~zero cost to > >> >> do this. If I'm wrong about the zero cost, or in practice the shari= ng doesn't > >> >> actually help improve allocation success, then I'm happy to take th= e exclusive > >> >> approach. > >> >> > >> >>> I suggest to start with simple design, that is, per-CPU > >> >>> cluster will not be shared among CPUs in most cases. > >> >> > >> >> I'm all for starting simple; I think that's what I already proposed= (exclusive > >> >> in this patch, then shared in the "big rewrite"). I'm just objectin= g to the > >> >> current half-and-half policy in this patch. > >> > > >> > Sounds good to me. We can start with exclusive solution and evaluat= e > >> > whether shared solution is good. > >> > >> Yep. And also evaluate the dynamic order inc/dec idea too... > > > > It is not able to avoid fragementation 100% of the time. I prefer the > > discontinued swap entry as the next step, which guarantees forward > > progress, we will not be stuck in a situation where we are not able to > > allocate swap entries due to fragmentation. > > If my understanding were correct, the implementation complexity of the > order promotion/demotion isn't at the same level of that of discontinued > swap entry. Discontinued swap entry has higher complexity but higher payout as well. It can get us to the place where cluster promotion/demotion can't. I also feel that if we implement something towards a buddy system allocator for swap, we should do a proper buddy allocator implementation of data structures. Chris