From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBAB9C3DA49 for ; Fri, 26 Jul 2024 02:07:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 50B436B0098; Thu, 25 Jul 2024 22:07:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 492DE6B0099; Thu, 25 Jul 2024 22:07:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30C866B009A; Thu, 25 Jul 2024 22:07:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 07CE56B0098 for ; Thu, 25 Jul 2024 22:07:49 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 705C914150E for ; Fri, 26 Jul 2024 02:07:48 +0000 (UTC) X-FDA: 82380267816.25.218565A Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) by imf11.hostedemail.com (Postfix) with ESMTP id 220BA40025 for ; Fri, 26 Jul 2024 02:07:44 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=VWQywEQ2; spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721959599; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kx/c7YhpEFgTHCLZw4mA2mzIiuirmRW7z00pH3TqXCs=; b=gbSBlDreHyZEfpi+6OmBE949b+HQbU0JttPX8T5lyl0Ybaf51NMwnRdMalPwcx1UmMnCNC CTMs5dvbkbqrYc9URkKTLir2ddNG7d8vYe+TGIFBEfDUt13iZV0evfRc8ozMEnYETT1Nb1 2q1fUkGlwBNw4q5qVF3lZMe2xY3+oKI= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=VWQywEQ2; spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721959599; a=rsa-sha256; cv=none; b=yKqQA5a5PQvEQhZAUJw8LpW9dQnnI0iWEZM/p1Ebd2t7EISi+T85KNY0Z7ww3mITqPfUgo yfz+nlLyikzqMUnASrfcRvqjU6lse1b+EJR3ihAybTpXJHk0CasyQm8H1E5vj1gOQNfDmg puN5M0yCozbkuORcvt8t5GZk0yG56x4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1721959665; x=1753495665; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=aVm7PuHpeIHXJvPSmFv1xW66DoJU7wHYbKkkNdhgS0Q=; b=VWQywEQ2f8BPYRlDshbTmEIAe+bPNCSAlU+SR9LhNd1ol+7r/zK3ctzk 5XXQehkwAMuFG9df3fpC2DNcZnNVZ2vJPemgYCusLtaR2lzn4I0ACxNbF tfoEa2Nwdd0h8JYtpIOeE6RBUpUhcsG7mntzLBzVQnoh82fJfV/uFOHc4 On+q5iGHBxmXVDiPH8I5rZaYBKPgPQA/y+Tcdg7w3pNThjgyq3cC6nDg+ PJK44xVvjHiJq8QZRkteX9QedchY15mnuWY/uc6tpjsVRuvzOP4RGaEpc 4oMZshUV+hwJmSXmQU5NP5b1dD3JiUtw8ekjcE0fHkLeOpd/f+IW8hvZS w==; X-CSE-ConnectionGUID: GnN5RGQ7SfS2Ugb8o/W3ig== X-CSE-MsgGUID: SiV9jC+wQ0KUJzcnSP+zbg== X-IronPort-AV: E=McAfee;i="6700,10204,11144"; a="19889095" X-IronPort-AV: E=Sophos;i="6.09,237,1716274800"; d="scan'208";a="19889095" Received: from fmviesa001.fm.intel.com ([10.60.135.141]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jul 2024 19:07:43 -0700 X-CSE-ConnectionGUID: CrOIbSyOTcGJ/Rs8aPTtVQ== X-CSE-MsgGUID: i56PlJuXTxS2VuabQU8O4g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,237,1716274800"; d="scan'208";a="84089389" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jul 2024 19:07:41 -0700 From: "Huang, Ying" To: Chris Li Cc: Ryan Roberts , Andrew Morton , Kairui Song , Hugh Dickins , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list In-Reply-To: (Chris Li's message of "Thu, 25 Jul 2024 01:26:36 -0700") References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> <874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com> <43f73463-af42-4a00-8996-5f63bdf264a3@arm.com> <87jzhdkdzv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com> <4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com> <87le1q6jyo.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 26 Jul 2024 10:04:08 +0800 Message-ID: <87zfq43o4n.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 220BA40025 X-Stat-Signature: nhem8i938hsxbpyo78jqrjwi357f4hat X-Rspam-User: X-HE-Tag: 1721959664-225368 X-HE-Meta: U2FsdGVkX1+Y7IIMLSEbZd8bwT6Liiskf8QI3YERuZaQJWIOMq3ycHheeWO41XPShRROculwyeFOsWVkwmln7QY0G6ZQ4jOylnTsSH2aWX0hoghDe1h2vluoAimQcrk4lMmH7nrLxRw9Ii2uVu1JyjyQobA9NZZ6lxryjEG5qVOSauw98AMT3Cj1+hIX5rvbrN+NGI46tDCQpJuEBof650gfV4iQQDxn8e27bcQOpeRIQl10kKuDYaFaq6Av0Qiq3nk2AScyhRWNFfWhZtTQJGcpTV/HqwRR8tIK5RzFYUeano5MmdSHX3bwSlqbrqvZSNfwveqQQ+tU+ZIi8Kv7S+GdcZ4rWn+DySbizvji0jQkw8eykxmPM/2XvWtfiY1zT7WgFOXtEpukA0iaAv0DVc3cIPAKa2+TxmkuvkS+VufR1nNJDvhvgTI0Zyw0Ry20UU0e931hIRVC8LxOhnT+zAheWQRMuj4olPV0ORegDAg5KiVwh8pvwJlNGN0R2yctAouhgzDNxeS3ANLK9eYA1anq91LH1P7WHrz9SKNo7TcgumOO3VKLmgIvZALB6YHUvFbXnT7ZoOu1TeqQubI7Fsvpure3yokhXNOxuLhNbncUaGqlN8eD0Pu/U1quN6W3zRb+XdeMfItEAlj5ZBdZXnjUlNIKoyor3uYwju4LUsqEu39lhc9+dWdRxRPsIttjqDJOELe8A6bSKX78KW+85YFcUWL0lZLK2edbjvIGw9HfWMp5/UG5Y2nKZ370qP/Pu1YRwn1TbWDCkuAlzLwmR2FBL8gAvtU3TsgPcD9bW3WN/Hf4dSSs+MOCYB2nMsWZdjYC4NTKYMNgLRhfL/TyhTGbLQoaldNxrvbTWs9ZlfeoNDFkarGTz6PVZ94J0igMy1RoVvqScR8iFKiAJ2K9fKcAkvQQS45Jv8Gt5gqYGUV/O/R7XwvfelAK1nHDkNtmUaWOGfBTnuUNEu0IF3g 9u0GnGCA OJ4AswpmyyMsA/RRKQiQ0oXIuC4stStyCEqodoJCCwrPgEzKlh4A4VbVZSlZ2UxhyNa8d5kUXp5DTC2u+AHYQilGGDPTmh6lFxwhmZEGyPOIC/Go74Mo/vTu6EYydfTxw3yX1Rn/PpvqfCey2JEWYxgq5viiirCTfDuD2gYDzjmN5uhuT+MtezJhi1X88JZKX2gkGq8I5oOaH8QEO6w2qALFpKawmK1GrXKgTFx1mHtQ+b7M2a/uwGfDJjNt3PWG9yWE3JeCAGhuS+rUeVzyd4l/nZVW1zoBzNlGG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > On Wed, Jul 24, 2024 at 11:57=E2=80=AFPM Huang, Ying wrote: >> >> Ryan Roberts writes: >> >> > On 23/07/2024 07:27, Huang, Ying wrote: >> >> Ryan Roberts writes: >> >> >> >>> On 22/07/2024 09:49, Huang, Ying wrote: >> >>>> Ryan Roberts writes: >> >>>> >> >>>>> On 22/07/2024 03:14, Huang, Ying wrote: >> >>>>>> Ryan Roberts writes: >> >>>>>> >> >>>>>>> On 18/07/2024 08:53, Huang, Ying wrote: >> >>>>>>>> Chris Li writes: >> >>>>>>>> >> >>>>>>>>> On Wed, Jul 17, 2024 at 3:14=E2=80=AFAM Ryan Roberts wrote: >> >>>>>>>>>> >> >>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote: >> >>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40=E2=80=AFAM Ryan Roberts wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote: >> >>>>>> >> >>>>>> [snip] >> >>>>>> >> >>>>>>>>>>>>> + >> >>>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >> >>>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_cluster= s[ci->order]); >> >>>>>>>>>>>> >> >>>>>>>>>>>> I find the transitions when you add and remove a cluster fr= om the >> >>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood cor= rectly): It is added >> >>>>>>>>>>>> to the list whenever there is at least one free swap entry = if not already on the >> >>>>>>>>>>>> list. But you take it off the list when assigning it as the= current cluster for >> >>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster(). >> >>>>>>>>>>>> >> >>>>>>>>>>>> So you could have this situation: >> >>>>>>>>>>>> >> >>>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that c= pu) >> >>>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster >> >>>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters >> >>>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters >> >>>>>>>>>>>> >> >>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as = their current >> >>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clu= sters list at >> >>>>>>>>>>>> allocation time (when removed from free_list) and only remo= ve it from the >> >>>>>>>>>>> >> >>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free= list and >> >>>>>>>>>>> moving it into nonfull. >> >>>>>>>>>> >> >>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation f= rom >> >>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring = of the code to >> >>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to= see the >> >>>>>>>>>> refactoring separated from behavioural changes. >> >>>>>>>>> >> >>>>>>>>> It is not a refactoring. It is a big rewrite of the swap alloc= ator >> >>>>>>>>> using the cluster. Behavior change is expected. The goal is co= mpletely >> >>>>>>>>> removing the brute force scanning of swap_map[] array for clus= ter swap >> >>>>>>>>> allocation. >> >>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> Since the patch was titled RFC and I thought it was just refa= ctoring, I was >> >>>>>>>>>> deferring review. But sounds like it is actually required to = realize the test >> >>>>>>>>>> results quoted on the cover letter? >> >>>>>>>>> >> >>>>>>>>> Yes, required because it handles the previous fall out case tr= y_ssd() >> >>>>>>>>> failed. This big rewrite has gone through a lot of testing and= bug >> >>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC= is >> >>>>>>>>> because it is not feature complete. Currently it does not do s= wap >> >>>>>>>>> cache reclaim. The next version will have swap cache reclaim a= nd >> >>>>>>>>> remove the RFC. >> >>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> I am only making the minimal change in this step so the big = rewrite can land. >> >>>>>>>>>>> >> >>>>>>>>>>>> nonfull_clusters list when it is completely full (or at lea= st definitely doesn't >> >>>>>>>>>>>> have room for an `order` allocation)? Then you allow "steal= ing" always instead >> >>>>>>>>>>>> of just sometimes. You would likely want to move the cluste= r to the end of the >> >>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_clu= ster() to reduce the >> >>>>>>>>>>>> chances of multiple CPUs using the same cluster. >> >>>>>>>>>>> >> >>>>>>>>>>> For nonfull clusters it is less important to avoid multiple = CPU >> >>>>>>>>>>> sharing the cluster. Because the cluster already has previou= s swap >> >>>>>>>>>>> entries allocated from the previous CPU. >> >>>>>>>>>> >> >>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogic= al case where cpuA >> >>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all th= e free pages and >> >>>>>>>>> >> >>>>>>>>> That happens to exist per cpu next pointer already. When the o= ther CPU >> >>>>>>>>> advances to the next cluster pointer, it can cross with the ot= her >> >>>>>>>>> CPU's next cluster pointer. >> >>>>>>>> >> >>>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per = cpu >> >>>>>>>> cluster only. If it doesn't do that, we should fix it. >> >>>>>>>> >> >>>>>>>> I agree with Ryan that we should make per cpu cluster correct. = A >> >>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. W= hen we >> >>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in= nonfull >> >>>>>>>> list if necessary. And, we should make it correct in this patc= h instead >> >>>>>>>> of later in series. I understand that you want to make the pat= ch itself >> >>>>>>>> simple, but it's important to make code simple to be understood= too. >> >>>>>>>> Consistent design choice will do that. >> >>>>>>> >> >>>>>>> I think I'm actually arguing for the opposite of what you sugges= t here. >> >>>>>> >> >>>>>> Sorry, I misunderstood your words. >> >>>>>> >> >>>>>>> As I see it, there are 2 possible approaches; either a cluster i= s always >> >>>>>>> considered exclusive to a single cpu when its set as a per-cpu c= luster, so it >> >>>>>>> does not appear on the nonfull list. Or a cluster is considered = sharable in this >> >>>>>>> case, in which case it should be added to the nonfull list. >> >>>>>>> >> >>>>>>> The code at the moment sort of does both; when a cpu decides to = use a cluster in >> >>>>>>> the nonfull list, it removes it from that list to make it exclus= ive. But as soon >> >>>>>>> as a single swap entry is freed from that cluster it is put back= on the list. >> >>>>>>> This neither-one-policy-nor-the-other seems odd to me. >> >>>>>>> >> >>>>>>> I think Huang, Ying is arguing to keep it always exclusive while= installed as a >> >>>>>>> per-cpu cluster. >> >>>>>> >> >>>>>> Yes. >> >>>>>> >> >>>>>>> I was arguing to make it always shared. Perhaps the best >> >>>>>>> approach is to implement the exclusive policy in this patch (you= 'd need a flag >> >>>>>>> to note if any pages were freed while in exclusive use, then whe= n exclusive use >> >>>>>>> completes, put it back on the nonfull list if the flag was set).= Then migrate to >> >>>>>>> the shared approach as part of the "big rewrite"? >> >>>>>>>> >> >>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I= think do want to >> >>>>>>>>>> share the cluster when you really need to, but try to avoid i= t if there are >> >>>>>>>>>> other options, and I think moving the cluster to the end of t= he list might be a >> >>>>>>>>>> way to help that? >> >>>>>>>>> >> >>>>>>>>> Simply moving to the end of the list can create a possible dea= dloop >> >>>>>>>>> when all clusters have been scanned and not available swap ran= ge >> >>>>>>>>> found. >> >>>>>> >> >>>>>> I also think that the shared approach has dead loop issue. >> >>>>> >> >>>>> What exactly do you mean by dead loop issue? Perhaps you are sugge= sting the code >> >>>>> won't know when to stop dequeing/requeuing clusters on the nonfull= list and will >> >>>>> go forever? That's surely just an implementation issue to solve? I= t's not a >> >>>>> reason to avoid the design principle; if we agree that maintaining= sharability >> >>>>> of the cluster is preferred then the code must be written to guard= against the >> >>>>> dead loop problem. It could be done by remembering the first clust= er you >> >>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when= you get back >> >>>>> to it. (I think holding the si lock will protect against concurren= tly freeing >> >>>>> the cluster so it should definitely remain in the list?). >> >>>> >> >>>> I believe that you can find some way to avoid the dead loop issue, >> >>>> although your suggestion may kill the performance via looping a lon= g list >> >>>> of nonfull clusters. >> >>> >> >>> I don't agree; If the clusters are considered exclusive (i.e. remove= d from the >> >>> list when made current for a cpu), that only reduces the size of the= list by a >> >>> maximum of the number of CPUs in the system, which I suspect is pret= ty small >> >>> compared to the number of nonfull clusters. >> >> >> >> Anyway, this depends on details. If we cannot allocate a order-N swap >> >> entry from the cluster, we should remove it from the nonfull list for >> >> order-N (This is the behavior of this patch too). >> > >> > Yes that's a good point, and I conceed it is more difficult to detect = that >> > condition if the cluster is shared. I suspect that with a bit of think= ing, we >> > could find a way though. >> > >> >> Your original >> >> suggestion appears like that you want to keep all cluster with order-N >> >> on the nonfull list for order-N always unless the number of free swap >> >> entry is less than 1<> > >> > Well I think that's certainly one of the conditions for removing it. B= ut agree >> > that if a full scan of the cluster has been performed and no swap entr= ies have >> > been freed since the scan started then it should also be removed from = the list. >> > >> >> >> >>>> And, I understand that in some situations it may >> >>>> be better to share clusters among CPUs. So my suggestion is, >> >>>> >> >>>> - Make swap_cluster_info->order more accurate, don't pretend that we >> >>>> have free swap entries with that order even after we are sure tha= t we >> >>>> haven't. >> >>> >> >>> Is this patch pretending that today? I don't think so? >> >> >> >> IIUC, in this patch swap_cluster_info->order is still "N" even if we = are >> >> sure that there are no order-N free swap entry in the cluster. >> > >> > Oh I see what you mean. I think you and Chris already discussed this? = IIRC >> > Chris's point was that if you move that cluster to N-1, eventually all= clusters >> > are for order-0 and you have no means of allocating high orders until = a whole >> > cluster becomes free. That logic certainly makes sense to me, so think= its >> > better for swap_cluster_info->order to remain static while the cluster= is >> > allocated. (I only skimmed that conversation so appologies if I got the >> > conclusion wrong!). >> > >> >> >> >>> But I agree that a >> >>> cluster should only be on the per-order nonfull list if we know ther= e are at >> >>> least enough free swap entries in that cluster to cover the order. O= f course >> >>> that doesn't tell us for sure because they may not be contiguous. >> >> >> >> We can check that when free swap entry via checking adjacent swap >> >> entries. IMHO, the performance should be acceptable. >> > >> > Would you then use the result of that scanning to "promote" a cluster'= s order? >> > e.g. swap_cluster_info->order =3D N+1? That would be neat. But this al= l feels like >> > a separate change on top of what Chris is doing here. For high orders = there >> > could be quite a bit of scanning required in the worst case for every = page that >> > gets freed. >> >> We can try to optimize it to control overhead if necessary. >> >> >> >> >>>> >> >>>> My question is whether it's so important to share the per-cpu clust= er >> >>>> among CPUs? >> >>> >> >>> My rationale for sharing is that the preference previously has been = to favour >> >>> efficient use of swap space; we don't want to fail a request for all= ocation of a >> >>> given order if there are actually slots available just because they = have been >> >>> reserved by another CPU. And I'm still asserting that it should be ~= zero cost to >> >>> do this. If I'm wrong about the zero cost, or in practice the sharin= g doesn't >> >>> actually help improve allocation success, then I'm happy to take the= exclusive >> >>> approach. >> >>> >> >>>> I suggest to start with simple design, that is, per-CPU >> >>>> cluster will not be shared among CPUs in most cases. >> >>> >> >>> I'm all for starting simple; I think that's what I already proposed = (exclusive >> >>> in this patch, then shared in the "big rewrite"). I'm just objecting= to the >> >>> current half-and-half policy in this patch. >> >> >> >> Sounds good to me. We can start with exclusive solution and evaluate >> >> whether shared solution is good. >> > >> > Yep. And also evaluate the dynamic order inc/dec idea too... >> >> Dynamic order inc/dec tries solving a more fundamental problem. For >> example, >> >> - Initially, almost only order-0 pages are swapped out, most non-full >> clusters are order-0. >> >> - Later, quite some order-0 swap entries are freed so that there are >> quite some order-4 swap entries available. > > If the freeing of swap entry is random distribution. You need 16 > continuous swap entries free at the same time at aligned 16 base > locations. The total number of order 4 free swap space add up together > is much lower than the order 0 allocatable swap space. > If having one entry free is 50% probability(swapfile half full), then > having 16 swap entries is continually free is (0.5) EXP 16 =3D 1.5 E-5. > If the swapfile is 80% full, that number drops to 6.5 E -12. This depends on workloads. Quite some workloads will show some degree of spatial locality. For a workload with no spatial locality at all as above, mTHP may be not a good choice at the first place. >> - Order-4 pages need to be swapped out, but no enough order-4 non-full >> clusters available. > > Exactly. > >> >> So, we need a way to migrate non-full clusters among orders to adjust to >> the various situations automatically. > > There is no easy way to migrate swap entries to different locations. > That is why I like to have discontiguous swap entries allocation for > mTHP. We suggest to migrate non-full swap clsuters among different lists, not swap entries. >> >> But yes, data is needed for any performance related change. BTW: I think non-full cluster isn't a good name. Partial cluster is much better and follows the same convention as partial slab. -- Best Regards, Huang, Ying