From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCB65C3DA49 for ; Tue, 23 Jul 2024 06:30:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 562BF6B007B; Tue, 23 Jul 2024 02:30:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 512466B0083; Tue, 23 Jul 2024 02:30:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3DA206B0085; Tue, 23 Jul 2024 02:30:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 25BFD6B007B for ; Tue, 23 Jul 2024 02:30:48 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 98369A61EB for ; Tue, 23 Jul 2024 06:30:47 +0000 (UTC) X-FDA: 82370044134.18.A11DB07 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) by imf09.hostedemail.com (Postfix) with ESMTP id 10831140004 for ; Tue, 23 Jul 2024 06:30:43 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=m7hkwJJn; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721716183; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vY0L3QSXgIBR0jAaeHXGKX8ANsTQG1wHdUdq3RHrtcw=; b=F7twOUqez5mgoUwDeuF1696/n+baC2dtfG7UfMb4rcNlVGzMnIl4Pv7acYbeVgazELpq0G bxT1xjMf34JwXgyci5btGKELMqYalwQrRTP+DSJ5f3bEP6Jrsu6sgvAp6Jd/roIqNpT9g8 Z14VT2QVlIy4a3aDPD8WlkLEBoZJP8g= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=m7hkwJJn; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721716183; a=rsa-sha256; cv=none; b=WWXcOtDZOhYzEScHH94m/3XZh0mE5jJnDtDil9NNRrvoBbY7jNj27YaehySfMDJT3tgElf tr1KN3V4K8YRtlBz8NYKkOhV3C1MmxE/TZYsfgFKXNf+wzUGHCHqG7mDKxoruJHv62SpSK bReKNw8t0obii4fOlTXJxg5VT9a/7rA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1721716244; x=1753252244; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=sAG0dFj1gQshlamUJIoFUzgwnmTSFyMBTs0DRiUb/wk=; b=m7hkwJJn1/ROb5jWdhsSEGL8/aMa68RLTkub739YzeksGTl0kKkDYLIE xp4pmnHrqdYcNiiibl788v6poTp+IMMCbLs7NlYLNSw4O3iIB9NotGhVx dKu0Ceh2M9xrTE6TRHcpj/uGPQkxcwIjwsC/Iug7IXPq9sz8cBo2BoqpU 8TuoUK65Ub+2HK9M2YIYUzxg1NPeRXqmFoN/DmQqEdSvMuB5J6ILL6J14 zozRLuuGHnt2KngWRv01+MVlbZxdojVosp3HxaQUK14/3wrFhQIVVgU47 OwSleHEBQ2Qe8N9HPbTDt18eN8Jh3EgqYuMfqWDLnvAVhTHBYbDm/L9Yd Q==; X-CSE-ConnectionGUID: fcVdlBToQ6ulQbcHUc0Wmw== X-CSE-MsgGUID: WGhTJl6FSj+8KoXffPHeEw== X-IronPort-AV: E=McAfee;i="6700,10204,11141"; a="29982240" X-IronPort-AV: E=Sophos;i="6.09,230,1716274800"; d="scan'208";a="29982240" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Jul 2024 23:30:43 -0700 X-CSE-ConnectionGUID: UWAmHK1JSWuLfP5VgIQokg== X-CSE-MsgGUID: VNZ9pXNtSwiDPJb2HwOw4g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,230,1716274800"; d="scan'208";a="52062852" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Jul 2024 23:30:40 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: Chris Li , Andrew Morton , Kairui Song , Hugh Dickins , "Kalesh Singh" , , , Barry Song Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list In-Reply-To: (Ryan Roberts's message of "Mon, 22 Jul 2024 10:54:35 +0100") References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> <874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com> <43f73463-af42-4a00-8996-5f63bdf264a3@arm.com> <87jzhdkdzv.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 23 Jul 2024 14:27:07 +0800 Message-ID: <87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 10831140004 X-Stat-Signature: rwybi9f5zfro6mnbc9grseyfq5ncoqmb X-Rspam-User: X-HE-Tag: 1721716243-601022 X-HE-Meta: U2FsdGVkX19X6GVQb25keqr+BmBaTo+PFzYdGUHVB5mxiL0yGpJQWjbEvQzysd75KPh23pm+Nd9kUCPB6U0lJPj0ToC2KiC3TrR72FM3rSX2BgUyCUVXpU+gez3SHnBmHc/zB6iuUHsH/9BizACV5TbEwS1jBh2ueXeDV+/Mj3wdpE1bVcV+sWBPpTZn+ui5WhLjXW34kRVzP//GtMrRxSiKlzr+4mFkFAhurYLKaXWbL4RBS2W/uBTjwyHfQ1Z8GKDj1YIr8f03LGCKhzSNBESG/kj6lXSbkWP5zOX2sKN5daCgaajMnCVDl32BkBV/IFmaLRJKPxJWNa3/1hBQyNjxzshmh+SBHvGTUOylrdslirr3sAgGvWePRZmkokFBO8l9x1B81sA4iu5I2737a7avwuRQYRLM5shy4J8Q9LT3wWOV2mDywh5WeL+xSFlhe9OvkEtgw1XwLh7H3oYD9972W+ag9ZnPQoElTAubzNrAS3LVl7m1WKQfw8Uk+XmR38FBFcEyKga8w/ATuWKJI085TP+UAtNtgMS8+5fSlfgrisanH09Cq4a5Fi4bA/GB+TMFJz1THVikwBQ1GoBO+Om5hRwdgiREtBGAhCOA1N8A2tVrkeZhRyQjtPFgJ3uDQHuDw9RJWD6DqzaiSDHPc1ssdjYDUWEHEg+iqFTGI6oQusyMmdBgRhddn+xW/JV3HhKnl9RFQvn3GrShd6+e2VLGO01CMMcwM/62NuHiNuYBrbErGfnsA2VQ8CmyfPtb6WS2XnQIHssV1V271ilKkAMvT431KljYbWhGsAElTqTwEco86gLGyPxER6gGm4O3qklrO4u+dE4w3KkB8DNbQqtSh3jD7D5WFkUdTv6sn+g1eQx4btcTUTzJ+cykUKWIkHL/KPZKQf8nviZVd4D8KZq3hqsTcJMMhde1YNe65rYjs7IIKw7nvXFo305eAy8wBVXnT8Rl9GzK0UQdcX6 AmeuHDjn hidhyF5bgXwmBvYb7psa8paMYCOTnW0av9eYkrPUL2fnhA2vhM5Y6QOgFKCJ5KNvYVUiREflFWXH4WDJyNX5JuZOtIlp6NB5nGNpPA3ZeolzID0BPh7K+QK8w5qvpBFIeh1Co9cOwD9iXhliWn00xuT2ykc0vZ4JWC4GCgktfbcQ2dxxNpE/ycZkL0buwWcixTdLUO/yHbY8ySFMvEr5uld6bydpzVWTlZ7pWJ4m32nuyYQ1vRmsM7coO47i5Oqg5j0Ci7GTvmFp8ddU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ryan Roberts writes: > On 22/07/2024 09:49, Huang, Ying wrote: >> Ryan Roberts writes: >>=20 >>> On 22/07/2024 03:14, Huang, Ying wrote: >>>> Ryan Roberts writes: >>>> >>>>> On 18/07/2024 08:53, Huang, Ying wrote: >>>>>> Chris Li writes: >>>>>> >>>>>>> On Wed, Jul 17, 2024 at 3:14=E2=80=AFAM Ryan Roberts wrote: >>>>>>>> >>>>>>>> On 16/07/2024 23:46, Chris Li wrote: >>>>>>>>> On Mon, Jul 15, 2024 at 8:40=E2=80=AFAM Ryan Roberts wrote: >>>>>>>>>> >>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote: >>>> >>>> [snip] >>>> >>>>>>>>>>> + >>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci-= >order]); >>>>>>>>>> >>>>>>>>>> I find the transitions when you add and remove a cluster from the >>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctl= y): It is added >>>>>>>>>> to the list whenever there is at least one free swap entry if no= t already on the >>>>>>>>>> list. But you take it off the list when assigning it as the curr= ent cluster for >>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster(). >>>>>>>>>> >>>>>>>>>> So you could have this situation: >>>>>>>>>> >>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu) >>>>>>>>>> - cpuA allocs 1 swap entry from current cluster >>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters >>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters >>>>>>>>>> >>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their= current >>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters= list at >>>>>>>>>> allocation time (when removed from free_list) and only remove it= from the >>>>>>>>> >>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list= and >>>>>>>>> moving it into nonfull. >>>>>>>> >>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from >>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of th= e code to >>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see = the >>>>>>>> refactoring separated from behavioural changes. >>>>>>> >>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator >>>>>>> using the cluster. Behavior change is expected. The goal is complet= ely >>>>>>> removing the brute force scanning of swap_map[] array for cluster s= wap >>>>>>> allocation. >>>>>>> >>>>>>>> >>>>>>>> Since the patch was titled RFC and I thought it was just refactori= ng, I was >>>>>>>> deferring review. But sounds like it is actually required to reali= ze the test >>>>>>>> results quoted on the cover letter? >>>>>>> >>>>>>> Yes, required because it handles the previous fall out case try_ssd= () >>>>>>> failed. This big rewrite has gone through a lot of testing and bug >>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is >>>>>>> because it is not feature complete. Currently it does not do swap >>>>>>> cache reclaim. The next version will have swap cache reclaim and >>>>>>> remove the RFC. >>>>>>> >>>>>>>> >>>>>>>>> I am only making the minimal change in this step so the big rewri= te can land. >>>>>>>>> >>>>>>>>>> nonfull_clusters list when it is completely full (or at least de= finitely doesn't >>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" = always instead >>>>>>>>>> of just sometimes. You would likely want to move the cluster to = the end of the >>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster(= ) to reduce the >>>>>>>>>> chances of multiple CPUs using the same cluster. >>>>>>>>> >>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU >>>>>>>>> sharing the cluster. Because the cluster already has previous swap >>>>>>>>> entries allocated from the previous CPU. >>>>>>>> >>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical ca= se where cpuA >>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the fre= e pages and >>>>>>> >>>>>>> That happens to exist per cpu next pointer already. When the other = CPU >>>>>>> advances to the next cluster pointer, it can cross with the other >>>>>>> CPU's next cluster pointer. >>>>>> >>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu >>>>>> cluster only. If it doesn't do that, we should fix it. >>>>>> >>>>>> I agree with Ryan that we should make per cpu cluster correct. A >>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we >>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonf= ull >>>>>> list if necessary. And, we should make it correct in this patch ins= tead >>>>>> of later in series. I understand that you want to make the patch it= self >>>>>> simple, but it's important to make code simple to be understood too. >>>>>> Consistent design choice will do that. >>>>> >>>>> I think I'm actually arguing for the opposite of what you suggest her= e. >>>> >>>> Sorry, I misunderstood your words. >>>> >>>>> As I see it, there are 2 possible approaches; either a cluster is alw= ays >>>>> considered exclusive to a single cpu when its set as a per-cpu cluste= r, so it >>>>> does not appear on the nonfull list. Or a cluster is considered shara= ble in this >>>>> case, in which case it should be added to the nonfull list. >>>>> >>>>> The code at the moment sort of does both; when a cpu decides to use a= cluster in >>>>> the nonfull list, it removes it from that list to make it exclusive. = But as soon >>>>> as a single swap entry is freed from that cluster it is put back on t= he list. >>>>> This neither-one-policy-nor-the-other seems odd to me. >>>>> >>>>> I think Huang, Ying is arguing to keep it always exclusive while inst= alled as a >>>>> per-cpu cluster. >>>> >>>> Yes. >>>> >>>>> I was arguing to make it always shared. Perhaps the best >>>>> approach is to implement the exclusive policy in this patch (you'd ne= ed a flag >>>>> to note if any pages were freed while in exclusive use, then when exc= lusive use >>>>> completes, put it back on the nonfull list if the flag was set). Then= migrate to >>>>> the shared approach as part of the "big rewrite"? >>>>>> >>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I thin= k do want to >>>>>>>> share the cluster when you really need to, but try to avoid it if = there are >>>>>>>> other options, and I think moving the cluster to the end of the li= st might be a >>>>>>>> way to help that? >>>>>>> >>>>>>> Simply moving to the end of the list can create a possible deadloop >>>>>>> when all clusters have been scanned and not available swap range >>>>>>> found. >>>> >>>> I also think that the shared approach has dead loop issue. >>> >>> What exactly do you mean by dead loop issue? Perhaps you are suggesting= the code >>> won't know when to stop dequeing/requeuing clusters on the nonfull list= and will >>> go forever? That's surely just an implementation issue to solve? It's n= ot a >>> reason to avoid the design principle; if we agree that maintaining shar= ability >>> of the cluster is preferred then the code must be written to guard agai= nst the >>> dead loop problem. It could be done by remembering the first cluster you >>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you = get back >>> to it. (I think holding the si lock will protect against concurrently f= reeing >>> the cluster so it should definitely remain in the list?). >>=20 >> I believe that you can find some way to avoid the dead loop issue, >> although your suggestion may kill the performance via looping a long list >> of nonfull clusters.=20=20 > > I don't agree; If the clusters are considered exclusive (i.e. removed fro= m the > list when made current for a cpu), that only reduces the size of the list= by a > maximum of the number of CPUs in the system, which I suspect is pretty sm= all > compared to the number of nonfull clusters. Anyway, this depends on details. If we cannot allocate a order-N swap entry from the cluster, we should remove it from the nonfull list for order-N (This is the behavior of this patch too). Your original suggestion appears like that you want to keep all cluster with order-N on the nonfull list for order-N always unless the number of free swap entry is less than 1<> And, I understand that in some situations it may >> be better to share clusters among CPUs. So my suggestion is, >>=20 >> - Make swap_cluster_info->order more accurate, don't pretend that we >> have free swap entries with that order even after we are sure that we >> haven't. > > Is this patch pretending that today? I don't think so? IIUC, in this patch swap_cluster_info->order is still "N" even if we are sure that there are no order-N free swap entry in the cluster. > But I agree that a > cluster should only be on the per-order nonfull list if we know there are= at > least enough free swap entries in that cluster to cover the order. Of cou= rse > that doesn't tell us for sure because they may not be contiguous. We can check that when free swap entry via checking adjacent swap entries. IMHO, the performance should be acceptable. >>=20 >> My question is whether it's so important to share the per-cpu cluster >> among CPUs?=20 > > My rationale for sharing is that the preference previously has been to fa= vour > efficient use of swap space; we don't want to fail a request for allocati= on of a > given order if there are actually slots available just because they have = been > reserved by another CPU. And I'm still asserting that it should be ~zero = cost to > do this. If I'm wrong about the zero cost, or in practice the sharing doe= sn't > actually help improve allocation success, then I'm happy to take the excl= usive > approach. > >> I suggest to start with simple design, that is, per-CPU >> cluster will not be shared among CPUs in most cases. > > I'm all for starting simple; I think that's what I already proposed (excl= usive > in this patch, then shared in the "big rewrite"). I'm just objecting to t= he > current half-and-half policy in this patch. Sounds good to me. We can start with exclusive solution and evaluate whether shared solution is good. >>=20 >> Another choice for sharing is when we run short of free swap space, we >> disable per-CPU cluster and allocate from the shared non-full cluster >> list directly. >>=20 >>> Which actually makes me wonder; what is the mechanism that prevents the= current >>> per-cpu cluster from being freed? Is that just handled by the conflict = detection >>> thingy? Perhaps that would be better handled with a flag to mark it in = use, or >>> raise count when its current. (If Chris has implemented that in the "big >>> rewrite" patch, sorry, I still haven't gotten around to looking at it := -| ) >>=20 >> Yes. We may need a flag for that. >>=20 >>>> >>>>>> This is another reason that we should put the cluster in >>>>>> nonfull_clusters[order--] if there are no free swap entry with "orde= r" >>>>>> in the cluster. It makes design complex to keep it in >>>>>> nonfull_clusters[order]. >>>>>> >>>>>>> We have tried many different approaches including moving to the end= of >>>>>>> the list. It can cause more fragmentation because each CPU allocates >>>>>>> their swap slot cache (64 entries) from a different cluster. >>>>>>> >>>>>>>>> Those behaviors will be fine >>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simpl= e. >>>>>>> >>>>>>> Again, I want to keep it simple here so patch 3 can land. >>>>>>> >>>>>>>>>> Another potential optimization (which was in my hacked version I= IRC) is to only >>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1= << order) >>>>>>>>>> boundary rather than when becoming completely full. You definite= ly won't be able >>>>>>>>>> to allocate order-2 if there are only 3 pages available, for exa= mple. >>>>>>>>> >>>>>>>>> That is in patch 3 as well. This patch is just doing the bare min= imum >>>>>>>>> to introduce the nonfull list. >>>>>>>>> >>>> >>>> [snip] -- Best Regards, Huang, Ying