From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8058C3DA5D for ; Mon, 22 Jul 2024 02:17:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4B5576B0082; Sun, 21 Jul 2024 22:17:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 463FB6B0083; Sun, 21 Jul 2024 22:17:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32B826B0085; Sun, 21 Jul 2024 22:17:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 14C066B0082 for ; Sun, 21 Jul 2024 22:17:45 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 842BB1613EC for ; Mon, 22 Jul 2024 02:17:44 +0000 (UTC) X-FDA: 82365777648.09.A33C804 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by imf23.hostedemail.com (Postfix) with ESMTP id 245E6140004 for ; Mon, 22 Jul 2024 02:17:40 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=l2WMloxv; spf=pass (imf23.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721614616; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Xde15FixdFQjnVFT0jq0xNhaPCppXQrajrVYLGSX+5s=; b=46xjU6vtY76OkcqMXuKYgmOcLsjCZwC1Muxl2wizxuG2o4tSVbMnGXVjdlBobsiiw/3p+t WiVfD1SNf0eU8v4IbPZ5htdeDhLqSlkeVkRE1CAqhF4mqxeSl1NOxnUqeL8ecZ2fThBlnD LhwDAQUKGPkw+exkLGhkY9V2KTwQjEQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721614616; a=rsa-sha256; cv=none; b=fV8WOAqED58vhXG9vJsAoEJtqpmq7sXhLhborlbTxZzxLObqVlgBuqim7sH/egP0rwSTDC a1PXIdMNH7YXxm3YCywi6rG3+6DGxp/P4DD0SXmfGpG2EACMoUea0NNB0oz9m7yNJna0U2 0GG7kv2A8uEGIMWFWGEWY30yVJrnaTk= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=l2WMloxv; spf=pass (imf23.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1721614661; x=1753150661; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=9WpNI1lqce4FVULvexJBf8KieXCM0lVlLQAIpUtTnGM=; b=l2WMloxvAbZvsSOeBd47tVv3gYs9BG6yZeD3Omzo7PVN0uvvpdPypXVO jyV4C/Lj11WawMLSg7AEUCzVuG2d8AkujVwKdMR8G3f1Wnd4iPtjwYOrh abdG2KLydKCdQInxCIsUTxAh8299Ybw050b4V2zt3IEvWMI0Corymkk09 ahGnlaJziCWG4S9FV9Pmh0g1XQ8pBhg0WjiaH6fGESNLKdKfPr7sAUmR5 DO6iW4QF5GYnqstpGnJ9J/qqEZD/o+pLhyVMSPqaS90Fd/qf+26fv7HPJ t0VC+gZ0vq93//aVGC8LmiG5s8n+q2dDTRK0xp8+JczZVrvrT/DskZELs Q==; X-CSE-ConnectionGUID: vZnB+kRCS1u/D4hp+mdORw== X-CSE-MsgGUID: FWBxZ+CNSgWitpHH8gr8qQ== X-IronPort-AV: E=McAfee;i="6700,10204,11140"; a="44588129" X-IronPort-AV: E=Sophos;i="6.09,227,1716274800"; d="scan'208";a="44588129" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jul 2024 19:17:39 -0700 X-CSE-ConnectionGUID: dC9FdTT+SNuRMARWnklLsw== X-CSE-MsgGUID: sN+6lOviTeKtYcDGrQ1b2Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,227,1716274800"; d="scan'208";a="56831651" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jul 2024 19:17:37 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: Chris Li , Andrew Morton , Kairui Song , Hugh Dickins , Kalesh Singh , , , Barry Song Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list In-Reply-To: (Ryan Roberts's message of "Fri, 19 Jul 2024 11:30:25 +0100") References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> <874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 22 Jul 2024 10:14:03 +0800 Message-ID: <87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 1wfsh8afmwzy13sundfnzi9dhd7gtbfm X-Rspamd-Queue-Id: 245E6140004 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1721614660-420610 X-HE-Meta: U2FsdGVkX19yV2VP2GO+Fyp/m6Itcicne/1vKidXntfg9vO07RGW0VaY85MKmP7N7OmUbvTKLVbpzg5yz8BePD8w+oANHoZEZzEKQBfJAEdgDl3uTdRF5/KRF/WJLyvPT9yAlNNCWfNZn9xCRpv2K6PKtcUwSe9xD0Dwcv/qOhLnO/Eb+s5vyaYtm17NV/x9sz+vK6ItxjqAnKLgT3/BWCTyBnDjI/sIOkxNW6eb1fWc/Rjpp2/6+CyWDBmw6rRdUHyTJK+Ck15k5gDmfViHHCO9w/NYIloZW2XY/Ino2ofaQCgqmQ0+g5lgqQxTbzPj7w5ymoffZUknbbQsPJsayUIVzZBVUDnTS6tevM8COgZ4kS4e5Wk49ZBPGMqtzljwLFoBF5IDysif4qvxxpxkSUwKXxqYy7yluPeXqjNoGpMoD9fq+Aix1Q6WjdKqrUiG8w0ncjsEL8Xde5H4ETp6SXYfHRah2PaZ1wr7+o85daAUrglFr+d/8XCNmzyDAcFVAUOSUvkIquO4xcOiE11cCuaT32PXqMzW5nrVjTYJSkTnWcWcQd7yHQ8ZuR4cES8aTHhfSSHDibVpGqC0H4q3rYIOcVOTLF3tgRWYamGLSGrj/YWWCiTffOOR9dt6FniP3NG/z5JJ6QN8WpNwIxL+gReAydUaZv69gPwss7/gtPtRMbGVTMdAeO90b2glB9oJ0am0kMRzS2FlFrMVOalkW1XHJH/xurVLsFgf2qtg4QWJlGyBGTkvtu7+YI0HFruWs/fE5RhCvVkVYiRKJRD0+8e33sF0WsQ1KF4jRCwEmTPWOQEb3sh0LbxvhuvP4IY7k8Z3MM1mwxoBKf74qxORQ1/jdOwGfKVwoC3yxLLkO9X6Ji8067K8EwmckBEkXvcGe2vbUeGiAS+Drj5sOVd58K0JR9erQ/cbW7CRjtbxRf5raa2lat7CC+MNmqXr7cCWCtqaXW+yse91+N0MRqt 7DKZvu2y mzXkEqYpiEV/+OP7y3jpewfVeO9EnRnVh0xhaFpMBnN346qzuT1BgjH1HtCgUcFF7p8NWRLFF4mGEicw9sl5UjCF5dcabSNV2ZgRkYO0NoE1x0pq1Lw1GLB9U2LQWLeXlSZzNtazhQhdKqIaPLZ5MqFLb8K5iXzXPOAKPRAIxvBrmWwpuiZNl8YIeB4usfevVbNRSCr7rzXDC2O73CeQ1wtj+Ux4GmsHiRDXOHt12l7s1BJKxPg4pzOlMbDfTperwKPhttk331lWibCU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ryan Roberts writes: > On 18/07/2024 08:53, Huang, Ying wrote: >> Chris Li writes: >>=20 >>> On Wed, Jul 17, 2024 at 3:14=E2=80=AFAM Ryan Roberts wrote: >>>> >>>> On 16/07/2024 23:46, Chris Li wrote: >>>>> On Mon, Jul 15, 2024 at 8:40=E2=80=AFAM Ryan Roberts wrote: >>>>>> >>>>>> On 11/07/2024 08:29, Chris Li wrote: [snip] >>>>>>> + >>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->ord= er]); >>>>>> >>>>>> I find the transitions when you add and remove a cluster from the >>>>>> nonfull_clusters list a bit strange (if I've understood correctly): = It is added >>>>>> to the list whenever there is at least one free swap entry if not al= ready on the >>>>>> list. But you take it off the list when assigning it as the current = cluster for >>>>>> a cpu in scan_swap_map_try_ssd_cluster(). >>>>>> >>>>>> So you could have this situation: >>>>>> >>>>>> - cpuA allocs cluster from free list (exclusive to that cpu) >>>>>> - cpuA allocs 1 swap entry from current cluster >>>>>> - swap entry is freed; cluster added to nonfull_clusters >>>>>> - cpuB "allocs" cluster from nonfull_clusters >>>>>> >>>>>> At this point both cpuA and cpuB share the same cluster as their cur= rent >>>>>> cluster. So why not just put the cluster on the nonfull_clusters lis= t at >>>>>> allocation time (when removed from free_list) and only remove it fro= m the >>>>> >>>>> The big rewrite on patch 3 does that, taking it off the free list and >>>>> moving it into nonfull. >>>> >>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from >>>> scan_swap_map_slots()" I assumed that was just a refactoring of the co= de to >>>> separate the SSD and HDD code paths. Personally I'd prefer to see the >>>> refactoring separated from behavioural changes. >>> >>> It is not a refactoring. It is a big rewrite of the swap allocator >>> using the cluster. Behavior change is expected. The goal is completely >>> removing the brute force scanning of swap_map[] array for cluster swap >>> allocation. >>> >>>> >>>> Since the patch was titled RFC and I thought it was just refactoring, = I was >>>> deferring review. But sounds like it is actually required to realize t= he test >>>> results quoted on the cover letter? >>> >>> Yes, required because it handles the previous fall out case try_ssd() >>> failed. This big rewrite has gone through a lot of testing and bug >>> fix. It is pretty stable now. The only reason I keep it as RFC is >>> because it is not feature complete. Currently it does not do swap >>> cache reclaim. The next version will have swap cache reclaim and >>> remove the RFC. >>> >>>> >>>>> I am only making the minimal change in this step so the big rewrite c= an land. >>>>> >>>>>> nonfull_clusters list when it is completely full (or at least defini= tely doesn't >>>>>> have room for an `order` allocation)? Then you allow "stealing" alwa= ys instead >>>>>> of just sometimes. You would likely want to move the cluster to the = end of the >>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to= reduce the >>>>>> chances of multiple CPUs using the same cluster. >>>>> >>>>> For nonfull clusters it is less important to avoid multiple CPU >>>>> sharing the cluster. Because the cluster already has previous swap >>>>> entries allocated from the previous CPU. >>>> >>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case w= here cpuA >>>> could be slightly ahead of cpuB so that cpuA allocates all the free pa= ges and >>> >>> That happens to exist per cpu next pointer already. When the other CPU >>> advances to the next cluster pointer, it can cross with the other >>> CPU's next cluster pointer. >>=20 >> No. si->percpu_cluster[cpu].next will keep in the current per cpu >> cluster only. If it doesn't do that, we should fix it. >>=20 >> I agree with Ryan that we should make per cpu cluster correct. A >> cluster in per cpu cluster shouldn't be put in nonfull list. When we >> scan to the end of a per cpu cluster, we can put the cluster in nonfull >> list if necessary. And, we should make it correct in this patch instead >> of later in series. I understand that you want to make the patch itself >> simple, but it's important to make code simple to be understood too. >> Consistent design choice will do that. > > I think I'm actually arguing for the opposite of what you suggest here. Sorry, I misunderstood your words. > As I see it, there are 2 possible approaches; either a cluster is always > considered exclusive to a single cpu when its set as a per-cpu cluster, s= o it > does not appear on the nonfull list. Or a cluster is considered sharable = in this > case, in which case it should be added to the nonfull list. > > The code at the moment sort of does both; when a cpu decides to use a clu= ster in > the nonfull list, it removes it from that list to make it exclusive. But = as soon > as a single swap entry is freed from that cluster it is put back on the l= ist. > This neither-one-policy-nor-the-other seems odd to me. > > I think Huang, Ying is arguing to keep it always exclusive while installe= d as a > per-cpu cluster. Yes. > I was arguing to make it always shared. Perhaps the best > approach is to implement the exclusive policy in this patch (you'd need a= flag > to note if any pages were freed while in exclusive use, then when exclusi= ve use > completes, put it back on the nonfull list if the flag was set). Then mig= rate to > the shared approach as part of the "big rewrite"? >>=20 >>>> cpuB just ends up scanning and finding nothing to allocate. I think do= want to >>>> share the cluster when you really need to, but try to avoid it if ther= e are >>>> other options, and I think moving the cluster to the end of the list m= ight be a >>>> way to help that? >>> >>> Simply moving to the end of the list can create a possible deadloop >>> when all clusters have been scanned and not available swap range >>> found. I also think that the shared approach has dead loop issue. >> This is another reason that we should put the cluster in >> nonfull_clusters[order--] if there are no free swap entry with "order" >> in the cluster. It makes design complex to keep it in >> nonfull_clusters[order]. >>=20 >>> We have tried many different approaches including moving to the end of >>> the list. It can cause more fragmentation because each CPU allocates >>> their swap slot cache (64 entries) from a different cluster. >>> >>>>> Those behaviors will be fine >>>>> tuned after the patch 3 big rewrite. Try to make this patch simple. >>> >>> Again, I want to keep it simple here so patch 3 can land. >>> >>>>>> Another potential optimization (which was in my hacked version IIRC)= is to only >>>>>> add/remove from nonfull list when `total - count` crosses the (1 << = order) >>>>>> boundary rather than when becoming completely full. You definitely w= on't be able >>>>>> to allocate order-2 if there are only 3 pages available, for example. >>>>> >>>>> That is in patch 3 as well. This patch is just doing the bare minimum >>>>> to introduce the nonfull list. >>>>> [snip] -- Best Regards, Huang, Ying