From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 773A1C3DA49 for ; Fri, 26 Jul 2024 07:46:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0DFE26B008C; Fri, 26 Jul 2024 03:46:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 08EEB6B0092; Fri, 26 Jul 2024 03:46:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E98616B0096; Fri, 26 Jul 2024 03:46:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CBDCA6B008C for ; Fri, 26 Jul 2024 03:46:38 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 78C3EA5C95 for ; Fri, 26 Jul 2024 07:46:38 +0000 (UTC) X-FDA: 82381121676.13.328242D Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) by imf21.hostedemail.com (Postfix) with ESMTP id 2AE751C0004 for ; Fri, 26 Jul 2024 07:46:35 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=hhYSKGH9; spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721979946; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oDOv3hsS8L2HqtiWtnSzWcPWfOPVYIPRZesxDHgp9pM=; b=bWOtJTfpLRaq5pBtbOSeFAq3tqQg3mDyG2ZS8z1jjfMDNIObOKmBsm6PjqjV5MRES7pme8 YMZ0byOd5F6N9M60qIf8OkW8i7I+g0YX1Ub8KbTFNwoNsTHZN32jzrYR6JGf4ZVnfqEWFf 26Am5ZkNV4+RkqFMoKlxD1E/5VeIC+I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721979946; a=rsa-sha256; cv=none; b=pVfxPIDFS0a35soNdtd/J+ZwodV88IMbFrgKq6x1CXvCdwd3OBwwNXq/aGSV/cvG1QuSKm OwMWWiXNLE/oFZJI4gOP1R6bfyf5eq6bNiGsmcS/MB0S2b8pUKSiXTqzFPgX0RPl01r5YT 5bfIA4iRBEO4fU9aXVU4cZjz74we9y4= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=hhYSKGH9; spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1721979997; x=1753515997; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=4dkOgTz+W+CPhX5MUDdwMnAzYZgSGUV44WKSHTwfMDs=; b=hhYSKGH9nTvvGtNd/SLx81fgGLXe88mAu3GevM/fKl3EbQiPXQ9BfFzM Dwc6tn1tmEQJMRVF8r0xac0PEIaBnmTcRGBnVduYI47KixXClfgg/2HIc SppfqgUonMdY4dRaq589RYTu0YA8nCYoHc0ysnk2+bn8VGcYtAIrGTEhe prWpXvafKjYBqrljn8Q5FVXO1nA1+wYBa+b/1WkPryrtAYu1IfMLyDZlQ UT9VHH0HmMIlg0zTIoVy2sDB7jyfJTEnxYwMp9Lr+/Wbhk4eL9f5wHlmD 8W0MaURdNGTJFv38XSvJSJjUOFFMavK8h/2Y87qNKlgIb8GtD8U5A0t3M Q==; X-CSE-ConnectionGUID: lHinOvujQfqRN98dkeMKgw== X-CSE-MsgGUID: Ezg/eLASTeuZMOtjlyzjww== X-IronPort-AV: E=McAfee;i="6700,10204,11144"; a="23519358" X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; d="scan'208";a="23519358" Received: from fmviesa008.fm.intel.com ([10.60.135.148]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2024 00:46:35 -0700 X-CSE-ConnectionGUID: YnW2zew7Tq2LPknE49BQRQ== X-CSE-MsgGUID: wBgAp7oqQX6RGR1fuBlPXQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; d="scan'208";a="53099844" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2024 00:46:32 -0700 From: "Huang, Ying" To: Chris Li Cc: Ryan Roberts , Andrew Morton , Kairui Song , Hugh Dickins , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list In-Reply-To: (Chris Li's message of "Fri, 26 Jul 2024 00:15:58 -0700") References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> <874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com> <43f73463-af42-4a00-8996-5f63bdf264a3@arm.com> <87jzhdkdzv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com> <4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com> <87plr26kg2.fsf@yhuang6-desk2.ccr.corp.intel.com> <87v80s3nvs.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzh83d3o.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 26 Jul 2024 15:42:59 +0800 Message-ID: <87y15o1tvg.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 2AE751C0004 X-Stat-Signature: ydagxw4yydzc9kzoff69zra65jsk3xi3 X-HE-Tag: 1721979995-819974 X-HE-Meta: U2FsdGVkX19MBVM7fNjFneNnFQDv9pvY+4SfCettsZ3OarXjPm0ls4Soa9mvTpVdRV86LYs29lyXtpXEB0Zqxdsd0BD1a6o9a96tSyTRWNnQcIhTGuOioyLz1M752mw1fwrhUZ6vXzAEIB8gUg2TQPildSc5wiL7QfT0K917fGUaKhVVEjJXqNCvHiXmZ2u1GsPZeiyv4uL8eeE2Yg2rabHfWi3SOH4UXFboNzTYRB1QWBriFGSmPxXcAGaYlsO0KxzhZ43jMAyEoMByGiIYfxKZALArNCJBLFZcjExvV55fEHlElOPv2cXbpy+0whruUn90SD4gCj/4rziO2IyGgeUN0QUBBQNPSxEB9iO4P6H5iNIkLdWzogkkdlwRzB6m8O35Y0ukhlbeJa0ZwP/VL2qMR/TFqxD4muHA7GYBFQdEO3J2dvI68wOy8EWe4iWQAnVNEZmOUj5lcIP1jnITuw/KwzEQTOlhNRxyMfoZe/4UVrb9BJkDrP0v2ZEaPux5UJLfhEBxUtGjWeh6FfwnHGF1+yXylV9NWcGhAOOTzORyiMfb9o9dyZjEzAnHqllypNVg+URilcbOAAb6BsrbnT43SF7F+8RKg529UffQO+wM2nTxyldd/J9bvM0L8lGDSlfjWnJhMKz6CHBYFnw01ua3dntFOu3Og5/cTdoXpzu5m9jdxFbsYjMA+SBpdpYjAU1euCksSGVfIIZyxjAgXO5PmbTwRUPtBTtELya1C6GqBxHMqijuRIQhbc99V13/E4Tx4bu7S5+wiM96crgZ71DVXw4JDzMY8Kuc8T1uyDTMyd40kxkYj/ZlBW3VCSXMjbpx2WFHOo7cxmlkgws3r6ugHakv5Ma7Vk/TKFr+fOV09hhUfUZGZAzHsySd1meYJSHzIkzWs19RnayjgWNO7+4wbyzYNrzVjMsVfp5JJL2eHNhRdPnmtTpwt4gr6qhwWaXEdsi59etR6UxhdnE PuQtgY60 ToggMNvL01e+dvfwERdszmzNBhv7prGpD8bkMm4DefCiBFBVgQZ/EvphY1xSFAUkrAZd38j8apBa0L7aXkMxHPot+WlLikTDCswLgY0n6uyA9bNqICpczE4sbEbhzZs14n/oTDlA9p1dM3RcceTGDM2OeSWP/3M9dun5obfAceDZ2z1QuOeAiBLiTB832/dAugakmzSQ6gZnFrzNdfM6ZX7Eusa3atJPdgNMt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > On Thu, Jul 25, 2024 at 11:05=E2=80=AFPM Huang, Ying wrote: >> >> Chris Li writes: >> >> > On Thu, Jul 25, 2024 at 7:13=E2=80=AFPM Huang, Ying wrote: >> >> > >> >> > The current proposed order also improves things step by step. The o= nly >> >> > disagreement here is which patch order we introduce yet another list >> >> > in addition to the nonfull one. I just feel that it does not make >> >> > sense to invest into new code if that new code is going to be >> >> > completely rewrite anyway in the next two patches. >> >> > >> >> > Unless you mean is we should not do the patch 3 big rewrite and sho= uld >> >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half= of >> >> > the allocation job and let scan_swap_map_slots() do the complex ret= ry >> >> > on top of try_ssd(). I feel the overall code is more complex and le= ss >> >> > maintainable. >> >> >> >> I haven't look at [3/3], will wait for your next version for that. S= o, >> >> I cannot say which order is better. Please consider reviewers' effort >> >> too. Small step patch is easier to be understood and reviewed. >> > >> > That is exactly the reason I don't want to introduce too much new code >> > depending on the scan_swap_map_slots() behavior, which will be >> > abandoned in the big rewrite. Their constraints are very different. I >> > want to make the big rewrite patch 3 as small as possible. Using >> > incremental follow up patches to improve it. >> > >> >> >> >> >> > That is why I want to make this change patch after patch 3. Ther= e is >> >> >> > also the long test cycle after the modification to make sure the= swap >> >> >> > code path is stable. I am not resisting a change of patch orders= , it >> >> >> > is that patch can't directly be removed before patch 3 before th= e big >> >> >> > rewrite. >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> > Your original >> >> >> >> > suggestion appears like that you want to keep all cluster wit= h order-N >> >> >> >> > on the nonfull list for order-N always unless the number of f= ree swap >> >> >> >> > entry is less than 1<> >> >> >> >> >> >> >> Well I think that's certainly one of the conditions for removin= g it. But agree >> >> >> >> that if a full scan of the cluster has been performed and no sw= ap entries have >> >> >> >> been freed since the scan started then it should also be remove= d from the list. >> >> >> > >> >> >> > Yes, in the later patch of patch, beyond patch 3, we have the al= most >> >> >> > full cluster that for the cluster has been scan and not able to >> >> >> > allocate order N entry. >> >> >> > >> >> >> >> >> >> >> >> > >> >> >> >> >>> And, I understand that in some situations it may >> >> >> >> >>> be better to share clusters among CPUs. So my suggestion i= s, >> >> >> >> >>> >> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't preten= d that we >> >> >> >> >>> have free swap entries with that order even after we are = sure that we >> >> >> >> >>> haven't. >> >> >> >> >> >> >> >> >> >> Is this patch pretending that today? I don't think so? >> >> >> >> > >> >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" eve= n if we are >> >> >> >> > sure that there are no order-N free swap entry in the cluster. >> >> >> >> >> >> >> >> Oh I see what you mean. I think you and Chris already discussed= this? IIRC >> >> >> >> Chris's point was that if you move that cluster to N-1, eventua= lly all clusters >> >> >> >> are for order-0 and you have no means of allocating high orders= until a whole >> >> >> >> cluster becomes free. That logic certainly makes sense to me, s= o think its >> >> >> >> better for swap_cluster_info->order to remain static while the = cluster is >> >> >> >> allocated. (I only skimmed that conversation so appologies if I= got the >> >> >> >> conclusion wrong!). >> >> >> > >> >> >> > Yes, that is the original intent, keep the cluster order as much= as possible. >> >> >> > >> >> >> >> >> >> >> >> > >> >> >> >> >> But I agree that a >> >> >> >> >> cluster should only be on the per-order nonfull list if we k= now there are at >> >> >> >> >> least enough free swap entries in that cluster to cover the = order. Of course >> >> >> >> >> that doesn't tell us for sure because they may not be contig= uous. >> >> >> >> > >> >> >> >> > We can check that when free swap entry via checking adjacent = swap >> >> >> >> > entries. IMHO, the performance should be acceptable. >> >> >> >> >> >> >> >> Would you then use the result of that scanning to "promote" a c= luster's order? >> >> >> >> e.g. swap_cluster_info->order =3D N+1? That would be neat. But = this all feels like >> >> >> >> a separate change on top of what Chris is doing here. For high = orders there >> >> >> >> could be quite a bit of scanning required in the worst case for= every page that >> >> >> >> gets freed. >> >> >> > >> >> >> > Right, I feel that is a different set of patches. Even this seri= es is >> >> >> > hard enough for review. Those order promotion and demotion is he= ading >> >> >> > towards a buddy system design. I want to point out that even the= buddy >> >> >> > system is not able to handle the case that swapfile is almost fu= ll and >> >> >> > the recently freed swap entries are not contiguous. >> >> >> > >> >> >> > We can invest in the buddy system, which doesn't handle all the >> >> >> > fragmentation issues. Or I prefer to go directly to the disconti= guous >> >> >> > swap entry. We pay a price for the indirect mapping of swap entr= ies. >> >> >> > But it will solve the fragmentation issue 100%. >> >> >> >> >> >> It's good if we can solve the fragmentation issue 100%. Just need= to >> >> >> pay attention to the cost. >> >> > >> >> > The cost you mean the development cost or the run time cost (memory= and cpu)? >> >> >> >> I mean runtime cost. >> > >> > Thanks for the clarification. Agree that we need to pay attention to >> > the run time cost. That is given. >> > >> >> >> >> >>> My question is whether it's so important to share the per-c= pu cluster >> >> >> >> >>> among CPUs? >> >> >> >> >> >> >> >> >> >> My rationale for sharing is that the preference previously h= as been to favour >> >> >> >> >> efficient use of swap space; we don't want to fail a request= for allocation of a >> >> >> >> >> given order if there are actually slots available just becau= se they have been >> >> >> >> >> reserved by another CPU. And I'm still asserting that it sho= uld be ~zero cost to >> >> >> >> >> do this. If I'm wrong about the zero cost, or in practice th= e sharing doesn't >> >> >> >> >> actually help improve allocation success, then I'm happy to = take the exclusive >> >> >> >> >> approach. >> >> >> >> >> >> >> >> >> >>> I suggest to start with simple design, that is, per-CPU >> >> >> >> >>> cluster will not be shared among CPUs in most cases. >> >> >> >> >> >> >> >> >> >> I'm all for starting simple; I think that's what I already p= roposed (exclusive >> >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just o= bjecting to the >> >> >> >> >> current half-and-half policy in this patch. >> >> >> >> > >> >> >> >> > Sounds good to me. We can start with exclusive solution and = evaluate >> >> >> >> > whether shared solution is good. >> >> >> >> >> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too... >> >> >> > >> >> >> > It is not able to avoid fragementation 100% of the time. I prefe= r the >> >> >> > discontinued swap entry as the next step, which guarantees forwa= rd >> >> >> > progress, we will not be stuck in a situation where we are not a= ble to >> >> >> > allocate swap entries due to fragmentation. >> >> >> >> >> >> If my understanding were correct, the implementation complexity of= the >> >> >> order promotion/demotion isn't at the same level of that of discon= tinued >> >> >> swap entry. >> >> > >> >> > Discontinued swap entry has higher complexity but higher payout as >> >> > well. It can get us to the place where cluster promotion/demotion >> >> > can't. >> >> > >> >> > I also feel that if we implement something towards a buddy system >> >> > allocator for swap, we should do a proper buddy allocator >> >> > implementation of data structures. >> >> >> >> I don't think that it's easy to implement a real buddy allocator for >> >> swap entries. So, I avoid to use buddy in my words. >> > >> > Then such a mix of cluster order promote/demote lose some benefit of >> > the buddy system. Because it lacks the proper data structure to >> > support buddy allocation. The buddy allocator provides more general >> > migration between orders. For the limited usage case of cluster >> > promotion/demotion is supported (by luck). We need to evaluate whether >> > it is worth the additional complexity. >> >> TBH, I believe that the complexity of order promote/demote is quite low, >> both for development and runtime. A real buddy allocator may need to >> increase per-swap-entry memory footprint much. > > I mostly concern its effectiveness. Anyway, the series is already > complex enough with the big rewrite and reclaim on swap cache. > > Let me know if you think it needs to be done before the big rewrite. I hope so. But, I will not force you to do that if you don't buy in it. -- Best Regards, Huang, Ying