From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 773A1C3DA49
	for <linux-mm@archiver.kernel.org>; Fri, 26 Jul 2024 07:46:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0DFE26B008C; Fri, 26 Jul 2024 03:46:39 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 08EEB6B0092; Fri, 26 Jul 2024 03:46:39 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E98616B0096; Fri, 26 Jul 2024 03:46:38 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CBDCA6B008C
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 03:46:38 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 78C3EA5C95
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:46:38 +0000 (UTC)
X-FDA: 82381121676.13.328242D
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	by imf21.hostedemail.com (Postfix) with ESMTP id 2AE751C0004
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:46:35 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=hhYSKGH9;
	spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721979946;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=oDOv3hsS8L2HqtiWtnSzWcPWfOPVYIPRZesxDHgp9pM=;
	b=bWOtJTfpLRaq5pBtbOSeFAq3tqQg3mDyG2ZS8z1jjfMDNIObOKmBsm6PjqjV5MRES7pme8
	YMZ0byOd5F6N9M60qIf8OkW8i7I+g0YX1Ub8KbTFNwoNsTHZN32jzrYR6JGf4ZVnfqEWFf
	26Am5ZkNV4+RkqFMoKlxD1E/5VeIC+I=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721979946; a=rsa-sha256;
	cv=none;
	b=pVfxPIDFS0a35soNdtd/J+ZwodV88IMbFrgKq6x1CXvCdwd3OBwwNXq/aGSV/cvG1QuSKm
	OwMWWiXNLE/oFZJI4gOP1R6bfyf5eq6bNiGsmcS/MB0S2b8pUKSiXTqzFPgX0RPl01r5YT
	5bfIA4iRBEO4fU9aXVU4cZjz74we9y4=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=hhYSKGH9;
	spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1721979997; x=1753515997;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=4dkOgTz+W+CPhX5MUDdwMnAzYZgSGUV44WKSHTwfMDs=;
  b=hhYSKGH9nTvvGtNd/SLx81fgGLXe88mAu3GevM/fKl3EbQiPXQ9BfFzM
   Dwc6tn1tmEQJMRVF8r0xac0PEIaBnmTcRGBnVduYI47KixXClfgg/2HIc
   SppfqgUonMdY4dRaq589RYTu0YA8nCYoHc0ysnk2+bn8VGcYtAIrGTEhe
   prWpXvafKjYBqrljn8Q5FVXO1nA1+wYBa+b/1WkPryrtAYu1IfMLyDZlQ
   UT9VHH0HmMIlg0zTIoVy2sDB7jyfJTEnxYwMp9Lr+/Wbhk4eL9f5wHlmD
   8W0MaURdNGTJFv38XSvJSJjUOFFMavK8h/2Y87qNKlgIb8GtD8U5A0t3M
   Q==;
X-CSE-ConnectionGUID: lHinOvujQfqRN98dkeMKgw==
X-CSE-MsgGUID: Ezg/eLASTeuZMOtjlyzjww==
X-IronPort-AV: E=McAfee;i="6700,10204,11144"; a="23519358"
X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; 
   d="scan'208";a="23519358"
Received: from fmviesa008.fm.intel.com ([10.60.135.148])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2024 00:46:35 -0700
X-CSE-ConnectionGUID: YnW2zew7Tq2LPknE49BQRQ==
X-CSE-MsgGUID: wBgAp7oqQX6RGR1fuBlPXQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; 
   d="scan'208";a="53099844"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2024 00:46:32 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  Kairui Song <kasong@tencent.com>,  Hugh
 Dickins <hughd@google.com>,  Kalesh Singh <kaleshsingh@google.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from
 nonfull list
In-Reply-To: <CACePvbWJ8c8pF0nQ=_bidJU2qoaJn6ygOAGVz4H07Kf0Rzxaag@mail.gmail.com>
	(Chris Li's message of "Fri, 26 Jul 2024 00:15:58 -0700")
References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org>
	<20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org>
	<ea720b4a-da70-4ee3-8f74-2c7344480170@arm.com>
	<CACePvbW_g4T10mqcG-FnJ11nP0obRG8ZgtdAN_EMCosnk9EQpA@mail.gmail.com>
	<b4b31314-1125-40ee-b784-20abc78bd468@arm.com>
	<CACePvbXfeyt5cSX3zQhbZQ4Z5suW6iXw4Kb8BDH96SeMi54o8Q@mail.gmail.com>
	<874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<a50fe2d0-f22d-4ba0-8796-56732da0a5c4@arm.com>
	<87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<43f73463-af42-4a00-8996-5f63bdf264a3@arm.com>
	<87jzhdkdzv.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<f6fa3965-38db-4bdc-b6fd-6cd472169322@arm.com>
	<87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com>
	<CACePvbV9cx6Le1cYgYo2D922E4Com45+XXquMZugog2+w5K_yg@mail.gmail.com>
	<87plr26kg2.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbXC6SwD1mx_s_9yCZpqTXZhRKMetbCcBNPOgT-ZtLmGCA@mail.gmail.com>
	<87v80s3nvs.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbWW6YLZe=47+kfuz76J+WWGmfKHvqatGGm=RyRX=D-WeQ@mail.gmail.com>
	<87jzh83d3o.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbWJ8c8pF0nQ=_bidJU2qoaJn6ygOAGVz4H07Kf0Rzxaag@mail.gmail.com>
Date: Fri, 26 Jul 2024 15:42:59 +0800
Message-ID: <87y15o1tvg.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 2AE751C0004
X-Stat-Signature: ydagxw4yydzc9kzoff69zra65jsk3xi3
X-HE-Tag: 1721979995-819974
X-HE-Meta: U2FsdGVkX19MBVM7fNjFneNnFQDv9pvY+4SfCettsZ3OarXjPm0ls4Soa9mvTpVdRV86LYs29lyXtpXEB0Zqxdsd0BD1a6o9a96tSyTRWNnQcIhTGuOioyLz1M752mw1fwrhUZ6vXzAEIB8gUg2TQPildSc5wiL7QfT0K917fGUaKhVVEjJXqNCvHiXmZ2u1GsPZeiyv4uL8eeE2Yg2rabHfWi3SOH4UXFboNzTYRB1QWBriFGSmPxXcAGaYlsO0KxzhZ43jMAyEoMByGiIYfxKZALArNCJBLFZcjExvV55fEHlElOPv2cXbpy+0whruUn90SD4gCj/4rziO2IyGgeUN0QUBBQNPSxEB9iO4P6H5iNIkLdWzogkkdlwRzB6m8O35Y0ukhlbeJa0ZwP/VL2qMR/TFqxD4muHA7GYBFQdEO3J2dvI68wOy8EWe4iWQAnVNEZmOUj5lcIP1jnITuw/KwzEQTOlhNRxyMfoZe/4UVrb9BJkDrP0v2ZEaPux5UJLfhEBxUtGjWeh6FfwnHGF1+yXylV9NWcGhAOOTzORyiMfb9o9dyZjEzAnHqllypNVg+URilcbOAAb6BsrbnT43SF7F+8RKg529UffQO+wM2nTxyldd/J9bvM0L8lGDSlfjWnJhMKz6CHBYFnw01ua3dntFOu3Og5/cTdoXpzu5m9jdxFbsYjMA+SBpdpYjAU1euCksSGVfIIZyxjAgXO5PmbTwRUPtBTtELya1C6GqBxHMqijuRIQhbc99V13/E4Tx4bu7S5+wiM96crgZ71DVXw4JDzMY8Kuc8T1uyDTMyd40kxkYj/ZlBW3VCSXMjbpx2WFHOo7cxmlkgws3r6ugHakv5Ma7Vk/TKFr+fOV09hhUfUZGZAzHsySd1meYJSHzIkzWs19RnayjgWNO7+4wbyzYNrzVjMsVfp5JJL2eHNhRdPnmtTpwt4gr6qhwWaXEdsi59etR6UxhdnE
 PuQtgY60
 ToggMNvL01e+dvfwERdszmzNBhv7prGpD8bkMm4DefCiBFBVgQZ/EvphY1xSFAUkrAZd38j8apBa0L7aXkMxHPot+WlLikTDCswLgY0n6uyA9bNqICpczE4sbEbhzZs14n/oTDlA9p1dM3RcceTGDM2OeSWP/3M9dun5obfAceDZ2z1QuOeAiBLiTB832/dAugakmzSQ6gZnFrzNdfM6ZX7Eusa3atJPdgNMt
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jul 25, 2024 at 11:05=E2=80=AFPM Huang, Ying <ying.huang@intel.co=
m> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Thu, Jul 25, 2024 at 7:13=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >> >
>> >> > The current proposed order also improves things step by step. The o=
nly
>> >> > disagreement here is which patch order we introduce yet another list
>> >> > in addition to the nonfull one. I just feel that it does not make
>> >> > sense to invest into new code if that new code is going to be
>> >> > completely rewrite anyway in the next two patches.
>> >> >
>> >> > Unless you mean is we should not do the patch 3 big rewrite and sho=
uld
>> >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half=
 of
>> >> > the allocation job and let scan_swap_map_slots() do the complex ret=
ry
>> >> > on top of try_ssd(). I feel the overall code is more complex and le=
ss
>> >> > maintainable.
>> >>
>> >> I haven't look at [3/3], will wait for your next version for that.  S=
o,
>> >> I cannot say which order is better.  Please consider reviewers' effort
>> >> too.  Small step patch is easier to be understood and reviewed.
>> >
>> > That is exactly the reason I don't want to introduce too much new code
>> > depending on the scan_swap_map_slots() behavior, which will be
>> > abandoned in the big rewrite. Their constraints are very different. I
>> > want to make the big rewrite patch 3 as small as possible. Using
>> > incremental follow up patches to improve it.
>> >
>> >>
>> >> >> > That is why I want to make this change patch after patch 3. Ther=
e is
>> >> >> > also the long test cycle after the modification to make sure the=
 swap
>> >> >> > code path is stable. I am not resisting a change of patch orders=
, it
>> >> >> > is that patch can't directly be removed before patch 3 before th=
e big
>> >> >> > rewrite.
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> > Your original
>> >> >> >> > suggestion appears like that you want to keep all cluster wit=
h order-N
>> >> >> >> > on the nonfull list for order-N always unless the number of f=
ree swap
>> >> >> >> > entry is less than 1<<N.
>> >> >> >>
>> >> >> >> Well I think that's certainly one of the conditions for removin=
g it. But agree
>> >> >> >> that if a full scan of the cluster has been performed and no sw=
ap entries have
>> >> >> >> been freed since the scan started then it should also be remove=
d from the list.
>> >> >> >
>> >> >> > Yes, in the later patch of patch, beyond patch 3, we have the al=
most
>> >> >> > full cluster that for the cluster has been scan and not able to
>> >> >> > allocate order N entry.
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>> And, I understand that in some situations it may
>> >> >> >> >>> be better to share clusters among CPUs.  So my suggestion i=
s,
>> >> >> >> >>>
>> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't preten=
d that we
>> >> >> >> >>>   have free swap entries with that order even after we are =
sure that we
>> >> >> >> >>>   haven't.
>> >> >> >> >>
>> >> >> >> >> Is this patch pretending that today? I don't think so?
>> >> >> >> >
>> >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" eve=
n if we are
>> >> >> >> > sure that there are no order-N free swap entry in the cluster.
>> >> >> >>
>> >> >> >> Oh I see what you mean. I think you and Chris already discussed=
 this? IIRC
>> >> >> >> Chris's point was that if you move that cluster to N-1, eventua=
lly all clusters
>> >> >> >> are for order-0 and you have no means of allocating high orders=
 until a whole
>> >> >> >> cluster becomes free. That logic certainly makes sense to me, s=
o think its
>> >> >> >> better for swap_cluster_info->order to remain static while the =
cluster is
>> >> >> >> allocated. (I only skimmed that conversation so appologies if I=
 got the
>> >> >> >> conclusion wrong!).
>> >> >> >
>> >> >> > Yes, that is the original intent, keep the cluster order as much=
 as possible.
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> But I agree that a
>> >> >> >> >> cluster should only be on the per-order nonfull list if we k=
now there are at
>> >> >> >> >> least enough free swap entries in that cluster to cover the =
order. Of course
>> >> >> >> >> that doesn't tell us for sure because they may not be contig=
uous.
>> >> >> >> >
>> >> >> >> > We can check that when free swap entry via checking adjacent =
swap
>> >> >> >> > entries.  IMHO, the performance should be acceptable.
>> >> >> >>
>> >> >> >> Would you then use the result of that scanning to "promote" a c=
luster's order?
>> >> >> >> e.g. swap_cluster_info->order =3D N+1? That would be neat. But =
this all feels like
>> >> >> >> a separate change on top of what Chris is doing here. For high =
orders there
>> >> >> >> could be quite a bit of scanning required in the worst case for=
 every page that
>> >> >> >> gets freed.
>> >> >> >
>> >> >> > Right, I feel that is a different set of patches. Even this seri=
es is
>> >> >> > hard enough for review. Those order promotion and demotion is he=
ading
>> >> >> > towards a buddy system design. I want to point out that even the=
 buddy
>> >> >> > system is not able to handle the case that swapfile is almost fu=
ll and
>> >> >> > the recently freed swap entries are not contiguous.
>> >> >> >
>> >> >> > We can invest in the buddy system, which doesn't handle all the
>> >> >> > fragmentation issues. Or I prefer to go directly to the disconti=
guous
>> >> >> > swap entry. We pay a price for the indirect mapping of swap entr=
ies.
>> >> >> > But it will solve the fragmentation issue 100%.
>> >> >>
>> >> >> It's good if we can solve the fragmentation issue 100%.  Just need=
 to
>> >> >> pay attention to the cost.
>> >> >
>> >> > The cost you mean the development cost or the run time cost (memory=
 and cpu)?
>> >>
>> >> I mean runtime cost.
>> >
>> > Thanks for the clarification. Agree that we need to pay attention to
>> > the run time cost. That is given.
>> >
>> >> >> >> >>> My question is whether it's so important to share the per-c=
pu cluster
>> >> >> >> >>> among CPUs?
>> >> >> >> >>
>> >> >> >> >> My rationale for sharing is that the preference previously h=
as been to favour
>> >> >> >> >> efficient use of swap space; we don't want to fail a request=
 for allocation of a
>> >> >> >> >> given order if there are actually slots available just becau=
se they have been
>> >> >> >> >> reserved by another CPU. And I'm still asserting that it sho=
uld be ~zero cost to
>> >> >> >> >> do this. If I'm wrong about the zero cost, or in practice th=
e sharing doesn't
>> >> >> >> >> actually help improve allocation success, then I'm happy to =
take the exclusive
>> >> >> >> >> approach.
>> >> >> >> >>
>> >> >> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >> >> >>
>> >> >> >> >> I'm all for starting simple; I think that's what I already p=
roposed (exclusive
>> >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just o=
bjecting to the
>> >> >> >> >> current half-and-half policy in this patch.
>> >> >> >> >
>> >> >> >> > Sounds good to me.  We can start with exclusive solution and =
evaluate
>> >> >> >> > whether shared solution is good.
>> >> >> >>
>> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >> >> >
>> >> >> > It is not able to avoid fragementation 100% of the time. I prefe=
r the
>> >> >> > discontinued swap entry as the next step, which guarantees forwa=
rd
>> >> >> > progress, we will not be stuck in a situation where we are not a=
ble to
>> >> >> > allocate swap entries due to fragmentation.
>> >> >>
>> >> >> If my understanding were correct, the implementation complexity of=
 the
>> >> >> order promotion/demotion isn't at the same level of that of discon=
tinued
>> >> >> swap entry.
>> >> >
>> >> > Discontinued swap entry has higher complexity but higher payout as
>> >> > well. It can get us to the place where cluster promotion/demotion
>> >> > can't.
>> >> >
>> >> > I also feel that if we implement something towards a buddy system
>> >> > allocator for swap, we should do a proper buddy allocator
>> >> > implementation of data structures.
>> >>
>> >> I don't think that it's easy to implement a real buddy allocator for
>> >> swap entries.  So, I avoid to use buddy in my words.
>> >
>> > Then such a mix of cluster order promote/demote lose some benefit of
>> > the buddy system. Because it lacks the proper data structure to
>> > support buddy allocation. The buddy allocator provides more general
>> > migration between orders. For the limited usage case of cluster
>> > promotion/demotion is supported (by luck). We need to evaluate whether
>> > it is worth the additional complexity.
>>
>> TBH, I believe that the complexity of order promote/demote is quite low,
>> both for development and runtime.  A real buddy allocator may need to
>> increase per-swap-entry memory footprint much.
>
> I mostly concern its effectiveness. Anyway, the series is already
> complex enough with the big rewrite and reclaim on swap cache.
>
> Let me know if you think it needs to be done before the big rewrite.

I hope so.  But, I will not force you to do that if you don't buy in it.

--
Best Regards,
Huang, Ying