From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ECA43C3DA49
	for <linux-mm@archiver.kernel.org>; Fri, 26 Jul 2024 06:05:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1A0666B008C; Fri, 26 Jul 2024 02:05:59 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 151006B0092; Fri, 26 Jul 2024 02:05:59 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F0C226B009A; Fri, 26 Jul 2024 02:05:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id CE3EA6B008C
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 02:05:58 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 6F160A5C2B
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 06:05:58 +0000 (UTC)
X-FDA: 82380867996.15.E17BD57
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13])
	by imf11.hostedemail.com (Postfix) with ESMTP id 255DD4000D
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 06:05:55 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=VTIcR2QY;
	spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721973890;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=2vwtLFK4oommnzvj6fB4GVWXmYzHIgiPOUahBWI4j4c=;
	b=y57WEGixZ4Tp2qSpFDV83C+ASSlOlRigA9n883psWc993bF3O+eqn+m2IZTHgAAJp7BYuf
	rqi63yk0Q/jAu7O+krDahZBn2he2HPgAbLQP6n4Nl5VZfUo+5KUQf0dj0dAyTxYPnnG7RB
	Xc0uwa6UUOtraaW4n11KNwW3VBsCTvE=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=VTIcR2QY;
	spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721973890; a=rsa-sha256;
	cv=none;
	b=siEdSkc1IDoJLNOklfD/P9bIFv7c1i5bISn+nwzY1G4ZFodH+JmaC9tEHwDN2gdIUFomxr
	XomUotGCjQG4FRWRwuUOaPRx8TIUgDTxwl8Jgyjw/gFosdNVkila+NnsBnrz2PmsU97wnB
	F4fPKAmuqIA8glmWbIYY1xJ5nl9QMj8=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1721973956; x=1753509956;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=GkclUsGZjjv+mhGeIqiCcLmIwUpgcXUJl9zbyqycT8M=;
  b=VTIcR2QYBdtoTOnnsMASviK8kmYulIDtBL+Mty+1CnSz12WfxgYnP79K
   3TQWZYHBIQNH0H0yTX8lIH9y/MimXQu21F4n2d7CUGQGMTtCz048rTguF
   MvwkdHsN7YlFT6DFAccTfveerHexy8T+hcmbtzXC2d3CGtYC/GvOhqCEa
   dqIKo5TYQT9YGhsxUcFBS3fmC7DDTnsGJwy0zTQpcIf94CS+5IpnVXu5u
   D3kHGqw/UlSn9Vs6QIgta1866zNsLFlPMeLFDlr7qUiKRMphSG2F/AN4X
   bwNfTx8CCHo1F6WuagrlHULIoTaU5gRDxbyZB+TgSlbfHIGO2gkiBS9sr
   Q==;
X-CSE-ConnectionGUID: nedWPThBRJKY+NO+2RkaVg==
X-CSE-MsgGUID: Cnqi6bRARBW6qhmtbn1ZWQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11144"; a="22665572"
X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; 
   d="scan'208";a="22665572"
Received: from orviesa003.jf.intel.com ([10.64.159.143])
  by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jul 2024 23:05:54 -0700
X-CSE-ConnectionGUID: 9re6eL1ZS4mdHEVy/CbCIg==
X-CSE-MsgGUID: 3Fmd0MnHQ5i9TUJgXEcC0Q==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; 
   d="scan'208";a="57991925"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jul 2024 23:05:52 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  Kairui Song <kasong@tencent.com>,  Hugh
 Dickins <hughd@google.com>,  Kalesh Singh <kaleshsingh@google.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from
 nonfull list
In-Reply-To: <CACePvbWW6YLZe=47+kfuz76J+WWGmfKHvqatGGm=RyRX=D-WeQ@mail.gmail.com>
	(Chris Li's message of "Thu, 25 Jul 2024 22:09:25 -0700")
References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org>
	<20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org>
	<ea720b4a-da70-4ee3-8f74-2c7344480170@arm.com>
	<CACePvbW_g4T10mqcG-FnJ11nP0obRG8ZgtdAN_EMCosnk9EQpA@mail.gmail.com>
	<b4b31314-1125-40ee-b784-20abc78bd468@arm.com>
	<CACePvbXfeyt5cSX3zQhbZQ4Z5suW6iXw4Kb8BDH96SeMi54o8Q@mail.gmail.com>
	<874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<a50fe2d0-f22d-4ba0-8796-56732da0a5c4@arm.com>
	<87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<43f73463-af42-4a00-8996-5f63bdf264a3@arm.com>
	<87jzhdkdzv.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<f6fa3965-38db-4bdc-b6fd-6cd472169322@arm.com>
	<87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com>
	<CACePvbV9cx6Le1cYgYo2D922E4Com45+XXquMZugog2+w5K_yg@mail.gmail.com>
	<87plr26kg2.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbXC6SwD1mx_s_9yCZpqTXZhRKMetbCcBNPOgT-ZtLmGCA@mail.gmail.com>
	<87v80s3nvs.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbWW6YLZe=47+kfuz76J+WWGmfKHvqatGGm=RyRX=D-WeQ@mail.gmail.com>
Date: Fri, 26 Jul 2024 14:02:19 +0800
Message-ID: <87jzh83d3o.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 255DD4000D
X-Stat-Signature: t6c7syqdpno7z8j1us4ykgcufzzjp4d7
X-Rspam-User: 
X-HE-Tag: 1721973955-298345
X-HE-Meta: U2FsdGVkX19XzZOJcvU7UY2wxfJPpesu3ZIfl6OcSFlcfyl3dt4Tm694gZxK7ZSVfER14Vpd6u5ZKPJnYHpwoByG8XVUcyf6wkEB/cEnXzGCzU/M+FAj4nVDPKKI1FmoIDQKpNdOCUprQQk+wlt83Dgp0BpzW2rVT+9CmjP8dCq/24cI5tGO0gdEVDjKX01A6ONWUJOTKth/di7DQDhxCqMGP5qU4pBv+9CGsjIue84uLIPVj3/IkE2PP5movjhEo8a5ZMuOCoEkbCvLAcRtFTq2GkEhZIwcDCehEnrLZRJAFfOTuVdEyN4HiFbdU41WAJi6tbfiAX7GA4pD00Sw4HYXLkDj7gE8nfuaMRsKo52xyU07tPtds8c6q7dkp0oeIIUzbX32WrUU+HBh8z/7QlT/nVtAhv/P9Ki55IaYDtCz9D4FEQYC2K2spawgZzgd4VnyGJs0SMLnvleyW/Bs3JNPowpsP6nvVABAPKRkqkZebtKtOZav6qEe/aJVXSlW7qKJ3dzoKqhGi3HIMwzZP1cFug03JiXYQcd6Cg1e4z4A18cUj4f3li+3OguZZj3ju3Vh7nfsAsop8AXmjJwxU/bKbuOXqaFIMHS4u1xps7ZJSa4WCTiJveiPkEH0eerjI6gag7AYTMEc98XYKcsnrYRWN77TImW9lgoHnc6owuQnlUTNl0fGFHTXTPB297uMwsT22b7ARPM75yaBQvm0/agwDLAgy0sBHdKSab5bFKT6DwItkKHJgUpc7kSb+zZR68lFrW2JhMR5JpnmgqolJouU1Dp5WyT3LOP1LLkQX7BE5tW2wxhbs1r6/jSpDp7HNdC3FvgzDyZGs32NbRgoE82rZeYOb0FMvVQDvqVuT87nDzPuKuwvt4/h0jL0k3ESiah42xeex5kbeIotUsJM2TXmiFpYnK71foEq6Lk8qKSwnFw5Kx1HzcIeRxVm7StxRmxgaA1X22ZUk/Jz5yy
 XqEaeZSX
 uWreRlVQuLJhGVLhFvtNpEnmu3o7+4Cq3Nj1vCArxBEHc05yFltBfMVRx5gvyibFJM13ZUdrMDlFjjNpQQCL+U3tPpmtonZ8DdsH2WVdIj6m3itkm4W22q/MZPwepI+4botnj7KeuJXN06qrFEXhkEUttlED2eBoapVwaMNQQukyqvR7DxAgMh9BUUKklNQqwRJ6+sIkfbF1otriclwnniKV2kfFgokTmDSsl
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jul 25, 2024 at 7:13=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>> >
>> > The current proposed order also improves things step by step. The only
>> > disagreement here is which patch order we introduce yet another list
>> > in addition to the nonfull one. I just feel that it does not make
>> > sense to invest into new code if that new code is going to be
>> > completely rewrite anyway in the next two patches.
>> >
>> > Unless you mean is we should not do the patch 3 big rewrite and should
>> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
>> > the allocation job and let scan_swap_map_slots() do the complex retry
>> > on top of try_ssd(). I feel the overall code is more complex and less
>> > maintainable.
>>
>> I haven't look at [3/3], will wait for your next version for that.  So,
>> I cannot say which order is better.  Please consider reviewers' effort
>> too.  Small step patch is easier to be understood and reviewed.
>
> That is exactly the reason I don't want to introduce too much new code
> depending on the scan_swap_map_slots() behavior, which will be
> abandoned in the big rewrite. Their constraints are very different. I
> want to make the big rewrite patch 3 as small as possible. Using
> incremental follow up patches to improve it.
>
>>
>> >> > That is why I want to make this change patch after patch 3. There is
>> >> > also the long test cycle after the modification to make sure the sw=
ap
>> >> > code path is stable. I am not resisting a change of patch orders, it
>> >> > is that patch can't directly be removed before patch 3 before the b=
ig
>> >> > rewrite.
>> >> >
>> >> >
>> >> >>
>> >> >> > Your original
>> >> >> > suggestion appears like that you want to keep all cluster with o=
rder-N
>> >> >> > on the nonfull list for order-N always unless the number of free=
 swap
>> >> >> > entry is less than 1<<N.
>> >> >>
>> >> >> Well I think that's certainly one of the conditions for removing i=
t. But agree
>> >> >> that if a full scan of the cluster has been performed and no swap =
entries have
>> >> >> been freed since the scan started then it should also be removed f=
rom the list.
>> >> >
>> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> >> > full cluster that for the cluster has been scan and not able to
>> >> > allocate order N entry.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >>> And, I understand that in some situations it may
>> >> >> >>> be better to share clusters among CPUs.  So my suggestion is,
>> >> >> >>>
>> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend t=
hat we
>> >> >> >>>   have free swap entries with that order even after we are sur=
e that we
>> >> >> >>>   haven't.
>> >> >> >>
>> >> >> >> Is this patch pretending that today? I don't think so?
>> >> >> >
>> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even i=
f we are
>> >> >> > sure that there are no order-N free swap entry in the cluster.
>> >> >>
>> >> >> Oh I see what you mean. I think you and Chris already discussed th=
is? IIRC
>> >> >> Chris's point was that if you move that cluster to N-1, eventually=
 all clusters
>> >> >> are for order-0 and you have no means of allocating high orders un=
til a whole
>> >> >> cluster becomes free. That logic certainly makes sense to me, so t=
hink its
>> >> >> better for swap_cluster_info->order to remain static while the clu=
ster is
>> >> >> allocated. (I only skimmed that conversation so appologies if I go=
t the
>> >> >> conclusion wrong!).
>> >> >
>> >> > Yes, that is the original intent, keep the cluster order as much as=
 possible.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >> But I agree that a
>> >> >> >> cluster should only be on the per-order nonfull list if we know=
 there are at
>> >> >> >> least enough free swap entries in that cluster to cover the ord=
er. Of course
>> >> >> >> that doesn't tell us for sure because they may not be contiguou=
s.
>> >> >> >
>> >> >> > We can check that when free swap entry via checking adjacent swap
>> >> >> > entries.  IMHO, the performance should be acceptable.
>> >> >>
>> >> >> Would you then use the result of that scanning to "promote" a clus=
ter's order?
>> >> >> e.g. swap_cluster_info->order =3D N+1? That would be neat. But thi=
s all feels like
>> >> >> a separate change on top of what Chris is doing here. For high ord=
ers there
>> >> >> could be quite a bit of scanning required in the worst case for ev=
ery page that
>> >> >> gets freed.
>> >> >
>> >> > Right, I feel that is a different set of patches. Even this series =
is
>> >> > hard enough for review. Those order promotion and demotion is headi=
ng
>> >> > towards a buddy system design. I want to point out that even the bu=
ddy
>> >> > system is not able to handle the case that swapfile is almost full =
and
>> >> > the recently freed swap entries are not contiguous.
>> >> >
>> >> > We can invest in the buddy system, which doesn't handle all the
>> >> > fragmentation issues. Or I prefer to go directly to the discontiguo=
us
>> >> > swap entry. We pay a price for the indirect mapping of swap entries.
>> >> > But it will solve the fragmentation issue 100%.
>> >>
>> >> It's good if we can solve the fragmentation issue 100%.  Just need to
>> >> pay attention to the cost.
>> >
>> > The cost you mean the development cost or the run time cost (memory an=
d cpu)?
>>
>> I mean runtime cost.
>
> Thanks for the clarification. Agree that we need to pay attention to
> the run time cost. That is given.
>
>> >> >> >>> My question is whether it's so important to share the per-cpu =
cluster
>> >> >> >>> among CPUs?
>> >> >> >>
>> >> >> >> My rationale for sharing is that the preference previously has =
been to favour
>> >> >> >> efficient use of swap space; we don't want to fail a request fo=
r allocation of a
>> >> >> >> given order if there are actually slots available just because =
they have been
>> >> >> >> reserved by another CPU. And I'm still asserting that it should=
 be ~zero cost to
>> >> >> >> do this. If I'm wrong about the zero cost, or in practice the s=
haring doesn't
>> >> >> >> actually help improve allocation success, then I'm happy to tak=
e the exclusive
>> >> >> >> approach.
>> >> >> >>
>> >> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >> >>
>> >> >> >> I'm all for starting simple; I think that's what I already prop=
osed (exclusive
>> >> >> >> in this patch, then shared in the "big rewrite"). I'm just obje=
cting to the
>> >> >> >> current half-and-half policy in this patch.
>> >> >> >
>> >> >> > Sounds good to me.  We can start with exclusive solution and eva=
luate
>> >> >> > whether shared solution is good.
>> >> >>
>> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >> >
>> >> > It is not able to avoid fragementation 100% of the time. I prefer t=
he
>> >> > discontinued swap entry as the next step, which guarantees forward
>> >> > progress, we will not be stuck in a situation where we are not able=
 to
>> >> > allocate swap entries due to fragmentation.
>> >>
>> >> If my understanding were correct, the implementation complexity of the
>> >> order promotion/demotion isn't at the same level of that of discontin=
ued
>> >> swap entry.
>> >
>> > Discontinued swap entry has higher complexity but higher payout as
>> > well. It can get us to the place where cluster promotion/demotion
>> > can't.
>> >
>> > I also feel that if we implement something towards a buddy system
>> > allocator for swap, we should do a proper buddy allocator
>> > implementation of data structures.
>>
>> I don't think that it's easy to implement a real buddy allocator for
>> swap entries.  So, I avoid to use buddy in my words.
>
> Then such a mix of cluster order promote/demote lose some benefit of
> the buddy system. Because it lacks the proper data structure to
> support buddy allocation. The buddy allocator provides more general
> migration between orders. For the limited usage case of cluster
> promotion/demotion is supported (by luck). We need to evaluate whether
> it is worth the additional complexity.

TBH, I believe that the complexity of order promote/demote is quite low,
both for development and runtime.  A real buddy allocator may need to
increase per-swap-entry memory footprint much.

--
Best Regards,
Huang, Ying