From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2BBB6C27C4F
	for <linux-mm@archiver.kernel.org>; Tue, 18 Jun 2024 06:56:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B15E88D0015; Tue, 18 Jun 2024 02:56:27 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AC61D8D0001; Tue, 18 Jun 2024 02:56:27 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 967748D0015; Tue, 18 Jun 2024 02:56:27 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 78C3B8D0001
	for <linux-mm@kvack.org>; Tue, 18 Jun 2024 02:56:27 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 2C1251404A9
	for <linux-mm@kvack.org>; Tue, 18 Jun 2024 06:56:27 +0000 (UTC)
X-FDA: 82243100814.22.BCD00C2
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12])
	by imf30.hostedemail.com (Postfix) with ESMTP id B291F8000A
	for <linux-mm@kvack.org>; Tue, 18 Jun 2024 06:56:24 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="UmTn/nCt";
	spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718693779;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7znFpAlCTDMylmxyub9MdaMynL2dSLtM0x/sV3qG4Pc=;
	b=fh69gNs/0/bEQorhY/TEGOl1cJQGDStAtIdBps9Gk0gSRxzWLOAxGlMOcuu7hkclD6zSGn
	aWp4Gcoc9tjEcy5/Z+LEVvJSVEJP7Kji9kJnLHVm2E5lS9VXMNIbGbCUY7PSLv8CAPVh9/
	8E0kSqNO3PB/WkF83BTqTw46+ZgJfWs=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718693779; a=rsa-sha256;
	cv=none;
	b=77e9xJTkm1L0gpGxEsUzvGTiO10a21YajoaT/K76xx7wOLOFLubUl+ET/4E5FiixKJst/7
	JJ5eyeNA7o7DNKhHL0GQ3lpTAUXGBpqtTb6nu0Zr2pGea7SOUaecXSt+k6CVd5JkrnWT9M
	yoJguHXPJV+gC4IaBE1t9meI4MIo1qE=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="UmTn/nCt";
	spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1718693785; x=1750229785;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=6+24cIDcAKw2fApyhL9V0ZTUfUa7T5xYrUP8O7pxpiQ=;
  b=UmTn/nCtMz4vrKFM0Es4461LXVAf+unGYiezdE3hZsnd0EzhwMHjwD2m
   DLMnNAYndnbseUokIRJ2VZjC1TeRR7EucGXmADo7pIMc8vlty2FVNUOgI
   WbekBEJconeqmWX6T9zDmOQxobOq8jAm1BXnLPutjFcPGWMH+5cVy+Fya
   POdqupS878J2Ybqfgefz5KddqNPhsbFBYMfRal7sc67lnqMKU6Bl/QY8v
   WSFePojCOHgYGpLNvZl/1jEZZwGr3Zt8U08DzbXo+C/lubxMsx92oHMoE
   g7lMAj08ZCf4OPhh9no+ZHrUPC3sziFND++Kjln/vSosG6N+kcJdSFKcj
   g==;
X-CSE-ConnectionGUID: dBuTuu0zT5yBGwNgnjtjsQ==
X-CSE-MsgGUID: aEBY0wftRSmpsRh1+sPdcg==
X-IronPort-AV: E=McAfee;i="6700,10204,11106"; a="19409938"
X-IronPort-AV: E=Sophos;i="6.08,247,1712646000"; 
   d="scan'208";a="19409938"
Received: from fmviesa001.fm.intel.com ([10.60.135.141])
  by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jun 2024 23:56:15 -0700
X-CSE-ConnectionGUID: hGDCuofCS4uSLLycWRJOJA==
X-CSE-MsgGUID: enW7CI8URjyHRA5amHz/MQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,247,1712646000"; 
   d="scan'208";a="72655182"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jun 2024 23:56:14 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,  Kairui Song
 <kasong@tencent.com>,  Ryan Roberts <ryan.roberts@arm.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster
 order
In-Reply-To: <CAF8kJuMi198++-OHqE5pG1y3BnvRBPepG59zpq-wqjbgrrLdHw@mail.gmail.com>
	(Chris Li's message of "Mon, 17 Jun 2024 21:35:32 -0700")
References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org>
	<CANeU7QkmQ+bJoFnr-ca-xp_dP1XgEKNSwb489MYVqynP_Q8Ddw@mail.gmail.com>
	<87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuN8HWLpv7=abVM2=M247KGZ92HLDxfgxWZD6JS47iZwZA@mail.gmail.com>
	<875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuMc3sXKarq3hMPYGFfeqyo81Q63HrE0XtztK9uQkcZacA@mail.gmail.com>
	<87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuPLhmJqMi-unDOm820c8_kRnQVA_dnSfgRzMXaHKnDHAQ@mail.gmail.com>
	<875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CANeU7Q=iYzyjDwgMRLtSZwKv414JqtZK8w=XWDd6bWZ7Ah-8jA@mail.gmail.com>
	<87wmmw6w9e.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CANeU7Q=Epa438LXEX4WEccxLt6WOziLg2sp_=RA3C4PxtHD5uw@mail.gmail.com>
	<87a5jp6xuo.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuMi198++-OHqE5pG1y3BnvRBPepG59zpq-wqjbgrrLdHw@mail.gmail.com>
Date: Tue, 18 Jun 2024 14:54:22 +0800
Message-ID: <8734pa68rl.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: B291F8000A
X-Stat-Signature: 3h7cq5c4af1byg83m8wq1a9u4ri9e785
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1718693784-391157
X-HE-Meta: U2FsdGVkX1/OwZ+RoM0cyVdGpj3LxneO7t2KT1YnPz35qtmQT3+7Ep5qrobq1bQpPTG2SLnUfTUy0whzZxFL4PQTtP/18Z2l9VINAdFG5McTsOV9oheeWK6hg37slu0OFROcBJdtu1Gq0kM16muTFCpxflAVfhKFtX4DudYY1yTVI4yYQq+e+WjqcUTxVWwzRuGfKJDcC/S8UL4Zu+i4fRj0HAU+bU0s0IzBWE1YsoLZaf24IewSFYALhcQl0UsTvyDJSvJC3KHrHSWNvP4eKbIv7+WMV9hHLjfvebD+r3YFlTGibzyxpYRyFRLTx2tGKWuP4YEIcXA9GlRJqyFPLMg7rewCoilrsVrCf3k9acvV4s4IH0oAXzfww+ZbntFYib7NRRcfgfH+JiCg6eGZXB1vv4NTgSyyAj2t0bbCQn5tw8N+aFdxF0VJ7cHlJGG12C3LFpfvm/MgWp2gzhBK4b2K1mq88GekErgAukc4A6ASv3maKDBoJqpy8Lk+k5YahXd/gAH8vyGwX1mAarYzK2A0Q38xEyVWMzWz1mAdDD3nquELQtzSu+BM7BjqVGL1omW5tPSPLbzp0/PzQNvEwggmctNV14SW336vp9LIkTuF9bBwCHuAQLsQNzAiwQkRymPNie7sEdcvR+UnZ8gc06LL56Gh2nDruhoNoIU/1BgEaubLqdDFY9nQYAWT3vgs6/CfZJMxzLzkB7nj8qJuVo07RjW3VlAbsDo3m76H68S8LGK5+Vk2fR32/R7yTpcdTc7nj+WXmUvQmiU7c9+MhgtbWOxozDaJ7/w0l3WYSXmTkTDApeOegfk0lFbZxDaDRuGbxIBTTU8ElULFj5vYu3KNDyLZvBJNE7E2+H3GjL1pyHh2bJQdWAbSqcXTYXEH1WcbLjVizw243lqa6ggAYbBIGCpYqZU9+NPHA3BUuCxN+M2aQGFC1g969OtJufYLviFzp05ORpW5hdrg2jJ
 Zs6qR87u
 pw5mVyZaAHjkkfO5/docEs5APwi9FspJsILiduLffooIY0/s4sCxcG+DN+ZONzSAVKpYeSeLrba+LPxQggTHqYjpGJbQ7LB0ca3ezBuLJx8wXGQ8QjimihSpkhWdxmcKur4QGBAsqp+RsgQoThJseNEbsNpoA1f71df/D9IlT34ONt1W68SJkMlHlq9OkYNK1+bjK6NgZaroQ9yMwBYISIV13Gj5VPRWEG84m5VNy6GZrMWQNPW3lgVJPZ1hPUbjWhXr4
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Thu, Jun 13, 2024 at 1:40=E2=80=AFAM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Mon, Jun 10, 2024 at 7:38=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > On Wed, Jun 5, 2024 at 7:02=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
>> >> >>
>> >> >> Chris Li <chrisl@kernel.org> writes:
>> >> >>
>> >> >
>> >> >> > In the page allocation side, we have the hugetlbfs which reserve=
 some
>> >> >> > memory for high order pages.
>> >> >> > We should have similar things to allow reserve some high order s=
wap
>> >> >> > entries without getting polluted by low order one.
>> >> >>
>> >> >> TBH, I don't like the idea of high order swap entries reservation.
>> >> > May I know more if you don't like the idea? I understand this can be
>> >> > controversial, because previously we like to take the THP as the be=
st
>> >> > effort approach. If there is some reason we can't make THP, we use =
the
>> >> > order 0 as fall back.
>> >> >
>> >> > For discussion purpose, I want break it down to smaller steps:
>> >> >
>> >> > First, can we agree that the following usage case is reasonable:
>> >> > The usage case is that, as Barry has shown, zsmalloc compresses big=
ger
>> >> > size than 4K and can have both better compress ratio and CPU
>> >> > performance gain.
>> >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gma=
il.com/
>> >> >
>> >> > So the goal is to make THP/mTHP have some reasonable success rate
>> >> > running in the mix size swap allocation, after either low order or
>> >> > high order swap requests can overflow the swap file size. The alloc=
ate
>> >> > can still recover from that, after some swap entries got free.
>> >> >
>> >> > Please let me know if you think the above usage case and goal are n=
ot
>> >> > reasonable for the kernel.
>> >>
>> >> I think that it's reasonable to improve the success rate of high-order
>> >
>> > Glad to hear that.
>> >
>> >> swap entries allocation.  I just think that it's hard to use the
>> >> reservation based method.  For example, how much should be reserved?
>> >
>> > Understand, it is harder to use than a fully transparent method, but
>> > still better than no solution at all. The alternative right now is we
>> > can't do it.
>> >
>> > Regarding how much we should reserve. Similarly, how much should you
>> > choose your swap file size? If you choose N, why not N*120% or N*80%?
>> > That did not stop us from having a swapfile, right?
>> >
>> >> Why system OOM when there's still swap space available?  And so forth.
>> >
>> > Keep in mind that the reservation is an option. If you prefer the old
>> > behavior, you don't have to use the reservation. That shouldn't be a
>> > reason to stop others who want to use it. We don't have an alternative
>> > solution for the long run mix size allocation yet. If there is, I like
>> > to hear it.
>>
>> It's not enough to make it optional.  When you run into issue, you need
>> to debug it.  And you may debug an issue on a system that is configured
>> by someone else.
>
> That is in general true with all kernel development regardless of
> using options or not. If there is a bug in my patch, I will need to
> debug and fix it or the patch might be reverted.
>
> I don't see that as a reason to take the option path or not. The
> option just means the user taking this option will need to understand
> the trade off and accept the defined behavior of that option.

User configuration knobs are not forbidden for Linux kernel.  But we are
more careful about them because they will introduce ABI which we need to
maintain forever.  And they are hard to be used for users.  Optimizing
automatically is generally the better solution.  So, I suggest you to
think more about the automatically solution before diving into a new
option.

>>
>> >> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
>> >
>> > Me too. I prefer transparent over reservation if it can achieve the
>> > same goal. Do we have a fully transparent method spec out? How to
>> > achieve fully transparent and also avoid fragmentation caused by mix
>> > order allocation/free?
>> >
>> > Keep in mind that we are still in the early stage of the mTHP swap
>> > development, I can have the reservation patch relatively easily. If
>> > you come up with a better transparent method patch which can achieve
>> > the same goal later, we can use it instead.
>>
>> Because we are still in the early stage, I think that we should try to
>> improve transparent solution firstly.  Personally, what I don't like is
>> that we don't work on the transparent solution because we have the
>> reservation solution.
>
> Do you have a road map or the design for the transparent solution you can=
 share?
> I am interested to know what is the short term step(e.g. a month)  in
> this transparent solution you have in mind, so we can compare the
> different approaches. I can't reason much just by the name
> "transparent solution" itself. Need more technical details.
>
> Right now we have a clear usage case we want to support, the swap
> in/out mTHP with bigger zsmalloc buffers. We can start with the
> limited usage case first then move to more general ones.

TBH, This is what I don't like.  It appears that you refuse to think
about the transparent (or automatic) solution.

I haven't thought about them thoroughly, but at least we may think about

- promoting low order non-full cluster when we find a free high order
  swap entries.

- stealing a low order non-full cluster with low usage count for
  high-order allocation.

- freeing more swap entries when swap devices become fragmented.

>> >> >> that's really important for you, I think that it's better to design
>> >> >> something like hugetlbfs vs core mm, that is, be separated from the
>> >> >> normal swap subsystem as much as possible.
>> >> >
>> >> > I am giving hugetlbfs just to make the point using reservation, or
>> >> > isolation of the resource to prevent mixing fragmentation existing =
in
>> >> > core mm.
>> >> > I am not suggesting copying the hugetlbfs implementation to the swap
>> >> > system. Unlike hugetlbfs, the swap allocation is typically done from
>> >> > the kernel, it is transparent from the application. I don't think
>> >> > separate from the swap subsystem is a good way to go.
>> >> >
>> >> > This comes down to why you don't like the reservation. e.g. if we u=
se
>> >> > two swapfile, one swapfile is purely allocate for high order, would
>> >> > that be better?
>> >>
>> >> Sorry, my words weren't accurate.  Personally, I just think that it's
>> >> better to make reservation related code not too intrusive.
>> >
>> > Yes. I will try to make it not too intrusive.
>> >
>> >> And, before reservation, we need to consider something else firstly.
>> >> Whether is it generally good to swap-in with swap-out order?  Should =
we
>> >
>> > When we have the reservation patch (or other means to sustain mix size
>> > swap allocation/free), we can test it out to get more data to reason
>> > about it.
>> > I consider the swap in size policy an orthogonal issue.
>>
>> No.  I don't think so.  If you swap-out in higher order, but swap-in in
>> lower order, you make the swap clusters fragmented.
>
> Sounds like that is the reason to apply swap-in the same order of the swa=
p out.
> In any case, my original point still stands. We need to have the
> ability to allocate high order swap entries with reasonable success
> rate *before* we have the option to choose which size to swap in. If
> allocating a high order swap always fails, we will be forced to use
> the low order one, there is no option to choose from. We can't evalute
> "is it generally good to swap-in with swap-out order?" by actual runs.

I think we don't need to fight for that.  Just prove the value of your
patchset with reasonable use cases and normal workloads.  Data will
persuade people.

--
Best Regards,
Huang, Ying