From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 200B9C27C75
	for <linux-mm@archiver.kernel.org>; Thu, 13 Jun 2024 08:40:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B26476B00A8; Thu, 13 Jun 2024 04:40:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AD6116B00A9; Thu, 13 Jun 2024 04:40:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 99DEF6B00AA; Thu, 13 Jun 2024 04:40:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 7C8FB6B00A8
	for <linux-mm@kvack.org>; Thu, 13 Jun 2024 04:40:53 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id E7665120A3D
	for <linux-mm@kvack.org>; Thu, 13 Jun 2024 08:40:52 +0000 (UTC)
X-FDA: 82225219944.05.7178339
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21])
	by imf09.hostedemail.com (Postfix) with ESMTP id 37504140008
	for <linux-mm@kvack.org>; Thu, 13 Jun 2024 08:40:49 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=ICb0jYG3;
	spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718268050; a=rsa-sha256;
	cv=none;
	b=dld7MaJaUbadRd4/CR/h68yS8+0rfKN8u5ZuMyZ9GmkDMZpvnZqwHQUAdJHJrGboLFj0Zt
	Trk8fl4ycpzZbrZ4OesaQ6HYJT8uaYZpKT7L2OSThVIk9HyC5b3OjjfugSxi/6UF34DWWF
	qw1SJNU9JNzzl2mhi0VkA6js7U1PzrA=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=ICb0jYG3;
	spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718268050;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=TJrbrQV2QboP9lZ95hNwxnSdB6kp7Pk38Q6XEPolQJU=;
	b=pEJarvsxTIWD2qahsMlA2CgRxHGeZVMEa+x7auydlIcVzTDXZdSl3NQ2dNZidKBzHBloj0
	qj5FmV89jKdeJEd4cytKmOBXVPQ5DjH7l0pajjj8JMMYaE0ceU/AuML7hlihJoOieFTzJe
	BsA52kf5PaXko9mxmNZ5nr7gT6Gkooo=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1718268050; x=1749804050;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=BufJUpD5h99/QMmWwXqc0nJMKvoaciS9R2964RYmPDQ=;
  b=ICb0jYG3IBwUyVGB+Yiwq3DWAZkPfOFKO1wPmhnYtWfEuqDjl0Vhl0Kf
   qqU0HX2PZEMvqpXoyDljBkajoULzwX9s4xLS1fLbiNYhDLa41HKPI2te+
   BaGB9g+CxUcth8H3T8/Ps52x97ewu/fFyNl/tDWiPfifAe0vKwipDN71Z
   YlCTvyiOGb/fyXl+TIL8iGWwgevg7pj05QWN0a7+akfCfUpA355dGuwsg
   GliHWng9u51VqM98MtnoguU0k3KwjFup1wo4MHueG7+fJVgMCG6rvoBrh
   k9PJlE9vlA5P+PewHJ2sp5Mxx+o0y/B7X3PK8A3l2bBeA/Sak0sP09oEh
   Q==;
X-CSE-ConnectionGUID: qishylgkT9WAb+jsBewrzA==
X-CSE-MsgGUID: VyhFT0F7Sc+s7VgTZQKgeg==
X-IronPort-AV: E=McAfee;i="6700,10204,11101"; a="15038182"
X-IronPort-AV: E=Sophos;i="6.08,234,1712646000"; 
   d="scan'208";a="15038182"
Received: from orviesa004.jf.intel.com ([10.64.159.144])
  by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jun 2024 01:40:49 -0700
X-CSE-ConnectionGUID: y6ZThw5fQ/qC4Z8Tk3LApA==
X-CSE-MsgGUID: oiRJwbZRTGmKIw26B63CQw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,234,1712646000"; 
   d="scan'208";a="45197867"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jun 2024 01:40:47 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,  Kairui Song
 <kasong@tencent.com>,  Ryan Roberts <ryan.roberts@arm.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster
 order
In-Reply-To: <CANeU7Q=Epa438LXEX4WEccxLt6WOziLg2sp_=RA3C4PxtHD5uw@mail.gmail.com>
	(Chris Li's message of "Tue, 11 Jun 2024 00:11:42 -0700")
References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org>
	<CANeU7QkmQ+bJoFnr-ca-xp_dP1XgEKNSwb489MYVqynP_Q8Ddw@mail.gmail.com>
	<87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuN8HWLpv7=abVM2=M247KGZ92HLDxfgxWZD6JS47iZwZA@mail.gmail.com>
	<875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuMc3sXKarq3hMPYGFfeqyo81Q63HrE0XtztK9uQkcZacA@mail.gmail.com>
	<87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuPLhmJqMi-unDOm820c8_kRnQVA_dnSfgRzMXaHKnDHAQ@mail.gmail.com>
	<875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CANeU7Q=iYzyjDwgMRLtSZwKv414JqtZK8w=XWDd6bWZ7Ah-8jA@mail.gmail.com>
	<87wmmw6w9e.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CANeU7Q=Epa438LXEX4WEccxLt6WOziLg2sp_=RA3C4PxtHD5uw@mail.gmail.com>
Date: Thu, 13 Jun 2024 16:38:55 +0800
Message-ID: <87a5jp6xuo.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: ubtiw5y54ay57ykscyz8px6mnqgewrrc
X-Rspamd-Queue-Id: 37504140008
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1718268049-162290
X-HE-Meta: U2FsdGVkX19ECrCRotZ7DPUCS7mI+VIWAOaB2DygVlsLfTpk7eN00A9abaOj1ZEhzNor/jRc7DfuLR5ZQchmaQGkOSXHrAXceseYpcE6F4drDc+AZsVMSFFzxJrQOnzg5Fa//G/Hqi7yqghkb0PnFVoDnoNwg2touePzgv/+egXPP+Y3fJDqCo44Tgq8slLhJMtMYGaIzB3/XelfCmVRZ4sXbgI5oG9Tk99N+YotKj97Hhs29ExFKputi1wk1+IJLQZdMFyVuajVpB+v62WJ4L5zKPsy2f1WvnYBw4mi+XDmkqocvbUgDvPibXuJFzM0HU/3zXFMGLTzOtIugfQduGisTVC6gTRRAHGKYWJgCZpZzn/k+3W+sSErUHoFsZ99JUUvJjVFv5oRCwvxpnedth47QKaJkcNvvtomRqw0Y5OtMwsu/f0fPqCyY/9G4Y+F7pFvS43ndDQas1NgyAXPtdIZqKJ22cpgp4crbA3883MSI0V4L8SSl/wSqaPLl5tQbqx1ytBE8XxmziZzOUgR1f1hylHC+N6jj4noxotwF8+TlbUZFylKGi//GBBdHigA9rWyYX3b8lA0mex1Nc6a6POh/w5MTUBcO/6R146A/yLUsFpbszXFNJX41ljRxinAqHlrefaGMBq8qFWujTffg+EJNgMBjI5tIPp0Vxr2Rs51OHHgLCHPYcR3ezZJ5wzk06qCBQjzzYkA7TcQXb+P+v/K1UiwkqiP6C8q8OpahVLEilVQJx9Q/aVwv00MUJMkw6QrfiUcPTDCBiDEXZIGlBQanpVX4oMTiREk0kC4qgeRls0sBLtbm/okfvYxIfR+kUa2/KyjWlGDMBGx4yWjG9OWeJ7boU1xHAtMo7IwQAHTsweGRFzlVKa/gF/AMv+hemJo1wGRut+DfBWPLa0bQjvxtPOznsxlPdz3JaXRTRdi2q51ICywp8iif8X1yiQgDzI4b7UoxiT/2W8PMLc
 4HErVJa+
 5qHVX4v7k+GzRk8hGwDlTs2Caq0t91xbs1BXZ+HMFvBZi63tQJnQjZiL+0LUOP82uu+bky+aXhPi3SNDcJGxHNVbtQ0MHALvsOQiEfPNA1Lo3kh2GE8KB2OaLzpgs0efVkwc7NVSkLXhVPRVTfknbAHZl9uVUoKbb07bppWsQV+IvJoqWzUd9BizN57YQfYar4byBgMD92pila7BAPqOFKPwjIrbCJ999KqrT2FIlhldYcL+GmOG2oSn3opuTVCFRLw4Jt7lR/7sgWutdtJg3vCQmcw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Mon, Jun 10, 2024 at 7:38=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, Jun 5, 2024 at 7:02=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >
>> >> > In the page allocation side, we have the hugetlbfs which reserve so=
me
>> >> > memory for high order pages.
>> >> > We should have similar things to allow reserve some high order swap
>> >> > entries without getting polluted by low order one.
>> >>
>> >> TBH, I don't like the idea of high order swap entries reservation.
>> > May I know more if you don't like the idea? I understand this can be
>> > controversial, because previously we like to take the THP as the best
>> > effort approach. If there is some reason we can't make THP, we use the
>> > order 0 as fall back.
>> >
>> > For discussion purpose, I want break it down to smaller steps:
>> >
>> > First, can we agree that the following usage case is reasonable:
>> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
>> > size than 4K and can have both better compress ratio and CPU
>> > performance gain.
>> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.=
com/
>> >
>> > So the goal is to make THP/mTHP have some reasonable success rate
>> > running in the mix size swap allocation, after either low order or
>> > high order swap requests can overflow the swap file size. The allocate
>> > can still recover from that, after some swap entries got free.
>> >
>> > Please let me know if you think the above usage case and goal are not
>> > reasonable for the kernel.
>>
>> I think that it's reasonable to improve the success rate of high-order
>
> Glad to hear that.
>
>> swap entries allocation.  I just think that it's hard to use the
>> reservation based method.  For example, how much should be reserved?
>
> Understand, it is harder to use than a fully transparent method, but
> still better than no solution at all. The alternative right now is we
> can't do it.
>
> Regarding how much we should reserve. Similarly, how much should you
> choose your swap file size? If you choose N, why not N*120% or N*80%?
> That did not stop us from having a swapfile, right?
>
>> Why system OOM when there's still swap space available?  And so forth.
>
> Keep in mind that the reservation is an option. If you prefer the old
> behavior, you don't have to use the reservation. That shouldn't be a
> reason to stop others who want to use it. We don't have an alternative
> solution for the long run mix size allocation yet. If there is, I like
> to hear it.

It's not enough to make it optional.  When you run into issue, you need
to debug it.  And you may debug an issue on a system that is configured
by someone else.

>> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.
>
> Me too. I prefer transparent over reservation if it can achieve the
> same goal. Do we have a fully transparent method spec out? How to
> achieve fully transparent and also avoid fragmentation caused by mix
> order allocation/free?
>
> Keep in mind that we are still in the early stage of the mTHP swap
> development, I can have the reservation patch relatively easily. If
> you come up with a better transparent method patch which can achieve
> the same goal later, we can use it instead.

Because we are still in the early stage, I think that we should try to
improve transparent solution firstly.  Personally, what I don't like is
that we don't work on the transparent solution because we have the
reservation solution.

>>
>> >> that's really important for you, I think that it's better to design
>> >> something like hugetlbfs vs core mm, that is, be separated from the
>> >> normal swap subsystem as much as possible.
>> >
>> > I am giving hugetlbfs just to make the point using reservation, or
>> > isolation of the resource to prevent mixing fragmentation existing in
>> > core mm.
>> > I am not suggesting copying the hugetlbfs implementation to the swap
>> > system. Unlike hugetlbfs, the swap allocation is typically done from
>> > the kernel, it is transparent from the application. I don't think
>> > separate from the swap subsystem is a good way to go.
>> >
>> > This comes down to why you don't like the reservation. e.g. if we use
>> > two swapfile, one swapfile is purely allocate for high order, would
>> > that be better?
>>
>> Sorry, my words weren't accurate.  Personally, I just think that it's
>> better to make reservation related code not too intrusive.
>
> Yes. I will try to make it not too intrusive.
>
>> And, before reservation, we need to consider something else firstly.
>> Whether is it generally good to swap-in with swap-out order?  Should we
>
> When we have the reservation patch (or other means to sustain mix size
> swap allocation/free), we can test it out to get more data to reason
> about it.
> I consider the swap in size policy an orthogonal issue.

No.  I don't think so.  If you swap-out in higher order, but swap-in in
lower order, you make the swap clusters fragmented.

>> consider memory wastage too?  One static policy doesn't fit all, we may
>> need either a dynamic policy, or make policy configurable.
>> In general, I think that we need to do this step by step.
>
> The core swap layer needs to be able to sustain mix size swap
> allocation free in the long run. Without that the swap in size policy
> is meaningless.
>
> Yes, that is the step by step approach. Allowing long run mix size
> swap allocation as the first step.
>
>> >> >> > Do you see another way to protect the high order cluster pollute=
d by
>> >> >> > lower order one?
>> >> >>
>> >> >> If we use high-order page allocation as reference, we need somethi=
ng
>> >> >> like compaction to guarantee high-order allocation finally.  But w=
e are
>> >> >> too far from that.
>> >> >
>> >> > We should consider reservation for high-order swap entry allocation
>> >> > similar to hugetlbfs for memory.
>> >> > Swap compaction will be very complicated because it needs to scan t=
he
>> >> > PTE to migrate the swap entry. It might be easier to support folio
>> >> > write out compound discontiguous swap entries. That is another way =
to
>> >> > address the fragmentation issue. We are also too far from that as
>> >> > right now.
>> >>
>> >> That's not easy to write out compound discontiguous swap entries too.
>> >> For example, how to put folios in swap cache?
>> >
>> > I propose the idea in the recent LSF/MM discussion, the last few
>> > slides are for the discontiguous swap and it has the discontiguous
>> > entries in swap cache.
>> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
>> >
>> > Agree it is not an easy change. The cache cache would have to change
>> > the assumption all offset are contiguous.
>> > For swap, we kind of have some in memory data associated with per
>> > offset already, so it might provide an opportunity to combine the
>> > offset related data structure for swap together. Another alternative
>> > might be using xarray without the multi entry property. , just treat
>> > each offset like a single entry. I haven't dug deep into this
>> > direction yet.
>>
>> Thanks!  I will study your idea.
>>
>
> I am happy to discuss if you have any questions.
>
>> > We can have more discussion, maybe arrange an upstream alignment
>> > meeting if there is interest.
>>
>> Sure.
>
> Ideally, if we can resolve our differences over the mail list then we
> don't need to have a separate meeting :-)
>

--
Best Regards,
Huang, Ying