From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0B67FC27C53
	for <linux-mm@archiver.kernel.org>; Thu,  6 Jun 2024 01:57:25 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 944976B00A0; Wed,  5 Jun 2024 21:57:24 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8CDA76B00A4; Wed,  5 Jun 2024 21:57:24 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 747B56B00A5; Wed,  5 Jun 2024 21:57:24 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 523746B00A0
	for <linux-mm@kvack.org>; Wed,  5 Jun 2024 21:57:24 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id BED731201BE
	for <linux-mm@kvack.org>; Thu,  6 Jun 2024 01:57:23 +0000 (UTC)
X-FDA: 82198801566.18.1FC0920
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13])
	by imf18.hostedemail.com (Postfix) with ESMTP id C9B6F1C0009
	for <linux-mm@kvack.org>; Thu,  6 Jun 2024 01:57:19 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="FL1Pn//x";
	spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717639041; a=rsa-sha256;
	cv=none;
	b=oxzGQMPC8JO92mLFk2UrKEUobRMKK0mcqtkxUZAfxjJ0uc3Pzt/gHAD4KS+tokC+ga6PbT
	xDFyL4XOSi6+rcFIzMwFINuWX46iUf7ba8NUSNLPjNd3/70bJwv9/Ecu4SaxDcTUQ0MIS7
	Jl+s2ujCBihT0DTfjkDBe1FxCiHbGQU=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="FL1Pn//x";
	spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1717639041;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=S2HFo4mqwalKj8fh+EPCZ+HBPP4u8qecPolLX0PueEQ=;
	b=gNI/BJVGFdgJuL63nSn7D0D5ueMEGzPV+Fc6NrPsyGAknPNLDvHvDBs3G8tee4UQRmp1Zy
	IaM5f+E6EZoNJqyZKI05a8I0GzTqcf1mAvoL98FknjhWnrM3Fk1THVIxu31phKg8Yk/L2H
	PBBQ7a/Mx/eqsw0X1nrz/NK+jmFNloM=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1717639041; x=1749175041;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=1+jnYSMue3rbA1K7GRIby88dm6CNWh/oOxG1dvm29rk=;
  b=FL1Pn//xDXnViy6bGG8GyXfXwvIdXgjL9Gxi7IOxUcnKDZmOv+yrCpf0
   9oiyZmM2H7MwE71lAi4iM64P85Iy03t30f2jiBoq4+xAXd71bVnATZrwk
   QYHnpS8SY/a14jMXsFzpZwhYjzyfScXTb4bSDHfmlXaBWPhl5RTAuwSjY
   ffJ44QUDUPfkfQXOfeuouRjqwMuItAtZ1plW6a9pIetgZiVq9zARFaLzn
   VC41deJG32FIBfoLy+9CcSKArRAuiSChc0PYoW/TPCzGqjTU5OQV9EGqp
   82uTz9OkMhF6dlfiGk/Q8XSn7IvMTIBKnDj+6auwhNTT5KkM+VkDe16ax
   g==;
X-CSE-ConnectionGUID: lvmAmRVtRRGG4vAU7y8KPA==
X-CSE-MsgGUID: vY/19wWFRiKW7ozTj30O+Q==
X-IronPort-AV: E=McAfee;i="6600,9927,11094"; a="25393561"
X-IronPort-AV: E=Sophos;i="6.08,218,1712646000"; 
   d="scan'208";a="25393561"
Received: from orviesa005.jf.intel.com ([10.64.159.145])
  by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2024 18:57:18 -0700
X-CSE-ConnectionGUID: 7XySAZOaRGqZvpdpxSGUDw==
X-CSE-MsgGUID: UbC28saqSDaLJkLI4E2GTQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,218,1712646000"; 
   d="scan'208";a="42725519"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2024 18:57:15 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,  Kairui Song
 <kasong@tencent.com>,  Ryan Roberts <ryan.roberts@arm.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster
 order
In-Reply-To: <CAF8kJuPLhmJqMi-unDOm820c8_kRnQVA_dnSfgRzMXaHKnDHAQ@mail.gmail.com>
	(Chris Li's message of "Wed, 5 Jun 2024 00:08:12 -0700")
References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org>
	<CANeU7QkmQ+bJoFnr-ca-xp_dP1XgEKNSwb489MYVqynP_Q8Ddw@mail.gmail.com>
	<87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuN8HWLpv7=abVM2=M247KGZ92HLDxfgxWZD6JS47iZwZA@mail.gmail.com>
	<875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuMc3sXKarq3hMPYGFfeqyo81Q63HrE0XtztK9uQkcZacA@mail.gmail.com>
	<87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAF8kJuPLhmJqMi-unDOm820c8_kRnQVA_dnSfgRzMXaHKnDHAQ@mail.gmail.com>
Date: Thu, 06 Jun 2024 09:55:24 +0800
Message-ID: <875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: C9B6F1C0009
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Stat-Signature: 4h6dfesn3uedatbe43fjgabky7rrkqmg
X-HE-Tag: 1717639039-653011
X-HE-Meta: U2FsdGVkX18KNmdWON2Qh7rOK3QYuE5EWZ7wg8U28E8UeAVwQMMEk/4ERiEe1hFvTfqrqQvl93s5WZ+1wAOqj4ltNr4okQACg7/wbrWplJENiEY00i2aShhsOVopZSPLF+SCKXUZUj+cnesiWjCG+LHC3Kq0BFOR8dqQ8M4Ntd7lUJ/UGod5+ThxWX60hbkMe1aPH6dJ3zjSJnKPpk0gJ8hZ5oMpNyt0Tz0cJsx9232niTs6PpBi2H+xa0c/7GG0w0eITBSwhVPMvym5GIJGSLHMW2eqfm1T02S0p5Ve4jEr8+qlISVuwdCM/iTid17hqZp0WOrYxId3Uyy7ibFpnL1yGBmO/5CfUIi+5bLjzj8AlFhveXKD/irWhgrIHJXIaU+HQAlz8RCZKAep8+G3fk8ppnMHvSsjNW1cynYQRo7kQnuNDxsz+Bz/Rv9R1cnuqJidB5TYEJyLa4ZW+KAg4N6sKQ032rZ+j+5+ZW0/2tcxyVmMPDKVT0+K/dqjBlXho7MVDp9I7de4gDlZP2osGC06dBmMXS52HtGAJ5jAYh3bva8XyJX37h5f7s5CYMI1GXCET0q8RQlfxvvwtTr/gUGLKj4phR6C7yzhhwM1/5oTtjWsATTHGBx/fl6Xf5pk2c+vP0m4OZuRpcJLRY/1FO78qIf/SH1bmDizEGlhvdGnB/N/JUP5SdaLLheLxGu2cgs0DTfpcGf4jtwM4nzauo8hp90wiz+K+4idS8XVvP2gteS+K1Sqj3OWLRlAHJScaPw9Xj+ScZB3pvY9YwnxTq5tr4jGVTi8si2R/px8ZZdTimorvcoIibAfHznnybnHKD0PX+YMXrT+Rrddd095K8clZEJrGHg4iXg+V4/4PwRgRJIedyO22aYshQOx0eYn3V4hMDtmeTcLjqHvCO3HWuNAfIYXB9nH3O985aIra+mxIoEqdCnpd/ezuTzyEP0SzefHkuN9sCzJkAdIP3X
 RJdVp05Z
 DN9Bh58o8gIc+e6lKdbfqkacz/Uy1VM3sSh1ShiNaBhNYWhm13485DJrxbgnON31omDROm5lvaxRN8WkOsD7C5ZrYa62fH0WJk69IVp6MGaF+dMuXk8/NV8IInIvTD34YES9nt7fRv5o+4DglS6iV+2bFypWmRS6Gw+M924RAcgoKdwXLjclRyzEmCKKWtO9NwP6GVyNTwiLcedVektzTqaKWLnB/itKhEfmj
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000010, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Thu, May 30, 2024 at 7:37=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, May 29, 2024 at 7:54=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> > because android does not have too many cpu. We are talking about a
>> > handful of clusters, which might not justify the code complexity. It
>> > does not change the behavior that order 0 can pollut higher order.
>>
>> I have a feeling that you don't really know why swap_map[] is scanned.
>> I suggest you to do more test and tracing to find out the reason.  I
>> suspect that there are some non-full cluster collection issues.
>
> Swap_map[] is scanned because of running out of non full clusters.
> This can happen because Android tries to make full use of the swapfile.
> However, once the swap_map[] scan happens, the non full cluster is pollut=
ed.
>
> I currently don't have a local reproduction of the issue Barry reported.
> However here is some data point:
> Two swap files, one for high order allocation only with this patch. No
> fall back.
> If there is a non-full cluster collection issue, we should see the
> fall back in this case as well.
>
> BTW, same setup without this patch series it will fall back on the
> high order allocation as well.
>
>>
>> >> Another issue is nonfull_cluster[order1] cannot be used for
>> >> nonfull_cluster[order2].  In definition, we should not fail order 0
>> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
>> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
>> >> perfect, but it is the simplest first step implementation.  You can
>> >> optimize based on it further.
>> >
>> > Yes, that is listed as the limitation of this cluster order approach.
>> > Initially we need to support one order well first. We might choose
>> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
>> > are too big. The sweet spot might be some there in between.  If we can
>> > support one order well, we can demonstrate the value of the mTHP. We
>> > can worry about other mix orders later.
>> >
>> > Do you have any suggestions for how to prevent the order 0 polluting
>> > the higher order cluster? If we allow that to happen, then it defeats
>> > the goal of being able to allocate higher order swap entries. The
>> > tricky question is we don't know how much swap space we should reserve
>> > for each order. We can always break higher order clusters to lower
>> > order, but can't do the reserves. The current patch series lets the
>> > actual usage determine the percentage of the cluster for each order.
>> > However that seems not enough for the test case Barry has. When the
>> > app gets OOM kill that is where a large swing of order 0 swap will
>> > show up and not enough higher order usage for the brief moment. The
>> > order 0 swap entry will pollute the high order cluster. We are
>> > currently debating a "knob" to be able to reserve a certain % of swap
>> > space for a certain order. Those reservations will be guaranteed and
>> > order 0 swap entry can't pollute them even when it runs out of swap
>> > space. That can make the mTHP at least usable for the Android case.
>>
>> IMO, the bottom line is that order-0 allocation is the first class
>> citizen, we must keep it optimized.  And, OOM with free swap space isn't
>> acceptable.  Please consider the policy we used for page allocation.
>
> We need to make order-0 and high order allocation both can work after
> the initial pass of allocating from empty clusters.
> Only order-0 allocation work is not good enough.
>
> In the page allocation side, we have the hugetlbfs which reserve some
> memory for high order pages.
> We should have similar things to allow reserve some high order swap
> entries without getting polluted by low order one.

TBH, I don't like the idea of high order swap entries reservation.  If
that's really important for you, I think that it's better to design
something like hugetlbfs vs core mm, that is, be separated from the
normal swap subsystem as much as possible.

>>
>> > Do you see another way to protect the high order cluster polluted by
>> > lower order one?
>>
>> If we use high-order page allocation as reference, we need something
>> like compaction to guarantee high-order allocation finally.  But we are
>> too far from that.
>
> We should consider reservation for high-order swap entry allocation
> similar to hugetlbfs for memory.
> Swap compaction will be very complicated because it needs to scan the
> PTE to migrate the swap entry. It might be easier to support folio
> write out compound discontiguous swap entries. That is another way to
> address the fragmentation issue. We are also too far from that as
> right now.

That's not easy to write out compound discontiguous swap entries too.
For example, how to put folios in swap cache?

>>
>> For specific configuration, I believe that we can get reasonable
>> high-order swap entry allocation success rate for specific use cases.
>> For example, if we only do limited maximum number order-0 swap entries
>> allocation, can we keep high-order clusters?
>
> Yes we can. Having a knob to reserve some high order swap space.
> Limiting order 0 is the same as having some high order swap entries
> reserved.
>
> That is a short term solution.

--
Best Regards,
Huang, Ying