From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BBB19CD68ED
	for <linux-mm@archiver.kernel.org>; Tue, 10 Oct 2023 06:10:29 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2730F8D00AA; Tue, 10 Oct 2023 02:10:29 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1FC178D006D; Tue, 10 Oct 2023 02:10:29 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 076208D00AA; Tue, 10 Oct 2023 02:10:29 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id E53F58D006D
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 02:10:28 -0400 (EDT)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id A5E71C0175
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 06:10:28 +0000 (UTC)
X-FDA: 81328527336.19.6EE806A
Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93])
	by imf01.hostedemail.com (Postfix) with ESMTP id 789AF4000B
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 06:10:25 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Bqy5TxQo;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1696918226;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tbj0iKcY1LjIe2ubOsxY+ZXnIVDS0wEWqOVWVN2FVTI=;
	b=D71IvUu5dwpYgEmt+WGAxkoNOkyz9DG1QZDGJ/IX8mfFEjFDUsil7+8fEUHEjSFsnlFcQ3
	cnOKOWlhYGru3cgQk09oyPKQy2Ys8CLlGEbAA4+1X/a257o9RsJXkZyMsGhYnfwXMKLvdI
	il8ebS9qHJVP09uWlU/gl45+BEynppc=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Bqy5TxQo;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696918226; a=rsa-sha256;
	cv=none;
	b=OsJoMyua5v5f38CAqlfnvd4SDAQ5gCI/AFoeqUJstqFlXCLnR307orNNrBzQ8Nh5xV5ISf
	CD+MI+cN9VJTUIfB1qUn6WfeTBcrkNX0aUIs81ylVZ6aN38QFDI5fBlcCY48YBuGUGPoPF
	s1J8rdqzLqqAkT4kEvYsy56KG/7xnxQ=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1696918225; x=1728454225;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=TUsCoEHWVSmVfCipsR2h51CAEI1MELGWQx6Zwd1DQwM=;
  b=Bqy5TxQohaprNcnUCiiw1Ox4USD8uY3V2IOXKrV22qY11239c6w82xxm
   96TfPz3vwijL6RRf40HmcW1UPghx46px5OT7fuWOShTkXv/pOK0OxiUpn
   /PCwsphRXB6JwCL3bhpX/vx25RO3KwHqjMBTqOpsHg1IYdE2X7+UrW+3z
   /PJgSxINptQFJFJGWI2OaZvFudHw6ovh33lzHKlm94WXsjq5VZkj+nDrS
   +rOE8F7yBgyLJk94j439aNMRGjxs1H3vEdaGupLLgN3ziVEO9dy6JBWkN
   gra0EKdwCoz2vuXFdJVQqOvJpTPRPXq2o+KQLjj9GpqwooDBoORWbTeqa
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10858"; a="381577432"
X-IronPort-AV: E=Sophos;i="6.03,211,1694761200"; 
   d="scan'208";a="381577432"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2023 23:10:21 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10858"; a="819100631"
X-IronPort-AV: E=Sophos;i="6.03,211,1694761200"; 
   d="scan'208";a="819100631"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2023 23:10:17 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Zi Yan <ziy@nvidia.com>
Cc: <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,
 Ryan Roberts <ryan.roberts@arm.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  "Matthew Wilcox (Oracle)"
 <willy@infradead.org>,  David Hildenbrand <david@redhat.com>,
 "Yin, Fengwei" <fengwei.yin@intel.com>, Yu Zhao <yuzhao@google.com>,
  Vlastimil Babka <vbabka@suse.cz>,  Johannes Weiner <hannes@cmpxchg.org>,
  Baolin Wang <baolin.wang@linux.alibaba.com>,  Kemeng Shi
 <shikemeng@huaweicloud.com>,  Mel Gorman <mgorman@techsingularity.net>,
  "Rohan Puri" <rohan.puri15@gmail.com>,  Mcgrof Chamberlain
 <mcgrof@kernel.org>,
  "Adam Manzanares" <a.manzanares@samsung.com>, John Hubbard
 <jhubbard@nvidia.com>
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
References: <20230912162815.440749-1-zi.yan@sent.com>
	<87a5ssjmld.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<14089E95-251E-43A4-AF32-C9773723C810@nvidia.com>
Date: Tue, 10 Oct 2023 14:08:08 +0800
In-Reply-To: <14089E95-251E-43A4-AF32-C9773723C810@nvidia.com> (Zi Yan's
	message of "Mon, 09 Oct 2023 09:43:38 -0400")
Message-ID: <87r0m3ggc7.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 789AF4000B
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: 9h81o53xs8j9zbx7f6fqcycy4pdzg7bf
X-HE-Tag: 1696918225-783643
X-HE-Meta: U2FsdGVkX19kVCNxzM2Ktx+gbkPkOgmYBX8SRb0DGEZWgR0p0w4PgqiSDyJSQ9ld/JPJEi52LMfotZNgqwzeb4jvOrkvvOWNgemJdR8In4bBPCHazvG9hueQpeuAlTzuEPm5Rtx3lLRnKUCjxZFlwhLmF8JKzz24Q7H5RkpMlnjl9v99+0Aa0FKtj1OdObSFzFXNS4IXkiNH3lgr2SOM9F6WSBt8IZpLeOhFYOcNqk3elzX+NPKuveeLXCtTd9ZZurScLFpJlHbGfrvo4vK56jj5y8+QuDrWTugY9HpYRUsIJCm1LdoV54DQvGXEtMCbh+XvIlrul5WaLlqDw1f6CSyFzMcy8y94x8bxA89jYkWaaZvN+C3E7OMRvVSordXtX2gGUig9sogB57IW1rdWKmxAoppN0Z/9Nc/UgkQGinJq0ET1BfiRQkHfrJ5qoDLIiwWs+XzeTmHTzyUeo0Y34urLp0blTirsMV+5/ktkhZL3EEM1ozSKKXjZCflT1eCB5b3VyGMNh/wQPlnqYDha8RCo2HNA1Sd7srFhZdMyczG2rUzdQ3iQbiBYg9PVRvxsfJ0whPMQNqIyte7i2B3OJq6DunqxdbTi3amZYlSvoX+CW3HohJ55NNfGuDKdMm4blgQWVoCAA73gofbfHWWf9N5M4Yfs0gmBWZHPpMj9KQY4KbzbqfOdZ7uIKUGNRL2uq1sperBPvHwLTzLKO4nuwp+i4oBxGMeFULUhDN2RJxpUp+IIKxzffI6NxDNZppGvDSwRv71KT4afPpgK3OhXbiqDVk9d1yb9lO83W1VQ2zddGAmhMtFjThF+9StWPkZ8kZkXE5EDMQLOqWrL/moOLSgqcq7QLd3QQT9rX65yei0Q6a7i8XgY8ykUjZdEbDYUROynrzd97b6QQFnsiBI5iGO9iTFOeVULGR+Dz3DpTAyNqN4mz7KqAJ8PjQHYimagLRqz5aT+ehvYpPSkP2v
 DbQuW4lU
 OcTDIWJ0yfZ6mRj/rtUSwSQX9kD1FhHIFTvE4eCWx0KVF8QReFm/0VXMxbPQtTuRPxhrmxunJcRXS0q/nUuu0vi2DMZ08MP8j/EJk3uICKCYIEQLjjMrrHlJlIpw8IpgXP/Rs5YvIMSGke+yCR+0nd1ymEkSci9/W5k2yyID52EmXTVVRpsWoY8Cc7ONuVtuh9JSkp83rxq9p+Pz6HbibSNlTvOpwxO8b4Mo7VxMzMCoEEJlnR0O2zHMemRmOtEeUD21/KA8xkFy36zdkCCEsFwiG/IAEj9LQjRKSmSGuHUciNkjEUialIFeCnZBWN79GNi2KwautEXNvPnZcGVQcboT0idMgqze3Gd8UHFA6tJvqFFA5ofGagWdIWjqe6CQUMcoEHhrPzaRKE939TGxSA+UXYKx+/myr/vq9kmAv+58unMYgHJNu2531w+ujS8XbBbpA
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


Something wrong with my mail box.  Sorry, if you received duplicated
mail.

Zi Yan <ziy@nvidia.com> writes:

> On 9 Oct 2023, at 3:12, Huang, Ying wrote:
>
>> Hi, Zi,
>>
>> Thanks for your patch!
>>
>> Zi Yan <zi.yan@sent.com> writes:
>>
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset enables >0 order folio memory compaction, which is one of
>>> the prerequisitions for large folio support[1]. It is on top of
>>> mm-everything-2023-09-11-22-56.
>>>
>>> Overview
>>> ===
>>>
>>> To support >0 order folio compaction, the patchset changes how free pages used
>>> for migration are kept during compaction.
>>
>> migrate_pages() can split the large folio for allocation failure.  So
>> the minimal implementation could be
>>
>> - allow to migrate large folios in compaction
>> - return -ENOMEM for order > 0 in compaction_alloc()
>>
>> The performance may be not desirable.  But that may be a baseline for
>> further optimization.
>
> I would imagine it might cause a regression since compaction might gradually
> split high order folios in the system.

I may not call it a pure regression, since large folio can be migrated
during compaction with that, but it's possible that this hurts
performance.

Anyway, this can be a not-so-good minimal baseline.

> But I can move Patch 4 first to make this the baseline and see how
> system performance changes.

Thanks!

>>
>> And, if we can measure the performance for each step of optimization,
>> that will be even better.
>
> Do you have any benchmark in mind for the performance tests? vm-scalability?

I remember Mel Gorman has done some tests for defragmentation before.
But that's for order-0 pages.

>>> Free pages used to be split into
>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>> on their order without post allocation process. When migrate_pages() asks for
>>> a new page, one of the free pages, based on the requested page order, is
>>> then processed and given out.
>>>
>>>
>>> Optimizations
>>> ===
>>>
>>> 1. Free page split is added to increase migration success rate in case
>>> a source page does not have a matched free page in the free page lists.
>>> Free page merge is possible but not implemented, since existing
>>> PFN-based buddy page merge algorithm requires the identification of
>>> buddy pages, but free pages kept for memory compaction cannot have
>>> PageBuddy set to avoid confusing other PFN scanners.
>>>
>>> 2. Sort source pages in ascending order before migration is added to
>>
>> Trivial.
>>
>> s/ascending/descending/
>>
>>> reduce free page split. Otherwise, high order free pages might be
>>> prematurely split, causing undesired high order folio migration failures.
>>>
>>>
>>> TODOs
>>> ===
>>>
>>> 1. Refactor free page post allocation and free page preparation code so
>>> that compaction_alloc() and compaction_free() can call functions instead
>>> of hard coding.
>>>
>>> 2. One possible optimization is to allow migrate_pages() to continue
>>> even if get_new_folio() returns a NULL. In general, that means there is
>>> not enough memory. But in >0 order folio compaction case, that means
>>> there is no suitable free page at source page order. It might be better
>>> to skip that page and finish the rest of migration to achieve a better
>>> compaction result.
>>
>> We can split the source folio if get_new_folio() returns NULL.  So, do
>> we really need this?
>
> It depends. The situation it can benefit is that when the system is going
> to allocate a high order free page and trigger a compaction, it is possible to
> get the high order free page by migrating a bunch of base pages instead of
> splitting a existing high order folio.
>
>>
>> In general, we may reconsider all further optimizations given splitting
>> is available already.
>
> In my mind, split should be avoided as much as possible.

If so, should we use "nosplit" logic in migrate_pages_batch() in some
situation?

> But it really depends
> on the actual situation, e.g., how much effort and cost the compaction wants
> to pay to get memory defragmented. If the system really wants to get a high
> order free page at any cost, split can be used without any issue. But applications
> might lose performance because existing large folios are split just to a
> new one.

Is it possible that splitting is desirable in some situation?  For
example, allocate some large DMA buffers at the cost of large anonymous
folios?

> Like I said in the email, there are tons of optimizations and policies for us
> to explore. We can start with the bare minimum support (if no performance
> regression is observed, we can even start with split all high folios like you
> suggested) and add optimizations one by one.

Sound good to me!  Thanks!

>>
>>> 3. Another possible optimization is to enable free page merge. It is
>>> possible that a to-be-migrated page causes free page split then fails to
>>> migrate eventually. We would lose a high order free page without free
>>> page merge function. But a way of identifying free pages for memory
>>> compaction is needed to reuse existing PFN-based buddy page merge.
>>>
>>> 4. The implemented >0 order folio compaction algorithm is quite naive
>>> and does not consider all possible situations. A better algorithm can
>>> improve compaction success rate.
>>>
>>>
>>> Feel free to give comments and ask questions.
>>>
>>> Thanks.
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>>
>>> Zi Yan (4):
>>>   mm/compaction: add support for >0 order folio memory compaction.
>>>   mm/compaction: optimize >0 order folio compaction with free page
>>>     split.
>>>   mm/compaction: optimize >0 order folio compaction by sorting source
>>>     pages.
>>>   mm/compaction: enable compacting >0 order folios.
>>>
>>>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>>  mm/internal.h   |   7 +-
>>>  2 files changed, 176 insertions(+), 36 deletions(-)

--
Best Regards,
Huang, Ying