From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D3A55C2BD09
	for <linux-mm@archiver.kernel.org>; Tue, 25 Jun 2024 02:36:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 43B236B0254; Mon, 24 Jun 2024 22:36:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 39E056B0256; Mon, 24 Jun 2024 22:36:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1C67B6B0255; Mon, 24 Jun 2024 22:36:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id F0F136B0253
	for <linux-mm@kvack.org>; Mon, 24 Jun 2024 22:36:40 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id AD3E41A1480
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 02:36:40 +0000 (UTC)
X-FDA: 82267847760.17.E8C063B
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.20])
	by imf27.hostedemail.com (Postfix) with ESMTP id 9FD874000A
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 02:36:37 +0000 (UTC)
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Mh4vWoZj;
	spf=pass (imf27.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719282987;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1jomIQbLIeIeYPGtGeXvmMh4XTW6u5a77tPPZxtBk7o=;
	b=5nDruvwNPKy3l/CVsw/WJPro2L7l6bOMtMXJ+seFixQ9EmqqPIAr4pvUB3oD93xAked560
	8V5GMRilZeNG9C5D5M0KUaYK0SI5NvcJGbjWDeuGsAPknDAsd9ZLPzEhl+pYGzIp17aNq+
	sFuHmE68E8gBJqWRvh4dWtm8aDfW6Q8=
ARC-Authentication-Results: i=1;
	imf27.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Mh4vWoZj;
	spf=pass (imf27.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719282987; a=rsa-sha256;
	cv=none;
	b=Zn0UOyegmKOGVA3G1SkaEyJ+qqlJsWiMNHmi/NHfbLdv/5syCX9rWvO/kEGxRnZyHWe50a
	HLyy2n32O91hUaJ1ixstY++RQD8NKqCWzHISBMog/heaRkgHaCIEv+zm1XHEzVokGD+Mr4
	GcDc0hYgAyJoLEYm6JyN8kVsbYE6aNM=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1719282998; x=1750818998;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=yfMDkDJE50I6Vmpl7pq+8RLmVoJW1UQor4UEjx+wuxs=;
  b=Mh4vWoZjV3hN2BUrImP1/QREkhMv3cUTxzL+Ss3traxKSEyvSzCxm/yr
   pBKL8qGy6HkztOQA9H64mRJZuBPwRA4khf6aYK4KFgSdZVTxaj22+0i9b
   UQ7DGHIoCf2uf3Z2/EP5/izo+IP45QKLjPTNWk9iVMUtz0lMI545Sc6zR
   kw8A4Qq4AxNTnH3x6jmq/cub5RTLiYeTCB1AUCqA2FC45k+TRtkjpp6t+
   SHbTm41aprYIHre9Do0qYKJm8Ri8P6MmMhJQ3ite+rMsJNFr3jZ3Xwv5y
   B5x7fHyrYephnfMhs3He+WIlh48b0FNO6KfzGgZAePngkioVxqczIlIPC
   Q==;
X-CSE-ConnectionGUID: jupBOm0dSN2M4VXK12kXtQ==
X-CSE-MsgGUID: IcwZcG5VT/OV8eqBMtpzMg==
X-IronPort-AV: E=McAfee;i="6700,10204,11113"; a="16104263"
X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; 
   d="scan'208";a="16104263"
Received: from orviesa005.jf.intel.com ([10.64.159.145])
  by orvoesa112.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2024 19:36:36 -0700
X-CSE-ConnectionGUID: /WixksIZTmytyaPaUFYvLA==
X-CSE-MsgGUID: lzspy27TRa+RJWp1U773UQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; 
   d="scan'208";a="48453078"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2024 19:36:34 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,  Kairui Song
 <kasong@tencent.com>,  Ryan Roberts <ryan.roberts@arm.com>,  Kalesh Singh
 <kaleshsingh@google.com>,  linux-kernel@vger.kernel.org,
  linux-mm@kvack.org,  Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v3 0/2] mm: swap: mTHP swap allocator base on swap
 cluster order
In-Reply-To: <CANeU7Qno3o-nDjYP7Pf5ZTB9Oh_zOGU0Sv_kV+aT=Z0j_tdKjg@mail.gmail.com>
	(Chris Li's message of "Fri, 21 Jun 2024 01:43:49 -0700")
References: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org>
	<87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CANeU7Qno3o-nDjYP7Pf5ZTB9Oh_zOGU0Sv_kV+aT=Z0j_tdKjg@mail.gmail.com>
Date: Tue, 25 Jun 2024 10:34:42 +0800
Message-ID: <87bk3pzr5p.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: mejajqo75fx8dphc7b43e4s1cwpp1ghj
X-Rspam-User: 
X-Rspamd-Queue-Id: 9FD874000A
X-Rspamd-Server: rspam02
X-HE-Tag: 1719282997-432988
X-HE-Meta: U2FsdGVkX19BzOVYJ0yG9EVsxv6o5AmFvN0cyFRbgSmPcggvLEdmAffHhdrbTUx2EP7Qz5rQJNIDYa1A9WOeAkLq3j9a1BfwcMVuA9w5vs+Kmjn3+Tqvg2u85pGkfBShZXACiw0PLEKGt167RwEU72YEFdF+a8+rMnqhAISnSxOgIUzQKP3VaCMacNq+r9eOOufqFkCaVD/IMlHzVA/N1COxsLy907UCePIGpno8QkAMLi1NL/yiPvAFcObHLHSJPobptTLtR6n/PO2WMjdDjaWeHGX9oMj/+acbN0pdElf9p1VYFmu5tthgEOyNToJFf8cv7lZo4Pwr+djPjSfhvlGPNXZLMNC71zOCj/YEdqbMX/t06Tz+p1I9PXKb+zG2xh4usZf2UcDIN75ZTNImQ8avpUVK5NWRsoqLCMaeDCgQmDtS4jfaSLuIJo0e/42xNzXk17t0d1zfn5+CvhhqBC59QXVn84jfZjClH9jWJa+lGvyl3XBvXe8c6rI7uwMNfouskTUoFAlkHAhLYzOChrPC6zzNbTXZiIxv8C6U7P/Zpig0jDsPuN2FPZhakLU/3QmDpxPOQ2Bk5WmL/qnf0ugQtw2LaMPJv85mVi0zm5RWaZHDmpktnGxUSUwavFOdfLMhNDk+zbgpcO8tgUwsKnu//lg/5U0SHuWV6o6xUrukD/YydYtU7LZ4W+jK4Hdq7dRL/f89rcfBWWp42qpRkWDbC/6aXpvEDavKSvVcVCoJOWKj70uz3GT3gS+80QtD8J01d1+jr0uFn1po4ucLMN2xf905bf+bKOosh2PdotqS1qCMicWcmX6FPyr38P0O/2U2hTNrqwj9lyO2tk33OZriOJ/8jj9ELYOv6hHoq3XuGS0Fpno1cSH9PxwMB/ouPoor93y+//+ROa6phG7Iu9nt+/1CkOfYGvRDmXlpTcCG7o7HCt8KK9atLmOBrKyuqPQz2nmi7WWQSiSZG3d
 xQtrJvYL
 KaetZSL2f3/UJ/WDj3HjfBTgZoRHKk874w9Cbq90In3IfBIEVlpre4ozGrS3x9cWXUbcgfFDud8nUJHtue8eg5HBv1qL4iB+VDPv6lB3EUu1S70krwDfTzRb+UC3WEbzu+tbDaNZjqs3ZxBvzVZ3aYL8HttdfTm8HMtqmOZxsbMKCCKdXYTQM1HT5AzUMNl2k4QjXNIvlt5Ov5WdCG5y2K/CXHvgGaCnznBzcgcAvV2caRoAa9eSt3ZHm+P8kb5BgKwOOGbEh9MYbD/J7Qq7ictpaMA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Wed, Jun 19, 2024 at 7:32=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > This is the short term solutiolns "swap cluster order" listed
>> > in my "Swap Abstraction" discussion slice 8 in the recent
>> > LSF/MM conference.
>> >
>> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> > orders" is introduced, it only allocates the mTHP swap entries
>> > from new empty cluster list.  It has a fragmentation issue
>> > reported by Barry.
>> >
>> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp=
+Ah+NSgNQ@mail.gmail.com/
>> >
>> > The reason is that all the empty cluster has been exhausted while
>> > there are planty of free swap entries to in the cluster that is
>> > not 100% free.
>> >
>> > Remember the swap allocation order in the cluster.
>> > Keep track of the per order non full cluster list for later allocation.
>> >
>> > User impact: For users that allocate and free mix order mTHP swapping,
>> > It greatly improves the success rate of the mTHP swap allocation after=
 the
>> > initial phase.
>> >
>> > Barry provides a test program to show the effect:
>> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.=
com/
>> >
>> > Without:
>> > $ mthp-swapout
>> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback perce=
ntage: 51.54%
>> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percent=
age: 100.00%
>> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percent=
age: 100.00%
>> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percent=
age: 100.00%
>> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percent=
age: 100.00%
>> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percen=
tage: 100.00%
>> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percen=
tage: 100.00%
>> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percen=
tage: 100.00%
>> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percen=
tage: 100.00%
>> >
>> > $ mthp-swapout -s
>> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percen=
tage: 85.65%
>> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percent=
age: 100.00%
>> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percent=
age: 100.00%
>> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percent=
age: 100.00%
>> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percent=
age: 100.00%
>> >
>> > With:
>> > $ mthp-swapout
>> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > ...
>> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback perce=
ntage: 0.00%
>> >
>> > $ mthp-swapout -s
>> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>> > ...
>> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback perce=
ntage: 0.00%
>>
>> Unfortunately, the data is gotten using a special designed test program
>> which always swap-in pages with swapped-out size.  I don't know whether
>> such workloads exist in reality.  Otherwise, you need to wait for mTHP
>
> The test program is designed to simulate mTHP swap behavior using
> zsmalloc and 64KB buffer.
> If we insist on only designing for existing workloads, then zsmalloc
> using 64KB buffer usage will never be able to run, exactly due the
> kernel has high failure rate allocating swap entries for 64KB. There
> is a bit of a chick and egg problem there, such a usage can not exist
> because the kernel can't support it yet. Kernel can't add patches to
> support it because such simulation tests are not "real".
>
> We need to break this cycle to support something new.
>
>> swap-in to be merged firstly, and people reach consensus that we should
>> always swap-in pages with swapped-out size.
>
> We don't have to be always. We can identify the situation that makes
> sense. For the zram/zsmalloc 64K buffer usage case, swap out as the
> same swap in size makes sense.
> I think we have agreement on such zsmalloc 64K usage cases we do want
> to support.
>
>>
>> Alternately, we can make some design adjustment to make the patchset
>> work in current situation (mTHP swap-out, normal page swap-in).
>>
>> - One non-full cluster list for each order (same as current design)
>>
>> - When one swap entry is freed, check whether one "order+1" swap entry
>>   becomes free, if so, move the cluster to "order+1" non-full cluster
>>   list.
>
> In the intended zsmalloc usage case, there is no order+1 swap entry
> request.

This my main concern about this series.  Only the Android use cases are
considered.  The general use cases are just ignored.  Is it hard to
consider or test a normal swap partition on your development machine?

> Moving the cluster to "order+1" will make less cluster available for "ord=
er".
> For that usage case it is negative gain.

The "order+1" cluster can be used to allocate "order" cluster when
existing "order" cluster is used up.

And in this way, we can protect clusters with more free spaces so that
they may become free.

>> - When allocate swap entry with "order", get cluster from free, "order",
>>   "order+1", ... non-full cluster list.  If all are empty, fallback to
>
> I don't see that it is useful for the zsmalloc 64K buffer usage case.
> There will be order 0 and order 4 and nothing else.
>
> How about let's keep it simple for now. If we identify some workload
> this algorithm can help. We can do that as a follow up step.

The simple design isn't flexible enough for your workloads too.  For
example,

- Initially, almost only order-0 pages are swapped out, most non-full
  clusters are order-0.

- Later, quite some order-0 swap entries are freed so that there are
  quite some order-4 swap entries available.

- Order-4 pages need to be swapped out, but no enough order-4 non-full
  clusters available.

So, we need a way to migrate non-full clusters among orders to adjust to
the situations automatically.

>>   order 0.
>>
>> Do you think that this works?
>>
>> > Reported-by: Barry Song <21cnbao@gmail.com>
>> > Signed-off-by: Chris Li <chrisl@kernel.org>
>> > ---
>> > Changes in v3:
>> > - Using V1 as base.
>> > - Rename "next" to "list" for the list field, suggested by Ying.
>> > - Update comment for the locking rules for cluster fields and list,
>> >   suggested by Ying.
>> > - Allocate from the nonfull list before attempting free list, suggested
>> >   by Kairui.
>>
>> Haven't looked into this.  It appears that this breaks the original
>> discard behavior which helps performance of some SSD, please refer to
>
> Can you clarify by "discard" you mean SSD discard command or just the
> way swap allocator recycles free clusters?

The SSD discard command, like in the following URL,

https://en.wikipedia.org/wiki/Trim_(computing)

>> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD").
>
> I did read that change log. Help me understand in more detail which
> discard behavior you have in mind. A lot of low end micro SD cards
> have proper FTL wear leveling now, ssd even better on that.

It's not FTL, it's discard/trim for SSD as above.

>> And as pointed out by Ryan, this may reduce the opportunity of the
>> sequential block device writing during swap-out, which may hurt
>> performance of SSD too.
>
> Only at the initial phase. If the swap IO continues, after the first
> pass fills up the swap file, the write will be random on the swapfile
> anyway. Because the swapfile only issues 2M discards commands when all
> 512 4K pages are free. The discarded area will be much smaller than
> the free area on swapfile. That combined with the random write page on
> the whole swap file. It might produce a worse internal write
> amplification for SSD, compared to only writing a subset of the
> swapfile area. I would love to hear from someone who understands SSD
> internals to confirm or deny my theory.

It depends on workloads.  Some workloads will have more severe
fragmentation than others.  For example, on quite some machines, the
swap devices will be far from being full to avoid possible OOM.

> Even let's assume the SSD wants a free block over a nonfull cluster
> first. Zswap and zram swap are not subject to SSD property. We might
> want to have a kernel option to select using  nonfree clusters over
> the free one for zram and zswap (ghost swapfile). That will help
> contain the fragmented swap area.

I suspect that it will help fragmentation avoidance much.  Please prove
its effectiveness with data firstly.  It can be a further optimization
patch in the series.

Even if we really need it, we can try to do it without a kernel option.
For example, detect whether we are using zram and enable it for zram
automatically (through a general flag).

--
Best Regards,
Huang, Ying