From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 176ADC3DA63
	for <linux-mm@archiver.kernel.org>; Fri, 26 Jul 2024 07:32:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8C6156B008C; Fri, 26 Jul 2024 03:32:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 875916B0092; Fri, 26 Jul 2024 03:32:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 73DD36B0096; Fri, 26 Jul 2024 03:32:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 518B96B008C
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 03:32:53 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 055CAA1656
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:32:53 +0000 (UTC)
X-FDA: 82381087026.27.70D1F2B
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.20])
	by imf16.hostedemail.com (Postfix) with ESMTP id A79CC18000B
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:32:50 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=B+7rzwkX;
	spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721979104;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=X3yKWNp1Cdi0iLh0gPsETVqTCCxLhM7h0W4bIEFRSUs=;
	b=V1OlWTMWQ7c2KT/SrINOqo65eOG0eRXD/fUQYAgHrRhiLmywmBi2suQ3AuquPfWzIdAE5h
	zhabnPwGfGyEukAVs371mshfhuJoMXM4wzrAzrTrolgA3mdJssElAuXDaO4SAWJ0r2OXoW
	7a5O5lFGxvNXIyBQky8VevBqlAjyYr8=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=B+7rzwkX;
	spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721979104; a=rsa-sha256;
	cv=none;
	b=2dkATRnGYqFM+P50SSZWkYCUOnxH3UrPTrF+1oG4hQ4Gbz6NaSnvGpAszDUlk+V5bvwVIY
	mXhF0zj9oH8wM4p7Etzc/FAG+BAnPY4wT3wqGvrdBlOctmSrHb5hU5JP7oFNWBiFr7lJQv
	WgYI9t45cVFmQrW+M8LO0GzeuJFw5eY=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1721979171; x=1753515171;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=xRmILblJ54qqwxCEnXupQO3wlslsHCFAgUEkXfKxiEw=;
  b=B+7rzwkXcA4MFd+8SmNo5raSV6wW6eqD7jQJ/vzGfSx0M1rI5m8iAdYJ
   1O/OtbSq6ulRoH0farM7GHINZ0qLGYO44+Ow2uiD0KQFHxuCEVBZmhrhH
   efShtTZ+cNTdwq17Y+r4xI2Em/ZZkpHTy/ZLZYgk3B+NWcqp5CUqW94OL
   8s0WPxH/K0xSpSvFKelh7BBlP8MCY3hcCL0YRCVwsJObgK3kJ97L7S3mx
   MR6BlLPUY8usjcUt4gpNktdyiiVXH3hQ1efM1dBeLg0GtYHzVHbHuxeBI
   OKEA3WdIDIgj5yzxh6OdqEQfhsICCaCCwkZTro5E/kd+UHpBb/rIHmzzX
   Q==;
X-CSE-ConnectionGUID: s6tT+WC9RE+48cX7ZlV4gA==
X-CSE-MsgGUID: KF7rJ/P2TNqP0S7VHj//aQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11144"; a="19558771"
X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; 
   d="scan'208";a="19558771"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by orvoesa112.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2024 00:32:49 -0700
X-CSE-ConnectionGUID: Tn/TOyOzTNqdsijLqqX4vw==
X-CSE-MsgGUID: XPWg7DiASZCecmFSS3kVqg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,238,1716274800"; 
   d="scan'208";a="52882301"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2024 00:32:46 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,  Kairui Song
 <kasong@tencent.com>,  Ryan Roberts <ryan.roberts@arm.com>,  Kalesh Singh
 <kaleshsingh@google.com>,  linux-kernel@vger.kernel.org,
  linux-mm@kvack.org,  Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v3 0/2] mm: swap: mTHP swap allocator base on swap
 cluster order
In-Reply-To: <CACePvbWXAJTuT+tRvVZNMZA_qHZz-44iKYLn3RnSeLzv64zxZw@mail.gmail.com>
	(Chris Li's message of "Fri, 26 Jul 2024 00:22:39 -0700")
References: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org>
	<87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CANeU7Qno3o-nDjYP7Pf5ZTB9Oh_zOGU0Sv_kV+aT=Z0j_tdKjg@mail.gmail.com>
	<87bk3pzr5p.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbXGBNC9WzzL4s2uB2UciOkV6nb4bKKkc5TBZP6QuHS_aQ@mail.gmail.com>
	<87frrw3aju.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbWXAJTuT+tRvVZNMZA_qHZz-44iKYLn3RnSeLzv64zxZw@mail.gmail.com>
Date: Fri, 26 Jul 2024 15:29:13 +0800
Message-ID: <877cd8392u.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: A79CC18000B
X-Stat-Signature: tnfcq8yb7eoia6t3ktaakzot1tcsq6c4
X-Rspam-User: 
X-HE-Tag: 1721979170-943975
X-HE-Meta: U2FsdGVkX1/dD8ZGyz5bZOQ3bzJeoT3umHzevXau8CRKKzLzsposDejLQojRdi/8Niy86eheW1DeUJ7Vxigjw8r+fqWFXqyZV9RgkJ2+37TL6nWsdoCcxv8DXPZlpCQXZOfurrzI3ZSDHNpuCTiglfbFwIztoH8bqh4CKfVjjIWSTF6vrxPdOK1d7HUOvuXKa76GNNHIp8S42Wrxpt3vFfRIu5joDlZaQCfLI5xEOnvYusPbOk4sWGoQQtr56CxE5DoTyPWWci/5MGZZ8Fhy/oaCA2cwmNQ1K5DVXaUcXUHB4yOciZlqJYLKvax5g/VvCmyj9Ict7/viMCAUKXN7K+sdSfSLyxxoP99/mF2mlZ0xzMwIobQ1bxWBsz5RgJCBSAJ4GZ/VWtER/W8kTktWffbQ680IDTyXrDgB4grf0fYteOkaTU5Zt/q7jck7qritVprVMV7apJZ4GqBhU/c2HuVuZa6pZ5ldBkJ0MipczZ2g0NL+/onlhZAiB5fet9jdfy2iefYlexw+YRRExqYwsYfcrBnZxrhxLhAYXHqUDVEJPdbOHLFqank1PKIksGg4WUoAP5v6TMEebZMkZcAJgk3cdDl0hM53nIYWVoORmdSGrWo871awXirHGMz3M+Nn5Cy0JLVFa5K3plS4VyDDrlO/rphZIhXuU7OdhVodneUh0+rIZaZ4U/n8TVXNulCbMKVNcozsQ4B6ToIsLXj302/HMOW1/fh5a8ek66KuNr38iBIRPIEs+51DaL8Q4Dy1A9Xb5JIIqznEEROfvQ/nnikLpRPLty3tMku6pm5s3jAVRoMGzntpuFuE5GyxR4q5W7fMkZTDLDo5WvbyFq9PVS+FW4sDLfwZuJaLLtfLt/ous1zqnBWzLiDBs/IoTTaD+TmYl2NfvGc15xDmGVO6mhd18E3BKM/14DaBTGvrO2uHN/VvfPl9QX5VaxrVD+/m7pVQD0r+hAXAyuTLvHZ
 reOk/Ep7
 ER0WveNyTr1Y0BTgEpcsK8xGCbzahcQbEY8vM9g04IOnX8xFYKP/OsREYSFWN6TIY5T3lNWDfalSsM0hmcORqD1grcHyNvbdyDHNaOxHQNHXiofgVXL+ILIz+vtvgkReczsu+4QUHb6xLheVPqW+dTaCj8AnKSCKux1G9ZF+SXK+EaSpg55zb5CXrpzjoEi80oMd+3cY3+tkyW/O9SIcB0vmuYi5A97DZIji69JSmvXOovXKxJzfRSj5wGpxm279zi7+7+u2m/bsdMRq4ZnO5BPoMbg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Fri, Jul 26, 2024 at 12:01=E2=80=AFAM Huang, Ying <ying.huang@intel.co=
m> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Mon, Jun 24, 2024 at 7:36=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > On Wed, Jun 19, 2024 at 7:32=E2=80=AFPM Huang, Ying <ying.huang@int=
el.com> wrote:
>> >> >>
>> >> >> Chris Li <chrisl@kernel.org> writes:
>> >> >>
>> >> >> > This is the short term solutiolns "swap cluster order" listed
>> >> >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> >> >> > LSF/MM conference.
>> >> >> >
>> >> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> >> >> > orders" is introduced, it only allocates the mTHP swap entries
>> >> >> > from new empty cluster list.  It has a fragmentation issue
>> >> >> > reported by Barry.
>> >> >> >
>> >> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhg=
MQdSMp+Ah+NSgNQ@mail.gmail.com/
>> >> >> >
>> >> >> > The reason is that all the empty cluster has been exhausted while
>> >> >> > there are planty of free swap entries to in the cluster that is
>> >> >> > not 100% free.
>> >> >> >
>> >> >> > Remember the swap allocation order in the cluster.
>> >> >> > Keep track of the per order non full cluster list for later allo=
cation.
>> >> >> >
>> >> >> > User impact: For users that allocate and free mix order mTHP swa=
pping,
>> >> >> > It greatly improves the success rate of the mTHP swap allocation=
 after the
>> >> >> > initial phase.
>> >> >> >
>> >> >> > Barry provides a test program to show the effect:
>> >> >> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@=
gmail.com/
>> >> >> >
>> >> >> > Without:
>> >> >> > $ mthp-swapout
>> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback=
 percentage: 51.54%
>> >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback p=
ercentage: 100.00%
>> >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback p=
ercentage: 100.00%
>> >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback p=
ercentage: 100.00%
>> >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback p=
ercentage: 100.00%
>> >> >> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback =
percentage: 100.00%
>> >> >> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback =
percentage: 100.00%
>> >> >> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback =
percentage: 100.00%
>> >> >> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback =
percentage: 100.00%
>> >> >> >
>> >> >> > $ mthp-swapout -s
>> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback =
percentage: 85.65%
>> >> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback p=
ercentage: 100.00%
>> >> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback p=
ercentage: 100.00%
>> >> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback p=
ercentage: 100.00%
>> >> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback p=
ercentage: 100.00%
>> >> >> >
>> >> >> > With:
>> >> >> > $ mthp-swapout
>> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > ...
>> >> >> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback=
 percentage: 0.00%
>> >> >> >
>> >> >> > $ mthp-swapout -s
>> >> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback p=
ercentage: 0.00%
>> >> >> > ...
>> >> >> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback =
percentage: 0.00%
>> >> >> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback=
 percentage: 0.00%
>> >> >>
>> >> >> Unfortunately, the data is gotten using a special designed test pr=
ogram
>> >> >> which always swap-in pages with swapped-out size.  I don't know wh=
ether
>> >> >> such workloads exist in reality.  Otherwise, you need to wait for =
mTHP
>> >> >
>> >> > The test program is designed to simulate mTHP swap behavior using
>> >> > zsmalloc and 64KB buffer.
>> >> > If we insist on only designing for existing workloads, then zsmalloc
>> >> > using 64KB buffer usage will never be able to run, exactly due the
>> >> > kernel has high failure rate allocating swap entries for 64KB. There
>> >> > is a bit of a chick and egg problem there, such a usage can not exi=
st
>> >> > because the kernel can't support it yet. Kernel can't add patches to
>> >> > support it because such simulation tests are not "real".
>> >> >
>> >> > We need to break this cycle to support something new.
>> >> >
>> >> >> swap-in to be merged firstly, and people reach consensus that we s=
hould
>> >> >> always swap-in pages with swapped-out size.
>> >> >
>> >> > We don't have to be always. We can identify the situation that makes
>> >> > sense. For the zram/zsmalloc 64K buffer usage case, swap out as the
>> >> > same swap in size makes sense.
>> >> > I think we have agreement on such zsmalloc 64K usage cases we do wa=
nt
>> >> > to support.
>> >> >
>> >> >>
>> >> >> Alternately, we can make some design adjustment to make the patchs=
et
>> >> >> work in current situation (mTHP swap-out, normal page swap-in).
>> >> >>
>> >> >> - One non-full cluster list for each order (same as current design)
>> >> >>
>> >> >> - When one swap entry is freed, check whether one "order+1" swap e=
ntry
>> >> >>   becomes free, if so, move the cluster to "order+1" non-full clus=
ter
>> >> >>   list.
>> >> >
>> >> > In the intended zsmalloc usage case, there is no order+1 swap entry
>> >> > request.
>> >>
>> >> This my main concern about this series.  Only the Android use cases a=
re
>> >> considered.  The general use cases are just ignored.  Is it hard to
>> >> consider or test a normal swap partition on your development machine?
>> >
>> > Please see the V4 cover letter. The V4 already has the SSD, zram and
>> > HDD stress testing.
>> > Of course I want to make sure the allocator works well with Barry's
>> > mthp test case as well.
>> >
>> >> > Moving the cluster to "order+1" will make less cluster available fo=
r "order".
>> >> > For that usage case it is negative gain.
>> >>
>> >> The "order+1" cluster can be used to allocate "order" cluster when
>> >> existing "order" cluster is used up.
>> >>
>> >> And in this way, we can protect clusters with more free spaces so that
>> >> they may become free.
>> >>
>> >> >> - When allocate swap entry with "order", get cluster from free, "o=
rder",
>> >> >>   "order+1", ... non-full cluster list.  If all are empty, fallbac=
k to
>> >> >
>> >> > I don't see that it is useful for the zsmalloc 64K buffer usage cas=
e.
>> >> > There will be order 0 and order 4 and nothing else.
>> >> >
>> >> > How about let's keep it simple for now. If we identify some workload
>> >> > this algorithm can help. We can do that as a follow up step.
>> >>
>> >> The simple design isn't flexible enough for your workloads too.  For
>> >> example,
>> >>
>> >> - Initially, almost only order-0 pages are swapped out, most non-full
>> >>   clusters are order-0.
>> >>
>> >> - Later, quite some order-0 swap entries are freed so that there are
>> >>   quite some order-4 swap entries available.
>> >>
>> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >>   clusters available.
>> >>
>> >> So, we need a way to migrate non-full clusters among orders to adjust=
 to
>> >> the situations automatically.
>> >
>> > Depends on how lucky it is to form the order-4 cluster naturally. The
>> > odds of forming the order-4 cluster naturally in random swap
>> > allocation/ free case is very low. I have the number in my other email
>> > thread.
>> > Anyway, if we convince this payout for the complexity it introduces,
>> > we can do that as follow up steps. Try to keep things simple at first
>> > for the review benefit.
>> >
>> >>
>> >> >>   order 0.
>> >> >>
>> >> >> Do you think that this works?
>> >> >>
>> >> >> > Reported-by: Barry Song <21cnbao@gmail.com>
>> >> >> > Signed-off-by: Chris Li <chrisl@kernel.org>
>> >> >> > ---
>> >> >> > Changes in v3:
>> >> >> > - Using V1 as base.
>> >> >> > - Rename "next" to "list" for the list field, suggested by Ying.
>> >> >> > - Update comment for the locking rules for cluster fields and li=
st,
>> >> >> >   suggested by Ying.
>> >> >> > - Allocate from the nonfull list before attempting free list, su=
ggested
>> >> >> >   by Kairui.
>> >> >>
>> >> >> Haven't looked into this.  It appears that this breaks the original
>> >> >> discard behavior which helps performance of some SSD, please refer=
 to
>> >> >
>> >> > Can you clarify by "discard" you mean SSD discard command or just t=
he
>> >> > way swap allocator recycles free clusters?
>> >>
>> >> The SSD discard command, like in the following URL,
>> >>
>> >> https://en.wikipedia.org/wiki/Trim_(computing)
>> >
>> > Thanks. I know what an SSD discard command is. Want to understand why
>> > that behavior is preferred.
>> >
>> > So the reasoning to prefer a new free block rather than a recent
>> > particle free cluster is to let the previous written cluster have a
>> > higher chance to issue the discard command?
>> >
>> > This preferred new block behavior is actually not friendly to SSD from
>> > a wearing point of view.
>> > Take this example:
>> > Let say the data need to allocate and free from swap. At any given
>> > time the swap usage is 1G. The swap SSD drive is 16G.
>> > Let say the allocation and free are at random 4K page locations. There
>> > is totally 64G swap data needed to write to swap, but at any given
>> > time there is only 1G data occupite on swapfile.
>> >
>> > a) If you always prefer new free blocks. Then the swap data will
>> > eventually write at all 16G drives then random write to full 16G.
>> > Chance of forming a free cluster so a discard command can be issued is
>> > very low. (15/16)**512 =3D 4.4E-15. From SSD point of view, it does not
>> > know most of the data written to 16G drive is not used. When a page is
>> > free on a swapfile, SSD drive doesn't know about it. It sees 4K random
>> > writes to all 16G of the drive, total 64G data written.
>> >
>> > b) If you always prefer a non full cluster first over a new cluster.
>> > The 64G data will concentrate random writing to the first 1G of drive
>> > location. Total 64G data written.
>> >
>> > I consider b) are more friendly to SSD than a). Because concentrate
>> > the write into the first 1G location. The SSD can know the data
>> > overwritten in those 1G has internally obsolete, so it can internally
>> > GC the those overwritten data without a discard command. Where a)
>> > random 4K writes to the whole drive without much discard at all. Full
>> > SSD doing random writes is a bad combination from a wearing point of
>> > view.
>> >
>> > Just my 2 cents. Anyway I revert the V4 to use free cluster before
>> > nonfull cluster just to behave the same as previously.
>> >
>> >> >> commit 2a8f94493432 ("swap: change block allocation algorithm for =
SSD").
>> >> >
>> >> > I did read that change log. Help me understand in more detail which
>> >> > discard behavior you have in mind. A lot of low end micro SD cards
>> >> > have proper FTL wear leveling now, ssd even better on that.
>> >>
>> >> It's not FTL, it's discard/trim for SSD as above.
>> >
>> > Thanks for the clarification.
>> >
>> >>
>> >> >> And as pointed out by Ryan, this may reduce the opportunity of the
>> >> >> sequential block device writing during swap-out, which may hurt
>> >> >> performance of SSD too.
>> >> >
>> >> > Only at the initial phase. If the swap IO continues, after the first
>> >> > pass fills up the swap file, the write will be random on the swapfi=
le
>> >> > anyway. Because the swapfile only issues 2M discards commands when =
all
>> >> > 512 4K pages are free. The discarded area will be much smaller than
>> >> > the free area on swapfile. That combined with the random write page=
 on
>> >> > the whole swap file. It might produce a worse internal write
>> >> > amplification for SSD, compared to only writing a subset of the
>> >> > swapfile area. I would love to hear from someone who understands SSD
>> >> > internals to confirm or deny my theory.
>> >>
>> >> It depends on workloads.  Some workloads will have more severe
>> >> fragmentation than others.  For example, on quite some machines, the
>> >> swap devices will be far from being full to avoid possible OOM.
>> >
>> > I suspect most of the SSD swap on client devices nowadays are only as
>> > backup just in case it needs to be swapped.
>> > There is not much SSD swap IO during normal use. The zram and zswap
>> > are more actively used in the data center and Android phone case, from
>> > swap IO ops point of view.
>>
>> I use a Linux laptop with 16GB DRAM for work.  And I found that the swap
>> space are almost always used.
>
> Just curious how many swap OPS per second on average? I suspect it
> will be a very low number.

It depends on workloads.  I have run some LLM pruning experiment
algorithm on the machine.  The swap IOPS is high for that.

[snip]

--
Best Regards,
Huang, Ying