From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 33899C52D7C
	for <linux-mm@archiver.kernel.org>; Mon, 19 Aug 2024 08:42:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id BEE176B008A; Mon, 19 Aug 2024 04:42:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BC4B16B008C; Mon, 19 Aug 2024 04:42:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A8D2E6B0092; Mon, 19 Aug 2024 04:42:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 89B566B008A
	for <linux-mm@kvack.org>; Mon, 19 Aug 2024 04:42:49 -0400 (EDT)
Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 2EF121C4C60
	for <linux-mm@kvack.org>; Mon, 19 Aug 2024 08:42:49 +0000 (UTC)
X-FDA: 82468354458.23.43E5C0C
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8])
	by imf17.hostedemail.com (Postfix) with ESMTP id BCC2B40003
	for <linux-mm@kvack.org>; Mon, 19 Aug 2024 08:42:46 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=dwe1vpPO;
	spf=pass (imf17.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724056928; a=rsa-sha256;
	cv=none;
	b=jyrz4hF5XuVKXA1PlMIpy5l41I6+O2f0Zhb5LPMc18E40Ke7B0En1tMY1Kjd7N2sKjMvfH
	VQIWf3vc6Lr0YDXW3NNrC2TSsOT2yIb5oByoBI9A2p4zXDDUwFEuLv6q5TqsmjsNG0OAEk
	1zU4pZOnQ57ezQFTFszq/ZnVGDbm8fI=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=dwe1vpPO;
	spf=pass (imf17.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1724056928;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=hR2EJee21xGG4Jx4+FxAyW39tvOglBi9/pWdQn50fhM=;
	b=uT8l0gq3wptxkBM2+XWxN3SrRlcBTrq6ETYUxl/Guxk7Oi/JZmNGA6kJQYVpISjZ78vQZT
	CW7six3hpZXgDb4Op01NlxFmIFosU/QPqXAx/+sJmjr4qaqoMRqCsdsjDCCGkfFOPwS01C
	p1ZhNbIu4Cjm4Hz4lcXZrWdUQuFknTo=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1724056967; x=1755592967;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=C7CX1eVpEgJh9Du2mCHC2zACyBnXeWvRXR+fomLra7k=;
  b=dwe1vpPOEzSHFDmKwDEg4Pzh0wRbTiUlYkFnFFogxpoFql9UHBOajY/X
   TZsq/QnIbAbAvP4z/QMT2/l0jmuc4ijwy+74hR2nkT1dN5cWP+lPuZqs4
   jKMdlYuH12toV3Jc/JbgEETTM8PXZ+4qp3RSH8P3j2M3iLTjAXKP1uWL/
   SgsB9YEeGeatXCRwmzxEMhIE+pVR0sTV/HDyyrRh5bGbruIgLENIJvJzm
   8tCIGYNFMunCYpRHlH8fdwmyaptrCSRwfK3ODobMj2Ffzf3UiuoSTu9zD
   3tzMszm4oLz4pX9sP2TkZXkx4mGxrOgDVZvk308vMHovLl4SQy6oG6y3t
   A==;
X-CSE-ConnectionGUID: bgfw5fcAQKOjS18pA2qRCQ==
X-CSE-MsgGUID: 87tMaOc3T36lBh4iR7gyzA==
X-IronPort-AV: E=McAfee;i="6700,10204,11168"; a="39805892"
X-IronPort-AV: E=Sophos;i="6.10,158,1719903600"; 
   d="scan'208";a="39805892"
Received: from fmviesa010.fm.intel.com ([10.60.135.150])
  by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Aug 2024 01:42:45 -0700
X-CSE-ConnectionGUID: qAeKlVYSTBesivUcFlpiVA==
X-CSE-MsgGUID: LVD4xajlRMKAvPa+e7r1tA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,158,1719903600"; 
   d="scan'208";a="60458060"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Aug 2024 01:42:43 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  Kairui Song <kasong@tencent.com>,  Ryan
 Roberts <ryan.roberts@arm.com>,  Kalesh Singh <kaleshsingh@google.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap
 cluster order
In-Reply-To: <CACePvbUenbKM+i5x6xR=2A=8tz4Eu2azDFAV_ksvn2TtrFsVOQ@mail.gmail.com>
	(Chris Li's message of "Fri, 16 Aug 2024 00:47:37 -0700")
References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org>
	<87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbXH8b9SOePQ-Ld_UBbcAdJ3gdYtEkReMto5Hbq9WAL7JQ@mail.gmail.com>
	<87sevfza3w.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CACePvbUenbKM+i5x6xR=2A=8tz4Eu2azDFAV_ksvn2TtrFsVOQ@mail.gmail.com>
Date: Mon, 19 Aug 2024 16:39:10 +0800
Message-ID: <87plq4hpox.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: tji69sjy4ncxk98yjokub55eugzficg3
X-Rspamd-Queue-Id: BCC2B40003
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1724056966-727349
X-HE-Meta: U2FsdGVkX1+TU8w0BgF9CqvBlwD/4fRo8V6MoBqpUi4u5xiZWT3g+WNPyp+GuK/IGLEszLXIWmGwS2Nb9XEDjMxzxmit9mByPkhaxkM/oXdbjuNvujd6bznnDvjen7c7oseKtTmfIXe+hQRcyPNH9eaXOI/3TlV+EJAUaw0hLsCaqsRGB79c/RUBOHPcd5Uu+A+DH1lerfk2QnUO+8lrBhMm35ue3IcKjC8qeWCn3+9yjs196exD6/NtLkbMZQ2aafs03NF667PxGH9AiaWEzpskrBR2N5erMDP2cu26M6pCqc0L1jKE3VhzsNxlOXhTIrp2P8URo1gxy9vR8ByEoB98ZM1ah3YOy8nnspI0GHS4ucZlq071fj0bTfqeJojNJuPPEdwjg6UO7LkyM8Z0jOsnIenL94LBK52KKgCwpe40w6FUuqy9tD57UIyO3V4s3hD7zzp2pjB/D3m9lG3upi4uBT6/Z6u930iEnmt1KC+5kztYpAcXp1oRwtmtrJU4UrSQjfcYtu6K1fPYJJ88p2vvjqxzi3mFgHc8SNjCWH94PUKeAk9iYzEy8yjRjeszPIuULUWbtEiWwmDKuni3I/jUofwvnsuAL2vD1jdRbiPFmK5Cs4vB0kZU3xCgukF4zfDZoeLXXeSwv9L3McyhgEadqVvyR9jvM7nUgmrGcXBQGEFi8X+sxXPoy49H9zyYdOSdGbxnwDpiOROD5ioUmYRsnw/UTjGhzYthibYvoZoXRNvmxf72lqkoUBQavhVYAZUqMDAXl5RWPDlL9JNVGWUvbFGHZqjRFS6oHEtMBUgO5+c9pVFidL2+7tydW5nrUfq+WX9i6nc/QQ8A9YCooUOOPU+cdXxK6qAMmvMW/B05URAYVMJFr5trlNhmBC6t0vTngfGOVwJPzq2XCuujSP5R5KVGzKMszwRwkweOvPoj/6+qogyk+L6SQALgWXXGgrnwrq7SXTE7NdRa1BC
 c7nHKvox
 AvJxkhKUW8xw5Q/Jp8MO4+pFAxY1ivSmllHS6Nk8ICyMtm/JlsHxZ6OZlglBa1BxTJlJsM9640qtN7hk6eH/+XPgIqnYjdV4a8mC3pINLxNp/l49qGCNTy9HY5SLJupw9RNg/AU3OUIlRHCzDDoLToR8iNiN2mWu+QFHeKN+leVYeFdVj8lRbgoSaLhXs1TKEdhuCUEOI1i9/K7d9NupErt1INhnWFo5eJr6v8mXdVKMi6wduLp68xaP3yLDbLSKFjClR
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Thu, Aug 8, 2024 at 1:38=E2=80=AFAM Huang, Ying <ying.huang@intel.com>=
 wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, Aug 7, 2024 at 12:59=E2=80=AFAM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Hi, Chris,
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > This is the short term solutions "swap cluster order" listed
>> >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> >> > LSF/MM conference.
>> >> >
>> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> >> > orders" is introduced, it only allocates the mTHP swap entries
>> >> > from the new empty cluster list.  It has a fragmentation issue
>> >> > reported by Barry.
>> >> >
>> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQd=
SMp+Ah+NSgNQ@mail.gmail.com/
>> >> >
>> >> > The reason is that all the empty clusters have been exhausted while
>> >> > there are plenty of free swap entries in the cluster that are
>> >> > not 100% free.
>> >> >
>> >> > Remember the swap allocation order in the cluster.
>> >> > Keep track of the per order non full cluster list for later allocat=
ion.
>> >> >
>> >> > This series gives the swap SSD allocation a new separate code path
>> >> > from the HDD allocation. The new allocator use cluster list only
>> >> > and do not global scan swap_map[] without lock any more.
>> >>
>> >> This sounds good.  Can we use SSD allocation method for HDD too?
>> >> We may not need a swap entry allocator optimized for HDD.
>> >
>> > Yes, that is the plan as well. That way we can completely get rid of
>> > the old scan_swap_map_slots() code.
>>
>> Good!
>>
>> > However, considering the size of the series, let's focus on the
>> > cluster allocation path first, get it tested and reviewed.
>>
>> OK.
>>
>> > For HDD optimization, mostly just the new block allocations portion
>> > need some separate code path from the new cluster allocator to not do
>> > the per cpu allocation.  Allocating from the non free list doesn't
>> > need to change too
>>
>> I suggest not consider HDD optimization at all.  Just use SSD algorithm
>> to simplify.
>
> Adding a global next allocating CI rather than the per CPU next CI
> pointer is pretty trivial as well. It is just a different way to fetch
> the next cluster pointer.

For HDD optimization, I mean original no struct swap_cluster_info etc.

>>
>> >>
>> >> Hi, Hugh,
>> >>
>> >> What do you think about this?
>> >>
>> >> > This streamline the swap allocation for SSD. The code matches the
>> >> > execution flow much better.
>> >> >
>> >> > User impact: For users that allocate and free mix order mTHP swappi=
ng,
>> >> > It greatly improves the success rate of the mTHP swap allocation af=
ter the
>> >> > initial phase.
>> >> >
>> >> > It also performs faster when the swapfile is close to full, because=
 the
>> >> > allocator can get the non full cluster from a list rather than scan=
ning
>> >> > a lot of swap_map entries.
>> >>
>> >> Do you have some test results to prove this?  Or which test below can
>> >> prove this?
>> >
>> > The two zram tests are already proving this. The system time
>> > improvement is about 2% on my low CPU count machine.
>> > Kairui has a higher core count machine and the difference is higher
>> > there. The theory is that higher CPU count has higher contentions.
>>
>> I will interpret this as the performance is better in theory.  But
>> there's almost no measurable results so far.
>
> I am trying to understand why don't see the performance improvement in
> the zram setup in my cover letter as a measurable result?

IIUC, there's no benchmark score difference, just system time.  And the
number is low too.

For Kairui's test, does all performance improvement come from "swapfile
is close to full"?

>>
>> > The 2% system time number does not sound like much. But consider this
>> > two factors:
>> > 1) swap allocator only takes a small percentage of the overall workloa=
d.
>> > 2) The new allocator does more work.
>> > The old allocator has a time tick budget. It will abort and fail to
>> > find an entry when it runs out of time budget, even though there are
>> > still some free entries on the swapfile.
>>
>> What is the time tick budget you mentioned?
>
> I was under the impression that the previous swap entry allocation
> code will not scan 100% of the swapfile if there is only one entry
> left.
> Please let me know if my understanding is not correct.
>
>         /* time to take a break? */
>         if (unlikely(--latency_ration < 0)) {
>                 if (n_ret)
>                         goto done;
>                 spin_unlock(&si->lock);
>                 cond_resched();
>                 spin_lock(&si->lock);
>                 latency_ration =3D LATENCY_LIMIT;
>         }

IIUC, this is to reduce latency via cond_resched().  If n_ret !=3D 0, we
have allocated some swap entries successfully, it's OK to return to
reduce allocation latency.

>
>>
>> > The new allocator can get to the last few free swap entries if it is
>> > available. If not then, the new swap allocator will work harder on
>> > swap cache reclaim.
>> >
>> > From the swap cache reclaim aspect, it is very hard to optimize the
>> > swap cache reclaim in the old allocation path because the scan
>> > position is randomized.
>> > The full list and frag list both design to help reduce the repeat
>> > reclaim attempt of the swap cache.
>>
>> [snip]
>>

--
Best Regards,
Huang, Ying