From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1B255C021AA
	for <linux-mm@archiver.kernel.org>; Wed, 19 Feb 2025 10:55:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 81B61440154; Wed, 19 Feb 2025 05:55:56 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7CB40280225; Wed, 19 Feb 2025 05:55:56 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 66B43440154; Wed, 19 Feb 2025 05:55:56 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 482B8280225
	for <linux-mm@kvack.org>; Wed, 19 Feb 2025 05:55:56 -0500 (EST)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 9A06C1614C8
	for <linux-mm@kvack.org>; Wed, 19 Feb 2025 10:55:55 +0000 (UTC)
X-FDA: 83136389070.22.29C9664
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf11.hostedemail.com (Postfix) with ESMTP id 8AEE94000F
	for <linux-mm@kvack.org>; Wed, 19 Feb 2025 10:55:53 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cAPtkEyK;
	spf=pass (imf11.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1739962553;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=doaIXTA2ktBza1Ag1cm9DXoE1Ls/mzqDcjLCjMkKRyY=;
	b=u+iHBCcQzlpZRGikEfD5LzXYIIwrhTqsqz6sN5WuqzGPVIY4db1CJv2MIBQidmvBafK1Ka
	dv03rAc+pNWF0KVrq9humWXW+Whhbs88hxcLGdOnjYOXq88Tr8upmDZEIFEiyQ2kzsQttR
	/ffvUIeEkXGS9NUrcujQgVFN9OQ1f8Q=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cAPtkEyK;
	spf=pass (imf11.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739962553; a=rsa-sha256;
	cv=none;
	b=GcdZ/k1mraiiUIQ4s8z5mV2sRIGUm0OPKpV/tafsKJGj7Ez/mWK6xJrJMUA4ErRJEgVxJj
	DydXsIWMomhdty27+NdQ2LBvmdN9zDgR0BYoZzKE4ZnLuMH2/ohGun3CUzq+KN1Cnm8bL3
	0UUF/fSXXFip/Q03I2ZrNDhFfQMdK4M=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1739962552;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=doaIXTA2ktBza1Ag1cm9DXoE1Ls/mzqDcjLCjMkKRyY=;
	b=cAPtkEyKLFgdr3Ktzr9cwhqCKV4tqkEeClZGr/F7QWZ4OBLlUcmB8eEbD4G5iYbg2U+ZSb
	THvmUYPK4nhyBMocdgjQh6y+6bcMIsEfhd08zgC5Va2WBtjINjH17x0GR+lldXTG1Pfah7
	k5dG+M8JhGuwC3DmK1sjl6jQQwl29FE=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-459--V2QL6t8PQOXuLV8IlJuvQ-1; Wed,
 19 Feb 2025 05:55:45 -0500
X-MC-Unique: -V2QL6t8PQOXuLV8IlJuvQ-1
X-Mimecast-MFC-AGG-ID: -V2QL6t8PQOXuLV8IlJuvQ_1739962543
Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id CE1CA19783B2;
	Wed, 19 Feb 2025 10:55:42 +0000 (UTC)
Received: from localhost (unknown [10.72.112.127])
	by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 1FEDA194129F;
	Wed, 19 Feb 2025 10:55:40 +0000 (UTC)
Date: Wed, 19 Feb 2025 18:55:35 +0800
From: Baoquan He <bhe@redhat.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>,
	Kalesh Singh <kaleshsingh@google.com>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
Message-ID: <Z7W4p0p/Qkke2LTp@MiWiFi-R3L-srv>
References: <20250214175709.76029-1-ryncsn@gmail.com>
 <20250214175709.76029-6-ryncsn@gmail.com>
 <Z7WOCvQq3xi9wxnt@MiWiFi-R3L-srv>
 <CAMgjq7C0Yg164SHJcP6wDC0od-xRuBMxLsJJwB0oWavpgsr8tg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAMgjq7C0Yg164SHJcP6wDC0od-xRuBMxLsJJwB0oWavpgsr8tg@mail.gmail.com>
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 8AEE94000F
X-Stat-Signature: 4sw8mh4urbqww7qpyma5ux53tobbrhry
X-HE-Tag: 1739962553-140116
X-HE-Meta: U2FsdGVkX18Pb4aPNEk5w5UqugBA0+6ZAN8iuLzVF9HUPQUsE1sfF5SPW3akXpaKdhnOk+Osg5ofbuuSD93mzUaj1BoJ4DlOXCu3tn0ncJ1JiBmMvjP9gHupMNunV/H6l0SZAESkaL4MyIHv8MaIxbAHIZQBI3KXEhgYIKMdtR0okO+sVGqu04J7P8q01MzlPoNTVIpxKk2IOorIGJdhAqUyzNUVuH+cPMTVzSiEmhGouhTcNI23GH8nGcZuC0UvpLSnituMQJp6HAYFRRGUydeXk9UV4r8DjyuRCIBwmrkoe0EHpI0HoJuTfkDNa4sG4XFD0u60eAcQO5v2hQlByMU0N7IuwfQEdZf8UK+5QHoFlDQECDxL3voQkpd6y0EZF1XvDgT5uAT8wEtWIwhY26Sd4Ng7PfV1DKcZKRnf5Y/WUtocYUPk3XP+NSZjhKo6483B6fapEZjukDoRE/Biw1phpmcUsLITVdQcXcMPCYRYPM57DWttCX3Xne98ijoFAaFCGHutj9UkczBQV2Ej7vwXboboPfWrl1YvlOhYTTZDGSb6t+YnwLn8izBaK/3Fg1hRsFgK+5v37hF2uiEQ4pGrKEZN+OP6QiMrTf5L66VfWtfb5h+OxTPUSBCDj9tr7Gpe+NGd5WfjIAl9vvdA7Am8+W5b/4D4ru+QFA3rDe+vtvb0HLYskgk6BHbm2d9zSbDdVTTjeoL4yRDg16HwGO6qSo3yo6J68yf/9Uj6Se6ivAyIXmtvzZZVO4jBlo0YhIjRFsGZ/QNnk2v7Plwtl2Cp4KYjTSc1ogczSp2LgvI/fRdGozYKv05AdSlwaW25e/uL2vz//6gji5kG3vVfDzZvZP0gSQxkOplbVQAPfBKwOzOP6P3QcnEmZ8860cmqPUrHWIrs66oolt1idauYcqlbzEXvRvNr9Lwv2YvIAK7TphcpJF6g/48r1YW59LQD3iSxLCp9ET/OrgVXNLL
 cmWzFVkg
 0Pkbn7fLXBi2cWmyiF4QSncC6GCun1ejaaZWe4R1amT3XFE7tcfbnbE3ffhLhcoPJMQhEKoh8q9xTHyVvQLEQ+CPtPI5vVGRvSgkufMalzA+CquCMU2rlGn8aGJDKw0XLiBCKYQZvaCxKza8peEI0ROppzE40p/TL4seFsvMZDAy+5duwVRKjAvEoxRfq2GsG3RSlNnSVHmhMWyw27dlS/b4nZt3HxtrjM6sAKBjU/SQLu9i/oAYy10QloyCgvc56/ldwO35j/8484e6RPv5ZzEaKFOa0TxER0tS5vR3EJjZFpDzF0mfOyJuIV8erI30yCuHcgCkYzCr67qPlKXrMphE/06K6Q1LgqLDnyyw7vcoWVGeymXzwb7to4fVssU4eib587QGZy3lRj6aS06P+WlwQwDAWRaydbKbUxNS564qu8DGaWw2woyd7Ra1jXeKD8ZHg3deDkdQksRy6kmWDo/751Q==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 02/19/25 at 04:34pm, Kairui Song wrote:
> On Wed, Feb 19, 2025 at 3:54 PM Baoquan He <bhe@redhat.com> wrote:
> 
> Hi Baoquan,
> 
> Thanks for the review!
> 
> >
> > On 02/15/25 at 01:57am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Current allocation workflow first traverses the plist with a global lock
> > > held, after choosing a device, it uses the percpu cluster on that swap
> > > device. This commit moves the percpu cluster variable out of being tied
> > > to individual swap devices, making it a global percpu variable, and will
> > > be used directly for allocation as a fast path.
> > >
> > > The global percpu cluster variable will never point to a HDD device, and
> > > allocation on HDD devices is still globally serialized.
> > >
> > > This improves the allocator performance and prepares for removal of the
> > > slot cache in later commits. There shouldn't be much observable behavior
> > > change, except one thing: this changes how swap device allocation
> > > rotation works.
> > >
> > > Currently, each allocation will rotate the plist, and because of the
> > > existence of slot cache (64 entries), swap devices of the same priority
> > > are rotated for every 64 entries consumed. And, high order allocations
> > > are different, they will bypass the slot cache, and so swap device is
> > > rotated for every 16K, 32K, or up to 2M allocation.
> > >
> > > The rotation rule was never clearly defined or documented, it was changed
> > > several times without mentioning too.
> > >
> > > After this commit, once slot cache is gone in later commits, swap device
> > > rotation will happen for every consumed cluster. Ideally non-HDD devices
> > > will be rotated if 2M space has been consumed for each order, this seems
> >
> > This breaks the rule where the high priority swap device is always taken
> > to allocate as long as there's free space in the device. After this patch,
> > it will try the percpu cluster firstly which is lower priority even though
> > the higher priority device has free space. However, this only happens when
> > the higher priority device is exhausted, not a generic case. If this is
> > expected, it may need be mentioned in log or doc somewhere at least.
> 
> Hmm, actually this rule was already broken if you are very strict
> about it. The current percpu slot cache does a pre-allocation, so the
> high priority device will be removed from the plist while some CPU's
> slot cache holding usable entries.
> 
> If the high priority device is exhausted, some CPU's percpu cluster
> will point to a low priority device indeed, and keep using it until
> the percpu cluster is drained. I think this should be
> OK. The high priority device is already full, so the amount of
> swapouts falls back to low priority device is only a performance
> issue, I think it's a tiny change for a rare case.
> 
> >
> > > reasonable. HDD devices is rotated for every allocation regardless of the
> > > allocation order, which should be OK and trivial.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  include/linux/swap.h |  11 ++--
> > >  mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
> > >  2 files changed, 79 insertions(+), 52 deletions(-)
> > ......
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index ae3bd0a862fc..791cd7ed5bdf 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
> > >
> > ......snip....
> > >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > >  {
> > >       int order = swap_entry_order(entry_order);
> > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > >       int n_ret = 0;
> > >       int node;
> > >
> > > +     /* Fast path using percpu cluster */
> > > +     local_lock(&percpu_swap_cluster.lock);
> > > +     n_ret = swap_alloc_fast(swp_entries,
> > > +                             SWAP_HAS_CACHE,
> > > +                             order, n_goal);
> > > +     if (n_ret == n_goal)
> > > +             goto out;
> > > +
> > > +     n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
> >
> > Here, the behaviour is changed too. In old allocation, partial
> > allocation will jump out to return. In this patch, you try the percpu
> > cluster firstly, then call scan_swap_map_slots() to try best and will
> > jump out even though partial allocation succeed. But the allocation from
> > scan_swap_map_slots() could happen on different si device, this looks
> > bizarre. Do you think we need reconsider the design?
> 
> Right, that's a behavior change, but only temporarily affects slot cache.
> get_swap_pages will only be called with size > 1 when order == 0, and
> only by slot cache. (Large order allocation always use size == 1,
> other users only uses order == 0 && size == 1). So I didn't' notice
> it, as this series is removing slot cache.
> 
> The partial side effect would be "returned slots will be from
> different devices" and "slot_cache may get drained faster as
> get_swap_pages may return less slots when percpu cluster is drained".
> Might be a performance issue but seems slight and trivial, slot cache
> can still work. And the next commit will just remove the slot cache,
> and the problem will be gone. I think I can add a comment about it
> here?

By the way, another thing I suddenly think of is the percpu cluster
becoming glober over all devices. If one order on the stored si of
percpu cluster is used up, then the whole percpu cluster is drained and
rewritten. Won't this impact performance compared with the old embedding
percpu cluster in each si? In log you said "Ideally non-HDD devices
will be rotated if 2M space has been consumed for each order, this seems
reasonable.", while in reality it may be very difficult to achieve the
'each 2M space has been consumed for each order', but more often happen
when very few of order's space has been consumed, then rewrite percpu.
Wonder what I have missed about this point.

>