From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77853C021AA for ; Wed, 19 Feb 2025 09:27:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA6FE2801F4; Wed, 19 Feb 2025 04:27:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D57902801E9; Wed, 19 Feb 2025 04:27:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF9AE2801F4; Wed, 19 Feb 2025 04:27:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9DB7A2801E9 for ; Wed, 19 Feb 2025 04:27:08 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 56A5848A0A for ; Wed, 19 Feb 2025 09:27:08 +0000 (UTC) X-FDA: 83136165336.27.7F57462 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 7A4D81C0009 for ; Wed, 19 Feb 2025 09:27:06 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iWMcrBGS; spf=pass (imf18.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739957226; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ndnLFTfO36QdDrY2GKsCimhP7BsNniHfdFDqWlGMKsI=; b=Vrq0iRJW62/XsffHee52MEvX+spRB4mAI975tlUWABa/zceN9vXDYxn8dn8CJ+wvWQlFPo Pvj5QDV0m1onfqWrfi7BPflkOOVYna0cv4sFOHk2xoy7HB/CjRG0auitow96Y+DlJKby5P FfRtdhMPM33D5iRlFo8QAYQo3UBRRXs= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iWMcrBGS; spf=pass (imf18.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739957226; a=rsa-sha256; cv=none; b=YUg7x953+IsVDyXGlucJfqeay901dJlRI0kh147duIw69vM3AJtsIVrdDseIXhhiYWE1o8 +vKl8KT3dxCVi+jv3sIZtAb1cMZ0ynRNGC6JGgtuZyvYasNQKHHfkkKciVyamtnhkGSQ/D 34YIGWRk30qz5dhvrni7s3iklXiAER8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1739957225; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ndnLFTfO36QdDrY2GKsCimhP7BsNniHfdFDqWlGMKsI=; b=iWMcrBGSBdQlv0dDuMA8k+mV0UiZEG1UhqhmG7DKUG7mvtuIjgzMyHL+dGCT1/snyf6oP3 N/chXl8EQnaFFZX2FJ2q53+c0Dt9T2TS3STmhL1V/vTw2KpWyV1SSkT7h2J/Spr//4uMvD /77KWr0jiUgo08bHq0qbNwCQ54AIET4= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-590-E15bHH2QOPWMUuQ8IXuv7g-1; Wed, 19 Feb 2025 04:27:04 -0500 X-MC-Unique: E15bHH2QOPWMUuQ8IXuv7g-1 X-Mimecast-MFC-AGG-ID: E15bHH2QOPWMUuQ8IXuv7g_1739957222 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 1A5D0180087F; Wed, 19 Feb 2025 09:27:02 +0000 (UTC) Received: from localhost (unknown [10.72.112.127]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 606751800940; Wed, 19 Feb 2025 09:26:59 +0000 (UTC) Date: Wed, 19 Feb 2025 17:26:55 +0800 From: Baoquan He To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path Message-ID: References: <20250214175709.76029-1-ryncsn@gmail.com> <20250214175709.76029-6-ryncsn@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 7A4D81C0009 X-Stat-Signature: 9nutp43n4efatjrefxpmb33yiwd1nj7p X-HE-Tag: 1739957226-812273 X-HE-Meta: U2FsdGVkX19O7LWlJRxkE847dYlkQGLrP2NAI8lvtbEeU2cMzz6uvSW7D92uaHqHLq2S7kw/UIxsvJ0xICs7MERgF8jsSvVRlj/gkFvzHUCidYu4CGeB7HXNu3c9u8I+HYBkuovpM7QgxJrSl0vT4xCFdh+01QuDrjaXaZTmzYcgODYLfY460J6WTswAawHFuU3nddFF+OE21CVVLrmuJevW14sKdPJwUMrTZj5+i/R5aqz9jz/YdTY3QZzACAu8XOEmN0wMxX8nkKxAl+Cr5/v7XadHO4MMgiqnvTw7SktaQr9QBfnB9DMi26XftzRBTvxsdt8ArTsgh/fiex8FkNFNtY8JEKay++WUN+J48uFbmqNM1kjaJndabg/vaTMuwMQ/WDWiNfgc4PvkzO25Jn6fmMOJK0ETwvk/9skgpIdQ83gQDHS8XUS5MU2omd9IY+bYUZqbmTkzX3uKW8/ZTualJwfqmTT0FS1vX43HCWftcnpbsqDX5rVm9X8K9EwuBmAoJ6sqD6unwxJW3bJNEvXP28rJ487I/YNwIG0wTSNUWF/WBUU3qePUBHSELd+5R9OqKHjda1kplwA2mWd+S0JFPqCVwutJFnzDZTYNlu4A17S+4zhOwwLIZEnkYvV0LPM0xUYmZ+lKjbZWryXDHOqn51luOKelNZbYUgJxCT/wbGCs6buNVdryrVSaR9rNcCYJsT3+iizSq52mC6l/eVVWtiWTTID4lYL7CgMMmqBPspCk58uEk1EqPlo+yGYUvXnm4vwrFuipVnvpySQLNiDw757VDOL/TrV4bOyWWi0bbqDaxa93gQvIN8ix7JCB6h1Ty3Sww1RCiVWcDo2UEiX0mRgCqubIf6JJR6vsm1Xb8SdReasgGA6W0Hw5r+hWqzn6lGDa7RnLd44eTFW3pcanYB50ULi199ch9e34VZ4C5veUJZGmR1nvOfKezkA32dgGkiO80YGjyAlGiAw mmKWCrr/ PSfxbWXlq0kqhjrFlcVMdZpjK7gwxjrUgXaJiFaiREz/NTEMf0zCxlizRAn+Ez/F/ckVwus0i3d/BeHH2tt76arZYMMcN6IuQq0Wb1nTIBt2XbO6HOIm9SKDhfv6DzVqNShFLP9zhsJ3H1lqC9kiFVK0JKMiAMtLPG1OjHDUF/cL0gU0IoK6tgvR5ECHQ+Hfs5C5c3kOP7kIkqPQiWeeUPAb/UYl006ZT6c+FA7oTLAet0QPg+OPUSkcEL79KlajLYVQSIf2yX4OacO6Mdtp8acczXL9BK1MS6fdCRx4BxiD81vvGAqoPnVEO6Yiaxf4xhNAkoK1zbTjjAcoeg34q4ZDM2r1oCmDnOKHYA5W3yiswAJX6vFSE6ZzCLmObrR0rAL8om9cm59tL23sdZF6eaNo8pR3YnjoqlR2WLRfwaiGUndDfx6Fi+DTmeUABwnyORFM+k8qBdCB5POuh+hnUg4WBKA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 02/19/25 at 04:34pm, Kairui Song wrote: > On Wed, Feb 19, 2025 at 3:54 PM Baoquan He wrote: > > Hi Baoquan, > > Thanks for the review! > > > > > On 02/15/25 at 01:57am, Kairui Song wrote: > > > From: Kairui Song > > > > > > Current allocation workflow first traverses the plist with a global lock > > > held, after choosing a device, it uses the percpu cluster on that swap > > > device. This commit moves the percpu cluster variable out of being tied > > > to individual swap devices, making it a global percpu variable, and will > > > be used directly for allocation as a fast path. > > > > > > The global percpu cluster variable will never point to a HDD device, and > > > allocation on HDD devices is still globally serialized. > > > > > > This improves the allocator performance and prepares for removal of the > > > slot cache in later commits. There shouldn't be much observable behavior > > > change, except one thing: this changes how swap device allocation > > > rotation works. > > > > > > Currently, each allocation will rotate the plist, and because of the > > > existence of slot cache (64 entries), swap devices of the same priority > > > are rotated for every 64 entries consumed. And, high order allocations > > > are different, they will bypass the slot cache, and so swap device is > > > rotated for every 16K, 32K, or up to 2M allocation. > > > > > > The rotation rule was never clearly defined or documented, it was changed > > > several times without mentioning too. > > > > > > After this commit, once slot cache is gone in later commits, swap device > > > rotation will happen for every consumed cluster. Ideally non-HDD devices > > > will be rotated if 2M space has been consumed for each order, this seems > > > > This breaks the rule where the high priority swap device is always taken > > to allocate as long as there's free space in the device. After this patch, > > it will try the percpu cluster firstly which is lower priority even though > > the higher priority device has free space. However, this only happens when > > the higher priority device is exhausted, not a generic case. If this is > > expected, it may need be mentioned in log or doc somewhere at least. > > Hmm, actually this rule was already broken if you are very strict > about it. The current percpu slot cache does a pre-allocation, so the > high priority device will be removed from the plist while some CPU's > slot cache holding usable entries. Ah, right, I didn't think about this point. > > If the high priority device is exhausted, some CPU's percpu cluster > will point to a low priority device indeed, and keep using it until > the percpu cluster is drained. I think this should be > OK. The high priority device is already full, so the amount of > swapouts falls back to low priority device is only a performance > issue, I think it's a tiny change for a rare case. Agree, thanks for explanation. > > > > > > reasonable. HDD devices is rotated for every allocation regardless of the > > > allocation order, which should be OK and trivial. > > > > > > Signed-off-by: Kairui Song > > > --- > > > include/linux/swap.h | 11 ++-- > > > mm/swapfile.c | 120 +++++++++++++++++++++++++++---------------- > > > 2 files changed, 79 insertions(+), 52 deletions(-) > > ...... > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index ae3bd0a862fc..791cd7ed5bdf 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c > > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0); > > > > > ......snip.... > > > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > > > { > > > int order = swap_entry_order(entry_order); > > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > > > int n_ret = 0; > > > int node; > > > > > > + /* Fast path using percpu cluster */ > > > + local_lock(&percpu_swap_cluster.lock); > > > + n_ret = swap_alloc_fast(swp_entries, > > > + SWAP_HAS_CACHE, > > > + order, n_goal); > > > + if (n_ret == n_goal) > > > + goto out; > > > + > > > + n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH); > > > > Here, the behaviour is changed too. In old allocation, partial > > allocation will jump out to return. In this patch, you try the percpu > > cluster firstly, then call scan_swap_map_slots() to try best and will > > jump out even though partial allocation succeed. But the allocation from > > scan_swap_map_slots() could happen on different si device, this looks > > bizarre. Do you think we need reconsider the design? > > Right, that's a behavior change, but only temporarily affects slot cache. > get_swap_pages will only be called with size > 1 when order == 0, and > only by slot cache. (Large order allocation always use size == 1, > other users only uses order == 0 && size == 1). So I didn't' notice > it, as this series is removing slot cache. Right, I am reviewing patch 6, noticed this is temporary. > > The partial side effect would be "returned slots will be from > different devices" and "slot_cache may get drained faster as > get_swap_pages may return less slots when percpu cluster is drained". > Might be a performance issue but seems slight and trivial, slot cache > can still work. And the next commit will just remove the slot cache, > and the problem will be gone. I think I can add a comment about it > here? Sounds good. Adding comment can avoid other people being confused when checking patch 5. Thanks.