From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 12A87C021B0 for ; Wed, 19 Feb 2025 07:54:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 769A0280202; Wed, 19 Feb 2025 02:54:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 719852801F6; Wed, 19 Feb 2025 02:54:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5BBF6280202; Wed, 19 Feb 2025 02:54:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 376442801F6 for ; Wed, 19 Feb 2025 02:54:01 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9FE1D50D1E for ; Wed, 19 Feb 2025 07:54:00 +0000 (UTC) X-FDA: 83135930640.26.6372BAC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf16.hostedemail.com (Postfix) with ESMTP id B3953180006 for ; Wed, 19 Feb 2025 07:53:58 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EsFOUaWq; spf=pass (imf16.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739951638; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2ehLyl7Fs/LrWVTjQBcmkAJn6qcYboCHZr47QQOhS8U=; b=py/9NqNQ6ilyppcvuFQh59sp4mSD0hKZxVLkT9QuqrBmdIyl62sgRUPHVA3ul7f4DHtazd 6fEdi1z9qmhXjVadcyqgqpjXCBeAammzLYkHbHTlHbs6aMf6LihV9RvfnmOY0SzMmJhplN s/EtE1Ttv7y8S/Ku4AU1e/gjBQQFx5s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739951638; a=rsa-sha256; cv=none; b=33CKDLa6XeQpFOIeWXQJLy5u3z54wCAyaRSm6IQluLNdjW/qQudiu3oYI3JeAdQakgTPr5 p5R7EE7shZPgMfXTUEEqM12HxiNJsBQw+JnPQnp/kvZYehrUtB9aKV44maCP9jgLoDWfha dYw0vcCN+vt5K2nOkD5CmUdJ4htO90w= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EsFOUaWq; spf=pass (imf16.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1739951638; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=2ehLyl7Fs/LrWVTjQBcmkAJn6qcYboCHZr47QQOhS8U=; b=EsFOUaWqBxLOkH7TzTgMMnyWN2Xcpys9XHTd7kQ0ws5i7auZmdL7AM53VApFvcgtEgR5Ya mfQedW72fHEVzuNgkFy2ghRQakeYQD84Q9VHfyj8B5cmFRhUo6LzGH3HNrM7yxuMhWvrIU H6lk7W5fnNl+MmHFzqxrF+jiPh0Zu6M= Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-511-ZFcCn7E3OlWf0teHfGmaeA-1; Wed, 19 Feb 2025 02:53:56 -0500 X-MC-Unique: ZFcCn7E3OlWf0teHfGmaeA-1 X-Mimecast-MFC-AGG-ID: ZFcCn7E3OlWf0teHfGmaeA_1739951634 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E4D9B18EB2C3; Wed, 19 Feb 2025 07:53:53 +0000 (UTC) Received: from localhost (unknown [10.72.112.127]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 28CC51955BCB; Wed, 19 Feb 2025 07:53:51 +0000 (UTC) Date: Wed, 19 Feb 2025 15:53:46 +0800 From: Baoquan He To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path Message-ID: References: <20250214175709.76029-1-ryncsn@gmail.com> <20250214175709.76029-6-ryncsn@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250214175709.76029-6-ryncsn@gmail.com> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Rspam-User: X-Rspamd-Queue-Id: B3953180006 X-Rspamd-Server: rspam07 X-Stat-Signature: fy98gqecc4ecf97qawma7n3o8m71871u X-HE-Tag: 1739951638-125381 X-HE-Meta: U2FsdGVkX1+eGBbrAVVH7ZueOz2tuZtuGbJm21oiz/XL53yjyA2HKCfzTN7O6MeqMixlvaLJ/lkTFRA0lYzgZMScnopR6gINzxl1GS48jHNqQWIZG8RBfDRKMbO0PqJP7xTQbZKCRlOUcorZvtjU9bgFjbOHwtMM6oOrLm2S2HJMqZ/eKljYb6rFDeniDk35AIv/t+X8dLgOREA2ky6DlnIlbR1j6WK5IwDZdfv5RiyLrSnpWCwYDiOE3AurGFjh+EOdT+uSEj+M8OX07EBuQp7YXKrp6zUVAqwXbAGelDdC1+pKw9kERY5Jzqq3eS09NalueYqDr8Ds/CEx8DiYYy8Wvq1Rr5kVFotmthpX8EdUZTsMy47LLrATz5KelSP5ZzHcTmNqHheG72rzWPNB4A8E6XjJ4wZdcc4QyGor1v3txlW6lLse+EIN4vXyKDZtfa1NsHp+76YyVirQTotuUwjP4JCmhYJUeUKdQfnZOpowUFZyCcUMCm5Bo9+UL3tOYEwkH8ZKNv90Zbj4EwYSi65jXANvGzVJ36sTUddfKHFG+zhDKW4BcVZRwkEPxEwuZIpVbnJkk3Z6K7Q0/ifLtviczqEGgqkBC/Hb20fX0vdtqI3D+EpRAE2Fpkh1r84BHr12BuLH7v2Esjkm4+Kzdvxzmciybyu+Rh6qLs0X1ol1Ax/vd+ESjJomuCqkiltm0FKjbMGuhB1HfIUj5i8/O5cY2Hm9qwLkoGcR9RkAYEiNRBZGP4xlIbDm34M16s3k3FdfABw0bM51hujBrTC0Et6sKgaSnComYnNBasaD9q0rI2+DI7XU5F/v0Y8tVvZP/VwAvz2+xr8tWfXzBhp6TTKOjlmYWupFg3PORtisRjuQhSnWqECBHgHo6NfWzFw0rmkQt92dKrG60x4QPGfy6OQzemrY9wrkaQwoabeEKqdjMFNQbt0RLsR9FkZ80OfV9MxGRc/8v3hwdQuP99X HIu6i2Cs eE2/PODeBiu4+BirMEXk6I2QKWBZ2hlhZXSMJdH4ZIYUA2awmDwi01RQVrDUoR/aC0hg/KEBUpZRf+lF4gcnKPAnqu31o5zyrsaydi5HEWSn8bt2saE9wuDWulph0L3j9iR0TopLfJ4j/TwUPUakwlNZQ6tU0cX98llzDAZ2kE2Xjg0fCpFhwb1Crm8aNgI32vd2/FpDhcxo+8LAkPBeD99aQFq0tONvHHcI2TUszInZkWEs8uYCAAc1fpk3q6Dbq3a797CHmPCwxGwZ1jpyfVreDx7Vipb65mqr0pMy5J9jHhFZwWSGElhuz8/aWN0WjA2+Gdy0jAp+JFosc91dZ4PS0JKBBwUEVp6zNXldvxAQD1IpztQjpzfVlFtxkBjhw0LchQ1meyOEdBoGGia2inYfDUpDxiOvc1WuZ+aPd7UXQMPE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 02/15/25 at 01:57am, Kairui Song wrote: > From: Kairui Song > > Current allocation workflow first traverses the plist with a global lock > held, after choosing a device, it uses the percpu cluster on that swap > device. This commit moves the percpu cluster variable out of being tied > to individual swap devices, making it a global percpu variable, and will > be used directly for allocation as a fast path. > > The global percpu cluster variable will never point to a HDD device, and > allocation on HDD devices is still globally serialized. > > This improves the allocator performance and prepares for removal of the > slot cache in later commits. There shouldn't be much observable behavior > change, except one thing: this changes how swap device allocation > rotation works. > > Currently, each allocation will rotate the plist, and because of the > existence of slot cache (64 entries), swap devices of the same priority > are rotated for every 64 entries consumed. And, high order allocations > are different, they will bypass the slot cache, and so swap device is > rotated for every 16K, 32K, or up to 2M allocation. > > The rotation rule was never clearly defined or documented, it was changed > several times without mentioning too. > > After this commit, once slot cache is gone in later commits, swap device > rotation will happen for every consumed cluster. Ideally non-HDD devices > will be rotated if 2M space has been consumed for each order, this seems This breaks the rule where the high priority swap device is always taken to allocate as long as there's free space in the device. After this patch, it will try the percpu cluster firstly which is lower priority even though the higher priority device has free space. However, this only happens when the higher priority device is exhausted, not a generic case. If this is expected, it may need be mentioned in log or doc somewhere at least. > reasonable. HDD devices is rotated for every allocation regardless of the > allocation order, which should be OK and trivial. > > Signed-off-by: Kairui Song > --- > include/linux/swap.h | 11 ++-- > mm/swapfile.c | 120 +++++++++++++++++++++++++++---------------- > 2 files changed, 79 insertions(+), 52 deletions(-) ...... > diff --git a/mm/swapfile.c b/mm/swapfile.c > index ae3bd0a862fc..791cd7ed5bdf 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0); > ......snip.... > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > { > int order = swap_entry_order(entry_order); > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > int n_ret = 0; > int node; > > + /* Fast path using percpu cluster */ > + local_lock(&percpu_swap_cluster.lock); > + n_ret = swap_alloc_fast(swp_entries, > + SWAP_HAS_CACHE, > + order, n_goal); > + if (n_ret == n_goal) > + goto out; > + > + n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH); Here, the behaviour is changed too. In old allocation, partial allocation will jump out to return. In this patch, you try the percpu cluster firstly, then call scan_swap_map_slots() to try best and will jump out even though partial allocation succeed. But the allocation from scan_swap_map_slots() could happen on different si device, this looks bizarre. Do you think we need reconsider the design? > + /* Rotate the device and switch to a new cluster */ > spin_lock(&swap_avail_lock); > start_over: > node = numa_node_id(); > plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) { > - /* requeue si to after same-priority siblings */ > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); > spin_unlock(&swap_avail_lock); > if (get_swap_device_info(si)) { > - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, > - n_goal, swp_entries, order); > + n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal, > + swp_entries + n_ret, order); > put_swap_device(si); > if (n_ret || size > 1) > - goto check_out; > + goto out; > } > > spin_lock(&swap_avail_lock); > @@ -1241,12 +1290,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > if (plist_node_empty(&next->avail_lists[node])) > goto start_over; > } > - > spin_unlock(&swap_avail_lock); > - > -check_out: > +out: > + local_unlock(&percpu_swap_cluster.lock); > atomic_long_sub(n_ret * size, &nr_swap_pages); > - > return n_ret; > } > ......snip...