From: Baoquan He <bhe@redhat.com>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
Hugh Dickins <hughd@google.com>,
Yosry Ahmed <yosryahmed@google.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>,
Nhat Pham <nphamcs@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Kalesh Singh <kaleshsingh@google.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
Date: Wed, 19 Feb 2025 15:53:46 +0800 [thread overview]
Message-ID: <Z7WOCvQq3xi9wxnt@MiWiFi-R3L-srv> (raw)
In-Reply-To: <20250214175709.76029-6-ryncsn@gmail.com>
On 02/15/25 at 01:57am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> Current allocation workflow first traverses the plist with a global lock
> held, after choosing a device, it uses the percpu cluster on that swap
> device. This commit moves the percpu cluster variable out of being tied
> to individual swap devices, making it a global percpu variable, and will
> be used directly for allocation as a fast path.
>
> The global percpu cluster variable will never point to a HDD device, and
> allocation on HDD devices is still globally serialized.
>
> This improves the allocator performance and prepares for removal of the
> slot cache in later commits. There shouldn't be much observable behavior
> change, except one thing: this changes how swap device allocation
> rotation works.
>
> Currently, each allocation will rotate the plist, and because of the
> existence of slot cache (64 entries), swap devices of the same priority
> are rotated for every 64 entries consumed. And, high order allocations
> are different, they will bypass the slot cache, and so swap device is
> rotated for every 16K, 32K, or up to 2M allocation.
>
> The rotation rule was never clearly defined or documented, it was changed
> several times without mentioning too.
>
> After this commit, once slot cache is gone in later commits, swap device
> rotation will happen for every consumed cluster. Ideally non-HDD devices
> will be rotated if 2M space has been consumed for each order, this seems
This breaks the rule where the high priority swap device is always taken
to allocate as long as there's free space in the device. After this patch,
it will try the percpu cluster firstly which is lower priority even though
the higher priority device has free space. However, this only happens when
the higher priority device is exhausted, not a generic case. If this is
expected, it may need be mentioned in log or doc somewhere at least.
> reasonable. HDD devices is rotated for every allocation regardless of the
> allocation order, which should be OK and trivial.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> include/linux/swap.h | 11 ++--
> mm/swapfile.c | 120 +++++++++++++++++++++++++++----------------
> 2 files changed, 79 insertions(+), 52 deletions(-)
......
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ae3bd0a862fc..791cd7ed5bdf 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
>
......snip....
> int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> {
> int order = swap_entry_order(entry_order);
> @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> int n_ret = 0;
> int node;
>
> + /* Fast path using percpu cluster */
> + local_lock(&percpu_swap_cluster.lock);
> + n_ret = swap_alloc_fast(swp_entries,
> + SWAP_HAS_CACHE,
> + order, n_goal);
> + if (n_ret == n_goal)
> + goto out;
> +
> + n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
Here, the behaviour is changed too. In old allocation, partial
allocation will jump out to return. In this patch, you try the percpu
cluster firstly, then call scan_swap_map_slots() to try best and will
jump out even though partial allocation succeed. But the allocation from
scan_swap_map_slots() could happen on different si device, this looks
bizarre. Do you think we need reconsider the design?
> + /* Rotate the device and switch to a new cluster */
> spin_lock(&swap_avail_lock);
> start_over:
> node = numa_node_id();
> plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> - /* requeue si to after same-priority siblings */
> plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> spin_unlock(&swap_avail_lock);
> if (get_swap_device_info(si)) {
> - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> - n_goal, swp_entries, order);
> + n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
> + swp_entries + n_ret, order);
> put_swap_device(si);
> if (n_ret || size > 1)
> - goto check_out;
> + goto out;
> }
>
> spin_lock(&swap_avail_lock);
> @@ -1241,12 +1290,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> if (plist_node_empty(&next->avail_lists[node]))
> goto start_over;
> }
> -
> spin_unlock(&swap_avail_lock);
> -
> -check_out:
> +out:
> + local_unlock(&percpu_swap_cluster.lock);
> atomic_long_sub(n_ret * size, &nr_swap_pages);
> -
> return n_ret;
> }
>
......snip...
next prev parent reply other threads:[~2025-02-19 7:54 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
2025-02-14 17:57 ` [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache Kairui Song
2025-02-19 2:11 ` Baoquan He
2025-02-14 17:57 ` [PATCH 2/7] mm, swap: drop the flag TTRS_DIRECT Kairui Song
2025-02-19 2:42 ` Baoquan He
2025-02-14 17:57 ` [PATCH 3/7] mm, swap: avoid redundant swap device pinning Kairui Song
2025-02-19 3:35 ` Baoquan He
2025-02-14 17:57 ` [PATCH 4/7] mm, swap: don't update the counter up-front Kairui Song
2025-02-14 17:57 ` [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path Kairui Song
2025-02-19 7:53 ` Baoquan He [this message]
2025-02-19 8:34 ` Kairui Song
2025-02-19 9:26 ` Baoquan He
2025-02-19 10:55 ` Baoquan He
2025-02-19 11:12 ` Kairui Song
2025-02-20 2:35 ` Baoquan He
2025-02-20 2:48 ` Kairui Song
2025-02-20 3:24 ` Baoquan He
2025-02-14 17:57 ` [PATCH 6/7] mm, swap: remove swap slot cache Kairui Song
2025-02-15 16:23 ` kernel test robot
2025-02-20 7:55 ` Baoquan He
2025-02-24 3:16 ` Kairui Song
2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
2025-02-14 20:13 ` Matthew Wilcox
2025-02-15 6:40 ` Kairui Song
2025-02-15 16:43 ` kernel test robot
2025-02-15 16:54 ` kernel test robot
2025-02-20 10:41 ` Baoquan He
2025-02-15 10:27 ` [PATCH 0/7] mm, swap: remove swap slot cache Baoquan He
2025-02-15 13:34 ` Kairui Song
2025-02-15 15:07 ` Baoquan He
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z7WOCvQq3xi9wxnt@MiWiFi-R3L-srv \
--to=bhe@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kaleshsingh@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=v-songbaohua@oppo.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox