From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Kairui Song <ryncsn@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Kairui Song <kasong@tencent.com>, Nhat Pham <nphamcs@gmail.com>,
Baoquan He <bhe@redhat.com>, Barry Song <baohua@kernel.org>,
Chris Li <chrisl@kernel.org>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
David Hildenbrand <david@redhat.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Ying Huang <ying.huang@linux.alibaba.com>,
YoungJun Park <youngjun.park@lge.com>,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
Subject: [PATCH v2 1/5] mm, swap: do not perform synchronous discard during allocation
Date: Fri, 24 Oct 2025 02:34:11 +0800 [thread overview]
Message-ID: <20251024-swap-clean-after-swap-table-p1-v2-1-c5b0e1092927@tencent.com> (raw)
In-Reply-To: <20251024-swap-clean-after-swap-table-p1-v2-0-c5b0e1092927@tencent.com>
From: Kairui Song <kasong@tencent.com>
Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
fast path"), swap allocation is protected by a local lock, which means
we can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap
allocator failed to find any usable cluster, it would look at the
pending discard cluster and try to issue some blocking discards. It may
not necessarily sleep, but the cond_resched at the bio layer indicates
this is wrong when combined with a local lock. And the bio GFP flag used
for discard bio is also wrong (not atomic).
It's arguable whether this synchronous discard is helpful at all. In
most cases, the async discard is good enough. And the swap allocator is
doing very differently at organizing the clusters since the recent
change, so it is very rare to see discard clusters piling up.
So far, no issues have been observed or reported with typical SSD setups
under months of high pressure. This issue was found during my code
review. But by hacking the kernel a bit: adding a mdelay(500) in the
async discard path, this issue will be observable with WARNING triggered
by the wrong GFP and cond_resched in the bio layer for debug builds.
So now let's apply a hotfix for this issue: remove the synchronous
discard in the swap allocation path. And when order 0 is failing with
all cluster list drained on all swap devices, try to do a discard
following the swap device priority list. If any discards released some
cluster, try the allocation again. This way, we can still avoid OOM due
to swap failure if the hardware is very slow and memory pressure is
extremely high.
This may cause more fragmentation issues if the discarding hardware is
really slow. Ideally, we want to discard pending clusters before
continuing to iterate the fragment cluster lists. This can be
implemented in a cleaner way if we clean up the device list iteration
part first.
Cc: stable@vger.kernel.org
Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
Acked-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
1 file changed, 33 insertions(+), 7 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cb2392ed8e0e..33e0bd905c55 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
goto done;
}
- /*
- * We don't have free cluster but have some clusters in discarding,
- * do discard now and reclaim them.
- */
- if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
- goto new_cluster;
-
if (order)
goto done;
@@ -1394,6 +1387,33 @@ static bool swap_alloc_slow(swp_entry_t *entry,
return false;
}
+/*
+ * Discard pending clusters in a synchronized way when under high pressure.
+ * Return: true if any cluster is discarded.
+ */
+static bool swap_sync_discard(void)
+{
+ bool ret = false;
+ int nid = numa_node_id();
+ struct swap_info_struct *si, *next;
+
+ spin_lock(&swap_avail_lock);
+ plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
+ spin_unlock(&swap_avail_lock);
+ if (get_swap_device_info(si)) {
+ if (si->flags & SWP_PAGE_DISCARD)
+ ret = swap_do_scheduled_discard(si);
+ put_swap_device(si);
+ }
+ if (ret)
+ return true;
+ spin_lock(&swap_avail_lock);
+ }
+ spin_unlock(&swap_avail_lock);
+
+ return false;
+}
+
/**
* folio_alloc_swap - allocate swap space for a folio
* @folio: folio we want to move to swap
@@ -1432,11 +1452,17 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
}
}
+again:
local_lock(&percpu_swap_cluster.lock);
if (!swap_alloc_fast(&entry, order))
swap_alloc_slow(&entry, order);
local_unlock(&percpu_swap_cluster.lock);
+ if (unlikely(!order && !entry.val)) {
+ if (swap_sync_discard())
+ goto again;
+ }
+
/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
if (mem_cgroup_try_charge_swap(folio, entry))
goto out_free;
--
2.51.0
next parent reply other threads:[~2025-10-23 18:34 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20251024-swap-clean-after-swap-table-p1-v2-0-c5b0e1092927@tencent.com>
2025-10-23 18:34 ` Kairui Song [this message]
2025-10-31 17:45 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251024-swap-clean-after-swap-table-p1-v2-1-c5b0e1092927@tencent.com \
--to=ryncsn@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=stable@vger.kernel.org \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox