linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3]  mm: process_mrelease: expedited reclaim and auto-kill support
@ 2026-04-13 22:39 Minchan Kim
  2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
	timmurray, Minchan Kim

This patch series introduces optimizations to expedite memory reclamation
in process_mrelease() and provides a secure, race-free "auto-kill"
mechanism for efficient container shutdown and OOM handling.

Currently, process_mrelease() unmaps pages but leaves clean file folios
on the LRU list, relying on standard memory reclaim to eventually free
them. Furthermore, requiring userspace to send a SIGKILL prior to
invoking process_mrelease() introduces scheduling race conditions where
the victim task may enter the exit path prematurely, bypassing expedited
reclamation hooks.

This series addresses these limitations in three logical steps.

Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
Integrates clean file folio eviction directly into the low-level TLB
batching (mmu_gather) infrastructure. Symmetrically truncates clean file
folios alongside anonymous pages during the unmap loop.

Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
folios undergoing process_mrelease reclaim. Perf profiling reveals that
LRU movement accounts for ~55% of overhead during unmap.

Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
signal delivery path, preventing scheduling races.

Minchan Kim (3):
  mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
  mm: process_mrelease: skip LRU movement for exclusive file folios
  mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag

 arch/s390/include/asm/tlb.h        |  2 +-
 include/linux/swap.h               |  9 ++++++---
 include/uapi/asm-generic/siginfo.h |  6 ++++++
 include/uapi/linux/mman.h          |  4 ++++
 kernel/signal.c                    |  4 ++++
 mm/memory.c                        | 13 ++++++++++++-
 mm/mmu_gather.c                    |  8 +++++---
 mm/oom_kill.c                      | 20 +++++++++++++++++++-
 mm/swap_state.c                    | 19 +++++++++++++++++--
 9 files changed, 74 insertions(+), 11 deletions(-)

-- 
2.54.0.rc0.605.g598a273b03-goog



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
  2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
@ 2026-04-13 22:39 ` Minchan Kim
  2026-04-13 22:39 ` [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
  2026-04-13 22:39 ` [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
  2 siblings, 0 replies; 4+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
	timmurray, Minchan Kim

Currently, process_mrelease() unmaps the pages but leaves clean file
folios on the LRU list, relying on standard memory reclaim to eventually
free them. This delays the immediate recovery of system memory under OOM
or container shutdown scenarios.

This patch implements an expedited eviction mechanism for clean file
folios by integrating directly into the low-level TLB batching
infrastructure (mmu_gather).

Instead of repeatedly locking and evicting folios one by one inside the
unmap loop (zap_present_folio_ptes), we pass the MMF_UNSTABLE flag
status down to free_pages_and_swap_cache(). Within this single unified
loop, anonymous pages are released via free_swap_cache(), and
file-backed folios are symmetrically truncated via mapping_evict_folio().

This avoids introducing unnecessary data structures, preserves TLB flush
safety, and removes duplicate tree traversals, resulting in an extremely
lean and highly responsive process_mrelease() implementation.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/s390/include/asm/tlb.h |  2 +-
 include/linux/swap.h        |  9 ++++++---
 mm/mmu_gather.c             |  8 +++++---
 mm/swap_state.c             | 19 +++++++++++++++++--
 4 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 619fd41e710e..554842345ccd 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -62,7 +62,7 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
 	VM_WARN_ON_ONCE(delay_rmap);
 	VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
 
-	free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+	free_pages_and_caches(encoded_pages, ARRAY_SIZE(encoded_pages), false);
 	return false;
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..e7b929b062f8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -433,7 +433,7 @@ static inline unsigned long total_swapcache_pages(void)
 
 void free_swap_cache(struct folio *folio);
 void free_folio_and_swap_cache(struct folio *folio);
-void free_pages_and_swap_cache(struct encoded_page **, int);
+void free_pages_and_caches(struct encoded_page **pages, int nr, bool free_unmapped_file);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
@@ -510,8 +510,11 @@ static inline void put_swap_device(struct swap_info_struct *si)
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
 #define free_folio_and_swap_cache(folio) \
 	folio_put(folio)
-#define free_pages_and_swap_cache(pages, nr) \
-	release_pages((pages), (nr));
+static inline void free_pages_and_caches(struct encoded_page **pages,
+		int nr, bool free_unmapped_file)
+{
+	release_pages(pages, nr);
+}
 
 static inline void free_swap_cache(struct folio *folio)
 {
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index fe5b6a031717..5ce5824db07f 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -100,7 +100,8 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
  */
 #define MAX_NR_FOLIOS_PER_FREE		512
 
-static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+static void __tlb_batch_free_encoded_pages(struct mm_struct *mm,
+		struct mmu_gather_batch *batch)
 {
 	struct encoded_page **pages = batch->encoded_pages;
 	unsigned int nr, nr_pages;
@@ -135,7 +136,8 @@ static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
 			}
 		}
 
-		free_pages_and_swap_cache(pages, nr);
+		free_pages_and_caches(pages, nr,
+				      mm_flags_test(MMF_UNSTABLE, mm));
 		pages += nr;
 		batch->nr -= nr;
 
@@ -148,7 +150,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
-		__tlb_batch_free_encoded_pages(batch);
+		__tlb_batch_free_encoded_pages(tlb->mm, batch);
 	tlb->active = &tlb->local;
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6d0eef7470be..e70a52ead6d3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -400,11 +400,22 @@ void free_folio_and_swap_cache(struct folio *folio)
 		folio_put(folio);
 }
 
+static inline void free_file_cache(struct folio *folio)
+{
+	if (folio_trylock(folio)) {
+		mapping_evict_folio(folio_mapping(folio), folio);
+		folio_unlock(folio);
+	}
+}
+
 /*
  * Passed an array of pages, drop them all from swapcache and then release
  * them.  They are removed from the LRU and freed if this is their last use.
+ *
+ * If @free_unmapped_file is true, this function will proactively evict clean
+ * file-backed folios if they are no longer mapped.
  */
-void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
+void free_pages_and_caches(struct encoded_page **pages, int nr, bool free_unmapped_file)
 {
 	struct folio_batch folios;
 	unsigned int refs[PAGEVEC_SIZE];
@@ -413,7 +424,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 	for (int i = 0; i < nr; i++) {
 		struct folio *folio = page_folio(encoded_page_ptr(pages[i]));
 
-		free_swap_cache(folio);
+		if (folio_test_anon(folio))
+			free_swap_cache(folio);
+		else if (unlikely(free_unmapped_file))
+			free_file_cache(folio);
+
 		refs[folios.nr] = 1;
 		if (unlikely(encoded_page_flags(pages[i]) &
 			     ENCODED_PAGE_BIT_NR_PAGES_NEXT))
-- 
2.54.0.rc0.605.g598a273b03-goog



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios
  2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
  2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
@ 2026-04-13 22:39 ` Minchan Kim
  2026-04-13 22:39 ` [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
  2 siblings, 0 replies; 4+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
	timmurray, Minchan Kim

For the process_mrelease reclaim, skip LRU handling for exclusive
file-backed folios since they will be freed soon so pointless
to move around in the LRU.

This avoids costly LRU movement which accounts for a significant portion
of the time during unmap_page_range.

-   91.31%     0.00%  mmap_exit_test   [kernel.kallsyms]  [.] exit_mm
     exit_mm
     __mmput
     exit_mmap
     unmap_vmas
   - unmap_page_range
      - 55.75% folio_mark_accessed
         + 48.79% __folio_batch_add_and_move
           4.23% workingset_activation
      + 12.94% folio_remove_rmap_ptes
      + 9.86% page_table_check_clear
      + 3.34% tlb_flush_mmu
        1.06% __page_table_check_pte_clear

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/memory.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2f815a34d924..25e17893c919 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1640,6 +1640,8 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 	bool delay_rmap = false;
 
 	if (!folio_test_anon(folio)) {
+		bool skip_mark_accessed;
+
 		ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 		if (pte_dirty(ptent)) {
 			folio_mark_dirty(folio);
@@ -1648,7 +1650,16 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 				*force_flush = true;
 			}
 		}
-		if (pte_young(ptent) && likely(vma_has_recency(vma)))
+
+		/*
+		 * For the process_mrelease reclaim, skip LRU handling for exclusive
+		 * file-backed folios since they will be freed soon so pointless
+		 * to move around in the LRU.
+		 */
+		skip_mark_accessed = mm_flags_test(MMF_UNSTABLE, mm) &&
+				     folio_mapcount(folio) < 2;
+		if (likely(!skip_mark_accessed) && pte_young(ptent) &&
+		    likely(vma_has_recency(vma)))
 			folio_mark_accessed(folio);
 		rss[mm_counter(folio)] -= nr;
 	} else {
-- 
2.54.0.rc0.605.g598a273b03-goog



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
  2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
  2026-04-13 22:39 ` [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
@ 2026-04-13 22:39 ` Minchan Kim
  2 siblings, 0 replies; 4+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
	timmurray, Minchan Kim

Currently, process_mrelease() requires userspace to send a SIGKILL signal
prior to invocation. This separation introduces a race window where the
victim task may receive the signal and enter the exit path before the
reaper can invoke process_mrelease().

In this case, the victim task frees its memory via the standard, unoptimized
exit path, bypassing the expedited clean file folio reclamation optimization
introduced in the previous patch (which relies on the MMF_UNSTABLE flag).

This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
an integrated auto-kill mode. When specified, process_mrelease() directly
injects a SIGKILL into the target task.

Crucially, this patch utilizes a dedicated signal code (KILL_MRELEASE)
during signal injection, belonging to a new SIGKILL si_codes section.
This special code ensures that the kernel's signal delivery path reliably
intercepts the request and marks the target address space as unstable
(MMF_UNSTABLE). This mechanism guarantees that the MMF_UNSTABLE flag is set
before either the victim task or the reaper proceeds, ensuring that the
expedited reclamation optimization is utilized regardless of scheduling
order.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/uapi/asm-generic/siginfo.h |  6 ++++++
 include/uapi/linux/mman.h          |  4 ++++
 kernel/signal.c                    |  4 ++++
 mm/oom_kill.c                      | 20 +++++++++++++++++++-
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 5a1ca43b5fc6..0f59b791dab4 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -252,6 +252,12 @@ typedef struct siginfo {
 #define BUS_MCEERR_AO	5
 #define NSIGBUS		5
 
+/*
+ * SIGKILL si_codes
+ */
+#define KILL_MRELEASE	1	/* sent by process_mrelease */
+#define NSIGKILL	1
+
 /*
  * SIGTRAP si_codes
  */
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..4266976b45ad 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -56,4 +56,8 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };
 
+/* Flags for process_mrelease */
+#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
+#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..c21b2176dc5e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1134,6 +1134,10 @@ static int __send_signal_locked(int sig, struct kernel_siginfo *info,
 
 out_set:
 	signalfd_notify(t, sig);
+
+	if (sig == SIGKILL && !is_si_special(info) &&
+	    info->si_code == KILL_MRELEASE && t->mm)
+		mm_flags_set(MMF_UNSTABLE, t->mm);
 	sigaddset(&pending->signal, sig);
 
 	/* Let multiprocess signals appear after on-going forks */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..0b5da5208707 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -20,6 +20,8 @@
 
 #include <linux/oom.h>
 #include <linux/mm.h>
+#include <uapi/linux/mman.h>
+#include <linux/capability.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
@@ -1218,13 +1220,29 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	bool reap = false;
 	long ret = 0;
 
-	if (flags)
+	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
 		return -EINVAL;
 
 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task))
 		return PTR_ERR(task);
 
+	if (flags & PROCESS_MRELEASE_REAP_KILL) {
+		struct kernel_siginfo info;
+
+		if (!capable(CAP_KILL)) {
+			ret = -EPERM;
+			goto put_task;
+		}
+		clear_siginfo(&info);
+		info.si_signo = SIGKILL;
+		info.si_code = KILL_MRELEASE;
+		info.si_pid = task_tgid_vnr(current);
+		info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+
+		do_send_sig_info(SIGKILL, &info, task, PIDTYPE_TGID);
+	}
+
 	/*
 	 * Make sure to choose a thread which still has a reference to mm
 	 * during the group exit
-- 
2.54.0.rc0.605.g598a273b03-goog



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-13 22:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
2026-04-13 22:39 ` [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
2026-04-13 22:39 ` [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox