[PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
@ 2025-09-09  6:53 Lei Liu
  2025-09-09  6:53 ` [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core Lei Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Lei Liu @ 2025-09-09  6:53 UTC (permalink / raw)
  To: Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton,
	Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He, Barry Song,
	Chris Li, Johannes Weiner, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Brendan Jackman, Zi Yan, Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
  Cc: Lei Liu

1. Problem Scenario
On systems with ZRAM and swap enabled, simultaneous process exits create
contention. The primary bottleneck occurs during swap entry release
operations, causing exiting processes to monopolize CPU resources. This
leads to scheduling delays for high-priority processes.

2. Android Use Case
During camera launch, LMKD terminates background processes to free memory.
Exiting processes compete for CPU cycles, delaying the camera preview
thread and causing visible stuttering - directly impacting user
experience.

3. Root Cause Analysis
When background applications heavily utilize swap space, process exit
profiling reveals 55% of time spent in free_swap_and_cache_nr():

Function              Duration (ms)   Percentage
do_signal               791.813     **********100%
do_group_exit           791.813     **********100%
do_exit                 791.813     **********100%
exit_mm                 577.859        *******73%
exit_mmap               577.497        *******73%
zap_pte_range           558.645        *******71%
free_swap_and_cache_nr  433.381          *****55%
free_swap_slot          403.568          *****51%
swap_entry_free         393.863          *****50%
swap_range_free         372.602           ****47%

4. Optimization Approach
a) For processes exceeding swap entry threshold: aggregate and isolate
swap entries to enable fast exit
b) Asynchronously release batched entries when isolation reaches
configured threshold

5. Performance Gains (User Scenario: Camera Cold Launch)
a) 74% reduction in process exit latency (>500ms cases)
b) ~4% lower peak CPU load during concurrent process exits
c) ~70MB additional free memory during camera preview initialization
d) 40% reduction in camera preview stuttering probability

6. Prior Art & Improvements
Reference: Zhiguo Jiang's patch
(https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)

Key enhancements:
a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
b) Async release delegated to workqueue kworkers with configurable
max_active for NUMA-optimized concurrency

Lei Liu (2):
  mm: swap: Gather swap entries and batch async release core
  mm: swap: Forced swap entries release under memory pressure

 include/linux/oom.h           |  23 ++++++
 include/linux/swapfile.h      |   2 +
 include/linux/vm_event_item.h |   1 +
 kernel/exit.c                 |   2 +
 mm/memcontrol.c               |   6 --
 mm/memory.c                   |   4 +-
 mm/page_alloc.c               |   4 +
 mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |   1 +
 9 files changed, 170 insertions(+), 7 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core
  2025-09-09  6:53 [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Lei Liu
@ 2025-09-09  6:53 ` Lei Liu
  2025-09-10  1:39   ` kernel test robot
  2025-09-10  3:12   ` kernel test robot
  2025-09-09  6:53 ` [PATCH v0 2/2] mm: swap: Forced swap entries release under memory pressure Lei Liu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Lei Liu @ 2025-09-09  6:53 UTC (permalink / raw)
  To: Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton,
	Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He, Barry Song,
	Chris Li, Johannes Weiner, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Chen Yu,
	Peter Zijlstra (Intel),
	Usama Arif, Hao Jia, Kirill A. Shutemov, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
  Cc: Lei Liu

Core functionality implementation for asynchronous release of swap entries:
1. For eligible processes, swap pages are first asynchronously aggregated
to a global list
2. Batch release occurs once a defined threshold is reached
3. Asynchronous release is executed by kworkers of a workqueue, with a
max_active configuration macro provided to control concurrent work item
numbers and address NUMA release efficiency issues

Signed-off-by: Lei Liu <liulei.rjpt@vivo.com>
---
 include/linux/oom.h           |  23 ++++++
 include/linux/swapfile.h      |   1 +
 include/linux/vm_event_item.h |   1 +
 kernel/exit.c                 |   2 +
 mm/memcontrol.c               |   6 --
 mm/memory.c                   |   4 +-
 mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |   1 +
 8 files changed, 165 insertions(+), 7 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1e0fc6931ce9..aa34429cc83b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -56,6 +56,23 @@ struct oom_control {
 extern struct mutex oom_lock;
 extern struct mutex oom_adj_mutex;
 
+extern atomic_t exiting_task_count;  // exiting task counts
+
+static inline int get_exiting_task_count(void)
+{
+	return atomic_read(&exiting_task_count);
+}
+
+static inline void inc_exiting_task_count(void)
+{
+	atomic_inc(&exiting_task_count);
+}
+
+static inline void dec_exiting_task_count(void)
+{
+	atomic_dec(&exiting_task_count);
+}
+
 static inline void set_current_oom_origin(void)
 {
 	current->signal->oom_flag_origin = true;
@@ -76,6 +93,12 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
+static inline bool task_is_dying(void)
+{
+	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+		(current->flags & PF_EXITING);
+}
+
 /*
  * Checks whether a page fault on the given mm is still reliable.
  * This is no longer true if the oom reaper started to reap the
diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index 99e3ed469e88..dc43464cd838 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -4,6 +4,7 @@
 
 extern unsigned long generic_max_swapfile_size(void);
 unsigned long arch_max_swapfile_size(void);
+int add_to_swap_gather_cache(struct mm_struct *mm, swp_entry_t entry, int nr);
 
 /* Maximum swapfile size supported for the arch (not inclusive). */
 extern unsigned long swapfile_maximum_size;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9e15a088ba38..05f33d26d459 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -186,6 +186,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSTACK_REST,
 #endif
 #endif /* CONFIG_DEBUG_STACK_USAGE */
+		ASYNC_SWAP_COUNTS,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 343eb97543d5..c879fe32aa0e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -897,6 +897,7 @@ void __noreturn do_exit(long code)
 	WARN_ON(irqs_disabled());
 	WARN_ON(tsk->plug);
 
+	inc_exiting_task_count();
 	kcov_task_exit(tsk);
 	kmsan_task_exit(tsk);
 
@@ -1001,6 +1002,7 @@ void __noreturn do_exit(long code)
 	exit_tasks_rcu_finish();
 
 	lockdep_free_task(tsk);
+	dec_exiting_task_count();
 	do_task_dead();
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..79bc4321cbb3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -102,12 +102,6 @@ static struct kmem_cache *memcg_pn_cachep;
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #endif
 
-static inline bool task_is_dying(void)
-{
-	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
-		(current->flags & PF_EXITING);
-}
-
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..e09db2932b25 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/swapfile.h>
 
 #include <trace/events/kmem.h>
 
@@ -1617,7 +1618,8 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 
 		nr = swap_pte_batch(pte, max_nr, ptent);
 		rss[MM_SWAPENTS] -= nr;
-		free_swap_and_cache_nr(entry, nr);
+		if (add_to_swap_gather_cache(tlb->mm, entry, nr))
+			free_swap_and_cache_nr(entry, nr);
 	} else if (is_migration_entry(entry)) {
 		struct folio *folio = pfn_swap_entry_folio(entry);
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b4f3cc712580..7c69e726b075 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,10 @@
 #include <linux/suspend.h>
 #include <linux/zswap.h>
 #include <linux/plist.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/vmstat.h>
 
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
@@ -170,6 +174,136 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim the swap entry if swap is getting full */
 #define TTRS_FULL		0x4
 
+/* Minimum number of exiting processes, adjustable based on system load */
+#define MIN_EXITING_TASKS_THRESHOLD 1
+/* Number of active work items for asynchronously releasing swap cache.
+ * Defaults to zero and is determined by the system itself, it can also
+ * be configured manually based on system load.
+ */
+#define NUM_ASYNC_SWAP_WORK_ITEMS 0
+
+static struct workqueue_struct *release_wq;
+static LIST_HEAD(swap_cache_list);
+static spinlock_t swap_cache_lock;
+static int cache_count;
+static int max_cache_entries = 32;
+static struct kmem_cache *swap_entry_cachep;
+atomic_t exiting_task_count = ATOMIC_INIT(0);
+
+/* Represents a cache entry for swap operations */
+struct swap_entry_cache {
+	swp_entry_t entry;
+	int nr;
+	struct list_head list;
+};
+
+static int async_swap_free_counts_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "exiting_tasks:%d cache_counts:%d\n",
+		   get_exiting_task_count(), cache_count);
+	return 0;
+}
+
+static void async_release_func(struct work_struct *work)
+{
+	struct swap_entry_cache *sec, *tmp;
+	unsigned int counts = 0;
+	LIST_HEAD(temp_list);
+
+	if (cache_count) {
+		spin_lock_irq(&swap_cache_lock);
+		list_splice_init(&swap_cache_list, &temp_list);
+		cache_count = 0;
+		spin_unlock_irq(&swap_cache_lock);
+	} else {
+		goto out;
+	}
+
+	list_for_each_entry_safe(sec, tmp, &temp_list, list) {
+		free_swap_and_cache_nr(sec->entry, sec->nr);
+		kmem_cache_free(swap_entry_cachep, sec);
+		counts++;
+	}
+	count_vm_events(ASYNC_SWAP_COUNTS, counts);
+out:
+	kfree(work);
+}
+
+static void flush_cache_if_needed(bool check_cache_count)
+{
+	struct work_struct *release_work;
+
+	if ((!check_cache_count && cache_count) ||
+	    cache_count >= max_cache_entries) {
+		release_work = kmalloc(sizeof(*release_work), GFP_ATOMIC);
+		if (release_work) {
+			INIT_WORK(release_work, async_release_func);
+			queue_work(release_wq, release_work);
+		}
+	}
+}
+
+/*
+ * add_to_swap_gather_cache - Add a swap entry to the cache.
+ * @mm: Memory descriptor.
+ * @entry: Swap entry to add.
+ * @nr: Associated number.
+ *
+ * Returns 0 on success, -1 for unmet conditions, -ENOMEM on allocation failure.
+ *
+ * Checks task exiting counts, allocates cache entry, adds it to the swap cache
+ * list, and may trigger a cache flush.
+ */
+int add_to_swap_gather_cache(struct mm_struct *mm, swp_entry_t entry, int nr)
+{
+	struct swap_entry_cache *sec;
+
+	if (!mm || get_exiting_task_count() < MIN_EXITING_TASKS_THRESHOLD)
+		return -1;
+
+	if (!task_is_dying() ||
+	    get_mm_counter(mm, MM_SWAPENTS) < (100 * SWAP_CLUSTER_MAX))
+		return -1;
+
+	sec = kmem_cache_alloc(swap_entry_cachep, GFP_ATOMIC);
+	if (!sec)
+		return -ENOMEM;
+
+	sec->entry = entry;
+	sec->nr = nr;
+	INIT_LIST_HEAD(&sec->list);
+
+	spin_lock_irq(&swap_cache_lock);
+	list_add_tail(&sec->list, &swap_cache_list);
+	cache_count++;
+	spin_unlock_irq(&swap_cache_lock);
+
+	flush_cache_if_needed(true);
+
+	return 0;
+}
+
+static int __init swap_async_free_setup(void)
+{
+	release_wq = alloc_workqueue("async_swap_free",
+				     WQ_UNBOUND | WQ_HIGHPRI | WQ_MEM_RECLAIM,
+				     NUM_ASYNC_SWAP_WORK_ITEMS);
+	if (!release_wq)
+		return -ENOMEM;
+
+	swap_entry_cachep = KMEM_CACHE(swap_entry_cache, SLAB_ACCOUNT);
+	if (!swap_entry_cachep)
+		return -ENOMEM;
+
+	spin_lock_init(&swap_cache_lock);
+	proc_create_single("aswap_free_counts", 0, NULL,
+			   async_swap_free_counts_show);
+
+	return 0;
+}
+
+postcore_initcall(swap_async_free_setup);
+
 static bool swap_only_has_cache(struct swap_info_struct *si,
 			      unsigned long offset, int nr_pages)
 {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 71cd1ceba191..fa7fe910becf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1494,6 +1494,7 @@ const char * const vmstat_text[] = {
 	[I(KSTACK_REST)]			= "kstack_rest",
 #endif
 #endif
+	[I(ASYNC_SWAP_COUNTS)]			= "async_swap_count",
 #undef I
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 };
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v0 2/2] mm: swap: Forced swap entries release under memory pressure
  2025-09-09  6:53 [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Lei Liu
  2025-09-09  6:53 ` [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core Lei Liu
@ 2025-09-09  6:53 ` Lei Liu
  2025-09-10  5:36   ` kernel test robot
  2025-09-09  7:30 ` [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Kairui Song
  2025-09-09 19:21 ` Shakeel Butt
  3 siblings, 1 reply; 26+ messages in thread
From: Lei Liu @ 2025-09-09  6:53 UTC (permalink / raw)
  To: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	open list:MEMORY MANAGEMENT - SWAP, open list
  Cc: Lei Liu

When there is memory pressure causing OOM, fully reclaim objects from the
global list that have not reached the threshold.

Signed-off-by: Lei Liu <liulei.rjpt@vivo.com>
---
 include/linux/swapfile.h | 1 +
 mm/page_alloc.c          | 4 ++++
 mm/swapfile.c            | 2 +-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index dc43464cd838..04c660aae7a0 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -5,6 +5,7 @@
 extern unsigned long generic_max_swapfile_size(void);
 unsigned long arch_max_swapfile_size(void);
 int add_to_swap_gather_cache(struct mm_struct *mm, swp_entry_t entry, int nr);
+void flush_cache_if_needed(bool check_ache_entries);
 
 /* Maximum swapfile size supported for the arch (not inclusive). */
 extern unsigned long swapfile_maximum_size;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d1d037f97c5f..7c5990c24df7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -55,6 +55,7 @@
 #include <linux/delayacct.h>
 #include <linux/cacheinfo.h>
 #include <linux/pgalloc_tag.h>
+#include <linux/swapfile.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -3967,6 +3968,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 
 	*did_some_progress = 0;
 
+	/* flash async swap cache pool */
+	flush_cache_if_needed(false);
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7c69e726b075..26640ec34fc6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -229,7 +229,7 @@ static void async_release_func(struct work_struct *work)
 	kfree(work);
 }
 
-static void flush_cache_if_needed(bool check_cache_count)
+void flush_cache_if_needed(bool check_cache_count)
 {
 	struct work_struct *release_work;
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09  6:53 [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Lei Liu
  2025-09-09  6:53 ` [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core Lei Liu
  2025-09-09  6:53 ` [PATCH v0 2/2] mm: swap: Forced swap entries release under memory pressure Lei Liu
@ 2025-09-09  7:30 ` Kairui Song
  2025-09-09  9:24   ` Barry Song
                     ` (2 more replies)
  2025-09-09 19:21 ` Shakeel Butt
  3 siblings, 3 replies; 26+ messages in thread
From: Kairui Song @ 2025-09-09  7:30 UTC (permalink / raw)
  To: Lei Liu
  Cc: Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
>

Hi Lei,

> 1. Problem Scenario
> On systems with ZRAM and swap enabled, simultaneous process exits create
> contention. The primary bottleneck occurs during swap entry release
> operations, causing exiting processes to monopolize CPU resources. This
> leads to scheduling delays for high-priority processes.
>
> 2. Android Use Case
> During camera launch, LMKD terminates background processes to free memory.
> Exiting processes compete for CPU cycles, delaying the camera preview
> thread and causing visible stuttering - directly impacting user
> experience.
>
> 3. Root Cause Analysis
> When background applications heavily utilize swap space, process exit
> profiling reveals 55% of time spent in free_swap_and_cache_nr():
>
> Function              Duration (ms)   Percentage
> do_signal               791.813     **********100%
> do_group_exit           791.813     **********100%
> do_exit                 791.813     **********100%
> exit_mm                 577.859        *******73%
> exit_mmap               577.497        *******73%
> zap_pte_range           558.645        *******71%
> free_swap_and_cache_nr  433.381          *****55%
> free_swap_slot          403.568          *****51%

Thanks for sharing this case.

One problem is that now the free_swap_slot function no longer exists
after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
actual overhead here?

Some batch freeing optimizations are introduced. And we have reworked
the whole locking mechanism for swap, so even on a system with 96t the
contention seems barely observable with common workloads.

And another series is further reducing the contention and the overall
overhead (24% faster freeing for phase 1):
https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/

Will these be helpful for you? I think optimizing the root problem is
better than just deferring the overhead with async workers, which may
increase the overall overhead and complexity.


> swap_entry_free         393.863          *****50%
> swap_range_free         372.602           ****47%
>
> 4. Optimization Approach
> a) For processes exceeding swap entry threshold: aggregate and isolate
> swap entries to enable fast exit
> b) Asynchronously release batched entries when isolation reaches
> configured threshold
>
> 5. Performance Gains (User Scenario: Camera Cold Launch)
> a) 74% reduction in process exit latency (>500ms cases)
> b) ~4% lower peak CPU load during concurrent process exits
> c) ~70MB additional free memory during camera preview initialization
> d) 40% reduction in camera preview stuttering probability
>
> 6. Prior Art & Improvements
> Reference: Zhiguo Jiang's patch
> (https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)
>
> Key enhancements:
> a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
> b) Async release delegated to workqueue kworkers with configurable
> max_active for NUMA-optimized concurrency
>
> Lei Liu (2):
>   mm: swap: Gather swap entries and batch async release core
>   mm: swap: Forced swap entries release under memory pressure
>
>  include/linux/oom.h           |  23 ++++++
>  include/linux/swapfile.h      |   2 +
>  include/linux/vm_event_item.h |   1 +
>  kernel/exit.c                 |   2 +
>  mm/memcontrol.c               |   6 --
>  mm/memory.c                   |   4 +-
>  mm/page_alloc.c               |   4 +
>  mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
>  mm/vmstat.c                   |   1 +
>  9 files changed, 170 insertions(+), 7 deletions(-)
>
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09  7:30 ` [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Kairui Song
@ 2025-09-09  9:24   ` Barry Song
  2025-09-09 16:15     ` Chris Li
  2025-09-10 14:07     ` Lei Liu
  2025-09-09 15:38   ` Chris Li
  2025-09-10 14:01   ` Lei Liu
  2 siblings, 2 replies; 26+ messages in thread
From: Barry Song @ 2025-09-09  9:24 UTC (permalink / raw)
  To: Kairui Song
  Cc: Lei Liu, Michal Hocko, David Rientjes, Shakeel Butt,
	Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 9, 2025 at 3:30 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
> >
>
> Hi Lei,
>
> > 1. Problem Scenario
> > On systems with ZRAM and swap enabled, simultaneous process exits create
> > contention. The primary bottleneck occurs during swap entry release
> > operations, causing exiting processes to monopolize CPU resources. This
> > leads to scheduling delays for high-priority processes.
> >
> > 2. Android Use Case
> > During camera launch, LMKD terminates background processes to free memory.
> > Exiting processes compete for CPU cycles, delaying the camera preview
> > thread and causing visible stuttering - directly impacting user
> > experience.
> >
> > 3. Root Cause Analysis
> > When background applications heavily utilize swap space, process exit
> > profiling reveals 55% of time spent in free_swap_and_cache_nr():
> >
> > Function              Duration (ms)   Percentage
> > do_signal               791.813     **********100%
> > do_group_exit           791.813     **********100%
> > do_exit                 791.813     **********100%
> > exit_mm                 577.859        *******73%
> > exit_mmap               577.497        *******73%
> > zap_pte_range           558.645        *******71%
> > free_swap_and_cache_nr  433.381          *****55%
> > free_swap_slot          403.568          *****51%
>
> Thanks for sharing this case.
>
> One problem is that now the free_swap_slot function no longer exists
> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
> actual overhead here?
>
> Some batch freeing optimizations are introduced. And we have reworked
> the whole locking mechanism for swap, so even on a system with 96t the
> contention seems barely observable with common workloads.
>
> And another series is further reducing the contention and the overall
> overhead (24% faster freeing for phase 1):
> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>
> Will these be helpful for you? I think optimizing the root problem is
> better than just deferring the overhead with async workers, which may
> increase the overall overhead and complexity.
>

I feel the cover letter does not clearly describe where the bottleneck
occurs or where the performance gains originate. To be honest, even
the versions submitted last year did not present the bottleneck clearly.

For example, is this due to lock contention (in which case we would
need performance data to see how much CPU time is spent waiting for
locks), or simply because we can simultaneously zap present and
non-present PTEs?

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09  7:30 ` [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Kairui Song
  2025-09-09  9:24   ` Barry Song
@ 2025-09-09 15:38   ` Chris Li
  2025-09-10 14:01   ` Lei Liu
  2 siblings, 0 replies; 26+ messages in thread
From: Chris Li @ 2025-09-09 15:38 UTC (permalink / raw)
  To: Kairui Song
  Cc: Lei Liu, Michal Hocko, David Rientjes, Shakeel Butt,
	Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 9, 2025 at 12:31 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
> >
>
> Hi Lei,
>
> > 1. Problem Scenario
> > On systems with ZRAM and swap enabled, simultaneous process exits create
> > contention. The primary bottleneck occurs during swap entry release
> > operations, causing exiting processes to monopolize CPU resources. This
> > leads to scheduling delays for high-priority processes.
> >
> > 2. Android Use Case
> > During camera launch, LMKD terminates background processes to free memory.
> > Exiting processes compete for CPU cycles, delaying the camera preview
> > thread and causing visible stuttering - directly impacting user
> > experience.
> >
> > 3. Root Cause Analysis
> > When background applications heavily utilize swap space, process exit
> > profiling reveals 55% of time spent in free_swap_and_cache_nr():
> >
> > Function              Duration (ms)   Percentage
> > do_signal               791.813     **********100%
> > do_group_exit           791.813     **********100%
> > do_exit                 791.813     **********100%
> > exit_mm                 577.859        *******73%
> > exit_mmap               577.497        *******73%
> > zap_pte_range           558.645        *******71%
> > free_swap_and_cache_nr  433.381          *****55%
> > free_swap_slot          403.568          *****51%
>
> Thanks for sharing this case.
>
> One problem is that now the free_swap_slot function no longer exists
> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
> actual overhead here?
>
> Some batch freeing optimizations are introduced. And we have reworked
> the whole locking mechanism for swap, so even on a system with 96t the
> contention seems barely observable with common workloads.
>
> And another series is further reducing the contention and the overall
> overhead (24% faster freeing for phase 1):
> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>
> Will these be helpful for you? I think optimizing the root problem is
> better than just deferring the overhead with async workers, which may
> increase the overall overhead and complexity.

+100.

Hi Lei,

This CC list is very long :-)

Is it similar to this one a while back?

https://lore.kernel.org/linux-mm/20240213-async-free-v3-1-b89c3cc48384@kernel.org/

I ultimately abandoned this approach and considered it harmful. Yes, I
can be as harsh as I like for my own previous bad ideas. The better
solution is as Kairui did, just remove the swap slot caching
completely. It is the harder path to take and get better results. I
recall having a discussion with Kairui on this and we are aligned on
removing the swap slot caching eventually . Thanks Kairui for the
heavy lifting of actually removing the swap slot cache. I am just
cheerleading on the side :-)

So no, we are not getting the async free of swap slot caching again.
We shouldn't need to.

Chris




>
>
> > swap_entry_free         393.863          *****50%
> > swap_range_free         372.602           ****47%
> >
> > 4. Optimization Approach
> > a) For processes exceeding swap entry threshold: aggregate and isolate
> > swap entries to enable fast exit
> > b) Asynchronously release batched entries when isolation reaches
> > configured threshold
> >
> > 5. Performance Gains (User Scenario: Camera Cold Launch)
> > a) 74% reduction in process exit latency (>500ms cases)
> > b) ~4% lower peak CPU load during concurrent process exits
> > c) ~70MB additional free memory during camera preview initialization
> > d) 40% reduction in camera preview stuttering probability
> >
> > 6. Prior Art & Improvements
> > Reference: Zhiguo Jiang's patch
> > (https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)
> >
> > Key enhancements:
> > a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
> > b) Async release delegated to workqueue kworkers with configurable
> > max_active for NUMA-optimized concurrency
> >
> > Lei Liu (2):
> >   mm: swap: Gather swap entries and batch async release core
> >   mm: swap: Forced swap entries release under memory pressure
> >
> >  include/linux/oom.h           |  23 ++++++
> >  include/linux/swapfile.h      |   2 +
> >  include/linux/vm_event_item.h |   1 +
> >  kernel/exit.c                 |   2 +
> >  mm/memcontrol.c               |   6 --
> >  mm/memory.c                   |   4 +-
> >  mm/page_alloc.c               |   4 +
> >  mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
> >  mm/vmstat.c                   |   1 +
> >  9 files changed, 170 insertions(+), 7 deletions(-)
> >
> > --
> > 2.34.1
> >
> >
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09  9:24   ` Barry Song
@ 2025-09-09 16:15     ` Chris Li
  2025-09-09 18:01       ` Chris Li
  2025-09-10 14:07     ` Lei Liu
  1 sibling, 1 reply; 26+ messages in thread
From: Chris Li @ 2025-09-09 16:15 UTC (permalink / raw)
  To: Barry Song
  Cc: Kairui Song, Lei Liu, Michal Hocko, David Rientjes, Shakeel Butt,
	Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 9, 2025 at 2:24 AM Barry Song <21cnbao@gmail.com> wrote:
> I feel the cover letter does not clearly describe where the bottleneck
> occurs or where the performance gains originate. To be honest, even
> the versions submitted last year did not present the bottleneck clearly.
>
> For example, is this due to lock contention (in which case we would
> need performance data to see how much CPU time is spent waiting for
> locks), or simply because we can simultaneously zap present and
> non-present PTEs?

I have done some long tail analysis of the zswap page fault a while
back, before zswap converting to xarray. For the zswap page fault, in
the long tail a good chunk is a bath free swap slot. The breakdown
inside  shows a huge chunk is the clear_shadow() followed by
memsw_uncharge(). I will post the link to the breakdown image once it
is available.

Chris


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09 16:15     ` Chris Li
@ 2025-09-09 18:01       ` Chris Li
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Li @ 2025-09-09 18:01 UTC (permalink / raw)
  To: Barry Song
  Cc: Kairui Song, Lei Liu, Michal Hocko, David Rientjes, Shakeel Butt,
	Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 9, 2025 at 9:15 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Sep 9, 2025 at 2:24 AM Barry Song <21cnbao@gmail.com> wrote:
> > I feel the cover letter does not clearly describe where the bottleneck
> > occurs or where the performance gains originate. To be honest, even
> > the versions submitted last year did not present the bottleneck clearly.
> >
> > For example, is this due to lock contention (in which case we would
> > need performance data to see how much CPU time is spent waiting for
> > locks), or simply because we can simultaneously zap present and
> > non-present PTEs?
>
> I have done some long tail analysis of the zswap page fault a while
> back, before zswap converting to xarray. For the zswap page fault, in
> the long tail a good chunk is a bath free swap slot. The breakdown
> inside  shows a huge chunk is the clear_shadow() followed by
> memsw_uncharge(). I will post the link to the breakdown image once it
> is available.

Here is a graph, high level breakdown shows the batch free swap slot
contribute to the long tail:
https://services.google.com/fh/files/misc/zswap-breakdown.png

The detail breakdown inside bath free swap slots:
https://services.google.com/fh/files/misc/zswap-breakdown-detail.png

Those data are on pretty old data, before zswap uses the xarray.

Now the batch freeing the swap entries is gone. I am wondering if the
new kernel shows any bottleneck for Lei's zram test case.

Hi Lei, Please report back on your new findings. In this case, with
removal of swap slot cache, the performance profile will likely be
very different. Let me know if you have difficulties running the
latest kernel on your test bench.

Chris


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09  6:53 [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Lei Liu
                   ` (2 preceding siblings ...)
  2025-09-09  7:30 ` [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Kairui Song
@ 2025-09-09 19:21 ` Shakeel Butt
  2025-09-09 19:48   ` Suren Baghdasaryan
  3 siblings, 1 reply; 26+ messages in thread
From: Shakeel Butt @ 2025-09-09 19:21 UTC (permalink / raw)
  To: Lei Liu
  Cc: Michal Hocko, David Rientjes, Andrew Morton, Kemeng Shi,
	Kairui Song, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> 1. Problem Scenario
> On systems with ZRAM and swap enabled, simultaneous process exits create
> contention. The primary bottleneck occurs during swap entry release
> operations, causing exiting processes to monopolize CPU resources. This
> leads to scheduling delays for high-priority processes.
> 
> 2. Android Use Case
> During camera launch, LMKD terminates background processes to free memory.

How does LMKD trigger the kills? SIGKILL or cgroup.kill?

> Exiting processes compete for CPU cycles, delaying the camera preview
> thread and causing visible stuttering - directly impacting user
> experience.

Since the exit/kill is due to low memory situation, punting the memory
freeing to a low priority async mechanism will help in improving user
experience. Most probably the application (camera preview here) will get
into global reclaim and will compete for CPU with the async memory
freeing.

What we really need is faster memory freeing and we should explore all
possible ways. As others suggested fix/improve the bottleneck in the
memory freeing path. In addition I think we should explore parallelizing
this as well.

On Android, I suppose most of the memory is associated with single or
small set of processes and parallelizing memory freeing would be
challenging. BTW is LMKD using process_mrelease() to release the killed
process memory? 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09 19:21 ` Shakeel Butt
@ 2025-09-09 19:48   ` Suren Baghdasaryan
  2025-09-10 14:14     ` Lei Liu
                       ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2025-09-09 19:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Lei Liu, Michal Hocko, David Rientjes, Andrew Morton, Kemeng Shi,
	Kairui Song, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Brendan Jackman, Zi Yan, Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > 1. Problem Scenario
> > On systems with ZRAM and swap enabled, simultaneous process exits create
> > contention. The primary bottleneck occurs during swap entry release
> > operations, causing exiting processes to monopolize CPU resources. This
> > leads to scheduling delays for high-priority processes.
> >
> > 2. Android Use Case
> > During camera launch, LMKD terminates background processes to free memory.
>
> How does LMKD trigger the kills? SIGKILL or cgroup.kill?

SIGKILL

>
> > Exiting processes compete for CPU cycles, delaying the camera preview
> > thread and causing visible stuttering - directly impacting user
> > experience.
>
> Since the exit/kill is due to low memory situation, punting the memory
> freeing to a low priority async mechanism will help in improving user
> experience. Most probably the application (camera preview here) will get
> into global reclaim and will compete for CPU with the async memory
> freeing.
>
> What we really need is faster memory freeing and we should explore all
> possible ways. As others suggested fix/improve the bottleneck in the
> memory freeing path. In addition I think we should explore parallelizing
> this as well.
>
> On Android, I suppose most of the memory is associated with single or
> small set of processes and parallelizing memory freeing would be
> challenging. BTW is LMKD using process_mrelease() to release the killed
> process memory?

Yes, LMKD has a reaper thread which wakes up and calls
process_mrelease() after the main LMKD thread issued SIGKILL.

>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core
  2025-09-09  6:53 ` [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core Lei Liu
@ 2025-09-10  1:39   ` kernel test robot
  2025-09-10  3:12   ` kernel test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kernel test robot @ 2025-09-10  1:39 UTC (permalink / raw)
  To: Lei Liu, Michal Hocko, David Rientjes, Shakeel Butt,
	Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Johannes Weiner, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Chen Yu,
	Usama Arif, Hao Jia, Kirill A. Shutemov, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko
  Cc: oe-kbuild-all, Linux Memory Management List

Hi Lei,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Lei-Liu/mm-swap-Gather-swap-entries-and-batch-async-release-core/20250909-145620
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250909065349.574894-2-liulei.rjpt%40vivo.com
patch subject: [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core
config: x86_64-buildonly-randconfig-003-20250910 (https://download.01.org/0day-ci/archive/20250910/202509100935.w5zKofdt-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250910/202509100935.w5zKofdt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509100935.w5zKofdt-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/swapfile.c:206:12: warning: 'async_swap_free_counts_show' defined but not used [-Wunused-function]
     206 | static int async_swap_free_counts_show(struct seq_file *m, void *v)
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/async_swap_free_counts_show +206 mm/swapfile.c

   205	
 > 206	static int async_swap_free_counts_show(struct seq_file *m, void *v)
   207	{
   208		seq_printf(m, "exiting_tasks:%d cache_counts:%d\n",
   209			   get_exiting_task_count(), cache_count);
   210		return 0;
   211	}
   212	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core
  2025-09-09  6:53 ` [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core Lei Liu
  2025-09-10  1:39   ` kernel test robot
@ 2025-09-10  3:12   ` kernel test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kernel test robot @ 2025-09-10  3:12 UTC (permalink / raw)
  To: Lei Liu, Michal Hocko, David Rientjes, Shakeel Butt,
	Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Johannes Weiner, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Chen Yu,
	Peter Zijlstra (Intel),
	Usama Arif, Hao Jia, Kirill A. Shutemov, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko
  Cc: oe-kbuild-all, Linux Memory Management List

Hi Lei,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Lei-Liu/mm-swap-Gather-swap-entries-and-batch-async-release-core/20250909-145620
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250909065349.574894-2-liulei.rjpt%40vivo.com
patch subject: [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core
config: sparc-randconfig-002-20250910 (https://download.01.org/0day-ci/archive/20250910/202509101043.MFfNJgKH-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250910/202509101043.MFfNJgKH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509101043.MFfNJgKH-lkp@intel.com/

All errors (new ones prefixed by >>):

   sparc64-linux-ld: kernel/exit.o: in function `do_exit':
>> exit.c:(.text+0x19f0): undefined reference to `exiting_task_count'
>> sparc64-linux-ld: exit.c:(.text+0x1a14): undefined reference to `exiting_task_count'
   sparc64-linux-ld: exit.c:(.text+0x2124): undefined reference to `exiting_task_count'
   sparc64-linux-ld: mm/memory.o: in function `unmap_page_range':
>> memory.c:(.text+0x46a4): undefined reference to `add_to_swap_gather_cache'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 2/2] mm: swap: Forced swap entries release under memory pressure
  2025-09-09  6:53 ` [PATCH v0 2/2] mm: swap: Forced swap entries release under memory pressure Lei Liu
@ 2025-09-10  5:36   ` kernel test robot
  0 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2025-09-10  5:36 UTC (permalink / raw)
  To: Lei Liu, Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham,
	Baoquan He, Barry Song, Chris Li, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, open list
  Cc: oe-kbuild-all, Linux Memory Management List, Lei Liu

Hi Lei,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Lei-Liu/mm-swap-Gather-swap-entries-and-batch-async-release-core/20250909-145620
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250909065349.574894-3-liulei.rjpt%40vivo.com
patch subject: [PATCH v0 2/2] mm: swap: Forced swap entries release under memory pressure
config: sparc-randconfig-002-20250910 (https://download.01.org/0day-ci/archive/20250910/202509101302.pptHT8X8-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250910/202509101302.pptHT8X8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509101302.pptHT8X8-lkp@intel.com/

All errors (new ones prefixed by >>):

   sparc64-linux-ld: kernel/exit.o: in function `do_exit':
   exit.c:(.text+0x19f0): undefined reference to `exiting_task_count'
   sparc64-linux-ld: exit.c:(.text+0x1a14): undefined reference to `exiting_task_count'
   sparc64-linux-ld: exit.c:(.text+0x2124): undefined reference to `exiting_task_count'
   sparc64-linux-ld: mm/memory.o: in function `unmap_page_range':
   memory.c:(.text+0x46a4): undefined reference to `add_to_swap_gather_cache'
   sparc64-linux-ld: mm/page_alloc.o: in function `__alloc_pages_slowpath.constprop.124':
>> page_alloc.c:(.text+0xb44c): undefined reference to `flush_cache_if_needed'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09  7:30 ` [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Kairui Song
  2025-09-09  9:24   ` Barry Song
  2025-09-09 15:38   ` Chris Li
@ 2025-09-10 14:01   ` Lei Liu
  2 siblings, 0 replies; 26+ messages in thread
From: Lei Liu @ 2025-09-10 14:01 UTC (permalink / raw)
  To: Kairui Song
  Cc: Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)


On 2025/9/9 15:30, Kairui Song wrote:
> [You don't often get email from ryncsn@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
> Hi Lei,
>
>> 1. Problem Scenario
>> On systems with ZRAM and swap enabled, simultaneous process exits create
>> contention. The primary bottleneck occurs during swap entry release
>> operations, causing exiting processes to monopolize CPU resources. This
>> leads to scheduling delays for high-priority processes.
>>
>> 2. Android Use Case
>> During camera launch, LMKD terminates background processes to free memory.
>> Exiting processes compete for CPU cycles, delaying the camera preview
>> thread and causing visible stuttering - directly impacting user
>> experience.
>>
>> 3. Root Cause Analysis
>> When background applications heavily utilize swap space, process exit
>> profiling reveals 55% of time spent in free_swap_and_cache_nr():
>>
>> Function              Duration (ms)   Percentage
>> do_signal               791.813     **********100%
>> do_group_exit           791.813     **********100%
>> do_exit                 791.813     **********100%
>> exit_mm                 577.859        *******73%
>> exit_mmap               577.497        *******73%
>> zap_pte_range           558.645        *******71%
>> free_swap_and_cache_nr  433.381          *****55%
>> free_swap_slot          403.568          *****51%
> Thanks for sharing this case.
>
> One problem is that now the free_swap_slot function no longer exists
> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
> actual overhead here?
>
> Some batch freeing optimizations are introduced. And we have reworked
> the whole locking mechanism for swap, so even on a system with 96t the
> contention seems barely observable with common workloads.
>
> And another series is further reducing the contention and the overall
> overhead (24% faster freeing for phase 1):
> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>
> Will these be helpful for you? I think optimizing the root problem is
> better than just deferring the overhead with async workers, which may
> increase the overall overhead and complexity.

Hi Kairui

Thank you for your optimization suggestions. We believe your patch may 
help ou
r scenario. We'll try integrating it to evaluate benefits. However, it 
may not
fully solve our issue. Below is our problem description:

Flame graph of time distribution for TikTok process exit (~400MB swapped):
do_notify_resume         3.89%
get_signal               3.89%
do_signal_exit           3.88%
do_exit                  3.88%
mmput                    3.22%
exit_mmap                3.22%
unmap_vmas               3.08%
unmap_page_range         3.07%
free_swap_and_cache_nr   1.31%****
swap_entry_range_free    1.17%****
zram_slot_free_notify    1.11%****
zram_free_hw_entry_dc    0.43%
free_zspage[zsmalloc]    0.09%

CPU: 8-core ARM64 (14.21GHz+33.5GHz+4*2.7GHz), 12GB RAM

Process with ~400MB swap exit situation:
Exit takes 200-300ms, ~4% CPU load
With more zram compression/swap, exit time increases to 400-500ms
free_swap_and_cache_nr avg: 0.5ms, max: ~1.5ms (running time)
free_swap_and_cache_nr dominates exit time (33%, up to 50% in worst cases
). Main time is zram resource freeing (0.25ms per operation). With dozens
of simultaneous exits, cumulative time becomes significant.

Optimization approach:
Focus isn't on optimizing hot functions (limited improvement potential).
High load comes from too many simultaneous exits. We'll make time-consumin
g interfaces in do_exit asynchronous to accelerate exit completion while
allowing non-swap page (file/anonymous) freeing by other processes.

Camera startup scenario:
20-30 background apps, anonymous pages compressed to zram (200-500MB).
Camera launch triggers lmkd to kill 10+ apps - their exits consume 25%+
CPU. System services/third-party processes use 60%+ CPU, leaving camera
startup process CPU-starved and delayed.


Sincere wishes,
Lei


>
>
>> swap_entry_free         393.863          *****50%
>> swap_range_free         372.602           ****47%
>>
>> 4. Optimization Approach
>> a) For processes exceeding swap entry threshold: aggregate and isolate
>> swap entries to enable fast exit
>> b) Asynchronously release batched entries when isolation reaches
>> configured threshold
>>
>> 5. Performance Gains (User Scenario: Camera Cold Launch)
>> a) 74% reduction in process exit latency (>500ms cases)
>> b) ~4% lower peak CPU load during concurrent process exits
>> c) ~70MB additional free memory during camera preview initialization
>> d) 40% reduction in camera preview stuttering probability
>>
>> 6. Prior Art & Improvements
>> Reference: Zhiguo Jiang's patch
>> (https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)
>>
>> Key enhancements:
>> a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
>> b) Async release delegated to workqueue kworkers with configurable
>> max_active for NUMA-optimized concurrency
>>
>> Lei Liu (2):
>>    mm: swap: Gather swap entries and batch async release core
>>    mm: swap: Forced swap entries release under memory pressure
>>
>>   include/linux/oom.h           |  23 ++++++
>>   include/linux/swapfile.h      |   2 +
>>   include/linux/vm_event_item.h |   1 +
>>   kernel/exit.c                 |   2 +
>>   mm/memcontrol.c               |   6 --
>>   mm/memory.c                   |   4 +-
>>   mm/page_alloc.c               |   4 +
>>   mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
>>   mm/vmstat.c                   |   1 +
>>   9 files changed, 170 insertions(+), 7 deletions(-)
>>
>> --
>> 2.34.1
>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09  9:24   ` Barry Song
  2025-09-09 16:15     ` Chris Li
@ 2025-09-10 14:07     ` Lei Liu
  2025-10-14 20:42       ` Barry Song
  1 sibling, 1 reply; 26+ messages in thread
From: Lei Liu @ 2025-09-10 14:07 UTC (permalink / raw)
  To: Barry Song, Kairui Song
  Cc: Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton,
	Kemeng Shi, Nhat Pham, Baoquan He, Chris Li, Johannes Weiner,
	Roman Gushchin, Muchun Song, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)


On 2025/9/9 17:24, Barry Song wrote:
> [You don't often get email from 21cnbao@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Sep 9, 2025 at 3:30 PM Kairui Song <ryncsn@gmail.com> wrote:
>> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
>> Hi Lei,
>>
>>> 1. Problem Scenario
>>> On systems with ZRAM and swap enabled, simultaneous process exits create
>>> contention. The primary bottleneck occurs during swap entry release
>>> operations, causing exiting processes to monopolize CPU resources. This
>>> leads to scheduling delays for high-priority processes.
>>>
>>> 2. Android Use Case
>>> During camera launch, LMKD terminates background processes to free memory.
>>> Exiting processes compete for CPU cycles, delaying the camera preview
>>> thread and causing visible stuttering - directly impacting user
>>> experience.
>>>
>>> 3. Root Cause Analysis
>>> When background applications heavily utilize swap space, process exit
>>> profiling reveals 55% of time spent in free_swap_and_cache_nr():
>>>
>>> Function              Duration (ms)   Percentage
>>> do_signal               791.813     **********100%
>>> do_group_exit           791.813     **********100%
>>> do_exit                 791.813     **********100%
>>> exit_mm                 577.859        *******73%
>>> exit_mmap               577.497        *******73%
>>> zap_pte_range           558.645        *******71%
>>> free_swap_and_cache_nr  433.381          *****55%
>>> free_swap_slot          403.568          *****51%
>> Thanks for sharing this case.
>>
>> One problem is that now the free_swap_slot function no longer exists
>> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
>> actual overhead here?
>>
>> Some batch freeing optimizations are introduced. And we have reworked
>> the whole locking mechanism for swap, so even on a system with 96t the
>> contention seems barely observable with common workloads.
>>
>> And another series is further reducing the contention and the overall
>> overhead (24% faster freeing for phase 1):
>> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>>
>> Will these be helpful for you? I think optimizing the root problem is
>> better than just deferring the overhead with async workers, which may
>> increase the overall overhead and complexity.
>>
> I feel the cover letter does not clearly describe where the bottleneck
> occurs or where the performance gains originate. To be honest, even
> the versions submitted last year did not present the bottleneck clearly.
>
> For example, is this due to lock contention (in which case we would
> need performance data to see how much CPU time is spent waiting for
> locks), or simply because we can simultaneously zap present and
> non-present PTEs?
>
> Thanks
> Barry

Hi Barry

Thank you for your question. Here is the issue we are encountering:

Flame graph of time distribution for douyin process exit (~400MB swapped):
do_notify_resume         3.89%
get_signal               3.89%
do_signal_exit           3.88%
do_exit                  3.88%
mmput                    3.22%
exit_mmap                3.22%
unmap_vmas               3.08%
unmap_page_range         3.07%
free_swap_and_cache_nr   1.31%****
swap_entry_range_free    1.17%****
zram_slot_free_notify    1.11%****
zram_free_hw_entry_dc    0.43%
free_zspage[zsmalloc]    0.09%

CPU: 8-core ARM64 (14.21GHz+33.5GHz+4*2.7GHz), 12GB RAM

Process with ~400MB swap exit situation:
Exit takes 200-300ms, ~4% CPU load
With more zram compression/swap, exit time increases to 400-500ms
free_swap_and_cache_nr avg: 0.5ms, max: ~1.5ms (running time)
free_swap_and_cache_nr dominates exit time (33%, up to 50% in worst cases
). Main time is zram resource freeing (0.25ms per operation). With dozens
of simultaneous exits, cumulative time becomes significant.

Optimization approach:
Focus isn't on optimizing hot functions (limited improvement potential).
High load comes from too many simultaneous exits. We'll make time-consumin
g interfaces in do_exit asynchronous to accelerate exit completion while
allowing non-swap page (file/anonymous) freeing by other processes.

Camera startup scenario:
20-30 background apps, anonymous pages compressed to zram (200-500MB).
Camera launch triggers lmkd to kill 10+ apps - their exits consume 25%+
CPU. System services/third-party processes use 60%+ CPU, leaving camera
startup process CPU-starved and delayed.

Sincere wishes,
Lei




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09 19:48   ` Suren Baghdasaryan
@ 2025-09-10 14:14     ` Lei Liu
  2025-09-10 14:56       ` Suren Baghdasaryan
                         ` (2 more replies)
  2025-09-10 15:40     ` Chris Li
  2025-09-10 20:10     ` Shakeel Butt
  2 siblings, 3 replies; 26+ messages in thread
From: Lei Liu @ 2025-09-10 14:14 UTC (permalink / raw)
  To: Suren Baghdasaryan, Shakeel Butt
  Cc: Michal Hocko, David Rientjes, Andrew Morton, Kemeng Shi,
	Kairui Song, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Brendan Jackman, Zi Yan, Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)


On 2025/9/10 3:48, Suren Baghdasaryan wrote:
> On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
>>> 1. Problem Scenario
>>> On systems with ZRAM and swap enabled, simultaneous process exits create
>>> contention. The primary bottleneck occurs during swap entry release
>>> operations, causing exiting processes to monopolize CPU resources. This
>>> leads to scheduling delays for high-priority processes.
>>>
>>> 2. Android Use Case
>>> During camera launch, LMKD terminates background processes to free memory.
>> How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> SIGKILL
>
>>> Exiting processes compete for CPU cycles, delaying the camera preview
>>> thread and causing visible stuttering - directly impacting user
>>> experience.
>> Since the exit/kill is due to low memory situation, punting the memory
>> freeing to a low priority async mechanism will help in improving user
>> experience. Most probably the application (camera preview here) will get
>> into global reclaim and will compete for CPU with the async memory
>> freeing.
>>
>> What we really need is faster memory freeing and we should explore all
>> possible ways. As others suggested fix/improve the bottleneck in the
>> memory freeing path. In addition I think we should explore parallelizing
>> this as well.
>>
>> On Android, I suppose most of the memory is associated with single or
>> small set of processes and parallelizing memory freeing would be
>> challenging. BTW is LMKD using process_mrelease() to release the killed
>> process memory?
> Yes, LMKD has a reaper thread which wakes up and calls
> process_mrelease() after the main LMKD thread issued SIGKILL.

Hi Suren

our current issue is that after lmkd kills a process,|exit_mm|takes 
considerable time. The interface you provided might help quickly free 
memory, potentially allowing us to release some memory from processes 
before lmkd kills them. This could be a good idea.

We will take your suggestion into consideration.


Thank you




>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 14:14     ` Lei Liu
@ 2025-09-10 14:56       ` Suren Baghdasaryan
  2025-09-10 16:05       ` Chris Li
  2025-09-10 20:12       ` Shakeel Butt
  2 siblings, 0 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2025-09-10 14:56 UTC (permalink / raw)
  To: Lei Liu
  Cc: Shakeel Butt, Michal Hocko, David Rientjes, Andrew Morton,
	Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He, Barry Song,
	Chris Li, Johannes Weiner, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Wed, Sep 10, 2025 at 7:14 AM Lei Liu <liulei.rjpt@vivo.com> wrote:
>
>
> On 2025/9/10 3:48, Suren Baghdasaryan wrote:
> > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> >>> 1. Problem Scenario
> >>> On systems with ZRAM and swap enabled, simultaneous process exits create
> >>> contention. The primary bottleneck occurs during swap entry release
> >>> operations, causing exiting processes to monopolize CPU resources. This
> >>> leads to scheduling delays for high-priority processes.
> >>>
> >>> 2. Android Use Case
> >>> During camera launch, LMKD terminates background processes to free memory.
> >> How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > SIGKILL
> >
> >>> Exiting processes compete for CPU cycles, delaying the camera preview
> >>> thread and causing visible stuttering - directly impacting user
> >>> experience.
> >> Since the exit/kill is due to low memory situation, punting the memory
> >> freeing to a low priority async mechanism will help in improving user
> >> experience. Most probably the application (camera preview here) will get
> >> into global reclaim and will compete for CPU with the async memory
> >> freeing.
> >>
> >> What we really need is faster memory freeing and we should explore all
> >> possible ways. As others suggested fix/improve the bottleneck in the
> >> memory freeing path. In addition I think we should explore parallelizing
> >> this as well.
> >>
> >> On Android, I suppose most of the memory is associated with single or
> >> small set of processes and parallelizing memory freeing would be
> >> challenging. BTW is LMKD using process_mrelease() to release the killed
> >> process memory?
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
>
> Hi Suren
>
> our current issue is that after lmkd kills a process,|exit_mm|takes
> considerable time. The interface you provided might help quickly free
> memory, potentially allowing us to release some memory from processes
> before lmkd kills them. This could be a good idea.
>
> We will take your suggestion into consideration.

I wasn't really suggesting anything, just explaining how LMKD works today.

>
>
> Thank you
>
>
>
>
> >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09 19:48   ` Suren Baghdasaryan
  2025-09-10 14:14     ` Lei Liu
@ 2025-09-10 15:40     ` Chris Li
  2025-09-10 20:10     ` Shakeel Butt
  2 siblings, 0 replies; 26+ messages in thread
From: Chris Li @ 2025-09-10 15:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Shakeel Butt, Lei Liu, Michal Hocko, David Rientjes,
	Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 9, 2025 at 12:48 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > thread and causing visible stuttering - directly impacting user
> > > experience.
> >
> > Since the exit/kill is due to low memory situation, punting the memory
> > freeing to a low priority async mechanism will help in improving user
> > experience. Most probably the application (camera preview here) will get
> > into global reclaim and will compete for CPU with the async memory
> > freeing.
> >
> > What we really need is faster memory freeing and we should explore all
> > possible ways. As others suggested fix/improve the bottleneck in the
> > memory freeing path. In addition I think we should explore parallelizing
> > this as well.
> >
> > On Android, I suppose most of the memory is associated with single or
> > small set of processes and parallelizing memory freeing would be
> > challenging. BTW is LMKD using process_mrelease() to release the killed
> > process memory?
>
> Yes, LMKD has a reaper thread which wakes up and calls
> process_mrelease() after the main LMKD thread issued SIGKILL.

I feel this is a better solution to address the exit process that is too slow.

We are basically optimizing the exit() system call, I feel there
should be something we can do in the userspace before exit() to help
us without the kernel putting too much complexity into exit().
process_mrelease() souds fit the bill pretty well.

Chris


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 14:14     ` Lei Liu
  2025-09-10 14:56       ` Suren Baghdasaryan
@ 2025-09-10 16:05       ` Chris Li
  2025-09-10 20:12       ` Shakeel Butt
  2 siblings, 0 replies; 26+ messages in thread
From: Chris Li @ 2025-09-10 16:05 UTC (permalink / raw)
  To: Lei Liu
  Cc: Suren Baghdasaryan, Shakeel Butt, Michal Hocko, David Rientjes,
	Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Wed, Sep 10, 2025 at 7:14 AM Lei Liu <liulei.rjpt@vivo.com> wrote:
> >> On Android, I suppose most of the memory is associated with single or
> >> small set of processes and parallelizing memory freeing would be
> >> challenging. BTW is LMKD using process_mrelease() to release the killed
> >> process memory?
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
>
> Hi Suren
>
> our current issue is that after lmkd kills a process,|exit_mm|takes
> considerable time. The interface you provided might help quickly free
> memory, potentially allowing us to release some memory from processes
> before lmkd kills them. This could be a good idea.
>
> We will take your suggestion into consideration.

Hi Lei,

I do want to help your usage case. With my previous analysis of the
swap fault time breakdown. The amount of time it spends on batching
freeing the swap entry is not that much. Yes, it has a long tail, but
that is on a very small percentage of page faults. It shouldn't have
such a huge impact on the global average time.

https://services.google.com/fh/files/misc/zswap-breakdown.png
https://services.google.com/fh/files/misc/zswap-breakdown-detail.png

That is what I am trying to get to, the batch free of swap entry is
just the surface level. By itself it does not contribute much. Your
exit latency is largely a different issue.

However, the approach you take, (I briefly go over your patch) is to
add another batch layer for the swap entry free. Which impacts not
only the exit() path, it impacts other non exit() freeing of swap
entry as well. The swap entry is a resource best managed by the swap
allocator. The swap allocator knows best when it is the best place to
cache it vs freeing it under pressure. The extra batch of swap entry
free (before triggering the threshold) is just swap entry seating in
the batch queue. The allocator has no internal knowledge of this batch
behavior and it is interfering with the global view of swap entry
allocator. You need to address this before your patch can be
re-considered.

It feels like a CFO needs to do a company wide budget and revenue
projection. The sales department is having a side pocket account to
defer the revenue and sand bagging the sales number, which can
jeopardize the CFO's ability to budget and project . BTW, what I
describe is probably illegal for public companies. Kids, don't try
this at home.

I think you can do some of the following:
1) redo the test with the latest kernel which does not have the swap
slot caching batching any more. Report back what you got.
2) try out the process_mrelease().

Please share your findings, I am happy to work with you to address the
problem you encounter.

Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-09 19:48   ` Suren Baghdasaryan
  2025-09-10 14:14     ` Lei Liu
  2025-09-10 15:40     ` Chris Li
@ 2025-09-10 20:10     ` Shakeel Butt
  2025-09-10 20:41       ` Suren Baghdasaryan
  2 siblings, 1 reply; 26+ messages in thread
From: Shakeel Butt @ 2025-09-10 20:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lei Liu, Michal Hocko, David Rientjes, Andrew Morton, Kemeng Shi,
	Kairui Song, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Brendan Jackman, Zi Yan, Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > 1. Problem Scenario
> > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > contention. The primary bottleneck occurs during swap entry release
> > > operations, causing exiting processes to monopolize CPU resources. This
> > > leads to scheduling delays for high-priority processes.
> > >
> > > 2. Android Use Case
> > > During camera launch, LMKD terminates background processes to free memory.
> >
> > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> 
> SIGKILL
> 
> >
> > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > thread and causing visible stuttering - directly impacting user
> > > experience.
> >
> > Since the exit/kill is due to low memory situation, punting the memory
> > freeing to a low priority async mechanism will help in improving user
> > experience. Most probably the application (camera preview here) will get
> > into global reclaim and will compete for CPU with the async memory
> > freeing.
> >
> > What we really need is faster memory freeing and we should explore all
> > possible ways. As others suggested fix/improve the bottleneck in the
> > memory freeing path. In addition I think we should explore parallelizing
> > this as well.
> >
> > On Android, I suppose most of the memory is associated with single or
> > small set of processes and parallelizing memory freeing would be
> > challenging. BTW is LMKD using process_mrelease() to release the killed
> > process memory?
> 
> Yes, LMKD has a reaper thread which wakes up and calls
> process_mrelease() after the main LMKD thread issued SIGKILL.
> 

Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
that still the plan? I am actually looking into cgroup.kill, beside
sending SIGKILL, putting the processes of the target cgroup in the oom
reaper list. In addition, making oom reaper able to reap processes in
parallel. I am hoping that functionality to be useful to Android as
well.
> >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 14:14     ` Lei Liu
  2025-09-10 14:56       ` Suren Baghdasaryan
  2025-09-10 16:05       ` Chris Li
@ 2025-09-10 20:12       ` Shakeel Butt
  2025-09-11  3:04         ` Lei Liu
  2 siblings, 1 reply; 26+ messages in thread
From: Shakeel Butt @ 2025-09-10 20:12 UTC (permalink / raw)
  To: Lei Liu
  Cc: Suren Baghdasaryan, Michal Hocko, David Rientjes, Andrew Morton,
	Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He, Barry Song,
	Chris Li, Johannes Weiner, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Wed, Sep 10, 2025 at 10:14:04PM +0800, Lei Liu wrote:
> 
> On 2025/9/10 3:48, Suren Baghdasaryan wrote:
> > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > 1. Problem Scenario
> > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > contention. The primary bottleneck occurs during swap entry release
> > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > leads to scheduling delays for high-priority processes.
> > > > 
> > > > 2. Android Use Case
> > > > During camera launch, LMKD terminates background processes to free memory.
> > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > SIGKILL
> > 
> > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > thread and causing visible stuttering - directly impacting user
> > > > experience.
> > > Since the exit/kill is due to low memory situation, punting the memory
> > > freeing to a low priority async mechanism will help in improving user
> > > experience. Most probably the application (camera preview here) will get
> > > into global reclaim and will compete for CPU with the async memory
> > > freeing.
> > > 
> > > What we really need is faster memory freeing and we should explore all
> > > possible ways. As others suggested fix/improve the bottleneck in the
> > > memory freeing path. In addition I think we should explore parallelizing
> > > this as well.
> > > 
> > > On Android, I suppose most of the memory is associated with single or
> > > small set of processes and parallelizing memory freeing would be
> > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > process memory?
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
> 
> Hi Suren
> 
> our current issue is that after lmkd kills a process,|exit_mm|takes
> considerable time. The interface you provided might help quickly free
> memory, potentially allowing us to release some memory from processes before
> lmkd kills them. This could be a good idea.
> 
> We will take your suggestion into consideration.

But LMKD already does the process_mrelease(). Is that not happening on
your setup?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 20:10     ` Shakeel Butt
@ 2025-09-10 20:41       ` Suren Baghdasaryan
  2025-09-10 22:10         ` T.J. Mercier
  0 siblings, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2025-09-10 20:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Lei Liu, Michal Hocko, David Rientjes, Andrew Morton, Kemeng Shi,
	Kairui Song, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Brendan Jackman, Zi Yan, Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	T.J. Mercier

On Wed, Sep 10, 2025 at 1:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > 1. Problem Scenario
> > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > contention. The primary bottleneck occurs during swap entry release
> > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > leads to scheduling delays for high-priority processes.
> > > >
> > > > 2. Android Use Case
> > > > During camera launch, LMKD terminates background processes to free memory.
> > >
> > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> >
> > SIGKILL
> >
> > >
> > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > thread and causing visible stuttering - directly impacting user
> > > > experience.
> > >
> > > Since the exit/kill is due to low memory situation, punting the memory
> > > freeing to a low priority async mechanism will help in improving user
> > > experience. Most probably the application (camera preview here) will get
> > > into global reclaim and will compete for CPU with the async memory
> > > freeing.
> > >
> > > What we really need is faster memory freeing and we should explore all
> > > possible ways. As others suggested fix/improve the bottleneck in the
> > > memory freeing path. In addition I think we should explore parallelizing
> > > this as well.
> > >
> > > On Android, I suppose most of the memory is associated with single or
> > > small set of processes and parallelizing memory freeing would be
> > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > process memory?
> >
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
> >
>
> Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
> that still the plan? I am actually looking into cgroup.kill, beside
> sending SIGKILL, putting the processes of the target cgroup in the oom
> reaper list. In addition, making oom reaper able to reap processes in
> parallel. I am hoping that functionality to be useful to Android as
> well.

Yes, cgroups v2 with per-app hierarchy is already enabled on Android
as of about a year or so ago. The first usecase was the freezer. TJ
(CC'ing him here) also changed how ActivityManager Service (AMS) kills
process groups to use cgroup.kill (think when you force-stop an app
that's what will happen). LMKD has not been changed to use cgroup.kill
but that might be worth doing now. TJ, WDYT?


> > >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 20:41       ` Suren Baghdasaryan
@ 2025-09-10 22:10         ` T.J. Mercier
  2025-09-10 22:33           ` Shakeel Butt
  0 siblings, 1 reply; 26+ messages in thread
From: T.J. Mercier @ 2025-09-10 22:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Shakeel Butt, Lei Liu, Michal Hocko, David Rientjes,
	Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Johannes Weiner, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Wed, Sep 10, 2025 at 1:41 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Sep 10, 2025 at 1:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > > 1. Problem Scenario
> > > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > > contention. The primary bottleneck occurs during swap entry release
> > > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > > leads to scheduling delays for high-priority processes.
> > > > >
> > > > > 2. Android Use Case
> > > > > During camera launch, LMKD terminates background processes to free memory.
> > > >
> > > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > >
> > > SIGKILL
> > >
> > > >
> > > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > > thread and causing visible stuttering - directly impacting user
> > > > > experience.
> > > >
> > > > Since the exit/kill is due to low memory situation, punting the memory
> > > > freeing to a low priority async mechanism will help in improving user
> > > > experience. Most probably the application (camera preview here) will get
> > > > into global reclaim and will compete for CPU with the async memory
> > > > freeing.
> > > >
> > > > What we really need is faster memory freeing and we should explore all
> > > > possible ways. As others suggested fix/improve the bottleneck in the
> > > > memory freeing path. In addition I think we should explore parallelizing
> > > > this as well.
> > > >
> > > > On Android, I suppose most of the memory is associated with single or
> > > > small set of processes and parallelizing memory freeing would be
> > > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > > process memory?
> > >
> > > Yes, LMKD has a reaper thread which wakes up and calls
> > > process_mrelease() after the main LMKD thread issued SIGKILL.
> > >
> >
> > Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
> > that still the plan? I am actually looking into cgroup.kill, beside
> > sending SIGKILL, putting the processes of the target cgroup in the oom
> > reaper list. In addition, making oom reaper able to reap processes in
> > parallel. I am hoping that functionality to be useful to Android as
> > well.
>
> Yes, cgroups v2 with per-app hierarchy is already enabled on Android
> as of about a year or so ago. The first usecase was the freezer. TJ
> (CC'ing him here) also changed how ActivityManager Service (AMS) kills
> process groups to use cgroup.kill (think when you force-stop an app
> that's what will happen). LMKD has not been changed to use cgroup.kill
> but that might be worth doing now. TJ, WDYT?

Sounds like it's worth trying here [1].

One potential downside of cgroup.kill is that it requires taking the
cgroup_mutex, which is one of our most heavily contended locks.

We already have logic that waits for exits in libprocessgroup's
KillProcessGroup [2], but I don't think LMKD needs or wants that from
its main thread. I think we'll still want process_mrelease [3] from
LMKD's reaper thread.

[1] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=233
[2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/processgroup.cpp;drc=61197364367c9e404c7da6900658f1b16c42d0da;l=537
[3] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=123

Shakeel could we not also invoke the oom reaper's help for regular
kill(SIGKILL)s?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 22:10         ` T.J. Mercier
@ 2025-09-10 22:33           ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2025-09-10 22:33 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Suren Baghdasaryan, Lei Liu, Michal Hocko, David Rientjes,
	Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Johannes Weiner, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Wed, Sep 10, 2025 at 03:10:29PM -0700, T.J. Mercier wrote:
> On Wed, Sep 10, 2025 at 1:41 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 1:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > >
> > > > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > > > 1. Problem Scenario
> > > > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > > > contention. The primary bottleneck occurs during swap entry release
> > > > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > > > leads to scheduling delays for high-priority processes.
> > > > > >
> > > > > > 2. Android Use Case
> > > > > > During camera launch, LMKD terminates background processes to free memory.
> > > > >
> > > > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > > >
> > > > SIGKILL
> > > >
> > > > >
> > > > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > > > thread and causing visible stuttering - directly impacting user
> > > > > > experience.
> > > > >
> > > > > Since the exit/kill is due to low memory situation, punting the memory
> > > > > freeing to a low priority async mechanism will help in improving user
> > > > > experience. Most probably the application (camera preview here) will get
> > > > > into global reclaim and will compete for CPU with the async memory
> > > > > freeing.
> > > > >
> > > > > What we really need is faster memory freeing and we should explore all
> > > > > possible ways. As others suggested fix/improve the bottleneck in the
> > > > > memory freeing path. In addition I think we should explore parallelizing
> > > > > this as well.
> > > > >
> > > > > On Android, I suppose most of the memory is associated with single or
> > > > > small set of processes and parallelizing memory freeing would be
> > > > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > > > process memory?
> > > >
> > > > Yes, LMKD has a reaper thread which wakes up and calls
> > > > process_mrelease() after the main LMKD thread issued SIGKILL.
> > > >
> > >
> > > Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
> > > that still the plan? I am actually looking into cgroup.kill, beside
> > > sending SIGKILL, putting the processes of the target cgroup in the oom
> > > reaper list. In addition, making oom reaper able to reap processes in
> > > parallel. I am hoping that functionality to be useful to Android as
> > > well.
> >
> > Yes, cgroups v2 with per-app hierarchy is already enabled on Android
> > as of about a year or so ago. The first usecase was the freezer. TJ
> > (CC'ing him here) also changed how ActivityManager Service (AMS) kills
> > process groups to use cgroup.kill (think when you force-stop an app
> > that's what will happen). LMKD has not been changed to use cgroup.kill
> > but that might be worth doing now. TJ, WDYT?
> 
> Sounds like it's worth trying here [1].
> 
> One potential downside of cgroup.kill is that it requires taking the
> cgroup_mutex, which is one of our most heavily contended locks.

Oh let me look into that and see if we can remove cgroup_mutex from that
interface.

> 
> We already have logic that waits for exits in libprocessgroup's
> KillProcessGroup [2], but I don't think LMKD needs or wants that from
> its main thread. I think we'll still want process_mrelease [3] from
> LMKD's reaper thread.

I imagine once kernel oom reaper can work on killed processes
transparently, it would be much easier to let it do the job instead of
manual process_mrelease() on all the processes in a cgroup.

> 
> [1] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=233
> [2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/processgroup.cpp;drc=61197364367c9e404c7da6900658f1b16c42d0da;l=537
> [3] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=123
> 
> Shakeel could we not also invoke the oom reaper's help for regular
> kill(SIGKILL)s?

I don't see why this can not be done. I will take a look.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 20:12       ` Shakeel Butt
@ 2025-09-11  3:04         ` Lei Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Lei Liu @ 2025-09-11  3:04 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Suren Baghdasaryan, Michal Hocko, David Rientjes, Andrew Morton,
	Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He, Barry Song,
	Chris Li, Johannes Weiner, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)


on 2025/9/11 4:12, Shakeel Butt wrote:
> On Wed, Sep 10, 2025 at 10:14:04PM +0800, Lei Liu wrote:
>> On 2025/9/10 3:48, Suren Baghdasaryan wrote:
>>> On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
>>>>> 1. Problem Scenario
>>>>> On systems with ZRAM and swap enabled, simultaneous process exits create
>>>>> contention. The primary bottleneck occurs during swap entry release
>>>>> operations, causing exiting processes to monopolize CPU resources. This
>>>>> leads to scheduling delays for high-priority processes.
>>>>>
>>>>> 2. Android Use Case
>>>>> During camera launch, LMKD terminates background processes to free memory.
>>>> How does LMKD trigger the kills? SIGKILL or cgroup.kill?
>>> SIGKILL
>>>
>>>>> Exiting processes compete for CPU cycles, delaying the camera preview
>>>>> thread and causing visible stuttering - directly impacting user
>>>>> experience.
>>>> Since the exit/kill is due to low memory situation, punting the memory
>>>> freeing to a low priority async mechanism will help in improving user
>>>> experience. Most probably the application (camera preview here) will get
>>>> into global reclaim and will compete for CPU with the async memory
>>>> freeing.
>>>>
>>>> What we really need is faster memory freeing and we should explore all
>>>> possible ways. As others suggested fix/improve the bottleneck in the
>>>> memory freeing path. In addition I think we should explore parallelizing
>>>> this as well.
>>>>
>>>> On Android, I suppose most of the memory is associated with single or
>>>> small set of processes and parallelizing memory freeing would be
>>>> challenging. BTW is LMKD using process_mrelease() to release the killed
>>>> process memory?
>>> Yes, LMKD has a reaper thread which wakes up and calls
>>> process_mrelease() after the main LMKD thread issued SIGKILL.
>> Hi Suren
>>
>> our current issue is that after lmkd kills a process,|exit_mm|takes
>> considerable time. The interface you provided might help quickly free
>> memory, potentially allowing us to release some memory from processes before
>> lmkd kills them. This could be a good idea.
>>
>> We will take your suggestion into consideration.
> But LMKD already does the process_mrelease(). Is that not happening on
> your setup?

Hi Shakeel

Thank you for your consideration.

In our product, we have observed that in scenarios where multiple

processes are being killed, the load on the lmkd_reaper thread can
become very heavy, leading to issues with power consumption and lag.

This problem also occurs in the current camera launch scenario.


Best regards,
Lei




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release
  2025-09-10 14:07     ` Lei Liu
@ 2025-10-14 20:42       ` Barry Song
  0 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-10-14 20:42 UTC (permalink / raw)
  To: Lei Liu
  Cc: Kairui Song, Michal Hocko, David Rientjes, Shakeel Butt,
	Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Chris Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Brendan Jackman, Zi Yan,
	Peter Zijlstra (Intel),
	Chen Yu, Hao Jia, Kirill A. Shutemov, Usama Arif, Oleg Nesterov,
	Christian Brauner, Mateusz Guzik, Steven Rostedt,
	Andrii Nakryiko, Al Viro, Fushuai Wang,
	open list:MEMORY MANAGEMENT - OOM KILLER, open list,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

>
> Hi Barry
>
> Thank you for your question. Here is the issue we are encountering:
>
> Flame graph of time distribution for douyin process exit (~400MB swapped):
> do_notify_resume         3.89%
> get_signal               3.89%
> do_signal_exit           3.88%
> do_exit                  3.88%
> mmput                    3.22%
> exit_mmap                3.22%
> unmap_vmas               3.08%
> unmap_page_range         3.07%
> free_swap_and_cache_nr   1.31%****
> swap_entry_range_free    1.17%****
> zram_slot_free_notify    1.11%****

If 1.11/1.31, or 85% of free_swap_and_cache_nr, comes from zram_free,
it’s clear that the swap/mm core is not the right place for this optimization.

As it involves too much complexity—for example, synchronization between
swapoff and your new threads.

> zram_free_hw_entry_dc    0.43%
> free_zspage[zsmalloc]    0.09%

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-10-14 20:42 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-09  6:53 [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Lei Liu
2025-09-09  6:53 ` [PATCH v0 1/2] mm: swap: Gather swap entries and batch async release core Lei Liu
2025-09-10  1:39   ` kernel test robot
2025-09-10  3:12   ` kernel test robot
2025-09-09  6:53 ` [PATCH v0 2/2] mm: swap: Forced swap entries release under memory pressure Lei Liu
2025-09-10  5:36   ` kernel test robot
2025-09-09  7:30 ` [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release Kairui Song
2025-09-09  9:24   ` Barry Song
2025-09-09 16:15     ` Chris Li
2025-09-09 18:01       ` Chris Li
2025-09-10 14:07     ` Lei Liu
2025-10-14 20:42       ` Barry Song
2025-09-09 15:38   ` Chris Li
2025-09-10 14:01   ` Lei Liu
2025-09-09 19:21 ` Shakeel Butt
2025-09-09 19:48   ` Suren Baghdasaryan
2025-09-10 14:14     ` Lei Liu
2025-09-10 14:56       ` Suren Baghdasaryan
2025-09-10 16:05       ` Chris Li
2025-09-10 20:12       ` Shakeel Butt
2025-09-11  3:04         ` Lei Liu
2025-09-10 15:40     ` Chris Li
2025-09-10 20:10     ` Shakeel Butt
2025-09-10 20:41       ` Suren Baghdasaryan
2025-09-10 22:10         ` T.J. Mercier
2025-09-10 22:33           ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox