[PATCH v2 0/3] mm: tlb swap entries batch async release

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm: tlb swap entries batch async release
@ 2024-07-31 13:33 Zhiguo Jiang
  2024-07-31 13:33 ` [PATCH v2 1/3] mm: move task_is_dying to h headfile Zhiguo Jiang
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Zhiguo Jiang @ 2024-07-31 13:33 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, Barry Song, kernel test robot,
	Zhiguo Jiang
  Cc: opensource.kernel

The main reasons for the prolonged exit of a background process is the
time-consuming release of its swap entries. The proportion of swap memory
occupied by the background process increases with its duration in the
background, and after a period of time, this value can reach 60% or more.
Additionally, the relatively lengthy path for releasing swap entries
further contributes to the longer time required for the background process
to release its swap entries.

In the multiple background applications scenario, when launching a large
memory application such as a camera, system may enter a low memory state,
which will triggers the killing of multiple background processes at the
same time. Due to multiple exiting processes occupying multiple CPUs for
concurrent execution, the current foreground application's CPU resources
are tight and may cause issues such as lagging.

To solve this problem, we have introduced the multiple exiting process
asynchronous swap memory release mechanism, which isolates and caches
swap entries occupied by multiple exit processes, and hands them over
to an asynchronous kworker to complete the release. This allows the
exiting processes to complete quickly and release CPU resources. We have
validated this modification on the products and achieved the expected
benefits.

It offers several benefits:
1. Alleviate the high system cpu load caused by multiple exiting
   processes running simultaneously.
2. Reduce lock competition in swap entry free path by an asynchronous
   kworker instead of multiple exiting processes parallel execution.
3. Release memory occupied by exiting processes more efficiently.

-v2:
1. fix arch s390 config compilation warning.
 Reported-by: kernel test robot <lkp@intel.com>
 Closes: https://lore.kernel.org/oe-kbuild-all/202407311703.8q8sDQ2p-lkp@intel.com/
 Reported-by: kernel test robot <lkp@intel.com>
 Closes: https://lore.kernel.org/oe-kbuild-all/202407311947.VPJNRqad-lkp@intel.com/

-v1:
 https://lore.kernel.org/linux-mm/20240730114426.511-1-justinjiang@vivo.com/

Zhiguo Jiang (3):
  mm: move task_is_dying to h headfile
  mm: tlb: add tlb swap entries batch async release
  mm: s390: fix compilation warning

 arch/s390/include/asm/tlb.h |   8 +
 include/asm-generic/tlb.h   |  44 ++++++
 include/linux/mm_types.h    |  58 +++++++
 include/linux/oom.h         |   6 +
 mm/memcontrol.c             |   6 -
 mm/memory.c                 |   3 +-
 mm/mmu_gather.c             | 297 ++++++++++++++++++++++++++++++++++++
 7 files changed, 415 insertions(+), 7 deletions(-)

-- 
2.39.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/3] mm: move task_is_dying to h headfile
  2024-07-31 13:33 [PATCH v2 0/3] mm: tlb swap entries batch async release Zhiguo Jiang
@ 2024-07-31 13:33 ` Zhiguo Jiang
  2024-07-31 13:33 ` [PATCH v2 2/3] mm: tlb: add tlb swap entries batch async release Zhiguo Jiang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Zhiguo Jiang @ 2024-07-31 13:33 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, Barry Song, kernel test robot,
	Zhiguo Jiang
  Cc: opensource.kernel

Move task_is_dying() to include/linux/oom.h so that it can be
referenced elsewhere.

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---
 include/linux/oom.h | 6 ++++++
 mm/memcontrol.c     | 6 ------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..a3a58463c0d5
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -77,6 +77,12 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
+static inline bool task_is_dying(void)
+{
+	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+		(current->flags & PF_EXITING);
+}
+
 /*
  * Checks whether a page fault on the given mm is still reliable.
  * This is no longer true if the oom reaper started to reap the
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b3ef3a70833..c54a8aea19b0
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -98,12 +98,6 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 
-static inline bool task_is_dying(void)
-{
-	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
-		(current->flags & PF_EXITING);
-}
-
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
-- 
2.39.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 2/3] mm: tlb: add tlb swap entries batch async release
  2024-07-31 13:33 [PATCH v2 0/3] mm: tlb swap entries batch async release Zhiguo Jiang
  2024-07-31 13:33 ` [PATCH v2 1/3] mm: move task_is_dying to h headfile Zhiguo Jiang
@ 2024-07-31 13:33 ` Zhiguo Jiang
  2024-07-31 13:33 ` [PATCH v2 3/3] mm: s390: fix compilation warning Zhiguo Jiang
  2024-07-31 16:17 ` [PATCH v2 0/3] mm: tlb swap entries batch async release Andrew Morton
  3 siblings, 0 replies; 12+ messages in thread
From: Zhiguo Jiang @ 2024-07-31 13:33 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, Barry Song, kernel test robot,
	Zhiguo Jiang
  Cc: opensource.kernel

The main reasons for the prolonged exit of a background process is the
time-consuming release of its swap entries. The proportion of swap memory
occupied by the background process increases with its duration in the
background, and after a period of time, this value can reach 60% or more.
Additionally, the relatively lengthy path for releasing swap entries
further contributes to the longer time required for the background process
to release its swap entries.

In the multiple background applications scenario, when launching a large
memory application such as a camera, system may enter a low memory state,
which will triggers the killing of multiple background processes at the
same time. Due to multiple exiting processes occupying multiple CPUs for
concurrent execution, the current foreground application's CPU resources
are tight and may cause issues such as lagging.

To solve this problem, we have introduced the multiple exiting process
asynchronous swap memory release mechanism, which isolates and caches
swap entries occupied by multiple exit processes, and hands them over
to an asynchronous kworker to complete the release. This allows the
exiting processes to complete quickly and release CPU resources. We have
validated this modification on the products and achieved the expected
benefits.

It offers several benefits:
1. Alleviate the high system cpu load caused by multiple exiting
   processes running simultaneously.
2. Reduce lock competition in swap entry free path by an asynchronous
   kworker instead of multiple exiting processes parallel execution.
3. Release memory occupied by exiting processes more efficiently.

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---
 include/asm-generic/tlb.h |  44 ++++++
 include/linux/mm_types.h  |  58 ++++++++
 mm/memory.c               |   3 +-
 mm/mmu_gather.c           | 297 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 401 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 709830274b75..8b4d516b35b8
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
 static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
 #endif
 
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
+struct mmu_swap_batch {
+	struct mmu_swap_batch *next;
+	unsigned int nr;
+	unsigned int max;
+	encoded_swpentry_t encoded_entrys[];
+};
+
+#define MAX_SWAP_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
+
+#define MAX_SWAP_GATHER_BATCH_COUNT	(10000UL / MAX_SWAP_GATHER_BATCH)
+
+struct mmu_swap_gather {
+	/*
+	 * the asynchronous kworker to batch
+	 * release swap entries
+	 */
+	struct work_struct free_work;
+
+	/* batch cache swap entries */
+	unsigned int batch_count;
+	struct mmu_swap_batch *active;
+	struct mmu_swap_batch local;
+	encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
+};
+
+bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+		swp_entry_t entry, int nr);
+#endif
+
 /*
  * struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
@@ -343,6 +374,18 @@ struct mmu_gather {
 	unsigned int		vma_exec : 1;
 	unsigned int		vma_huge : 1;
 	unsigned int		vma_pfn  : 1;
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
+	/*
+	 * Two states of releasing swap entries
+	 * asynchronously:
+	 * swp_freeable - have opportunity to
+	 * release asynchronously future
+	 * swp_freeing - be releasing asynchronously.
+	 */
+	unsigned int		swp_freeable : 1;
+	unsigned int		swp_freeing : 1;
+	unsigned int		swp_disable : 1;
+#endif
 
 	unsigned int		batch_count;
 
@@ -354,6 +397,7 @@ struct mmu_gather {
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	unsigned int page_size;
 #endif
+	struct mmu_swap_gather *swp;
 #endif
 };
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..f26fbff93ff4
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -283,6 +283,64 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/*
+ * encoded_swpentry_t - a type marking the encoded swp_entry_t.
+ *
+ * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
+ * bit indicating extra context-dependent information. Only used in swp_entry
+ * asynchronous release path by mmu_swap_gather.
+ */
+typedef struct {
+	unsigned long val;
+} encoded_swpentry_t;
+
+/*
+ * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
+ * total number of consecutive swap entries associated with the same folio. If this
+ * bit is not set, "nr" is implicitly 1.
+ *
+ * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
+ * Bit63 can be used here.
+ */
+#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
+
+static __always_inline encoded_swpentry_t
+encode_swpentry(swp_entry_t entry, unsigned long flags)
+{
+	encoded_swpentry_t ret;
+
+	VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
+	ret.val = flags | entry.val;
+	return ret;
+}
+
+static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
+{
+	return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
+}
+
+static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
+{
+	swp_entry_t ret;
+
+	ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
+	return ret;
+}
+
+static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
+{
+	encoded_swpentry_t ret;
+
+	VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
+	ret.val = nr;
+	return ret;
+}
+
+static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
+{
+	return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
diff --git a/mm/memory.c b/mm/memory.c
index b9f5cc0db3eb..bfa1995558d2
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (!should_zap_cows(details))
 				continue;
 			rss[MM_SWAPENTS] -= nr;
-			free_swap_and_cache_nr(entry, nr);
+			if (!__tlb_remove_swap_entries(tlb, entry, nr))
+				free_swap_and_cache_nr(entry, nr);
 		} else if (is_migration_entry(entry)) {
 			folio = pfn_swap_entry_folio(entry);
 			if (!should_zap_folio(details, folio))
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0..2bb413d052bd
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,11 +9,304 @@
 #include <linux/smp.h>
 #include <linux/swap.h>
 #include <linux/rmap.h>
+#include <linux/oom.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
+/*
+ * The swp_entry asynchronous release mechanism for multiple processes exiting
+ * simultaneously.
+ *
+ * During the multiple exiting processes releasing their own mm simultaneously,
+ * the swap entries in the exiting processes are handled by isolating, caching
+ * and handing over to an asynchronous kworker to complete the release.
+ *
+ * The conditions for the exiting process entering the swp_entry asynchronous
+ * release path:
+ * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
+ *    to alloc struct mmu_swap_gather frequently.
+ * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
+ *
+ * Since the time for determining the number of exiting processes is dynamic,
+ * the exiting process may start to enter the swp_entry asynchronous release
+ * at the beginning or middle stage of the exiting process's swp_entry release
+ * path.
+ *
+ * Once an exiting process enters the swp_entry asynchronous release, all remaining
+ * swap entries in this exiting process need to be fully released by asynchronous
+ * kworker theoretically.
+ *
+ * The function of the swp_entry asynchronous release:
+ * 1. Alleviate the high system cpu load caused by multiple exiting processes
+ *    running simultaneously.
+ * 2. Reduce lock competition in swap entry free path by an asynchronous kworker
+ *    instead of multiple exiting processes parallel execution.
+ * 3. Release memory occupied by exiting processes more efficiently.
+ */
+
+/*
+ * The min number of exiting processes required for swp_entry asynchronous release
+ */
+#define NR_MIN_EXITING_PROCESSES 2
+
+atomic_t nr_exiting_processes = ATOMIC_INIT(0);
+static struct kmem_cache *swap_gather_cachep;
+static struct workqueue_struct *swapfree_wq;
+static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
+
+static int __init tlb_swap_async_free_setup(void)
+{
+	swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
+		WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
+	if (!swapfree_wq)
+		goto fail;
+
+	swap_gather_cachep = kmem_cache_create("swap_gather",
+		sizeof(struct mmu_swap_gather),
+		0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
+		NULL);
+	if (!swap_gather_cachep)
+		goto kcache_fail;
+
+	static_branch_disable(&tlb_swap_asyncfree_disabled);
+	return 0;
+
+kcache_fail:
+	destroy_workqueue(swapfree_wq);
+fail:
+	return -ENOMEM;
+}
+postcore_initcall(tlb_swap_async_free_setup);
+
+static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
+{
+	struct mmu_swap_batch *swap_batch, *next;
+
+	for (swap_batch = swap_gather->local.next; swap_batch; swap_batch = next) {
+		next = swap_batch->next;
+		free_page((unsigned long)swap_batch);
+	}
+	swap_gather->local.next = NULL;
+	kmem_cache_free(swap_gather_cachep, swap_gather);
+}
+
+static void tlb_swap_async_free_work(struct work_struct *w)
+{
+	int i, nr_multi, nr_free;
+	swp_entry_t start_entry;
+	struct mmu_swap_batch *swap_batch;
+	struct mmu_swap_gather *swap_gather = container_of(w,
+		struct mmu_swap_gather, free_work);
+
+	/* Release swap entries cached in mmu_swap_batch. */
+	for (swap_batch = &swap_gather->local; swap_batch && swap_batch->nr;
+	    swap_batch = swap_batch->next) {
+		nr_free = 0;
+		for (i = 0; i < swap_batch->nr; i++) {
+			if (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
+			    ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
+				start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
+				nr_multi = encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
+				free_swap_and_cache_nr(start_entry, nr_multi);
+				nr_free += 2;
+			} else {
+				start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
+				free_swap_and_cache_nr(start_entry, 1);
+				nr_free++;
+			}
+		}
+		swap_batch->nr -= nr_free;
+		WARN_ON_ONCE(swap_batch->nr);
+	}
+	__tlb_swap_gather_free(swap_gather);
+}
+
+static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
+{
+	/*
+	 * Only the exiting processes with the MM_SWAPENTS counter >=
+	 * SWAP_CLUSTER_MAX have the opportunity to release their swap
+	 * entries by asynchronous kworker.
+	 */
+	if (!task_is_dying() ||
+	    get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
+		return true;
+
+	atomic_inc(&nr_exiting_processes);
+	if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES)
+		tlb->swp_freeable = 1;
+	else
+		tlb->swp_freeing = 1;
+
+	return false;
+}
+
+/**
+ * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
+ * for swp_entry tear-down.
+ * @tlb: the mmu_swap_gather structure belongs to tlb
+ */
+static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
+{
+	tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GFP_NOWAIT);
+	if (unlikely(!tlb->swp))
+		return false;
+
+	tlb->swp->local.next  = NULL;
+	tlb->swp->local.nr    = 0;
+	tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
+
+	tlb->swp->active      = &tlb->swp->local;
+	tlb->swp->batch_count = 0;
+
+	INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
+	return true;
+}
+
+static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
+{
+	if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
+		return;
+
+	tlb->swp = NULL;
+	tlb->swp_freeable = 0;
+	tlb->swp_freeing = 0;
+	tlb->swp_disable = 0;
+
+	if (__tlb_swap_gather_mmu_check(tlb))
+		return;
+
+	/*
+	 * If the exiting process meets the conditions of
+	 * swp_entry asynchronous release, an mmu_swap_gather
+	 * structure will be initialized.
+	 */
+	if (tlb->swp_freeing)
+		__tlb_swap_gather_init(tlb);
+}
+
+static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool finish)
+{
+	queue_work(swapfree_wq, &tlb->swp->free_work);
+	tlb->swp = NULL;
+	if (!finish)
+		__tlb_swap_gather_init(tlb);
+}
+
+static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
+{
+	struct mmu_swap_batch *swap_batch;
+
+	if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
+		goto free;
+
+	swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
+	if (unlikely(!swap_batch))
+		goto free;
+
+	swap_batch->next = NULL;
+	swap_batch->nr   = 0;
+	swap_batch->max  = MAX_SWAP_GATHER_BATCH;
+
+	tlb->swp->active->next = swap_batch;
+	tlb->swp->active = swap_batch;
+	tlb->swp->batch_count++;
+	return true;
+free:
+	/* batch move to wq */
+	__tlb_swap_gather_queuework(tlb, false);
+	return false;
+}
+
+/**
+ * __tlb_remove_swap_entries - the swap entries in exiting process are
+ * isolated, batch cached in struct mmu_swap_batch.
+ * @tlb: the current mmu_gather
+ * @entry: swp_entry to be isolated and cached
+ * @nr: the number of consecutive entries starting from entry parameter.
+ */
+bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+			     swp_entry_t entry, int nr)
+{
+	struct mmu_swap_batch *swap_batch;
+	unsigned long flags = 0;
+	bool ret = false;
+
+	if (tlb->swp_disable)
+		return ret;
+
+	if (!tlb->swp_freeable && !tlb->swp_freeing)
+		return ret;
+
+
+	if (tlb->swp_freeable) {
+		if (atomic_read(&nr_exiting_processes) <
+		    NR_MIN_EXITING_PROCESSES)
+			return ret;
+		/*
+		 * If the current number of exiting processes
+		 * is >= NR_MIN_EXITING_PROCESSES, the exiting
+		 * process with swp_freeable state will enter
+		 * swp_freeing state to start releasing its
+		 * remaining swap entries by the asynchronous
+		 * kworker.
+		 */
+		tlb->swp_freeable = 0;
+		tlb->swp_freeing = 1;
+	}
+
+	VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
+	if (!tlb->swp && !__tlb_swap_gather_init(tlb))
+		return ret;
+
+	swap_batch = tlb->swp->active;
+	if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
+		__tlb_swap_gather_queuework(tlb, false);
+		return ret;
+	}
+
+	if (likely(nr == 1)) {
+		swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
+	} else {
+		flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
+		swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
+		swap_batch->encoded_entrys[swap_batch->nr++] = encode_nr_swpentrys(nr);
+	}
+	ret = true;
+
+	if (swap_batch->nr >= swap_batch->max - 1) {
+		if (!__tlb_swap_next_batch(tlb))
+			goto exit;
+		swap_batch = tlb->swp->active;
+	}
+	VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
+exit:
+	return ret;
+}
+
+static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
+{
+	if (tlb->swp_disable)
+		return;
+
+	if (!tlb->swp_freeable && !tlb->swp_freeing)
+		return;
+
+	if (tlb->swp_freeable) {
+		tlb->swp_freeable = 0;
+		VM_BUG_ON(tlb->swp_freeing);
+		goto exit;
+	}
+	tlb->swp_freeing = 0;
+	if (unlikely(!tlb->swp))
+		goto exit;
+
+	__tlb_swap_gather_queuework(tlb, true);
+exit:
+	atomic_dec(&nr_exiting_processes);
+}
 
 static bool tlb_next_batch(struct mmu_gather *tlb)
 {
@@ -386,6 +679,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
+
+	tlb->swp_disable = 1;
+	__tlb_swap_gather_mmu(tlb);
 #endif
 	tlb->delayed_rmap = 0;
 
@@ -466,6 +762,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
 	tlb_batch_list_free(tlb);
+	__tlb_batch_swap_finish(tlb);
 #endif
 	dec_tlb_flush_pending(tlb->mm);
 }
-- 
2.39.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 3/3] mm: s390: fix compilation warning
  2024-07-31 13:33 [PATCH v2 0/3] mm: tlb swap entries batch async release Zhiguo Jiang
  2024-07-31 13:33 ` [PATCH v2 1/3] mm: move task_is_dying to h headfile Zhiguo Jiang
  2024-07-31 13:33 ` [PATCH v2 2/3] mm: tlb: add tlb swap entries batch async release Zhiguo Jiang
@ 2024-07-31 13:33 ` Zhiguo Jiang
  2024-08-05 12:04   ` David Hildenbrand
  2024-07-31 16:17 ` [PATCH v2 0/3] mm: tlb swap entries batch async release Andrew Morton
  3 siblings, 1 reply; 12+ messages in thread
From: Zhiguo Jiang @ 2024-07-31 13:33 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, Barry Song, kernel test robot,
	Zhiguo Jiang
  Cc: opensource.kernel

Define static inline bool __tlb_remove_page_size() to fix arch s390
config compilation Warning.

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407311703.8q8sDQ2p-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407311947.VPJNRqad-lkp@intel.com/

 arch/s390/include/asm/tlb.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index e95b2c8081eb..3f681f63390f
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 		struct page *page, bool delay_rmap, int page_size);
 static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
 		struct page *page, unsigned int nr_pages, bool delay_rmap);
+static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+		swp_entry_t entry, int nr);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
 	return false;
 }
 
+static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+		swp_entry_t entry, int nr)
+{
+	return false;
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	__tlb_flush_mm_lazy(tlb->mm);
-- 
2.39.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] mm: tlb swap entries batch async release
  2024-07-31 13:33 [PATCH v2 0/3] mm: tlb swap entries batch async release Zhiguo Jiang
                   ` (2 preceding siblings ...)
  2024-07-31 13:33 ` [PATCH v2 3/3] mm: s390: fix compilation warning Zhiguo Jiang
@ 2024-07-31 16:17 ` Andrew Morton
  2024-08-01  6:30   ` zhiguojiang
  3 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2024-07-31 16:17 UTC (permalink / raw)
  To: Zhiguo Jiang
  Cc: linux-mm, linux-kernel, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Arnd Bergmann, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	linux-arch, cgroups, Barry Song, kernel test robot,
	opensource.kernel

On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:

> The main reasons for the prolonged exit of a background process is the

The kernel really doesn't have a concept of a "background process". 
It's a userspace concept - perhaps "the parent process isn't waiting on
this process via wait()".

I assume here you're referring to an Android userspace concept?  I
expect that when Android "backgrounds" a process, it does lots of
things to that process.  Perhaps scheduling priority, perhaps
alteration of various MM tunables, etc.

So rather than referring to "backgrounding" it would be better to
identify what tuning alterations are made to such processes to bring
about this behavior.

> time-consuming release of its swap entries. The proportion of swap memory
> occupied by the background process increases with its duration in the
> background, and after a period of time, this value can reach 60% or more.

Again, what is it about the tuning of such processes which causes this
behavior?

> Additionally, the relatively lengthy path for releasing swap entries
> further contributes to the longer time required for the background process
> to release its swap entries.
> 
> In the multiple background applications scenario, when launching a large
> memory application such as a camera, system may enter a low memory state,
> which will triggers the killing of multiple background processes at the
> same time. Due to multiple exiting processes occupying multiple CPUs for
> concurrent execution, the current foreground application's CPU resources
> are tight and may cause issues such as lagging.
> 
> To solve this problem, we have introduced the multiple exiting process
> asynchronous swap memory release mechanism, which isolates and caches
> swap entries occupied by multiple exit processes, and hands them over
> to an asynchronous kworker to complete the release. This allows the
> exiting processes to complete quickly and release CPU resources. We have
> validated this modification on the products and achieved the expected
> benefits.

Dumb question: why can't this be done in userspace?  The exiting
process does fork/exit and lets the child do all this asynchronous freeing?

> It offers several benefits:
> 1. Alleviate the high system cpu load caused by multiple exiting
>    processes running simultaneously.
> 2. Reduce lock competition in swap entry free path by an asynchronous
>    kworker instead of multiple exiting processes parallel execution.

Why is lock contention reduced?  The same amount of work needs to be
done.

> 3. Release memory occupied by exiting processes more efficiently.

Probably it's slightly less efficient.

There are potential problems with this approach of passing work to a
kernel thread:

- The process will exit while its resources are still allocated.  But
  its parent process assumes those resources are now all freed and the
  parent process then proceeds to allocate resources.  This results in
  a time period where peak resource consumption is higher than it was
  before such a change.

- If all CPUs are running in userspace with realtime policy
  (SCHED_FIFO, for example) then the kworker thread will not run,
  indefinitely.

- Work which should have been accounted to the exiting process will
  instead go unaccounted.  

So please fully address all these potential issues.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] mm: tlb swap entries batch async release
  2024-07-31 16:17 ` [PATCH v2 0/3] mm: tlb swap entries batch async release Andrew Morton
@ 2024-08-01  6:30   ` zhiguojiang
  2024-08-01  7:36     ` Barry Song
  0 siblings, 1 reply; 12+ messages in thread
From: zhiguojiang @ 2024-08-01  6:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Arnd Bergmann, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	linux-arch, cgroups, Barry Song, kernel test robot,
	opensource.kernel



在 2024/8/1 0:17, Andrew Morton 写道:
> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>
>> The main reasons for the prolonged exit of a background process is the
> The kernel really doesn't have a concept of a "background process".
> It's a userspace concept - perhaps "the parent process isn't waiting on
> this process via wait()".
>
> I assume here you're referring to an Android userspace concept?  I
> expect that when Android "backgrounds" a process, it does lots of
> things to that process.  Perhaps scheduling priority, perhaps
> alteration of various MM tunables, etc.
>
> So rather than referring to "backgrounding" it would be better to
> identify what tuning alterations are made to such processes to bring
> about this behavior.
Hi Andrew Morton,

Thank you for your review and comments.

You are right. The "background process" here refers to the process
corresponding to an Android application switched to the background.
In fact, this patch is applicable to any exiting process.

The further explaination the concept of "multiple exiting processes",
is that it refers to different processes owning independent mm rather
than sharing the same mm.

I will use "mm" to describe process instead of "background" in next
version.
>
>> time-consuming release of its swap entries. The proportion of swap memory
>> occupied by the background process increases with its duration in the
>> background, and after a period of time, this value can reach 60% or more.
> Again, what is it about the tuning of such processes which causes this
> behavior?
When system is low memory, memory recycling will be trigged, where
anonymous folios in the process will be continuously reclaimed, resulting
in an increase of swap entries occupies by this process. So when the
process is killed, it takes more time to release it's swap entries over
time.

Testing datas of process occuping different physical memory sizes at
different time points:
Testing Platform: 8GB RAM
Testing procedure:
After booting up, start 15 processes first, and then observe the
physical memory size occupied by the last launched process at
different time points.

Example:
The process launched last: com.qiyi.video
|  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
-------------------------------------------------------------------
|     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
|   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
|   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
|  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
|    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
| Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
min - minute.

Based on the above datas, we can know that the swap ratio occupied by
the process gradually increases over time.
>
>> Additionally, the relatively lengthy path for releasing swap entries
>> further contributes to the longer time required for the background process
>> to release its swap entries.
>>
>> In the multiple background applications scenario, when launching a large
>> memory application such as a camera, system may enter a low memory state,
>> which will triggers the killing of multiple background processes at the
>> same time. Due to multiple exiting processes occupying multiple CPUs for
>> concurrent execution, the current foreground application's CPU resources
>> are tight and may cause issues such as lagging.
>>
>> To solve this problem, we have introduced the multiple exiting process
>> asynchronous swap memory release mechanism, which isolates and caches
>> swap entries occupied by multiple exit processes, and hands them over
>> to an asynchronous kworker to complete the release. This allows the
>> exiting processes to complete quickly and release CPU resources. We have
>> validated this modification on the products and achieved the expected
>> benefits.
> Dumb question: why can't this be done in userspace?  The exiting
> process does fork/exit and lets the child do all this asynchronous freeing?
The logic optimization for kernel releasing swap entries cannot be
implemented in userspace. The multiple exiting processes here own
their independent mm, rather than parent and child processes share the
same mm. Therefore, when the kernel executes multiple exiting process
simultaneously, they will definitely occupy multiple CPU core resources
to complete it.
>> It offers several benefits:
>> 1. Alleviate the high system cpu load caused by multiple exiting
>>     processes running simultaneously.
>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>     kworker instead of multiple exiting processes parallel execution.
> Why is lock contention reduced?  The same amount of work needs to be
> done.
When multiple CPU cores run to release the different swap entries belong
to different exiting processes simultaneously, cluster lock or swapinfo
lock may encounter lock contention issues, and while an asynchronous
kworker that only occupies one CPU core is used to complete this work,
it can reduce the probability of lock contention and free up the
remaining CPU core resources for other non-exiting processes to use.
>
>> 3. Release memory occupied by exiting processes more efficiently.
> Probably it's slightly less efficient.
We observed that using an asynchronous kworker can result in more free
memory earlier. When multiple processes exit simultaneously, due to CPU
core resources competition, these exiting processes remain in a
runnable state for a long time and cannot release their occupied memory
resources timely.
>
> There are potential problems with this approach of passing work to a
> kernel thread:
>
> - The process will exit while its resources are still allocated.  But
>    its parent process assumes those resources are now all freed and the
>    parent process then proceeds to allocate resources.  This results in
>    a time period where peak resource consumption is higher than it was
>    before such a change.
- I don't think this modification will cause such a problem. Perhaps I
   haven't fully understood your meaning yet. Can you give me a specific
   example?
> - If all CPUs are running in userspace with realtime policy
>    (SCHED_FIFO, for example) then the kworker thread will not run,
>    indefinitely.
- In my clumsy understanding, the execution priority of kernel threads
   should not be lower than that of the exiting process, and the
   asynchronous kworker execution should only be triggered when the
   process exits. The exiting process should not be set to SCHED_LFO,
   so when the exiting process is executed, the asynchronous kworker
   should also have opportunity to get timely execution.
> - Work which should have been accounted to the exiting process will
>    instead go unaccounted.
- You are right, the statistics of process exit time may no longer be
   complete.
> So please fully address all these potential issues.
Thanks
Zhiguo



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] mm: tlb swap entries batch async release
  2024-08-01  6:30   ` zhiguojiang
@ 2024-08-01  7:36     ` Barry Song
  2024-08-01 10:33       ` zhiguojiang
  0 siblings, 1 reply; 12+ messages in thread
From: Barry Song @ 2024-08-01  7:36 UTC (permalink / raw)
  To: zhiguojiang
  Cc: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, kernel test robot,
	opensource.kernel

On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/8/1 0:17, Andrew Morton 写道:
> > [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >
> >> The main reasons for the prolonged exit of a background process is the
> > The kernel really doesn't have a concept of a "background process".
> > It's a userspace concept - perhaps "the parent process isn't waiting on
> > this process via wait()".
> >
> > I assume here you're referring to an Android userspace concept?  I
> > expect that when Android "backgrounds" a process, it does lots of
> > things to that process.  Perhaps scheduling priority, perhaps
> > alteration of various MM tunables, etc.
> >
> > So rather than referring to "backgrounding" it would be better to
> > identify what tuning alterations are made to such processes to bring
> > about this behavior.
> Hi Andrew Morton,
>
> Thank you for your review and comments.
>
> You are right. The "background process" here refers to the process
> corresponding to an Android application switched to the background.
> In fact, this patch is applicable to any exiting process.
>
> The further explaination the concept of "multiple exiting processes",
> is that it refers to different processes owning independent mm rather
> than sharing the same mm.
>
> I will use "mm" to describe process instead of "background" in next
> version.
> >
> >> time-consuming release of its swap entries. The proportion of swap memory
> >> occupied by the background process increases with its duration in the
> >> background, and after a period of time, this value can reach 60% or more.
> > Again, what is it about the tuning of such processes which causes this
> > behavior?
> When system is low memory, memory recycling will be trigged, where
> anonymous folios in the process will be continuously reclaimed, resulting
> in an increase of swap entries occupies by this process. So when the
> process is killed, it takes more time to release it's swap entries over
> time.
>
> Testing datas of process occuping different physical memory sizes at
> different time points:
> Testing Platform: 8GB RAM
> Testing procedure:
> After booting up, start 15 processes first, and then observe the
> physical memory size occupied by the last launched process at
> different time points.
>
> Example:
> The process launched last: com.qiyi.video
> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
> -------------------------------------------------------------------
> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> min - minute.
>
> Based on the above datas, we can know that the swap ratio occupied by
> the process gradually increases over time.

If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
pages are actually swapped out, you end up zapping 100 PTEs but only freeing
36.28 pages of memory. By doing this asynchronously, you prevent the
swap_release operation from blocking the process of zapping normal
PTEs that are mapping to memory.

Could you provide data showing the improvements after implementing
asynchronous freeing of swap entries?


> >
> >> Additionally, the relatively lengthy path for releasing swap entries
> >> further contributes to the longer time required for the background process
> >> to release its swap entries.
> >>
> >> In the multiple background applications scenario, when launching a large
> >> memory application such as a camera, system may enter a low memory state,
> >> which will triggers the killing of multiple background processes at the
> >> same time. Due to multiple exiting processes occupying multiple CPUs for
> >> concurrent execution, the current foreground application's CPU resources
> >> are tight and may cause issues such as lagging.
> >>
> >> To solve this problem, we have introduced the multiple exiting process
> >> asynchronous swap memory release mechanism, which isolates and caches
> >> swap entries occupied by multiple exit processes, and hands them over
> >> to an asynchronous kworker to complete the release. This allows the
> >> exiting processes to complete quickly and release CPU resources. We have
> >> validated this modification on the products and achieved the expected
> >> benefits.
> > Dumb question: why can't this be done in userspace?  The exiting
> > process does fork/exit and lets the child do all this asynchronous freeing?
> The logic optimization for kernel releasing swap entries cannot be
> implemented in userspace. The multiple exiting processes here own
> their independent mm, rather than parent and child processes share the
> same mm. Therefore, when the kernel executes multiple exiting process
> simultaneously, they will definitely occupy multiple CPU core resources
> to complete it.
> >> It offers several benefits:
> >> 1. Alleviate the high system cpu load caused by multiple exiting
> >>     processes running simultaneously.
> >> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>     kworker instead of multiple exiting processes parallel execution.
> > Why is lock contention reduced?  The same amount of work needs to be
> > done.
> When multiple CPU cores run to release the different swap entries belong
> to different exiting processes simultaneously, cluster lock or swapinfo
> lock may encounter lock contention issues, and while an asynchronous
> kworker that only occupies one CPU core is used to complete this work,
> it can reduce the probability of lock contention and free up the
> remaining CPU core resources for other non-exiting processes to use.
> >
> >> 3. Release memory occupied by exiting processes more efficiently.
> > Probably it's slightly less efficient.
> We observed that using an asynchronous kworker can result in more free
> memory earlier. When multiple processes exit simultaneously, due to CPU
> core resources competition, these exiting processes remain in a
> runnable state for a long time and cannot release their occupied memory
> resources timely.
> >
> > There are potential problems with this approach of passing work to a
> > kernel thread:
> >
> > - The process will exit while its resources are still allocated.  But
> >    its parent process assumes those resources are now all freed and the
> >    parent process then proceeds to allocate resources.  This results in
> >    a time period where peak resource consumption is higher than it was
> >    before such a change.
> - I don't think this modification will cause such a problem. Perhaps I
>    haven't fully understood your meaning yet. Can you give me a specific
>    example?

Normally, after completing zap_pte_range, your swap slots are returned to
the swap file, except for a few slot caches. However, with the asynchronous
approach, it means that even after your process has completely exited,
 some swap slots might still not be released to the system. This could
potentially starve other processes waiting for swap slots to perform
swap-outs. I assume this isn't a critical issue for you because, in the
case of killing processes, freeing up memory is more important than
releasing swap entries?


> > - If all CPUs are running in userspace with realtime policy
> >    (SCHED_FIFO, for example) then the kworker thread will not run,
> >    indefinitely.
> - In my clumsy understanding, the execution priority of kernel threads
>    should not be lower than that of the exiting process, and the
>    asynchronous kworker execution should only be triggered when the
>    process exits. The exiting process should not be set to SCHED_LFO,
>    so when the exiting process is executed, the asynchronous kworker
>    should also have opportunity to get timely execution.
> > - Work which should have been accounted to the exiting process will
> >    instead go unaccounted.
> - You are right, the statistics of process exit time may no longer be
>    complete.
> > So please fully address all these potential issues.
> Thanks
> Zhiguo
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] mm: tlb swap entries batch async release
  2024-08-01  7:36     ` Barry Song
@ 2024-08-01 10:33       ` zhiguojiang
  2024-08-02 10:42         ` Barry Song
  0 siblings, 1 reply; 12+ messages in thread
From: zhiguojiang @ 2024-08-01 10:33 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, kernel test robot,
	opensource.kernel



在 2024/8/1 15:36, Barry Song 写道:
> On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>> 在 2024/8/1 0:17, Andrew Morton 写道:
>>> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>
>>>> The main reasons for the prolonged exit of a background process is the
>>> The kernel really doesn't have a concept of a "background process".
>>> It's a userspace concept - perhaps "the parent process isn't waiting on
>>> this process via wait()".
>>>
>>> I assume here you're referring to an Android userspace concept?  I
>>> expect that when Android "backgrounds" a process, it does lots of
>>> things to that process.  Perhaps scheduling priority, perhaps
>>> alteration of various MM tunables, etc.
>>>
>>> So rather than referring to "backgrounding" it would be better to
>>> identify what tuning alterations are made to such processes to bring
>>> about this behavior.
>> Hi Andrew Morton,
>>
>> Thank you for your review and comments.
>>
>> You are right. The "background process" here refers to the process
>> corresponding to an Android application switched to the background.
>> In fact, this patch is applicable to any exiting process.
>>
>> The further explaination the concept of "multiple exiting processes",
>> is that it refers to different processes owning independent mm rather
>> than sharing the same mm.
>>
>> I will use "mm" to describe process instead of "background" in next
>> version.
>>>> time-consuming release of its swap entries. The proportion of swap memory
>>>> occupied by the background process increases with its duration in the
>>>> background, and after a period of time, this value can reach 60% or more.
>>> Again, what is it about the tuning of such processes which causes this
>>> behavior?
>> When system is low memory, memory recycling will be trigged, where
>> anonymous folios in the process will be continuously reclaimed, resulting
>> in an increase of swap entries occupies by this process. So when the
>> process is killed, it takes more time to release it's swap entries over
>> time.
>>
>> Testing datas of process occuping different physical memory sizes at
>> different time points:
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After booting up, start 15 processes first, and then observe the
>> physical memory size occupied by the last launched process at
>> different time points.
>>
>> Example:
>> The process launched last: com.qiyi.video
>> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
>> -------------------------------------------------------------------
>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>> min - minute.
>>
>> Based on the above datas, we can know that the swap ratio occupied by
>> the process gradually increases over time.
> If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
> pages are actually swapped out, you end up zapping 100 PTEs but only freeing
> 36.28 pages of memory. By doing this asynchronously, you prevent the
> swap_release operation from blocking the process of zapping normal
> PTEs that are mapping to memory.
>
> Could you provide data showing the improvements after implementing
> asynchronous freeing of swap entries?
Hi Barry,

Your understanding is correct. From the perspective of the benefits of
releasing the physical memory occupied by the exiting process, an
asynchronous kworker releasing swap entries can indeed accelerate
the exiting process to release its pte_present memory (e.g. file and
anonymous folio) faster.

In addition, from the perspective of CPU resources, for scenarios where
multiple exiting processes are running simultaneously, an asynchronous
kworker instead of multiple exiting processes is used to release swap
entries can release more CPU core resources for the current non-exiting
and important processes to use, thereby improving the user experience
of the current non-exiting and important processes. I think this is the
main contribution of this modification.

Example:
When there are multiple processes and the system memory is low, if
the camera processes are started at this time, it will trigger the
instantaneous killing of many processes because the camera processes
need to alloc a large amount of memory, resulting in multiple exiting
processes running simultaneously. These exiting processes will compete
with the current camera processes for CPU resources, and the release of
physical memory occupied by multiple exiting processes due to scheduling
is slow, ultimately affecting the slow execution of the camera process.

By using this optimization modification, multiple exiting processes can
quickly exit, freeing up their CPU resources and physical memory of
pte_preset, improving the running speed of camera processes.

Testing Platform: 8GB RAM
Testing procedure:
After restarting the machine, start 15 app processes first, and then
start the camera app processes, we monitor the cold start and preview
time datas of the camera app processes.

Test datas of camera processes cold start time (unit: millisecond):
|  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
| before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
| after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |

Test datas of camera processes preview time (unit: millisecond):
|  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
| before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
| after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |

Base on the average of the six sets of test datas above, we can see that
the benefit datas of the modified patch:
1. The cold start time of camera app processes has reduced by about 20%.
2. The preview time of camera app processes has reduced by about 42%.
>
>>>> Additionally, the relatively lengthy path for releasing swap entries
>>>> further contributes to the longer time required for the background process
>>>> to release its swap entries.
>>>>
>>>> In the multiple background applications scenario, when launching a large
>>>> memory application such as a camera, system may enter a low memory state,
>>>> which will triggers the killing of multiple background processes at the
>>>> same time. Due to multiple exiting processes occupying multiple CPUs for
>>>> concurrent execution, the current foreground application's CPU resources
>>>> are tight and may cause issues such as lagging.
>>>>
>>>> To solve this problem, we have introduced the multiple exiting process
>>>> asynchronous swap memory release mechanism, which isolates and caches
>>>> swap entries occupied by multiple exit processes, and hands them over
>>>> to an asynchronous kworker to complete the release. This allows the
>>>> exiting processes to complete quickly and release CPU resources. We have
>>>> validated this modification on the products and achieved the expected
>>>> benefits.
>>> Dumb question: why can't this be done in userspace?  The exiting
>>> process does fork/exit and lets the child do all this asynchronous freeing?
>> The logic optimization for kernel releasing swap entries cannot be
>> implemented in userspace. The multiple exiting processes here own
>> their independent mm, rather than parent and child processes share the
>> same mm. Therefore, when the kernel executes multiple exiting process
>> simultaneously, they will definitely occupy multiple CPU core resources
>> to complete it.
>>>> It offers several benefits:
>>>> 1. Alleviate the high system cpu load caused by multiple exiting
>>>>      processes running simultaneously.
>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>      kworker instead of multiple exiting processes parallel execution.
>>> Why is lock contention reduced?  The same amount of work needs to be
>>> done.
>> When multiple CPU cores run to release the different swap entries belong
>> to different exiting processes simultaneously, cluster lock or swapinfo
>> lock may encounter lock contention issues, and while an asynchronous
>> kworker that only occupies one CPU core is used to complete this work,
>> it can reduce the probability of lock contention and free up the
>> remaining CPU core resources for other non-exiting processes to use.
>>>> 3. Release memory occupied by exiting processes more efficiently.
>>> Probably it's slightly less efficient.
>> We observed that using an asynchronous kworker can result in more free
>> memory earlier. When multiple processes exit simultaneously, due to CPU
>> core resources competition, these exiting processes remain in a
>> runnable state for a long time and cannot release their occupied memory
>> resources timely.
>>> There are potential problems with this approach of passing work to a
>>> kernel thread:
>>>
>>> - The process will exit while its resources are still allocated.  But
>>>     its parent process assumes those resources are now all freed and the
>>>     parent process then proceeds to allocate resources.  This results in
>>>     a time period where peak resource consumption is higher than it was
>>>     before such a change.
>> - I don't think this modification will cause such a problem. Perhaps I
>>     haven't fully understood your meaning yet. Can you give me a specific
>>     example?
> Normally, after completing zap_pte_range, your swap slots are returned to
> the swap file, except for a few slot caches. However, with the asynchronous
> approach, it means that even after your process has completely exited,
>   some swap slots might still not be released to the system. This could
> potentially starve other processes waiting for swap slots to perform
> swap-outs. I assume this isn't a critical issue for you because, in the
> case of killing processes, freeing up memory is more important than
> releasing swap entries?
  I did not encounter issues caused by the slow release of swap entries
by asynchronous kworker during our testing. Normally, asynchronous
kworker can also release cached swap entries in a short period of time.
Of course, if the system allows, it is necessary to increase the running
priority of the asynchronous kworker appropriately in order to release
swap entries faster, which is also beneficial for the system.

The swap-out datas for swap entries is also compressed and stored in the
zram memory space, so it is relatively important to release the zram
memory space corresponding to swap entries as soon as possible.
>
>>> - If all CPUs are running in userspace with realtime policy
>>>     (SCHED_FIFO, for example) then the kworker thread will not run,
>>>     indefinitely.
>> - In my clumsy understanding, the execution priority of kernel threads
>>     should not be lower than that of the exiting process, and the
>>     asynchronous kworker execution should only be triggered when the
>>     process exits. The exiting process should not be set to SCHED_LFO,
>>     so when the exiting process is executed, the asynchronous kworker
>>     should also have opportunity to get timely execution.
>>> - Work which should have been accounted to the exiting process will
>>>     instead go unaccounted.
>> - You are right, the statistics of process exit time may no longer be
>>     complete.
>>> So please fully address all these potential issues.
>> Thanks
>> Zhiguo
>>
> Thanks
> Barry
Thanks
Zhiguo



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] mm: tlb swap entries batch async release
  2024-08-01 10:33       ` zhiguojiang
@ 2024-08-02 10:42         ` Barry Song
  2024-08-02 14:42           ` zhiguojiang
  0 siblings, 1 reply; 12+ messages in thread
From: Barry Song @ 2024-08-02 10:42 UTC (permalink / raw)
  To: zhiguojiang
  Cc: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, kernel test robot,
	opensource.kernel

On Thu, Aug 1, 2024 at 10:33 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/8/1 15:36, Barry Song 写道:
> > On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
> >>
> >> 在 2024/8/1 0:17, Andrew Morton 写道:
> >>> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >>>
> >>>> The main reasons for the prolonged exit of a background process is the
> >>> The kernel really doesn't have a concept of a "background process".
> >>> It's a userspace concept - perhaps "the parent process isn't waiting on
> >>> this process via wait()".
> >>>
> >>> I assume here you're referring to an Android userspace concept?  I
> >>> expect that when Android "backgrounds" a process, it does lots of
> >>> things to that process.  Perhaps scheduling priority, perhaps
> >>> alteration of various MM tunables, etc.
> >>>
> >>> So rather than referring to "backgrounding" it would be better to
> >>> identify what tuning alterations are made to such processes to bring
> >>> about this behavior.
> >> Hi Andrew Morton,
> >>
> >> Thank you for your review and comments.
> >>
> >> You are right. The "background process" here refers to the process
> >> corresponding to an Android application switched to the background.
> >> In fact, this patch is applicable to any exiting process.
> >>
> >> The further explaination the concept of "multiple exiting processes",
> >> is that it refers to different processes owning independent mm rather
> >> than sharing the same mm.
> >>
> >> I will use "mm" to describe process instead of "background" in next
> >> version.
> >>>> time-consuming release of its swap entries. The proportion of swap memory
> >>>> occupied by the background process increases with its duration in the
> >>>> background, and after a period of time, this value can reach 60% or more.
> >>> Again, what is it about the tuning of such processes which causes this
> >>> behavior?
> >> When system is low memory, memory recycling will be trigged, where
> >> anonymous folios in the process will be continuously reclaimed, resulting
> >> in an increase of swap entries occupies by this process. So when the
> >> process is killed, it takes more time to release it's swap entries over
> >> time.
> >>
> >> Testing datas of process occuping different physical memory sizes at
> >> different time points:
> >> Testing Platform: 8GB RAM
> >> Testing procedure:
> >> After booting up, start 15 processes first, and then observe the
> >> physical memory size occupied by the last launched process at
> >> different time points.
> >>
> >> Example:
> >> The process launched last: com.qiyi.video
> >> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
> >> -------------------------------------------------------------------
> >> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> >> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> >> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> >> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> >> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> >> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> >> min - minute.
> >>
> >> Based on the above datas, we can know that the swap ratio occupied by
> >> the process gradually increases over time.
> > If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
> > pages are actually swapped out, you end up zapping 100 PTEs but only freeing
> > 36.28 pages of memory. By doing this asynchronously, you prevent the
> > swap_release operation from blocking the process of zapping normal
> > PTEs that are mapping to memory.
> >
> > Could you provide data showing the improvements after implementing
> > asynchronous freeing of swap entries?
> Hi Barry,
>
> Your understanding is correct. From the perspective of the benefits of
> releasing the physical memory occupied by the exiting process, an
> asynchronous kworker releasing swap entries can indeed accelerate
> the exiting process to release its pte_present memory (e.g. file and
> anonymous folio) faster.
>
> In addition, from the perspective of CPU resources, for scenarios where
> multiple exiting processes are running simultaneously, an asynchronous
> kworker instead of multiple exiting processes is used to release swap
> entries can release more CPU core resources for the current non-exiting
> and important processes to use, thereby improving the user experience
> of the current non-exiting and important processes. I think this is the
> main contribution of this modification.
>
> Example:
> When there are multiple processes and the system memory is low, if
> the camera processes are started at this time, it will trigger the
> instantaneous killing of many processes because the camera processes
> need to alloc a large amount of memory, resulting in multiple exiting
> processes running simultaneously. These exiting processes will compete
> with the current camera processes for CPU resources, and the release of
> physical memory occupied by multiple exiting processes due to scheduling
> is slow, ultimately affecting the slow execution of the camera process.
>
> By using this optimization modification, multiple exiting processes can
> quickly exit, freeing up their CPU resources and physical memory of
> pte_preset, improving the running speed of camera processes.
>
> Testing Platform: 8GB RAM
> Testing procedure:
> After restarting the machine, start 15 app processes first, and then
> start the camera app processes, we monitor the cold start and preview
> time datas of the camera app processes.
>
> Test datas of camera processes cold start time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>
> Test datas of camera processes preview time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>
> Base on the average of the six sets of test datas above, we can see that
> the benefit datas of the modified patch:
> 1. The cold start time of camera app processes has reduced by about 20%.
> 2. The preview time of camera app processes has reduced by about 42%.

This sounds quite promising. I understand that asynchronous releasing
of swap entries can help killed processes free memory more quickly,
allowing your camera app to access it faster. However, I’m unsure
about the impact of swap-related lock contention. My intuition is that
it might not be significant, given that the cluster uses a single lock
and its relatively small size helps distribute the swap locks.

Anyway, I’m very interested in your patchset and can certainly
appreciate its benefits from my own experience working on phones. I’m
quite busy with other issues at the moment, but I hope to provide you
with detailed comments in about two weeks.

> >
> >>>> Additionally, the relatively lengthy path for releasing swap entries
> >>>> further contributes to the longer time required for the background process
> >>>> to release its swap entries.
> >>>>
> >>>> In the multiple background applications scenario, when launching a large
> >>>> memory application such as a camera, system may enter a low memory state,
> >>>> which will triggers the killing of multiple background processes at the
> >>>> same time. Due to multiple exiting processes occupying multiple CPUs for
> >>>> concurrent execution, the current foreground application's CPU resources
> >>>> are tight and may cause issues such as lagging.
> >>>>
> >>>> To solve this problem, we have introduced the multiple exiting process
> >>>> asynchronous swap memory release mechanism, which isolates and caches
> >>>> swap entries occupied by multiple exit processes, and hands them over
> >>>> to an asynchronous kworker to complete the release. This allows the
> >>>> exiting processes to complete quickly and release CPU resources. We have
> >>>> validated this modification on the products and achieved the expected
> >>>> benefits.
> >>> Dumb question: why can't this be done in userspace?  The exiting
> >>> process does fork/exit and lets the child do all this asynchronous freeing?
> >> The logic optimization for kernel releasing swap entries cannot be
> >> implemented in userspace. The multiple exiting processes here own
> >> their independent mm, rather than parent and child processes share the
> >> same mm. Therefore, when the kernel executes multiple exiting process
> >> simultaneously, they will definitely occupy multiple CPU core resources
> >> to complete it.
> >>>> It offers several benefits:
> >>>> 1. Alleviate the high system cpu load caused by multiple exiting
> >>>>      processes running simultaneously.
> >>>> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>>>      kworker instead of multiple exiting processes parallel execution.
> >>> Why is lock contention reduced?  The same amount of work needs to be
> >>> done.
> >> When multiple CPU cores run to release the different swap entries belong
> >> to different exiting processes simultaneously, cluster lock or swapinfo
> >> lock may encounter lock contention issues, and while an asynchronous
> >> kworker that only occupies one CPU core is used to complete this work,
> >> it can reduce the probability of lock contention and free up the
> >> remaining CPU core resources for other non-exiting processes to use.
> >>>> 3. Release memory occupied by exiting processes more efficiently.
> >>> Probably it's slightly less efficient.
> >> We observed that using an asynchronous kworker can result in more free
> >> memory earlier. When multiple processes exit simultaneously, due to CPU
> >> core resources competition, these exiting processes remain in a
> >> runnable state for a long time and cannot release their occupied memory
> >> resources timely.
> >>> There are potential problems with this approach of passing work to a
> >>> kernel thread:
> >>>
> >>> - The process will exit while its resources are still allocated.  But
> >>>     its parent process assumes those resources are now all freed and the
> >>>     parent process then proceeds to allocate resources.  This results in
> >>>     a time period where peak resource consumption is higher than it was
> >>>     before such a change.
> >> - I don't think this modification will cause such a problem. Perhaps I
> >>     haven't fully understood your meaning yet. Can you give me a specific
> >>     example?
> > Normally, after completing zap_pte_range, your swap slots are returned to
> > the swap file, except for a few slot caches. However, with the asynchronous
> > approach, it means that even after your process has completely exited,
> >   some swap slots might still not be released to the system. This could
> > potentially starve other processes waiting for swap slots to perform
> > swap-outs. I assume this isn't a critical issue for you because, in the
> > case of killing processes, freeing up memory is more important than
> > releasing swap entries?
>   I did not encounter issues caused by the slow release of swap entries
> by asynchronous kworker during our testing. Normally, asynchronous
> kworker can also release cached swap entries in a short period of time.
> Of course, if the system allows, it is necessary to increase the running
> priority of the asynchronous kworker appropriately in order to release
> swap entries faster, which is also beneficial for the system.
>
> The swap-out datas for swap entries is also compressed and stored in the
> zram memory space, so it is relatively important to release the zram
> memory space corresponding to swap entries as soon as possible.
> >
> >>> - If all CPUs are running in userspace with realtime policy
> >>>     (SCHED_FIFO, for example) then the kworker thread will not run,
> >>>     indefinitely.
> >> - In my clumsy understanding, the execution priority of kernel threads
> >>     should not be lower than that of the exiting process, and the
> >>     asynchronous kworker execution should only be triggered when the
> >>     process exits. The exiting process should not be set to SCHED_LFO,
> >>     so when the exiting process is executed, the asynchronous kworker
> >>     should also have opportunity to get timely execution.
> >>> - Work which should have been accounted to the exiting process will
> >>>     instead go unaccounted.
> >> - You are right, the statistics of process exit time may no longer be
> >>     complete.
> >>> So please fully address all these potential issues.
> >> Thanks
> >> Zhiguo

Thanks
Barry


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] mm: tlb swap entries batch async release
  2024-08-02 10:42         ` Barry Song
@ 2024-08-02 14:42           ` zhiguojiang
  0 siblings, 0 replies; 12+ messages in thread
From: zhiguojiang @ 2024-08-02 14:42 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, kernel test robot,
	opensource.kernel



在 2024/8/2 18:42, Barry Song 写道:
> On Thu, Aug 1, 2024 at 10:33 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>> 在 2024/8/1 15:36, Barry Song 写道:
>>> On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>>> 在 2024/8/1 0:17, Andrew Morton 写道:
>>>>> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>>>
>>>>>> The main reasons for the prolonged exit of a background process is the
>>>>> The kernel really doesn't have a concept of a "background process".
>>>>> It's a userspace concept - perhaps "the parent process isn't waiting on
>>>>> this process via wait()".
>>>>>
>>>>> I assume here you're referring to an Android userspace concept?  I
>>>>> expect that when Android "backgrounds" a process, it does lots of
>>>>> things to that process.  Perhaps scheduling priority, perhaps
>>>>> alteration of various MM tunables, etc.
>>>>>
>>>>> So rather than referring to "backgrounding" it would be better to
>>>>> identify what tuning alterations are made to such processes to bring
>>>>> about this behavior.
>>>> Hi Andrew Morton,
>>>>
>>>> Thank you for your review and comments.
>>>>
>>>> You are right. The "background process" here refers to the process
>>>> corresponding to an Android application switched to the background.
>>>> In fact, this patch is applicable to any exiting process.
>>>>
>>>> The further explaination the concept of "multiple exiting processes",
>>>> is that it refers to different processes owning independent mm rather
>>>> than sharing the same mm.
>>>>
>>>> I will use "mm" to describe process instead of "background" in next
>>>> version.
>>>>>> time-consuming release of its swap entries. The proportion of swap memory
>>>>>> occupied by the background process increases with its duration in the
>>>>>> background, and after a period of time, this value can reach 60% or more.
>>>>> Again, what is it about the tuning of such processes which causes this
>>>>> behavior?
>>>> When system is low memory, memory recycling will be trigged, where
>>>> anonymous folios in the process will be continuously reclaimed, resulting
>>>> in an increase of swap entries occupies by this process. So when the
>>>> process is killed, it takes more time to release it's swap entries over
>>>> time.
>>>>
>>>> Testing datas of process occuping different physical memory sizes at
>>>> different time points:
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After booting up, start 15 processes first, and then observe the
>>>> physical memory size occupied by the last launched process at
>>>> different time points.
>>>>
>>>> Example:
>>>> The process launched last: com.qiyi.video
>>>> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
>>>> -------------------------------------------------------------------
>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>>>> min - minute.
>>>>
>>>> Based on the above datas, we can know that the swap ratio occupied by
>>>> the process gradually increases over time.
>>> If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
>>> pages are actually swapped out, you end up zapping 100 PTEs but only freeing
>>> 36.28 pages of memory. By doing this asynchronously, you prevent the
>>> swap_release operation from blocking the process of zapping normal
>>> PTEs that are mapping to memory.
>>>
>>> Could you provide data showing the improvements after implementing
>>> asynchronous freeing of swap entries?
>> Hi Barry,
>>
>> Your understanding is correct. From the perspective of the benefits of
>> releasing the physical memory occupied by the exiting process, an
>> asynchronous kworker releasing swap entries can indeed accelerate
>> the exiting process to release its pte_present memory (e.g. file and
>> anonymous folio) faster.
>>
>> In addition, from the perspective of CPU resources, for scenarios where
>> multiple exiting processes are running simultaneously, an asynchronous
>> kworker instead of multiple exiting processes is used to release swap
>> entries can release more CPU core resources for the current non-exiting
>> and important processes to use, thereby improving the user experience
>> of the current non-exiting and important processes. I think this is the
>> main contribution of this modification.
>>
>> Example:
>> When there are multiple processes and the system memory is low, if
>> the camera processes are started at this time, it will trigger the
>> instantaneous killing of many processes because the camera processes
>> need to alloc a large amount of memory, resulting in multiple exiting
>> processes running simultaneously. These exiting processes will compete
>> with the current camera processes for CPU resources, and the release of
>> physical memory occupied by multiple exiting processes due to scheduling
>> is slow, ultimately affecting the slow execution of the camera process.
>>
>> By using this optimization modification, multiple exiting processes can
>> quickly exit, freeing up their CPU resources and physical memory of
>> pte_preset, improving the running speed of camera processes.
>>
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After restarting the machine, start 15 app processes first, and then
>> start the camera app processes, we monitor the cold start and preview
>> time datas of the camera app processes.
>>
>> Test datas of camera processes cold start time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>
>> Test datas of camera processes preview time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>
>> Base on the average of the six sets of test datas above, we can see that
>> the benefit datas of the modified patch:
>> 1. The cold start time of camera app processes has reduced by about 20%.
>> 2. The preview time of camera app processes has reduced by about 42%.
> This sounds quite promising. I understand that asynchronous releasing
> of swap entries can help killed processes free memory more quickly,
> allowing your camera app to access it faster. However, I’m unsure
> about the impact of swap-related lock contention. My intuition is that
> it might not be significant, given that the cluster uses a single lock
> and its relatively small size helps distribute the swap locks.
>
> Anyway, I’m very interested in your patchset and can certainly
> appreciate its benefits from my own experience working on phones. I’m
> quite busy with other issues at the moment, but I hope to provide you
> with detailed comments in about two weeks.
>
>>>>>> Additionally, the relatively lengthy path for releasing swap entries
>>>>>> further contributes to the longer time required for the background process
>>>>>> to release its swap entries.
>>>>>>
>>>>>> In the multiple background applications scenario, when launching a large
>>>>>> memory application such as a camera, system may enter a low memory state,
>>>>>> which will triggers the killing of multiple background processes at the
>>>>>> same time. Due to multiple exiting processes occupying multiple CPUs for
>>>>>> concurrent execution, the current foreground application's CPU resources
>>>>>> are tight and may cause issues such as lagging.
>>>>>>
>>>>>> To solve this problem, we have introduced the multiple exiting process
>>>>>> asynchronous swap memory release mechanism, which isolates and caches
>>>>>> swap entries occupied by multiple exit processes, and hands them over
>>>>>> to an asynchronous kworker to complete the release. This allows the
>>>>>> exiting processes to complete quickly and release CPU resources. We have
>>>>>> validated this modification on the products and achieved the expected
>>>>>> benefits.
>>>>> Dumb question: why can't this be done in userspace?  The exiting
>>>>> process does fork/exit and lets the child do all this asynchronous freeing?
>>>> The logic optimization for kernel releasing swap entries cannot be
>>>> implemented in userspace. The multiple exiting processes here own
>>>> their independent mm, rather than parent and child processes share the
>>>> same mm. Therefore, when the kernel executes multiple exiting process
>>>> simultaneously, they will definitely occupy multiple CPU core resources
>>>> to complete it.
>>>>>> It offers several benefits:
>>>>>> 1. Alleviate the high system cpu load caused by multiple exiting
>>>>>>       processes running simultaneously.
>>>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>>>       kworker instead of multiple exiting processes parallel execution.
>>>>> Why is lock contention reduced?  The same amount of work needs to be
>>>>> done.
>>>> When multiple CPU cores run to release the different swap entries belong
>>>> to different exiting processes simultaneously, cluster lock or swapinfo
>>>> lock may encounter lock contention issues, and while an asynchronous
>>>> kworker that only occupies one CPU core is used to complete this work,
>>>> it can reduce the probability of lock contention and free up the
>>>> remaining CPU core resources for other non-exiting processes to use.
>>>>>> 3. Release memory occupied by exiting processes more efficiently.
>>>>> Probably it's slightly less efficient.
>>>> We observed that using an asynchronous kworker can result in more free
>>>> memory earlier. When multiple processes exit simultaneously, due to CPU
>>>> core resources competition, these exiting processes remain in a
>>>> runnable state for a long time and cannot release their occupied memory
>>>> resources timely.
>>>>> There are potential problems with this approach of passing work to a
>>>>> kernel thread:
>>>>>
>>>>> - The process will exit while its resources are still allocated.  But
>>>>>      its parent process assumes those resources are now all freed and the
>>>>>      parent process then proceeds to allocate resources.  This results in
>>>>>      a time period where peak resource consumption is higher than it was
>>>>>      before such a change.
>>>> - I don't think this modification will cause such a problem. Perhaps I
>>>>      haven't fully understood your meaning yet. Can you give me a specific
>>>>      example?
>>> Normally, after completing zap_pte_range, your swap slots are returned to
>>> the swap file, except for a few slot caches. However, with the asynchronous
>>> approach, it means that even after your process has completely exited,
>>>    some swap slots might still not be released to the system. This could
>>> potentially starve other processes waiting for swap slots to perform
>>> swap-outs. I assume this isn't a critical issue for you because, in the
>>> case of killing processes, freeing up memory is more important than
>>> releasing swap entries?
>>    I did not encounter issues caused by the slow release of swap entries
>> by asynchronous kworker during our testing. Normally, asynchronous
>> kworker can also release cached swap entries in a short period of time.
>> Of course, if the system allows, it is necessary to increase the running
>> priority of the asynchronous kworker appropriately in order to release
>> swap entries faster, which is also beneficial for the system.
>>
>> The swap-out datas for swap entries is also compressed and stored in the
>> zram memory space, so it is relatively important to release the zram
>> memory space corresponding to swap entries as soon as possible.
Thank you for your attention and look forward to your reply.

You are correct, it might not be significant for cluster lock contention
due to its relatively small size of entries. However, asynchronous
kworker should have some benefits for swapinfo lock contention, because
when multiple exiting processes release their respective entries at the
same time, there will have swapinfo lock contention issue in the
swapcache_free_entries().
>>>>> - If all CPUs are running in userspace with realtime policy
>>>>>      (SCHED_FIFO, for example) then the kworker thread will not run,
>>>>>      indefinitely.
>>>> - In my clumsy understanding, the execution priority of kernel threads
>>>>      should not be lower than that of the exiting process, and the
>>>>      asynchronous kworker execution should only be triggered when the
>>>>      process exits. The exiting process should not be set to SCHED_LFO,
>>>>      so when the exiting process is executed, the asynchronous kworker
>>>>      should also have opportunity to get timely execution.
>>>>> - Work which should have been accounted to the exiting process will
>>>>>      instead go unaccounted.
>>>> - You are right, the statistics of process exit time may no longer be
>>>>      complete.
>>>>> So please fully address all these potential issues.
>>>> Thanks
>>>> Zhiguo
> Thanks
> Barry
Thanks
Zhiguo



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] mm: s390: fix compilation warning
  2024-07-31 13:33 ` [PATCH v2 3/3] mm: s390: fix compilation warning Zhiguo Jiang
@ 2024-08-05 12:04   ` David Hildenbrand
  2024-08-05 12:10     ` zhiguojiang
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-08-05 12:04 UTC (permalink / raw)
  To: Zhiguo Jiang, Andrew Morton, linux-mm, linux-kernel, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, linux-arch, cgroups, Barry Song, kernel test robot
  Cc: opensource.kernel

On 31.07.24 15:33, Zhiguo Jiang wrote:
> Define static inline bool __tlb_remove_page_size() to fix arch s390
> config compilation Warning.

This should be squashed into patch #2, no?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] mm: s390: fix compilation warning
  2024-08-05 12:04   ` David Hildenbrand
@ 2024-08-05 12:10     ` zhiguojiang
  0 siblings, 0 replies; 12+ messages in thread
From: zhiguojiang @ 2024-08-05 12:10 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, linux-mm, linux-kernel,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Arnd Bergmann, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, linux-arch, cgroups, Barry Song,
	kernel test robot
  Cc: opensource.kernel



在 2024/8/5 20:04, David Hildenbrand 写道:
> On 31.07.24 15:33, Zhiguo Jiang wrote:
>> Define static inline bool __tlb_remove_page_size() to fix arch s390
>> config compilation Warning.
>
> This should be squashed into patch #2, no?
Ok, thank you for your nice guidance, I will squash it into patch #2 in 
next version.

Thanks
Zhiguo



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-08-05 12:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-31 13:33 [PATCH v2 0/3] mm: tlb swap entries batch async release Zhiguo Jiang
2024-07-31 13:33 ` [PATCH v2 1/3] mm: move task_is_dying to h headfile Zhiguo Jiang
2024-07-31 13:33 ` [PATCH v2 2/3] mm: tlb: add tlb swap entries batch async release Zhiguo Jiang
2024-07-31 13:33 ` [PATCH v2 3/3] mm: s390: fix compilation warning Zhiguo Jiang
2024-08-05 12:04   ` David Hildenbrand
2024-08-05 12:10     ` zhiguojiang
2024-07-31 16:17 ` [PATCH v2 0/3] mm: tlb swap entries batch async release Andrew Morton
2024-08-01  6:30   ` zhiguojiang
2024-08-01  7:36     ` Barry Song
2024-08-01 10:33       ` zhiguojiang
2024-08-02 10:42         ` Barry Song
2024-08-02 14:42           ` zhiguojiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox