[PATCH v5 00/12] fold per-CPU vmstats remotely

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/12] fold per-CPU vmstats remotely
@ 2023-03-13 16:25 Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 01/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
                   ` (12 more replies)
  0 siblings, 13 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka

This patch series addresses the following two problems:

    1. A customer provided some evidence which indicates that
       the idle tick was stopped; albeit, CPU-specific vmstat
       counters still remained populated.

       Thus one can only assume quiet_vmstat() was not
       invoked on return to the idle loop. If I understand
       correctly, I suspect this divergence might erroneously
       prevent a reclaim attempt by kswapd. If the number of
       zone specific free pages are below their per-cpu drift
       value then zone_page_state_snapshot() is used to
       compute a more accurate view of the aforementioned
       statistic.  Thus any task blocked on the NUMA node
       specific pfmemalloc_wait queue will be unable to make
       significant progress via direct reclaim unless it is
       killed after being woken up by kswapd
       (see throttle_direct_reclaim())

    2. With a SCHED_FIFO task that busy loops on a given CPU,
       and kworker for that CPU at SCHED_OTHER priority,
       queuing work to sync per-vmstats will either cause that
       work to never execute, or stalld (i.e. stall daemon)
       boosts kworker priority which causes a latency
       violation

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.

v5:
- Drop "mm/vmstat: remove remote node draining"        (Vlastimil Babka)
- Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka)

v4:
- Switch per-CPU vmstat counters to s32, required 
  by RISC-V, ARC architectures			

v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats  (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
  stats more accurate				       (Peter Xu)
- Improve changelog of
  "mm/vmstat: switch counter modification to cmpxchg"  (Peter Xu / David)
- Improve changelog of
  "mm/vmstat: remove remote node draining"	       (David Hildenbrand)


v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
  (Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
  (Uros Bizjak / Matthew Wilcox)


 arch/arm64/include/asm/percpu.h     |   16 ++
 arch/loongarch/include/asm/percpu.h |   23 +++-
 arch/s390/include/asm/percpu.h      |    5 
 arch/x86/include/asm/percpu.h       |   39 +++----
 include/asm-generic/percpu.h        |   17 +++
 include/linux/mmzone.h              |    3 
 include/linux/percpu-defs.h         |    2 
 kernel/fork.c                       |    2 
 kernel/scs.c                        |    2 
 mm/page_alloc.c                     |   23 ----
 mm/vmstat.c                         |  424 +++++++++++++++++++++++++++++++++++++++++------------------------------------
 11 files changed, 307 insertions(+), 249 deletions(-)




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 01/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 02/12] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this, 
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change ARM's this_cpu_cmpxchg_ helpers to be atomic,
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
+++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
@@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
 	_pcp_protect_return(xchg_relaxed, pcp, val)
 
 #define this_cpu_cmpxchg_1(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg, pcp, o, n)
+
+#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
 	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 02/12] this_cpu_cmpxchg: loongarch: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 01/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 03/12] this_cpu_cmpxchg: S390: " Marcelo Tosatti
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add this_cpu_cmpxchg_local helpers to Loongarch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/loongarch/include/asm/percpu.h
+++ linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
@@ -150,6 +150,16 @@ static inline unsigned long __percpu_xch
 }
 
 /* this_cpu_cmpxchg */
+#define _protect_cmpxchg(pcp, o, n)				\
+({								\
+	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
+	preempt_disable_notrace();				\
+	__ret = cmpxchg(raw_cpu_ptr(&(pcp)), o, n);		\
+	preempt_enable_notrace();				\
+	__ret;							\
+})
+
+/* this_cpu_cmpxchg_local */
 #define _protect_cmpxchg_local(pcp, o, n)			\
 ({								\
 	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
@@ -222,10 +232,15 @@ do {									\
 #define this_cpu_xchg_4(pcp, val) _percpu_xchg(pcp, val)
 #define this_cpu_xchg_8(pcp, val) _percpu_xchg(pcp, val)
 
-#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+
+#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg(ptr, o, n)
 
 #include <asm-generic/percpu.h>
 




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 03/12] this_cpu_cmpxchg: S390: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 01/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 02/12] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 04/12] this_cpu_cmpxchg: x86: " Marcelo Tosatti
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add S390's this_cpu_cmpxchg_local.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/s390/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/s390/include/asm/percpu.h
+++ linux-vmstat-remote/arch/s390/include/asm/percpu.h
@@ -148,6 +148,11 @@
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+
 #define arch_this_cpu_xchg(pcp, nval)					\
 ({									\
 	typeof(pcp) *ptr__;						\




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 04/12] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 03/12] this_cpu_cmpxchg: S390: " Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 05/12] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change x86's this_cpu_cmpxchg_ helpers to be atomic.
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/x86/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/x86/include/asm/percpu.h
+++ linux-vmstat-remote/arch/x86/include/asm/percpu.h
@@ -197,11 +197,11 @@ do {									\
  * cmpxchg has no such implied lock semantics as a result it is much
  * more efficient for cpu local operations.
  */
-#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval)		\
+#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval, lockp)	\
 ({									\
 	__pcpu_type_##size pco_old__ = __pcpu_cast_##size(_oval);	\
 	__pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval);	\
-	asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]",		\
+	asm qual (__pcpu_op2_##size(lockp "cmpxchg", "%[nval]",		\
 				    __percpu_arg([var]))		\
 		  : [oval] "+a" (pco_old__),				\
 		    [var] "+m" (_var)					\
@@ -279,16 +279,20 @@ do {									\
 #define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, , pcp, val)
 #define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, , pcp, val)
 #define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, , pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval, "")
 
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, volatile, pcp, val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, volatile, pcp, val)
 #define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, volatile, pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, "")
+
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 #ifdef CONFIG_X86_CMPXCHG64
 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
@@ -319,16 +323,17 @@ do {									\
 #define raw_cpu_or_8(pcp, val)			percpu_to_op(8, , "or", (pcp), val)
 #define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, , pcp, val)
 #define raw_cpu_xchg_8(pcp, nval)		raw_percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval, "")
 
-#define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
-#define this_cpu_add_8(pcp, val)		percpu_add_op(8, volatile, (pcp), val)
-#define this_cpu_and_8(pcp, val)		percpu_to_op(8, volatile, "and", (pcp), val)
-#define this_cpu_or_8(pcp, val)			percpu_to_op(8, volatile, "or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
-#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_read_8(pcp)				percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)			percpu_to_op(8, volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)			percpu_add_op(8, volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val)			percpu_to_op(8, volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val)				percpu_to_op(8, volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val)			percpu_add_return_op(8, volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval)			percpu_xchg_op(8, volatile, pcp, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_8(pcp, oval, nval)		percpu_cmpxchg_op(8, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 05/12] add this_cpu_cmpxchg_local and asm-generic definitions
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 04/12] this_cpu_cmpxchg: x86: " Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 06/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Add this_cpu_cmpxchg_local_ helpers to asm-generic/percpu.h.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/asm-generic/percpu.h
===================================================================
--- linux-vmstat-remote.orig/include/asm-generic/percpu.h
+++ linux-vmstat-remote/include/asm-generic/percpu.h
@@ -424,6 +424,23 @@ do {									\
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
+#ifndef this_cpu_cmpxchg_local_1
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_2
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_4
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_8
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+
 #ifndef this_cpu_cmpxchg_double_1
 #define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
Index: linux-vmstat-remote/include/linux/percpu-defs.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/percpu-defs.h
+++ linux-vmstat-remote/include/linux/percpu-defs.h
@@ -513,6 +513,8 @@ do {									\
 #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
 	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
+#define this_cpu_cmpxchg_local(pcp, oval, nval) \
+	__pcpu_size_call_return2(this_cpu_cmpxchg_local_, pcp, oval, nval)
 #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
 




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 06/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 05/12] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 07/12] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Peter Xu, Marcelo Tosatti

this_cpu_cmpxchg was modified to atomic version, which 
can be more costly than non-atomic version.

Switch users of this_cpu_cmpxchg to this_cpu_cmpxchg_local
(which preserves pre-non-atomic this_cpu_cmpxchg behaviour).

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/kernel/fork.c
===================================================================
--- linux-vmstat-remote.orig/kernel/fork.c
+++ linux-vmstat-remote/kernel/fork.c
@@ -203,7 +203,7 @@ static bool try_release_thread_stack_to_
 	unsigned int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+		if (this_cpu_cmpxchg_local(cached_stacks[i], NULL, vm) != NULL)
 			continue;
 		return true;
 	}
Index: linux-vmstat-remote/kernel/scs.c
===================================================================
--- linux-vmstat-remote.orig/kernel/scs.c
+++ linux-vmstat-remote/kernel/scs.c
@@ -83,7 +83,7 @@ void scs_free(void *s)
 	 */
 
 	for (i = 0; i < NR_CACHED_SCS; i++)
-		if (this_cpu_cmpxchg(scs_cache[i], 0, s) == NULL)
+		if (this_cpu_cmpxchg_local(scs_cache[i], 0, s) == NULL)
 			return;
 
 	kasan_unpoison_vmalloc(s, SCS_SIZE, KASAN_VMALLOC_PROT_NORMAL);




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 07/12] mm/vmstat: switch counter modification to cmpxchg
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 06/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 08/12] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

In preparation for switching vmstat shepherd to flush
per-CPU counters remotely, switch the __{mod,inc,dec} functions that
modify the counters to use cmpxchg.

To facilitate reviewing, functions are ordered in the text file, as:

__{mod,inc,dec}_{zone,node}_page_state
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
#endif

This patch defines the __ versions for the
CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts:

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state = {mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state
#endif

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c 
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla			Patched		Diff
115.25			117		1.4%

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/*
+ * If we have cmpxchg_local support then we do not need to incur the overhead
+ * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
+ *
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *     0       No overstepping
+ *     1       Overstepping half of threshold
+ *     -1      Overstepping minus half of threshold
+ */
+static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
+				  long delta, int overstep_mode)
+{
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a zone.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		zone_page_state_add(z, zone, item);
+}
+
+void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			 long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_zone_page_state);
+
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			   long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+void inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_zone_page_state);
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_zone_page_state);
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+				  enum node_stat_item item,
+				  int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	if (vmstat_item_in_bytes(item)) {
+		/*
+		 * Only cgroups use subpage accounting right now; at
+		 * the global level, these items still change in
+		 * multiples of whole pages. Store them as pages
+		 * internally to keep the per-cpu counters compact.
+		 */
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+#else
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -541,149 +723,6 @@ void __dec_node_page_state(struct page *
 }
 EXPORT_SYMBOL(__dec_node_page_state);
 
-#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
-/*
- * If we have cmpxchg_local support then we do not need to incur the overhead
- * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
- *
- * mod_state() modifies the zone counter state through atomic per cpu
- * operations.
- *
- * Overstep mode specifies how overstep should handled:
- *     0       No overstepping
- *     1       Overstepping half of threshold
- *     -1      Overstepping minus half of threshold
-*/
-static inline void mod_zone_state(struct zone *zone,
-       enum zone_stat_item item, long delta, int overstep_mode)
-{
-	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long o, n, t, z;
-
-	do {
-		z = 0;  /* overflow to zone counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a zone.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to zone counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		zone_page_state_add(z, zone, item);
-}
-
-void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-			 long delta)
-{
-	mod_zone_state(zone, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_zone_page_state);
-
-void inc_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_zone_page_state);
-
-void dec_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_zone_page_state);
-
-static inline void mod_node_state(struct pglist_data *pgdat,
-       enum node_stat_item item, int delta, int overstep_mode)
-{
-	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	long o, n, t, z;
-
-	if (vmstat_item_in_bytes(item)) {
-		/*
-		 * Only cgroups use subpage accounting right now; at
-		 * the global level, these items still change in
-		 * multiples of whole pages. Store them as pages
-		 * internally to keep the per-cpu counters compact.
-		 */
-		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
-		delta >>= PAGE_SHIFT;
-	}
-
-	do {
-		z = 0;  /* overflow to node counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a node.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to node counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		node_page_state_add(z, pgdat, item);
-}
-
-void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
-					long delta)
-{
-	mod_node_state(pgdat, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_node_page_state);
-
-void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
-{
-	mod_node_state(pgdat, item, 1, 1);
-}
-
-void inc_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_node_page_state);
-
-void dec_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_node_page_state);
-#else
 /*
  * Use interrupt disable to serialize counter updates
  */
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -8628,9 +8628,6 @@ static int page_alloc_cpu_dead(unsigned
 	/*
 	 * Zero the differential counters of the dead processor
 	 * so that the vm statistics are consistent.
-	 *
-	 * This is only okay since the processor is dead and cannot
-	 * race with what we are doing.
 	 */
 	cpu_vm_stats_fold(cpu);
 




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 08/12] vmstat: switch per-cpu vmstat counters to 32-bits
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 07/12] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 09/12] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Some architectures only provide xchg/cmpxchg in 32/64-bit quantities.

Since the next patch is about to use xchg on per-CPU vmstat counters,
switch them to s32.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -689,8 +689,8 @@ struct per_cpu_pages {
 
 struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
-	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
-	s8 stat_threshold;
+	s32 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
+	s32 stat_threshold;
 #endif
 #ifdef CONFIG_NUMA
 	/*
@@ -703,8 +703,8 @@ struct per_cpu_zonestat {
 };
 
 struct per_cpu_nodestat {
-	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 stat_threshold;
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -351,7 +351,7 @@ static inline void mod_zone_state(struct
 				  long delta, int overstep_mode)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long o, n, t, z;
 
 	do {
@@ -428,7 +428,7 @@ static inline void mod_node_state(struct
 				  int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	if (vmstat_item_in_bytes(item)) {
@@ -525,7 +525,7 @@ void __mod_zone_page_state(struct zone *
 			   long delta)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long x;
 	long t;
 
@@ -556,7 +556,7 @@ void __mod_node_page_state(struct pglist
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -614,8 +614,8 @@ EXPORT_SYMBOL(__mod_node_page_state);
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -623,7 +623,7 @@ void __inc_zone_state(struct zone *zone,
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
@@ -635,8 +635,8 @@ void __inc_zone_state(struct zone *zone,
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -646,7 +646,7 @@ void __inc_node_state(struct pglist_data
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -670,8 +670,8 @@ EXPORT_SYMBOL(__inc_node_page_state);
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -679,7 +679,7 @@ void __dec_zone_state(struct zone *zone,
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
@@ -691,8 +691,8 @@ void __dec_zone_state(struct zone *zone,
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -702,7 +702,7 @@ void __dec_node_state(struct pglist_data
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 09/12] mm/vmstat: use xchg in cpu_vm_stats_fold
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 08/12] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 10/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use xchg instead of a
pair of read/write instructions.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -924,7 +924,7 @@ static int refresh_cpu_vm_stats(bool do_
 }
 
 /*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
@@ -945,8 +945,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = pzstats->vm_stat_diff[i];
-				pzstats->vm_stat_diff[i] = 0;
+				v = xchg(&pzstats->vm_stat_diff[i], 0);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
@@ -956,8 +955,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_numa_event[i]) {
 				unsigned long v;
 
-				v = pzstats->vm_numa_event[i];
-				pzstats->vm_numa_event[i] = 0;
+				v = xchg(&pzstats->vm_numa_event[i], 0);
 				zone_numa_event_add(v, zone, i);
 			}
 		}
@@ -973,8 +971,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (p->vm_node_stat_diff[i]) {
 				int v;
 
-				v = p->vm_node_stat_diff[i];
-				p->vm_node_stat_diff[i] = 0;
+				v = xchg(&p->vm_node_stat_diff[i], 0);
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
 			}




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 10/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 09/12] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 11/12] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Now that the counters are modified via cmpxchg both CPU locally
(via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU 
vmstats folding remotely.

This fixes the following two problems:

 1. A customer provided some evidence which indicates that
    the idle tick was stopped; albeit, CPU-specific vmstat
    counters still remained populated.

    Thus one can only assume quiet_vmstat() was not
    invoked on return to the idle loop. If I understand
    correctly, I suspect this divergence might erroneously
    prevent a reclaim attempt by kswapd. If the number of
    zone specific free pages are below their per-cpu drift
    value then zone_page_state_snapshot() is used to
    compute a more accurate view of the aforementioned
    statistic.  Thus any task blocked on the NUMA node
    specific pfmemalloc_wait queue will be unable to make
    significant progress via direct reclaim unless it is
    killed after being woken up by kswapd
    (see throttle_direct_reclaim())

 2. With a SCHED_FIFO task that busy loops on a given CPU,
    and kworker for that CPU at SCHED_OTHER priority,
    queuing work to sync per-vmstats will either cause that
    work to never execute, or stalld (i.e. stall daemon)
    boosts kworker priority which causes a latency
    violation

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -2043,6 +2043,23 @@ static void vmstat_shepherd(struct work_
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		cpu_vm_stats_fold(cpu);
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
+}
+#else
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
@@ -2062,6 +2079,7 @@ static void vmstat_shepherd(struct work_
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }
+#endif
 
 static void __init start_shepherd_timer(void)
 {




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 11/12] mm/vmstat: refresh stats remotely instead of via work item
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 10/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-13 16:25 ` [PATCH v5 12/12] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
  2023-03-14 12:25 ` [PATCH v5 00/12] fold per-CPU vmstats remotely Michal Hocko
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Refresh per-CPU stats remotely, instead of queueing 
work items, for the stat_refresh procfs method.

This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -1901,11 +1901,20 @@ static DEFINE_PER_CPU(struct delayed_wor
 int sysctl_stat_interval __read_mostly = HZ;
 
 #ifdef CONFIG_PROC_FS
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+static int refresh_all_vm_stats(void);
+#else
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
 }
 
+static int refresh_all_vm_stats(void)
+{
+	return schedule_on_each_cpu(refresh_vm_stats);
+}
+#endif
+
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -1925,7 +1934,7 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
+	err = refresh_all_vm_stats();
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
@@ -2045,7 +2054,7 @@ static DECLARE_DEFERRABLE_WORK(shepherd,
 
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
-static void vmstat_shepherd(struct work_struct *w)
+static int refresh_all_vm_stats(void)
 {
 	int cpu;
 
@@ -2055,7 +2064,12 @@ static void vmstat_shepherd(struct work_
 		cond_resched();
 	}
 	cpus_read_unlock();
+	return 0;
+}
 
+static void vmstat_shepherd(struct work_struct *w)
+{
+	refresh_all_vm_stats();
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 12/12] vmstat: add pcp remote node draining via cpu_vm_stats_fold
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 11/12] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
@ 2023-03-13 16:25 ` Marcelo Tosatti
  2023-03-14 12:25 ` [PATCH v5 00/12] fold per-CPU vmstats remotely Michal Hocko
  12 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Marcelo Tosatti

Large NUMA systems might have significant portions 
of system memory to be trapped in pcp queues. The number of pcp is                                                             
determined by the number of processors and nodes in a system. A system                                                              
with 4 processors and 2 nodes has 8 pcps which is okay. But a system                                                                
with 1024 processors and 512 nodes has 512k pcps with a high potential                                                              
for large amount of memory being caught in them.

Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case,
where vmstat_shepherd will perform the aging and draining via
cpu_vm_stats_fold.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -928,7 +928,7 @@ static int refresh_cpu_vm_stats(bool do_
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
-void cpu_vm_stats_fold(int cpu)
+void cpu_vm_stats_fold(int cpu, bool do_pagesets)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
@@ -938,6 +938,9 @@ void cpu_vm_stats_fold(int cpu)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat *pzstats;
+#ifdef CONFIG_NUMA
+		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+#endif
 
 		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 
@@ -948,6 +951,11 @@ void cpu_vm_stats_fold(int cpu)
 				v = xchg(&pzstats->vm_stat_diff[i], 0);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
+#ifdef CONFIG_NUMA
+				/* 3 seconds idle till flush */
+				if (do_pagesets)
+					pcp->expire = 3;
+#endif
 			}
 		}
 #ifdef CONFIG_NUMA
@@ -959,6 +967,38 @@ void cpu_vm_stats_fold(int cpu)
 				zone_numa_event_add(v, zone, i);
 			}
 		}
+
+		if (do_pagesets) {
+			cond_resched();
+			/*
+			 * Deal with draining the remote pageset of a
+			 * processor
+			 *
+			 * Check if there are pages remaining in this pageset
+			 * if not then there is nothing to expire.
+			 */
+			if (!pcp->expire || !pcp->count)
+				continue;
+
+			/*
+			 * We never drain zones local to this processor.
+			 */
+			if (zone_to_nid(zone) == cpu_to_node(cpu)) {
+				pcp->expire = 0;
+				continue;
+			}
+
+			WARN_ON(pcp->expire < 0);
+			/*
+			 * pcp->expire is only accessed from vmstat_shepherd context,
+			 * therefore no locking is required.
+			 */
+			if (--pcp->expire)
+				continue;
+
+			if (pcp->count)
+				drain_zone_pages(zone, pcp);
+		}
 #endif
 	}
 
@@ -2060,7 +2100,7 @@ static int refresh_all_vm_stats(void)
 
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		cpu_vm_stats_fold(cpu);
+		cpu_vm_stats_fold(cpu, true);
 		cond_resched();
 	}
 	cpus_read_unlock();
Index: linux-vmstat-remote/include/linux/vmstat.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/vmstat.h
+++ linux-vmstat-remote/include/linux/vmstat.h
@@ -291,7 +291,7 @@ extern void __dec_zone_state(struct zone
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
-void cpu_vm_stats_fold(int cpu);
+void cpu_vm_stats_fold(int cpu, bool do_pagesets);
 void refresh_zone_stat_thresholds(void);
 
 struct ctl_table;
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -8629,7 +8629,7 @@ static int page_alloc_cpu_dead(unsigned
 	 * Zero the differential counters of the dead processor
 	 * so that the vm statistics are consistent.
 	 */
-	cpu_vm_stats_fold(cpu);
+	cpu_vm_stats_fold(cpu, false);
 
 	for_each_populated_zone(zone)
 		zone_pcp_update(zone, 0);




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 00/12] fold per-CPU vmstats remotely
  2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (11 preceding siblings ...)
  2023-03-13 16:25 ` [PATCH v5 12/12] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-14 12:25 ` Michal Hocko
  2023-03-14 12:59   ` Marcelo Tosatti
  12 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-03-14 12:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> This patch series addresses the following two problems:
> 
>     1. A customer provided some evidence which indicates that
>        the idle tick was stopped; albeit, CPU-specific vmstat
>        counters still remained populated.
> 
>        Thus one can only assume quiet_vmstat() was not
>        invoked on return to the idle loop. If I understand
>        correctly, I suspect this divergence might erroneously
>        prevent a reclaim attempt by kswapd. If the number of
>        zone specific free pages are below their per-cpu drift
>        value then zone_page_state_snapshot() is used to
>        compute a more accurate view of the aforementioned
>        statistic.  Thus any task blocked on the NUMA node
>        specific pfmemalloc_wait queue will be unable to make
>        significant progress via direct reclaim unless it is
>        killed after being woken up by kswapd
>        (see throttle_direct_reclaim())

I have hard time to follow the actual problem described above. Are you
suggesting that a lack of pcp vmstat counters update has led to
reclaim issues? What is the said "evidence"? Could you share more of the
story please?

>     2. With a SCHED_FIFO task that busy loops on a given CPU,
>        and kworker for that CPU at SCHED_OTHER priority,
>        queuing work to sync per-vmstats will either cause that
>        work to never execute, or stalld (i.e. stall daemon)
>        boosts kworker priority which causes a latency
>        violation

Why is that a problem? Out-of-sync stats shouldn't cause major problems.
Or can they?

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 00/12] fold per-CPU vmstats remotely
  2023-03-14 12:25 ` [PATCH v5 00/12] fold per-CPU vmstats remotely Michal Hocko
@ 2023-03-14 12:59   ` Marcelo Tosatti
  2023-03-14 13:00     ` Marcelo Tosatti
  2023-03-14 14:31     ` Michal Hocko
  0 siblings, 2 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-14 12:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > This patch series addresses the following two problems:
> > 
> >     1. A customer provided some evidence which indicates that
> >        the idle tick was stopped; albeit, CPU-specific vmstat
> >        counters still remained populated.
> > 
> >        Thus one can only assume quiet_vmstat() was not
> >        invoked on return to the idle loop. If I understand
> >        correctly, I suspect this divergence might erroneously
> >        prevent a reclaim attempt by kswapd. If the number of
> >        zone specific free pages are below their per-cpu drift
> >        value then zone_page_state_snapshot() is used to
> >        compute a more accurate view of the aforementioned
> >        statistic.  Thus any task blocked on the NUMA node
> >        specific pfmemalloc_wait queue will be unable to make
> >        significant progress via direct reclaim unless it is
> >        killed after being woken up by kswapd
> >        (see throttle_direct_reclaim())
> 
> I have hard time to follow the actual problem described above. Are you
> suggesting that a lack of pcp vmstat counters update has led to
> reclaim issues? What is the said "evidence"? Could you share more of the
> story please?


  - The process was trapped in throttle_direct_reclaim().
    The function wait_event_killable() was called to wait condition
    allow_direct_reclaim(pgdat) for current node to be true.
    The allow_direct_reclaim(pgdat) examined the number of free pages
    on the node by zone_page_state() which just returns value in
    zone->vm_stat[NR_FREE_PAGES].

  - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
    However, the freelist on this node was not empty.

  - This inconsistent of vmstat value was caused by percpu vmstat on
    nohz_full cpus. Every increment/decrement of vmstat is performed
    on percpu vmstat counter at first, then pooled diffs are cumulated
    to the zone's vmstat counter in timely manner. However, on nohz_full
    cpus (in case of this customer's system, 48 of 52 cpus) these pooled
    diffs were not cumulated once the cpu had no event on it so that
    the cpu started sleeping infinitely.
    I checked percpu vmstat and found there were total 69 counts not
    cumulated to the zone's vmstat counter yet.

  - In this situation, kswapd did not help the trapped process.
    In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
    of free pages on the node by zone_page_state_snapshot() which
    checks pending counts on percpu vmstat.
    Therefore kswapd could know there were 69 free pages correctly.
    Since zone->_watermark = {8, 20, 32}, kswapd did not work because
    69 was greater than 32 as high watermark.

  - As the result:
    - The process waited allow_direct_reclaim(pgdat) to be true.
      - allow_direct_reclaim() saw 0 via zone_page_state().
        It woke kswapd since 0 was lower than min watermark.
    - The kswapd did nothing.
      - kswapd saw 69 via zone_page_state_snapshot().
        It woke waiters without performing memory reclaim
        since 69 is greater than high watermark.
    - The process woked by kswapd soon restart waiting for kswapd.
      - Still allow_direct_reclaim() saw 0 via zone_page_state().
        It woke kswapd since 0 was lower than min watermark.

> 
> >     2. With a SCHED_FIFO task that busy loops on a given CPU,
> >        and kworker for that CPU at SCHED_OTHER priority,
> >        queuing work to sync per-vmstats will either cause that
> >        work to never execute, or stalld (i.e. stall daemon)
> >        boosts kworker priority which causes a latency
> >        violation
> 
> Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> Or can they?

Consider SCHED_FIFO task that is polling the network queue (say
testpmd).

	do {
	 	if (net_registers->state & DATA_AVAILABLE) {
			process_data)();
		}
	 } while (!stopped);

Since this task runs at SCHED_FIFO priority, kworker won't 
be scheduled to run (therefore per-CPU vmstats won't be
flushed to global vmstats). 

Or, if testpmd runs at SCHED_OTHER, then the work item to
flush per-CPU vmstats causes

	testpmd -> kworker
	kworker: flush per-CPU vmstats
	kworker -> testpmd

And this might cause undesired latencies to the packets being
processed by the testpmd task.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 00/12] fold per-CPU vmstats remotely
  2023-03-14 12:59   ` Marcelo Tosatti
@ 2023-03-14 13:00     ` Marcelo Tosatti
  2023-03-14 14:31     ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-14 13:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Tue, Mar 14, 2023 at 09:59:37AM -0300, Marcelo Tosatti wrote:
> > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > Or can they?
> 
> Consider SCHED_FIFO task that is polling the network queue (say
> testpmd).
> 
> 	do {
> 	 	if (net_registers->state & DATA_AVAILABLE) {
> 			process_data)();
> 		}
> 	 } while (!stopped);
> 
> Since this task runs at SCHED_FIFO priority, kworker won't 
> be scheduled to run (therefore per-CPU vmstats won't be
> flushed to global vmstats). 
> 
> Or, if testpmd runs at SCHED_OTHER, then the work item to
> flush per-CPU vmstats causes
> 
> 	testpmd -> kworker
> 	kworker: flush per-CPU vmstats
> 	kworker -> testpmd
> 
> And this might cause undesired latencies to the packets being
> processed by the testpmd task.

This problem is unrelated to the kswapd problem, but both are addressed
by the patchset.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 00/12] fold per-CPU vmstats remotely
  2023-03-14 12:59   ` Marcelo Tosatti
  2023-03-14 13:00     ` Marcelo Tosatti
@ 2023-03-14 14:31     ` Michal Hocko
  2023-03-14 18:49       ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-03-14 14:31 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > This patch series addresses the following two problems:
> > > 
> > >     1. A customer provided some evidence which indicates that
> > >        the idle tick was stopped; albeit, CPU-specific vmstat
> > >        counters still remained populated.
> > > 
> > >        Thus one can only assume quiet_vmstat() was not
> > >        invoked on return to the idle loop. If I understand
> > >        correctly, I suspect this divergence might erroneously
> > >        prevent a reclaim attempt by kswapd. If the number of
> > >        zone specific free pages are below their per-cpu drift
> > >        value then zone_page_state_snapshot() is used to
> > >        compute a more accurate view of the aforementioned
> > >        statistic.  Thus any task blocked on the NUMA node
> > >        specific pfmemalloc_wait queue will be unable to make
> > >        significant progress via direct reclaim unless it is
> > >        killed after being woken up by kswapd
> > >        (see throttle_direct_reclaim())
> > 
> > I have hard time to follow the actual problem described above. Are you
> > suggesting that a lack of pcp vmstat counters update has led to
> > reclaim issues? What is the said "evidence"? Could you share more of the
> > story please?
> 
> 
>   - The process was trapped in throttle_direct_reclaim().
>     The function wait_event_killable() was called to wait condition
>     allow_direct_reclaim(pgdat) for current node to be true.
>     The allow_direct_reclaim(pgdat) examined the number of free pages
>     on the node by zone_page_state() which just returns value in
>     zone->vm_stat[NR_FREE_PAGES].
> 
>   - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
>     However, the freelist on this node was not empty.
> 
>   - This inconsistent of vmstat value was caused by percpu vmstat on
>     nohz_full cpus. Every increment/decrement of vmstat is performed
>     on percpu vmstat counter at first, then pooled diffs are cumulated
>     to the zone's vmstat counter in timely manner. However, on nohz_full
>     cpus (in case of this customer's system, 48 of 52 cpus) these pooled
>     diffs were not cumulated once the cpu had no event on it so that
>     the cpu started sleeping infinitely.
>     I checked percpu vmstat and found there were total 69 counts not
>     cumulated to the zone's vmstat counter yet.
> 
>   - In this situation, kswapd did not help the trapped process.
>     In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
>     of free pages on the node by zone_page_state_snapshot() which
>     checks pending counts on percpu vmstat.
>     Therefore kswapd could know there were 69 free pages correctly.
>     Since zone->_watermark = {8, 20, 32}, kswapd did not work because
>     69 was greater than 32 as high watermark.

If the imprecision of allow_direct_reclaim is the underlying problem why
haven't you used zone_page_state_snapshot instead?

Anyway, this is kind of information that is really helpful to have in
the patch description.

[...]
> > >     2. With a SCHED_FIFO task that busy loops on a given CPU,
> > >        and kworker for that CPU at SCHED_OTHER priority,
> > >        queuing work to sync per-vmstats will either cause that
> > >        work to never execute, or stalld (i.e. stall daemon)
> > >        boosts kworker priority which causes a latency
> > >        violation
> > 
> > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > Or can they?
> 
> Consider SCHED_FIFO task that is polling the network queue (say
> testpmd).
> 
> 	do {
> 	 	if (net_registers->state & DATA_AVAILABLE) {
> 			process_data)();
> 		}
> 	 } while (!stopped);
> 
> Since this task runs at SCHED_FIFO priority, kworker won't 
> be scheduled to run (therefore per-CPU vmstats won't be
> flushed to global vmstats). 

Yes, that is certainly possible. But my main point is that vmstat
imprecision shouldn't cause functional problems. That is why we have
_snapshot readers to get an exact value where it matters for
consistency.

> Or, if testpmd runs at SCHED_OTHER, then the work item to
> flush per-CPU vmstats causes
> 
> 	testpmd -> kworker
> 	kworker: flush per-CPU vmstats
> 	kworker -> testpmd
> 
> And this might cause undesired latencies to the packets being
> processed by the testpmd task.

Right but can you have any latencies expectation in a situation like
that?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 00/12] fold per-CPU vmstats remotely
  2023-03-14 14:31     ` Michal Hocko
@ 2023-03-14 18:49       ` Marcelo Tosatti
  2023-03-14 21:01         ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-14 18:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Tue, Mar 14, 2023 at 03:31:21PM +0100, Michal Hocko wrote:
> On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> > On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > > This patch series addresses the following two problems:
> > > > 
> > > >     1. A customer provided some evidence which indicates that
> > > >        the idle tick was stopped; albeit, CPU-specific vmstat
> > > >        counters still remained populated.
> > > > 
> > > >        Thus one can only assume quiet_vmstat() was not
> > > >        invoked on return to the idle loop. If I understand
> > > >        correctly, I suspect this divergence might erroneously
> > > >        prevent a reclaim attempt by kswapd. If the number of
> > > >        zone specific free pages are below their per-cpu drift
> > > >        value then zone_page_state_snapshot() is used to
> > > >        compute a more accurate view of the aforementioned
> > > >        statistic.  Thus any task blocked on the NUMA node
> > > >        specific pfmemalloc_wait queue will be unable to make
> > > >        significant progress via direct reclaim unless it is
> > > >        killed after being woken up by kswapd
> > > >        (see throttle_direct_reclaim())
> > > 
> > > I have hard time to follow the actual problem described above. Are you
> > > suggesting that a lack of pcp vmstat counters update has led to
> > > reclaim issues? What is the said "evidence"? Could you share more of the
> > > story please?
> > 
> > 
> >   - The process was trapped in throttle_direct_reclaim().
> >     The function wait_event_killable() was called to wait condition
> >     allow_direct_reclaim(pgdat) for current node to be true.
> >     The allow_direct_reclaim(pgdat) examined the number of free pages
> >     on the node by zone_page_state() which just returns value in
> >     zone->vm_stat[NR_FREE_PAGES].
> > 
> >   - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
> >     However, the freelist on this node was not empty.
> > 
> >   - This inconsistent of vmstat value was caused by percpu vmstat on
> >     nohz_full cpus. Every increment/decrement of vmstat is performed
> >     on percpu vmstat counter at first, then pooled diffs are cumulated
> >     to the zone's vmstat counter in timely manner. However, on nohz_full
> >     cpus (in case of this customer's system, 48 of 52 cpus) these pooled
> >     diffs were not cumulated once the cpu had no event on it so that
> >     the cpu started sleeping infinitely.
> >     I checked percpu vmstat and found there were total 69 counts not
> >     cumulated to the zone's vmstat counter yet.
> > 
> >   - In this situation, kswapd did not help the trapped process.
> >     In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
> >     of free pages on the node by zone_page_state_snapshot() which
> >     checks pending counts on percpu vmstat.
> >     Therefore kswapd could know there were 69 free pages correctly.
> >     Since zone->_watermark = {8, 20, 32}, kswapd did not work because
> >     69 was greater than 32 as high watermark.
> 
> If the imprecision of allow_direct_reclaim is the underlying problem why
> haven't you used zone_page_state_snapshot instead?

It might have dealt with problem #1 for this particular case. However,
looking at the callers of zone_page_state:

   5   2227  mm/compaction.c <<compaction_suitable>>
             zone_page_state(zone, NR_FREE_PAGES));
   6    124  mm/highmem.c <<__nr_free_highpages>>
             pages += zone_page_state(zone, NR_FREE_PAGES);
   7    283  mm/page-writeback.c <<node_dirtyable_memory>>
             nr_pages += zone_page_state(zone, NR_FREE_PAGES);
   8    318  mm/page-writeback.c <<highmem_dirtyable_memory>>
             nr_pages = zone_page_state(z, NR_FREE_PAGES);
   9    321  mm/page-writeback.c <<highmem_dirtyable_memory>>
             nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
  10    322  mm/page-writeback.c <<highmem_dirtyable_memory>>
             nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
  11   3091  mm/page_alloc.c <<__rmqueue>>
             zone_page_state(zone, NR_FREE_CMA_PAGES) >
  12   3092  mm/page_alloc.c <<__rmqueue>>
             zone_page_state(zone, NR_FREE_PAGES) / 2) {

The suggested patchset fixes the problem of where due to nohz_full,
the delayed timer for vmstat_work can be armed but not executed (which means
the per-cpu counters can be out of sync for as long as the cpu is in 
idle while in nohz_full mode).

You probably do not want to convert all callers of zone_page_state
into zone_page_state_snapshot (as a justification for the proposed
patchset).

> Anyway, this is kind of information that is really helpful to have in
> the patch description.

Agree: resending a new version with updated commit.

> [...]
> > > >     2. With a SCHED_FIFO task that busy loops on a given CPU,
> > > >        and kworker for that CPU at SCHED_OTHER priority,
> > > >        queuing work to sync per-vmstats will either cause that
> > > >        work to never execute, or stalld (i.e. stall daemon)
> > > >        boosts kworker priority which causes a latency
> > > >        violation
> > > 
> > > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > > Or can they?
> > 
> > Consider SCHED_FIFO task that is polling the network queue (say
> > testpmd).
> > 
> > 	do {
> > 	 	if (net_registers->state & DATA_AVAILABLE) {
> > 			process_data)();
> > 		}
> > 	 } while (!stopped);
> > 
> > Since this task runs at SCHED_FIFO priority, kworker won't 
> > be scheduled to run (therefore per-CPU vmstats won't be
> > flushed to global vmstats). 
> 
> Yes, that is certainly possible. But my main point is that vmstat
> imprecision shouldn't cause functional problems. That is why we have
> _snapshot readers to get an exact value where it matters for
> consistency.

Understood. Perhaps allow_direct_reclaim should use
zone_page_state_snapshot, as otherwise it is only precise
at sysctl_stat_interval intervals?

> 
> > Or, if testpmd runs at SCHED_OTHER, then the work item to
> > flush per-CPU vmstats causes
> > 
> > 	testpmd -> kworker
> > 	kworker: flush per-CPU vmstats
> > 	kworker -> testpmd
>
> And this might cause undesired latencies to the packets being                                                                                    
> processed by the testpmd task.                                                                                                                   

> Right but can you have any latencies expectation in a situation like
> that?

Not sure i understand what you mean. Example:

https://www.gabrieleara.it/assets/documents/papers/conferences/2021-ieee-nfv-sdn.pdf

In general, UDPDK exhibits a much lower
latency than the in-kernel UDP stack used through the POSIX
API (e.g., a 69 % reduction from 95 µs down to 29 µs), thanks
to its ability to bypass the kernel entirely, which in turn
outperforms the in-kernel TCP stack as available through the
POSIX API, as expected.
...
Alternatively, application processes can use UDPDK
with the non-blocking API calls (using the O_NONBLOCK flag)
and perform some other action while waiting for packets to
be ready to be sent/received to/from the UDPDK Process,
instead of performing continuous busy-loops on packet queues.
However, in this case the cost of a single CPU fully busy due
to the UDPDK Process itself is anyway unavoidab



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 00/12] fold per-CPU vmstats remotely
  2023-03-14 18:49       ` Marcelo Tosatti
@ 2023-03-14 21:01         ` Michal Hocko
  2023-03-15  0:29           ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-03-14 21:01 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Tue 14-03-23 15:49:09, Marcelo Tosatti wrote:
> On Tue, Mar 14, 2023 at 03:31:21PM +0100, Michal Hocko wrote:
> > On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> > > On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > > > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > > > This patch series addresses the following two problems:
> > > > > 
> > > > >     1. A customer provided some evidence which indicates that
> > > > >        the idle tick was stopped; albeit, CPU-specific vmstat
> > > > >        counters still remained populated.
> > > > > 
> > > > >        Thus one can only assume quiet_vmstat() was not
> > > > >        invoked on return to the idle loop. If I understand
> > > > >        correctly, I suspect this divergence might erroneously
> > > > >        prevent a reclaim attempt by kswapd. If the number of
> > > > >        zone specific free pages are below their per-cpu drift
> > > > >        value then zone_page_state_snapshot() is used to
> > > > >        compute a more accurate view of the aforementioned
> > > > >        statistic.  Thus any task blocked on the NUMA node
> > > > >        specific pfmemalloc_wait queue will be unable to make
> > > > >        significant progress via direct reclaim unless it is
> > > > >        killed after being woken up by kswapd
> > > > >        (see throttle_direct_reclaim())
> > > > 
> > > > I have hard time to follow the actual problem described above. Are you
> > > > suggesting that a lack of pcp vmstat counters update has led to
> > > > reclaim issues? What is the said "evidence"? Could you share more of the
> > > > story please?
> > > 
> > > 
> > >   - The process was trapped in throttle_direct_reclaim().
> > >     The function wait_event_killable() was called to wait condition
> > >     allow_direct_reclaim(pgdat) for current node to be true.
> > >     The allow_direct_reclaim(pgdat) examined the number of free pages
> > >     on the node by zone_page_state() which just returns value in
> > >     zone->vm_stat[NR_FREE_PAGES].
> > > 
> > >   - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
> > >     However, the freelist on this node was not empty.
> > > 
> > >   - This inconsistent of vmstat value was caused by percpu vmstat on
> > >     nohz_full cpus. Every increment/decrement of vmstat is performed
> > >     on percpu vmstat counter at first, then pooled diffs are cumulated
> > >     to the zone's vmstat counter in timely manner. However, on nohz_full
> > >     cpus (in case of this customer's system, 48 of 52 cpus) these pooled
> > >     diffs were not cumulated once the cpu had no event on it so that
> > >     the cpu started sleeping infinitely.
> > >     I checked percpu vmstat and found there were total 69 counts not
> > >     cumulated to the zone's vmstat counter yet.
> > > 
> > >   - In this situation, kswapd did not help the trapped process.
> > >     In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
> > >     of free pages on the node by zone_page_state_snapshot() which
> > >     checks pending counts on percpu vmstat.
> > >     Therefore kswapd could know there were 69 free pages correctly.
> > >     Since zone->_watermark = {8, 20, 32}, kswapd did not work because
> > >     69 was greater than 32 as high watermark.
> > 
> > If the imprecision of allow_direct_reclaim is the underlying problem why
> > haven't you used zone_page_state_snapshot instead?
> 
> It might have dealt with problem #1 for this particular case. However,
> looking at the callers of zone_page_state:
> 
>    5   2227  mm/compaction.c <<compaction_suitable>>
>              zone_page_state(zone, NR_FREE_PAGES));
>    6    124  mm/highmem.c <<__nr_free_highpages>>
>              pages += zone_page_state(zone, NR_FREE_PAGES);
>    7    283  mm/page-writeback.c <<node_dirtyable_memory>>
>              nr_pages += zone_page_state(zone, NR_FREE_PAGES);
>    8    318  mm/page-writeback.c <<highmem_dirtyable_memory>>
>              nr_pages = zone_page_state(z, NR_FREE_PAGES);
>    9    321  mm/page-writeback.c <<highmem_dirtyable_memory>>
>              nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
>   10    322  mm/page-writeback.c <<highmem_dirtyable_memory>>
>              nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
>   11   3091  mm/page_alloc.c <<__rmqueue>>
>              zone_page_state(zone, NR_FREE_CMA_PAGES) >
>   12   3092  mm/page_alloc.c <<__rmqueue>>
>              zone_page_state(zone, NR_FREE_PAGES) / 2) {
> 
> The suggested patchset fixes the problem of where due to nohz_full,
> the delayed timer for vmstat_work can be armed but not executed (which means
> the per-cpu counters can be out of sync for as long as the cpu is in 
> idle while in nohz_full mode).
> 
> You probably do not want to convert all callers of zone_page_state
> into zone_page_state_snapshot (as a justification for the proposed
> patchset).

Yes, I do not really think we want or even need to convert all of them.
But it seems that your direct reclaim throttling example really requires
that. The thing with the remote flushing is that it would suffer from
a similar imprecations as the flushing could be deferred and under
certain conditions really starved. So it is definitely worth fixing the
issue you are seeing without such a complex scheme.

> > Anyway, this is kind of information that is really helpful to have in
> > the patch description.
> 
> Agree: resending a new version with updated commit.

I would really recommend trying out the simple fix and see if it changes
the behavior.

> > [...]
> > > > >     2. With a SCHED_FIFO task that busy loops on a given CPU,
> > > > >        and kworker for that CPU at SCHED_OTHER priority,
> > > > >        queuing work to sync per-vmstats will either cause that
> > > > >        work to never execute, or stalld (i.e. stall daemon)
> > > > >        boosts kworker priority which causes a latency
> > > > >        violation
> > > > 
> > > > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > > > Or can they?
> > > 
> > > Consider SCHED_FIFO task that is polling the network queue (say
> > > testpmd).
> > > 
> > > 	do {
> > > 	 	if (net_registers->state & DATA_AVAILABLE) {
> > > 			process_data)();
> > > 		}
> > > 	 } while (!stopped);
> > > 
> > > Since this task runs at SCHED_FIFO priority, kworker won't 
> > > be scheduled to run (therefore per-CPU vmstats won't be
> > > flushed to global vmstats). 
> > 
> > Yes, that is certainly possible. But my main point is that vmstat
> > imprecision shouldn't cause functional problems. That is why we have
> > _snapshot readers to get an exact value where it matters for
> > consistency.
> 
> Understood. Perhaps allow_direct_reclaim should use
> zone_page_state_snapshot, as otherwise it is only precise
> at sysctl_stat_interval intervals?

or even much less than that. The flusher uses WQ infrastructure and even
when a WQ_MEM_RECLAIM one is used this doesn't mean that all workers
could be jammed.
 
> > > Or, if testpmd runs at SCHED_OTHER, then the work item to
> > > flush per-CPU vmstats causes
> > > 
> > > 	testpmd -> kworker
> > > 	kworker: flush per-CPU vmstats
> > > 	kworker -> testpmd
> >
> > And this might cause undesired latencies to the packets being                                                                                    
> > processed by the testpmd task.                                                                                                                   
> 
> > Right but can you have any latencies expectation in a situation like
> > that?
> 
> Not sure i understand what you mean. Example:
> 
> https://www.gabrieleara.it/assets/documents/papers/conferences/2021-ieee-nfv-sdn.pdf
> 
> In general, UDPDK exhibits a much lower
> latency than the in-kernel UDP stack used through the POSIX
> API (e.g., a 69 % reduction from 95 µs down to 29 µs), thanks
> to its ability to bypass the kernel entirely, which in turn
> outperforms the in-kernel TCP stack as available through the
> POSIX API, as expected.
> ...
> Alternatively, application processes can use UDPDK
> with the non-blocking API calls (using the O_NONBLOCK flag)
> and perform some other action while waiting for packets to
> be ready to be sent/received to/from the UDPDK Process,
> instead of performing continuous busy-loops on packet queues.
> However, in this case the cost of a single CPU fully busy due
> to the UDPDK Process itself is anyway unavoidab

If the userspace workload avoids the kernel completely then it is quite
unlikely that there is any pcp work to be flushed for in-kernel
counters.

That being said, I am nor saying remote flushing is not useful. I just
think that the issue you are reporting here could be fixed by a much
simpler fix that doesn't change the way how the flushing is performed.
Such a large rework should be justified by performance numbers. It
should be also explained how do we end up doing a lot of work on
isolated cpus or a pure user space workload.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 00/12] fold per-CPU vmstats remotely
  2023-03-14 21:01         ` Michal Hocko
@ 2023-03-15  0:29           ` Marcelo Tosatti
  0 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2023-03-15  0:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Tue, Mar 14, 2023 at 10:01:06PM +0100, Michal Hocko wrote:
> On Tue 14-03-23 15:49:09, Marcelo Tosatti wrote:
> > On Tue, Mar 14, 2023 at 03:31:21PM +0100, Michal Hocko wrote:
> > > On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> > > > On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > > > > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > > > > This patch series addresses the following two problems:
> > > > > > 
> > > > > >     1. A customer provided some evidence which indicates that
> > > > > >        the idle tick was stopped; albeit, CPU-specific vmstat
> > > > > >        counters still remained populated.
> > > > > > 
> > > > > >        Thus one can only assume quiet_vmstat() was not
> > > > > >        invoked on return to the idle loop. If I understand
> > > > > >        correctly, I suspect this divergence might erroneously
> > > > > >        prevent a reclaim attempt by kswapd. If the number of
> > > > > >        zone specific free pages are below their per-cpu drift
> > > > > >        value then zone_page_state_snapshot() is used to
> > > > > >        compute a more accurate view of the aforementioned
> > > > > >        statistic.  Thus any task blocked on the NUMA node
> > > > > >        specific pfmemalloc_wait queue will be unable to make
> > > > > >        significant progress via direct reclaim unless it is
> > > > > >        killed after being woken up by kswapd
> > > > > >        (see throttle_direct_reclaim())
> > > > > 
> > > > > I have hard time to follow the actual problem described above. Are you
> > > > > suggesting that a lack of pcp vmstat counters update has led to
> > > > > reclaim issues? What is the said "evidence"? Could you share more of the
> > > > > story please?
> > > > 
> > > > 
> > > >   - The process was trapped in throttle_direct_reclaim().
> > > >     The function wait_event_killable() was called to wait condition
> > > >     allow_direct_reclaim(pgdat) for current node to be true.
> > > >     The allow_direct_reclaim(pgdat) examined the number of free pages
> > > >     on the node by zone_page_state() which just returns value in
> > > >     zone->vm_stat[NR_FREE_PAGES].
> > > > 
> > > >   - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
> > > >     However, the freelist on this node was not empty.
> > > > 
> > > >   - This inconsistent of vmstat value was caused by percpu vmstat on
> > > >     nohz_full cpus. Every increment/decrement of vmstat is performed
> > > >     on percpu vmstat counter at first, then pooled diffs are cumulated
> > > >     to the zone's vmstat counter in timely manner. However, on nohz_full
> > > >     cpus (in case of this customer's system, 48 of 52 cpus) these pooled
> > > >     diffs were not cumulated once the cpu had no event on it so that
> > > >     the cpu started sleeping infinitely.
> > > >     I checked percpu vmstat and found there were total 69 counts not
> > > >     cumulated to the zone's vmstat counter yet.
> > > > 
> > > >   - In this situation, kswapd did not help the trapped process.
> > > >     In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
> > > >     of free pages on the node by zone_page_state_snapshot() which
> > > >     checks pending counts on percpu vmstat.
> > > >     Therefore kswapd could know there were 69 free pages correctly.
> > > >     Since zone->_watermark = {8, 20, 32}, kswapd did not work because
> > > >     69 was greater than 32 as high watermark.
> > > 
> > > If the imprecision of allow_direct_reclaim is the underlying problem why
> > > haven't you used zone_page_state_snapshot instead?
> > 
> > It might have dealt with problem #1 for this particular case. However,
> > looking at the callers of zone_page_state:
> > 
> >    5   2227  mm/compaction.c <<compaction_suitable>>
> >              zone_page_state(zone, NR_FREE_PAGES));
> >    6    124  mm/highmem.c <<__nr_free_highpages>>
> >              pages += zone_page_state(zone, NR_FREE_PAGES);
> >    7    283  mm/page-writeback.c <<node_dirtyable_memory>>
> >              nr_pages += zone_page_state(zone, NR_FREE_PAGES);
> >    8    318  mm/page-writeback.c <<highmem_dirtyable_memory>>
> >              nr_pages = zone_page_state(z, NR_FREE_PAGES);
> >    9    321  mm/page-writeback.c <<highmem_dirtyable_memory>>
> >              nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
> >   10    322  mm/page-writeback.c <<highmem_dirtyable_memory>>
> >              nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
> >   11   3091  mm/page_alloc.c <<__rmqueue>>
> >              zone_page_state(zone, NR_FREE_CMA_PAGES) >
> >   12   3092  mm/page_alloc.c <<__rmqueue>>
> >              zone_page_state(zone, NR_FREE_PAGES) / 2) {
> > 
> > The suggested patchset fixes the problem of where due to nohz_full,
> > the delayed timer for vmstat_work can be armed but not executed (which means
> > the per-cpu counters can be out of sync for as long as the cpu is in 
> > idle while in nohz_full mode).
> > 
> > You probably do not want to convert all callers of zone_page_state
> > into zone_page_state_snapshot (as a justification for the proposed
> > patchset).
> 
> Yes, I do not really think we want or even need to convert all of them.

OK.

> But it seems that your direct reclaim throttling example really requires
> that. The thing with the remote flushing is that it would suffer from
> a similar imprecations as the flushing could be deferred and under
> certain conditions really starved. 

> So it is definitely worth fixing the
> issue you are seeing without such a complex scheme.

The scheme is necessary for other reasons.

> > > Anyway, this is kind of information that is really helpful to have in
> > > the patch description.
> > 
> > Agree: resending a new version with updated commit.
> 
> I would really recommend trying out the simple fix and see if it changes
> the behavior.
>
> > > [...]
> > > > > >     2. With a SCHED_FIFO task that busy loops on a given CPU,
> > > > > >        and kworker for that CPU at SCHED_OTHER priority,
> > > > > >        queuing work to sync per-vmstats will either cause that
> > > > > >        work to never execute, or stalld (i.e. stall daemon)
> > > > > >        boosts kworker priority which causes a latency
> > > > > >        violation
> > > > > 
> > > > > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > > > > Or can they?
> > > > 
> > > > Consider SCHED_FIFO task that is polling the network queue (say
> > > > testpmd).
> > > > 
> > > > 	do {
> > > > 	 	if (net_registers->state & DATA_AVAILABLE) {
> > > > 			process_data)();
> > > > 		}
> > > > 	 } while (!stopped);
> > > > 
> > > > Since this task runs at SCHED_FIFO priority, kworker won't 
> > > > be scheduled to run (therefore per-CPU vmstats won't be
> > > > flushed to global vmstats). 
> > > 
> > > Yes, that is certainly possible. But my main point is that vmstat
> > > imprecision shouldn't cause functional problems. That is why we have
> > > _snapshot readers to get an exact value where it matters for
> > > consistency.
> > 
> > Understood. Perhaps allow_direct_reclaim should use
> > zone_page_state_snapshot, as otherwise it is only precise
> > at sysctl_stat_interval intervals?
> 
> or even much less than that. The flusher uses WQ infrastructure and even
> when a WQ_MEM_RECLAIM one is used this doesn't mean that all workers
> could be jammed.
>  
> > > > Or, if testpmd runs at SCHED_OTHER, then the work item to
> > > > flush per-CPU vmstats causes
> > > > 
> > > > 	testpmd -> kworker
> > > > 	kworker: flush per-CPU vmstats
> > > > 	kworker -> testpmd
> > >
> > > And this might cause undesired latencies to the packets being
> > > processed by the testpmd task.
> > 
> > > Right but can you have any latencies expectation in a situation like
> > > that?
> > 
> > Not sure i understand what you mean. Example:
> > 
> > https://www.gabrieleara.it/assets/documents/papers/conferences/2021-ieee-nfv-sdn.pdf
> > 
> > In general, UDPDK exhibits a much lower
> > latency than the in-kernel UDP stack used through the POSIX
> > API (e.g., a 69 % reduction from 95 µs down to 29 µs), thanks
> > to its ability to bypass the kernel entirely, which in turn
> > outperforms the in-kernel TCP stack as available through the
> > POSIX API, as expected.
> > ...
> > Alternatively, application processes can use UDPDK
> > with the non-blocking API calls (using the O_NONBLOCK flag)
> > and perform some other action while waiting for packets to
> > be ready to be sent/received to/from the UDPDK Process,
> > instead of performing continuous busy-loops on packet queues.
> > However, in this case the cost of a single CPU fully busy due
> > to the UDPDK Process itself is anyway unavoidab
> 
> If the userspace workload avoids the kernel completely then it is quite
> unlikely that there is any pcp work to be flushed for in-kernel
> counters.

This particular workload avoids the kernel. Others (were latency is
still a concern) don't.

> That being said, I am nor saying remote flushing is not useful. 

> I just think that the issue you are reporting here could be fixed by
> a much simpler fix that doesn't change the way how the flushing is
> performed.

OK. Must change flushing anyway, but fixing allow_direct_reclaim to 
use zone_page_state_snapshot won't hurt.

> Such a large rework should be justified by performance numbers.

OK.

> It should be also explained how do we end up doing a lot of work on
> isolated cpus or a pure user space workload.

Again, pure user space workload is one example where latency matters, in
response to the "can you have any latencies expectation in a situation like
that?" question.

Will resend -v8 with allow_direct_reclaim fix.



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-03-15 14:23 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-13 16:25 [PATCH v5 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 01/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 02/12] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 03/12] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 04/12] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 05/12] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 06/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 07/12] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 08/12] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 09/12] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 10/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 11/12] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
2023-03-13 16:25 ` [PATCH v5 12/12] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
2023-03-14 12:25 ` [PATCH v5 00/12] fold per-CPU vmstats remotely Michal Hocko
2023-03-14 12:59   ` Marcelo Tosatti
2023-03-14 13:00     ` Marcelo Tosatti
2023-03-14 14:31     ` Michal Hocko
2023-03-14 18:49       ` Marcelo Tosatti
2023-03-14 21:01         ` Michal Hocko
2023-03-15  0:29           ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox