* [PATCH v3 00/11] fold per-CPU vmstats remotely
@ 2023-03-03 19:58 Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86
This patch series addresses the following two problems:
1. A customer provided some evidence which indicates that
the idle tick was stopped; albeit, CPU-specific vmstat
counters still remained populated.
Thus one can only assume quiet_vmstat() was not
invoked on return to the idle loop. If I understand
correctly, I suspect this divergence might erroneously
prevent a reclaim attempt by kswapd. If the number of
zone specific free pages are below their per-cpu drift
value then zone_page_state_snapshot() is used to
compute a more accurate view of the aforementioned
statistic. Thus any task blocked on the NUMA node
specific pfmemalloc_wait queue will be unable to make
significant progress via direct reclaim unless it is
killed after being woken up by kswapd
(see throttle_direct_reclaim())
2. With a SCHED_FIFO task that busy loops on a given CPU,
and kworker for that CPU at SCHED_OTHER priority,
queuing work to sync per-vmstats will either cause that
work to never execute, or stalld (i.e. stall daemon)
boosts kworker priority which causes a latency
violation
By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.
This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).
Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.
v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
stats more accurate (Peter Xu)
- Improve changelog of
"mm/vmstat: switch counter modification to cmpxchg" (Peter Xu / David)
- Improve changelog of
"mm/vmstat: remove remote node draining" (David Hildenbrand)
v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
(Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
(Uros Bizjak / Matthew Wilcox)
arch/arm64/include/asm/percpu.h | 16 ++
arch/loongarch/include/asm/percpu.h | 23 +++-
arch/s390/include/asm/percpu.h | 5
arch/x86/include/asm/percpu.h | 39 +++----
include/asm-generic/percpu.h | 17 +++
include/linux/mmzone.h | 3
include/linux/percpu-defs.h | 2
kernel/fork.c | 2
kernel/scs.c | 2
mm/page_alloc.c | 23 ----
mm/vmstat.c | 424 +++++++++++++++++++++++++++++++++++++++++------------------------------------
11 files changed, 307 insertions(+), 249 deletions(-)
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 01/11] mm/vmstat: remove remote node draining
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
David Hildenbrand, Marcelo Tosatti
Draining of pages from the local pcp for a remote zone should not be
necessary, since once the system is low on memory (or compaction on a
zone is in effect), drain_all_pages should be called freeing any unused
pcps.
For reference, the original commit which introduces remote node
draining is 4037d452202e34214e8a939fa5621b2b3bbb45b7.
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -679,9 +679,6 @@ struct per_cpu_pages {
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
short free_factor; /* batch scaling factor during free */
-#ifdef CONFIG_NUMA
- short expire; /* When 0, remote pagesets are drained */
-#endif
/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -803,20 +803,16 @@ static int fold_diff(int *zone_diff, int
*
* The function returns the number of global counters updated.
*/
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static int refresh_cpu_vm_stats(void)
{
struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
- int changes = 0;
for_each_populated_zone(zone) {
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
- struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
int v;
@@ -826,44 +822,8 @@ static int refresh_cpu_vm_stats(bool do_
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
-#ifdef CONFIG_NUMA
- /* 3 seconds idle till flush */
- __this_cpu_write(pcp->expire, 3);
-#endif
}
}
-#ifdef CONFIG_NUMA
-
- if (do_pagesets) {
- cond_resched();
- /*
- * Deal with draining the remote pageset of this
- * processor
- *
- * Check if there are pages remaining in this pageset
- * if not then there is nothing to expire.
- */
- if (!__this_cpu_read(pcp->expire) ||
- !__this_cpu_read(pcp->count))
- continue;
-
- /*
- * We never drain zones local to this processor.
- */
- if (zone_to_nid(zone) == numa_node_id()) {
- __this_cpu_write(pcp->expire, 0);
- continue;
- }
-
- if (__this_cpu_dec_return(pcp->expire))
- continue;
-
- if (__this_cpu_read(pcp->count)) {
- drain_zone_pages(zone, this_cpu_ptr(pcp));
- changes++;
- }
- }
-#endif
}
for_each_online_pgdat(pgdat) {
@@ -880,8 +840,7 @@ static int refresh_cpu_vm_stats(bool do_
}
}
- changes += fold_diff(global_zone_diff, global_node_diff);
- return changes;
+ return fold_diff(global_zone_diff, global_node_diff);
}
/*
@@ -1867,7 +1826,7 @@ int sysctl_stat_interval __read_mostly =
#ifdef CONFIG_PROC_FS
static void refresh_vm_stats(struct work_struct *work)
{
- refresh_cpu_vm_stats(true);
+ refresh_cpu_vm_stats();
}
int vmstat_refresh(struct ctl_table *table, int write,
@@ -1877,6 +1836,8 @@ int vmstat_refresh(struct ctl_table *tab
int err;
int i;
+ drain_all_pages(NULL);
+
/*
* The regular update, every sysctl_stat_interval, may come later
* than expected: leaving a significant amount in per_cpu buckets.
@@ -1931,7 +1892,7 @@ int vmstat_refresh(struct ctl_table *tab
static void vmstat_update(struct work_struct *w)
{
- if (refresh_cpu_vm_stats(true)) {
+ if (refresh_cpu_vm_stats()) {
/*
* Counters were updated so we expect more updates
* to occur in the future. Keep on running the
@@ -1994,7 +1955,7 @@ void quiet_vmstat(void)
* it would be too expensive from this path.
* vmstat_shepherd will take care about that for us.
*/
- refresh_cpu_vm_stats(false);
+ refresh_cpu_vm_stats();
}
/*
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -3176,26 +3176,6 @@ static int rmqueue_bulk(struct zone *zon
return allocated;
}
-#ifdef CONFIG_NUMA
-/*
- * Called from the vmstat counter updater to drain pagesets of this
- * currently executing processor on remote nodes after they have
- * expired.
- */
-void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
-{
- int to_drain, batch;
-
- batch = READ_ONCE(pcp->batch);
- to_drain = min(pcp->count, batch);
- if (to_drain > 0) {
- spin_lock(&pcp->lock);
- free_pcppages_bulk(zone, to_drain, pcp, 0);
- spin_unlock(&pcp->lock);
- }
-}
-#endif
-
/*
* Drain pcplists of the indicated processor and zone.
*/
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 03/11] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.
Following the kernel convention for cmpxchg/cmpxchg_local,
change ARM's this_cpu_cmpxchg_ helpers to be atomic,
and add this_cpu_cmpxchg_local_ helpers which are not atomic.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
+++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
@@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
_pcp_protect_return(xchg_relaxed, pcp, val)
#define this_cpu_cmpxchg_1(pcp, o, n) \
- _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+ _pcp_protect_return(cmpxchg, pcp, o, n)
#define this_cpu_cmpxchg_2(pcp, o, n) \
- _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+ _pcp_protect_return(cmpxchg, pcp, o, n)
#define this_cpu_cmpxchg_4(pcp, o, n) \
- _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+ _pcp_protect_return(cmpxchg, pcp, o, n)
#define this_cpu_cmpxchg_8(pcp, o, n) \
+ _pcp_protect_return(cmpxchg, pcp, o, n)
+
+#define this_cpu_cmpxchg_local_1(pcp, o, n) \
_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_2(pcp, o, n) \
+ _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_4(pcp, o, n) \
+ _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_8(pcp, o, n) \
+ _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+
#ifdef __KVM_NVHE_HYPERVISOR__
extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 03/11] this_cpu_cmpxchg: loongarch: switch this_cpu_cmpxchg to locked, add _local function
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 04/11] this_cpu_cmpxchg: S390: " Marcelo Tosatti
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.
Following the kernel convention for cmpxchg/cmpxchg_local,
add this_cpu_cmpxchg_local helpers to Loongarch.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/loongarch/include/asm/percpu.h
+++ linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
@@ -150,6 +150,16 @@ static inline unsigned long __percpu_xch
}
/* this_cpu_cmpxchg */
+#define _protect_cmpxchg(pcp, o, n) \
+({ \
+ typeof(*raw_cpu_ptr(&(pcp))) __ret; \
+ preempt_disable_notrace(); \
+ __ret = cmpxchg(raw_cpu_ptr(&(pcp)), o, n); \
+ preempt_enable_notrace(); \
+ __ret; \
+})
+
+/* this_cpu_cmpxchg_local */
#define _protect_cmpxchg_local(pcp, o, n) \
({ \
typeof(*raw_cpu_ptr(&(pcp))) __ret; \
@@ -222,10 +232,15 @@ do { \
#define this_cpu_xchg_4(pcp, val) _percpu_xchg(pcp, val)
#define this_cpu_xchg_8(pcp, val) _percpu_xchg(pcp, val)
-#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+
+#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg(ptr, o, n)
#include <asm-generic/percpu.h>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 04/11] this_cpu_cmpxchg: S390: switch this_cpu_cmpxchg to locked, add _local function
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (2 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 03/11] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 05/11] this_cpu_cmpxchg: x86: " Marcelo Tosatti
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.
Following the kernel convention for cmpxchg/cmpxchg_local,
add S390's this_cpu_cmpxchg_local.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/arch/s390/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/s390/include/asm/percpu.h
+++ linux-vmstat-remote/arch/s390/include/asm/percpu.h
@@ -148,6 +148,11 @@
#define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
#define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+
#define arch_this_cpu_xchg(pcp, nval) \
({ \
typeof(pcp) *ptr__; \
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 05/11] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (3 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 04/11] this_cpu_cmpxchg: S390: " Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 06/11] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.
Following the kernel convention for cmpxchg/cmpxchg_local,
change x86's this_cpu_cmpxchg_ helpers to be atomic.
and add this_cpu_cmpxchg_local_ helpers which are not atomic.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/arch/x86/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/x86/include/asm/percpu.h
+++ linux-vmstat-remote/arch/x86/include/asm/percpu.h
@@ -197,11 +197,11 @@ do { \
* cmpxchg has no such implied lock semantics as a result it is much
* more efficient for cpu local operations.
*/
-#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval) \
+#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval, lockp) \
({ \
__pcpu_type_##size pco_old__ = __pcpu_cast_##size(_oval); \
__pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval); \
- asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]", \
+ asm qual (__pcpu_op2_##size(lockp "cmpxchg", "%[nval]", \
__percpu_arg([var])) \
: [oval] "+a" (pco_old__), \
[var] "+m" (_var) \
@@ -279,16 +279,20 @@ do { \
#define raw_cpu_add_return_1(pcp, val) percpu_add_return_op(1, , pcp, val)
#define raw_cpu_add_return_2(pcp, val) percpu_add_return_op(2, , pcp, val)
#define raw_cpu_add_return_4(pcp, val) percpu_add_return_op(4, , pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(1, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(2, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(1, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(2, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(4, , pcp, oval, nval, "")
#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(1, volatile, pcp, val)
#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(2, volatile, pcp, val)
#define this_cpu_add_return_4(pcp, val) percpu_add_return_op(4, volatile, pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) percpu_cmpxchg_op(1, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) percpu_cmpxchg_op(2, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) percpu_cmpxchg_op(4, volatile, pcp, oval, nval, "")
+
+#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(1, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(2, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(4, volatile, pcp, oval, nval, LOCK_PREFIX)
#ifdef CONFIG_X86_CMPXCHG64
#define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2) \
@@ -319,16 +323,17 @@ do { \
#define raw_cpu_or_8(pcp, val) percpu_to_op(8, , "or", (pcp), val)
#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(8, , pcp, val)
#define raw_cpu_xchg_8(pcp, nval) raw_percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, , pcp, oval, nval, "")
-#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
-#define this_cpu_add_8(pcp, val) percpu_add_op(8, volatile, (pcp), val)
-#define this_cpu_and_8(pcp, val) percpu_to_op(8, volatile, "and", (pcp), val)
-#define this_cpu_or_8(pcp, val) percpu_to_op(8, volatile, "or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(8, volatile, pcp, val)
-#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(8, volatile, pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val) percpu_add_op(8, volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val) percpu_to_op(8, volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val) percpu_to_op(8, volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(8, volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(8, volatile, pcp, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) percpu_cmpxchg_op(8, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, volatile, pcp, oval, nval, LOCK_PREFIX)
/*
* Pretty complex macro to generate cmpxchg16 instruction. The instruction
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 06/11] add this_cpu_cmpxchg_local and asm-generic definitions
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (4 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 05/11] this_cpu_cmpxchg: x86: " Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.
Add this_cpu_cmpxchg_local_ helpers to asm-generic/percpu.h.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/include/asm-generic/percpu.h
===================================================================
--- linux-vmstat-remote.orig/include/asm-generic/percpu.h
+++ linux-vmstat-remote/include/asm-generic/percpu.h
@@ -424,6 +424,23 @@ do { \
this_cpu_generic_cmpxchg(pcp, oval, nval)
#endif
+#ifndef this_cpu_cmpxchg_local_1
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) \
+ this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_2
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) \
+ this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_4
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) \
+ this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_8
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) \
+ this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+
#ifndef this_cpu_cmpxchg_double_1
#define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
Index: linux-vmstat-remote/include/linux/percpu-defs.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/percpu-defs.h
+++ linux-vmstat-remote/include/linux/percpu-defs.h
@@ -513,6 +513,8 @@ do { \
#define this_cpu_xchg(pcp, nval) __pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
#define this_cpu_cmpxchg(pcp, oval, nval) \
__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
+#define this_cpu_cmpxchg_local(pcp, oval, nval) \
+ __pcpu_size_call_return2(this_cpu_cmpxchg_local_, pcp, oval, nval)
#define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (5 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 06/11] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Peter Xu, Marcelo Tosatti
this_cpu_cmpxchg was modified to atomic version, which
can be more costly than non-atomic version.
Switch users of this_cpu_cmpxchg to this_cpu_cmpxchg_local
(which preserves pre-non-atomic this_cpu_cmpxchg behaviour).
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/kernel/fork.c
===================================================================
--- linux-vmstat-remote.orig/kernel/fork.c
+++ linux-vmstat-remote/kernel/fork.c
@@ -203,7 +203,7 @@ static bool try_release_thread_stack_to_
unsigned int i;
for (i = 0; i < NR_CACHED_STACKS; i++) {
- if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+ if (this_cpu_cmpxchg_local(cached_stacks[i], NULL, vm) != NULL)
continue;
return true;
}
Index: linux-vmstat-remote/kernel/scs.c
===================================================================
--- linux-vmstat-remote.orig/kernel/scs.c
+++ linux-vmstat-remote/kernel/scs.c
@@ -83,7 +83,7 @@ void scs_free(void *s)
*/
for (i = 0; i < NR_CACHED_SCS; i++)
- if (this_cpu_cmpxchg(scs_cache[i], 0, s) == NULL)
+ if (this_cpu_cmpxchg_local(scs_cache[i], 0, s) == NULL)
return;
kasan_unpoison_vmalloc(s, SCS_SIZE, KASAN_VMALLOC_PROT_NORMAL);
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 08/11] mm/vmstat: switch counter modification to cmpxchg
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (6 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 09/11] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
In preparation for switching vmstat shepherd to flush
per-CPU counters remotely, switch the __{mod,inc,dec} functions that
modify the counters to use cmpxchg.
To facilitate reviewing, functions are ordered in the text file, as:
__{mod,inc,dec}_{zone,node}_page_state
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
#endif
This patch defines the __ versions for the
CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts:
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state = {mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state
#endif
To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
For the single_page_alloc_free test, which does
/** Loop to measure **/
for (i = 0; i < rec->loops; i++) {
my_page = alloc_page(gfp_mask);
if (unlikely(my_page == NULL))
return 0;
__free_page(my_page);
}
Unit is cycles.
Vanilla Patched Diff
115.25 117 1.4%
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
}
}
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/*
+ * If we have cmpxchg_local support then we do not need to incur the overhead
+ * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
+ *
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ * 0 No overstepping
+ * 1 Overstepping half of threshold
+ * -1 Overstepping minus half of threshold
+ */
+static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
+ long delta, int overstep_mode)
+{
+ struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
+ s8 __percpu *p = pcp->vm_stat_diff + item;
+ long o, n, t, z;
+
+ do {
+ z = 0; /* overflow to zone counters */
+
+ /*
+ * The fetching of the stat_threshold is racy. We may apply
+ * a counter threshold to the wrong the cpu if we get
+ * rescheduled while executing here. However, the next
+ * counter update will apply the threshold again and
+ * therefore bring the counter under the threshold again.
+ *
+ * Most of the time the thresholds are the same anyways
+ * for all cpus in a zone.
+ */
+ t = this_cpu_read(pcp->stat_threshold);
+
+ o = this_cpu_read(*p);
+ n = delta + o;
+
+ if (abs(n) > t) {
+ int os = overstep_mode * (t >> 1);
+
+ /* Overflow must be added to zone counters */
+ z = n + os;
+ n = -os;
+ }
+ } while (this_cpu_cmpxchg(*p, o, n) != o);
+
+ if (z)
+ zone_page_state_add(z, zone, item);
+}
+
+void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+ long delta)
+{
+ mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_zone_page_state);
+
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+ long delta)
+{
+ mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+void inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+ mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_zone_page_state);
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+ mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+ mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_zone_page_state);
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+ mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+ enum node_stat_item item,
+ int delta, int overstep_mode)
+{
+ struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+ s8 __percpu *p = pcp->vm_node_stat_diff + item;
+ long o, n, t, z;
+
+ if (vmstat_item_in_bytes(item)) {
+ /*
+ * Only cgroups use subpage accounting right now; at
+ * the global level, these items still change in
+ * multiples of whole pages. Store them as pages
+ * internally to keep the per-cpu counters compact.
+ */
+ VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+ delta >>= PAGE_SHIFT;
+ }
+
+ do {
+ z = 0; /* overflow to node counters */
+
+ /*
+ * The fetching of the stat_threshold is racy. We may apply
+ * a counter threshold to the wrong the cpu if we get
+ * rescheduled while executing here. However, the next
+ * counter update will apply the threshold again and
+ * therefore bring the counter under the threshold again.
+ *
+ * Most of the time the thresholds are the same anyways
+ * for all cpus in a node.
+ */
+ t = this_cpu_read(pcp->stat_threshold);
+
+ o = this_cpu_read(*p);
+ n = delta + o;
+
+ if (abs(n) > t) {
+ int os = overstep_mode * (t >> 1);
+
+ /* Overflow must be added to node counters */
+ z = n + os;
+ n = -os;
+ }
+ } while (this_cpu_cmpxchg(*p, o, n) != o);
+
+ if (z)
+ node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+ long delta)
+{
+ mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+ long delta)
+{
+ mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+ mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+ mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+ mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+ mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+ mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+#else
/*
* For use when we know that interrupts are disabled,
* or when we know that preemption is disabled and that
@@ -541,149 +723,6 @@ void __dec_node_page_state(struct page *
}
EXPORT_SYMBOL(__dec_node_page_state);
-#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
-/*
- * If we have cmpxchg_local support then we do not need to incur the overhead
- * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
- *
- * mod_state() modifies the zone counter state through atomic per cpu
- * operations.
- *
- * Overstep mode specifies how overstep should handled:
- * 0 No overstepping
- * 1 Overstepping half of threshold
- * -1 Overstepping minus half of threshold
-*/
-static inline void mod_zone_state(struct zone *zone,
- enum zone_stat_item item, long delta, int overstep_mode)
-{
- struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
- s8 __percpu *p = pcp->vm_stat_diff + item;
- long o, n, t, z;
-
- do {
- z = 0; /* overflow to zone counters */
-
- /*
- * The fetching of the stat_threshold is racy. We may apply
- * a counter threshold to the wrong the cpu if we get
- * rescheduled while executing here. However, the next
- * counter update will apply the threshold again and
- * therefore bring the counter under the threshold again.
- *
- * Most of the time the thresholds are the same anyways
- * for all cpus in a zone.
- */
- t = this_cpu_read(pcp->stat_threshold);
-
- o = this_cpu_read(*p);
- n = delta + o;
-
- if (abs(n) > t) {
- int os = overstep_mode * (t >> 1) ;
-
- /* Overflow must be added to zone counters */
- z = n + os;
- n = -os;
- }
- } while (this_cpu_cmpxchg(*p, o, n) != o);
-
- if (z)
- zone_page_state_add(z, zone, item);
-}
-
-void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
- long delta)
-{
- mod_zone_state(zone, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_zone_page_state);
-
-void inc_zone_page_state(struct page *page, enum zone_stat_item item)
-{
- mod_zone_state(page_zone(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_zone_page_state);
-
-void dec_zone_page_state(struct page *page, enum zone_stat_item item)
-{
- mod_zone_state(page_zone(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_zone_page_state);
-
-static inline void mod_node_state(struct pglist_data *pgdat,
- enum node_stat_item item, int delta, int overstep_mode)
-{
- struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
- s8 __percpu *p = pcp->vm_node_stat_diff + item;
- long o, n, t, z;
-
- if (vmstat_item_in_bytes(item)) {
- /*
- * Only cgroups use subpage accounting right now; at
- * the global level, these items still change in
- * multiples of whole pages. Store them as pages
- * internally to keep the per-cpu counters compact.
- */
- VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
- delta >>= PAGE_SHIFT;
- }
-
- do {
- z = 0; /* overflow to node counters */
-
- /*
- * The fetching of the stat_threshold is racy. We may apply
- * a counter threshold to the wrong the cpu if we get
- * rescheduled while executing here. However, the next
- * counter update will apply the threshold again and
- * therefore bring the counter under the threshold again.
- *
- * Most of the time the thresholds are the same anyways
- * for all cpus in a node.
- */
- t = this_cpu_read(pcp->stat_threshold);
-
- o = this_cpu_read(*p);
- n = delta + o;
-
- if (abs(n) > t) {
- int os = overstep_mode * (t >> 1) ;
-
- /* Overflow must be added to node counters */
- z = n + os;
- n = -os;
- }
- } while (this_cpu_cmpxchg(*p, o, n) != o);
-
- if (z)
- node_page_state_add(z, pgdat, item);
-}
-
-void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
- long delta)
-{
- mod_node_state(pgdat, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_node_page_state);
-
-void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
-{
- mod_node_state(pgdat, item, 1, 1);
-}
-
-void inc_node_page_state(struct page *page, enum node_stat_item item)
-{
- mod_node_state(page_pgdat(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_node_page_state);
-
-void dec_node_page_state(struct page *page, enum node_stat_item item)
-{
- mod_node_state(page_pgdat(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_node_page_state);
-#else
/*
* Use interrupt disable to serialize counter updates
*/
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -8608,9 +8608,6 @@ static int page_alloc_cpu_dead(unsigned
/*
* Zero the differential counters of the dead processor
* so that the vm statistics are consistent.
- *
- * This is only okay since the processor is dead and cannot
- * race with what we are doing.
*/
cpu_vm_stats_fold(cpu);
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 09/11] mm/vmstat: use xchg in cpu_vm_stats_fold
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (7 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 11/11] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use xchg instead of a
pair of read/write instructions.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -883,7 +883,7 @@ static int refresh_cpu_vm_stats(void)
}
/*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
* There cannot be any access by the offline cpu and therefore
* synchronization is simplified.
*/
@@ -904,8 +904,7 @@ void cpu_vm_stats_fold(int cpu)
if (pzstats->vm_stat_diff[i]) {
int v;
- v = pzstats->vm_stat_diff[i];
- pzstats->vm_stat_diff[i] = 0;
+ v = xchg(&pzstats->vm_stat_diff[i], 0);
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
}
@@ -915,8 +914,7 @@ void cpu_vm_stats_fold(int cpu)
if (pzstats->vm_numa_event[i]) {
unsigned long v;
- v = pzstats->vm_numa_event[i];
- pzstats->vm_numa_event[i] = 0;
+ v = xchg(&pzstats->vm_numa_event[i], 0);
zone_numa_event_add(v, zone, i);
}
}
@@ -932,8 +930,7 @@ void cpu_vm_stats_fold(int cpu)
if (p->vm_node_stat_diff[i]) {
int v;
- v = p->vm_node_stat_diff[i];
- p->vm_node_stat_diff[i] = 0;
+ v = xchg(&p->vm_node_stat_diff[i], 0);
atomic_long_add(v, &pgdat->vm_stat[i]);
global_node_diff[i] += v;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (8 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 09/11] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 11/11] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
Now that the counters are modified via cmpxchg both CPU locally
(via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU
vmstats folding remotely.
This fixes the following two problems:
1. A customer provided some evidence which indicates that
the idle tick was stopped; albeit, CPU-specific vmstat
counters still remained populated.
Thus one can only assume quiet_vmstat() was not
invoked on return to the idle loop. If I understand
correctly, I suspect this divergence might erroneously
prevent a reclaim attempt by kswapd. If the number of
zone specific free pages are below their per-cpu drift
value then zone_page_state_snapshot() is used to
compute a more accurate view of the aforementioned
statistic. Thus any task blocked on the NUMA node
specific pfmemalloc_wait queue will be unable to make
significant progress via direct reclaim unless it is
killed after being woken up by kswapd
(see throttle_direct_reclaim())
2. With a SCHED_FIFO task that busy loops on a given CPU,
and kworker for that CPU at SCHED_OTHER priority,
queuing work to sync per-vmstats will either cause that
work to never execute, or stalld (i.e. stall daemon)
boosts kworker priority which causes a latency
violation
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -2004,6 +2004,23 @@ static void vmstat_shepherd(struct work_
static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+ int cpu;
+
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ cpu_vm_stats_fold(cpu);
+ cond_resched();
+ }
+ cpus_read_unlock();
+
+ schedule_delayed_work(&shepherd,
+ round_jiffies_relative(sysctl_stat_interval));
+}
+#else
static void vmstat_shepherd(struct work_struct *w)
{
int cpu;
@@ -2023,6 +2040,7 @@ static void vmstat_shepherd(struct work_
schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
}
+#endif
static void __init start_shepherd_timer(void)
{
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 11/11] mm/vmstat: refresh stats remotely instead of via work item
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
` (9 preceding siblings ...)
2023-03-03 19:58 ` [PATCH v3 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-03-03 19:58 ` Marcelo Tosatti
10 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 19:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
Marcelo Tosatti
Refresh per-CPU stats remotely, instead of queueing
work items, for the stat_refresh procfs method.
This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -1860,11 +1860,21 @@ static DEFINE_PER_CPU(struct delayed_wor
int sysctl_stat_interval __read_mostly = HZ;
#ifdef CONFIG_PROC_FS
+
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+static int refresh_all_vm_stats(void);
+#else
static void refresh_vm_stats(struct work_struct *work)
{
refresh_cpu_vm_stats();
}
+static int refresh_all_vm_stats(void)
+{
+ return schedule_on_each_cpu(refresh_vm_stats);
+}
+#endif
+
int vmstat_refresh(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
@@ -1886,7 +1896,7 @@ int vmstat_refresh(struct ctl_table *tab
* transiently negative values, report an error here if any of
* the stats is negative, so we know to go looking for imbalance.
*/
- err = schedule_on_each_cpu(refresh_vm_stats);
+ err = refresh_all_vm_stats();
if (err)
return err;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
@@ -2006,7 +2016,7 @@ static DECLARE_DEFERRABLE_WORK(shepherd,
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
-static void vmstat_shepherd(struct work_struct *w)
+static int refresh_all_vm_stats(void)
{
int cpu;
@@ -2016,7 +2026,12 @@ static void vmstat_shepherd(struct work_
cond_resched();
}
cpus_read_unlock();
+ return 0;
+}
+static void vmstat_shepherd(struct work_struct *w)
+{
+ refresh_all_vm_stats();
schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
}
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-03-03 20:00 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-03 19:58 [PATCH v3 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 03/11] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 04/11] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 05/11] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 06/11] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 09/11] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-03 19:58 ` [PATCH v3 11/11] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox