[PATCH 0/4] memcg: cleanup the memcg stats interfaces

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] memcg: cleanup the memcg stats interfaces
@ 2025-11-10 23:20 Shakeel Butt
  2025-11-10 23:20 ` [PATCH 1/4] memcg: use mod_node_page_state to update stats Shakeel Butt
                   ` (8 more replies)
  0 siblings, 9 replies; 49+ messages in thread
From: Shakeel Butt @ 2025-11-10 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

The memcg stats are safe against irq (and nmi) context and thus does not
require disabling irqs. However for some stats which are also maintained
at node level, it is using irq unsafe interface and thus requiring the
users to still disables irqs or use interfaces which explicitly disables
irqs. Let's move memcg code to use irq safe node level stats function
which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
(all major ones), so there will not be any performance penalty for its
usage.

Shakeel Butt (4):
  memcg: use mod_node_page_state to update stats
  memcg: remove __mod_lruvec_kmem_state
  memcg: remove __mod_lruvec_state
  memcg: remove __lruvec_stat_mod_folio

 include/linux/memcontrol.h | 28 ++++------------------
 include/linux/mm_inline.h  |  2 +-
 include/linux/vmstat.h     | 48 ++------------------------------------
 mm/filemap.c               | 20 ++++++++--------
 mm/huge_memory.c           |  4 ++--
 mm/khugepaged.c            |  8 +++----
 mm/memcontrol.c            | 20 ++++++++--------
 mm/migrate.c               | 20 ++++++++--------
 mm/page-writeback.c        |  2 +-
 mm/rmap.c                  |  4 ++--
 mm/shmem.c                 |  6 ++---
 mm/vmscan.c                |  4 ++--
 mm/workingset.c            |  2 +-
 13 files changed, 53 insertions(+), 115 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
@ 2025-11-10 23:20 ` Shakeel Butt
  2025-11-11  1:39   ` Harry Yoo
                     ` (2 more replies)
  2025-11-10 23:20 ` [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state Shakeel Butt
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 49+ messages in thread
From: Shakeel Butt @ 2025-11-10 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

The memcg stats are safe against irq (and nmi) context and thus does not
require disabling irqs. However some code paths for memcg stats also
update the node level stats and use irq unsafe interface and thus
require the users to disable irqs. However node level stats, on
architectures with HAVE_CMPXCHG_LOCAL (all major ones), has interface
which does not require irq disabling. Let's move memcg stats code to
start using that interface for node level stats.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 2 +-
 include/linux/vmstat.h     | 4 ++--
 mm/memcontrol.c            | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8c0f15e5978f..f82fac2fd988 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1408,7 +1408,7 @@ static inline void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 {
 	struct page *page = virt_to_head_page(p);
 
-	__mod_node_page_state(page_pgdat(page), idx, val);
+	mod_node_page_state(page_pgdat(page), idx, val);
 }
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index c287998908bf..11a37aaa4dd9 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -557,7 +557,7 @@ static inline void mod_lruvec_page_state(struct page *page,
 static inline void __mod_lruvec_state(struct lruvec *lruvec,
 				      enum node_stat_item idx, int val)
 {
-	__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
+	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
 }
 
 static inline void mod_lruvec_state(struct lruvec *lruvec,
@@ -569,7 +569,7 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 static inline void __lruvec_stat_mod_folio(struct folio *folio,
 					 enum node_stat_item idx, int val)
 {
-	__mod_node_page_state(folio_pgdat(folio), idx, val);
+	mod_node_page_state(folio_pgdat(folio), idx, val);
 }
 
 static inline void lruvec_stat_mod_folio(struct folio *folio,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 025da46d9959..f4b8a6414ed3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -770,7 +770,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val)
 {
 	/* Update node */
-	__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
+	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
 
 	/* Update memcg and lruvec */
 	if (!mem_cgroup_disabled())
@@ -789,7 +789,7 @@ void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
 	if (!memcg) {
 		rcu_read_unlock();
-		__mod_node_page_state(pgdat, idx, val);
+		mod_node_page_state(pgdat, idx, val);
 		return;
 	}
 
@@ -815,7 +815,7 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
 	 * vmstats to keep it correct for the root memcg.
 	 */
 	if (!memcg) {
-		__mod_node_page_state(pgdat, idx, val);
+		mod_node_page_state(pgdat, idx, val);
 	} else {
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		__mod_lruvec_state(lruvec, idx, val);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
  2025-11-10 23:20 ` [PATCH 1/4] memcg: use mod_node_page_state to update stats Shakeel Butt
@ 2025-11-10 23:20 ` Shakeel Butt
  2025-11-11  1:46   ` Harry Yoo
                     ` (2 more replies)
  2025-11-10 23:20 ` [PATCH 3/4] memcg: remove __mod_lruvec_state Shakeel Butt
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 49+ messages in thread
From: Shakeel Butt @ 2025-11-10 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

The __mod_lruvec_kmem_state is already safe against irqs, so there is no
need to have a separate interface (i.e. mod_lruvec_kmem_state) which
wraps calls to it with irq disabling and reenabling. Let's rename
__mod_lruvec_kmem_state to mod_lruvec_kmem_state.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 28 +++++-----------------------
 mm/memcontrol.c            |  2 +-
 mm/workingset.c            |  2 +-
 3 files changed, 7 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f82fac2fd988..1384a9d305e1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -957,17 +957,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
 void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
 
-void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
-
-static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
-					 int val)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__mod_lruvec_kmem_state(p, idx, val);
-	local_irq_restore(flags);
-}
+void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			unsigned long count);
@@ -1403,14 +1393,6 @@ static inline void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
 {
 }
 
-static inline void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
-					   int val)
-{
-	struct page *page = virt_to_head_page(p);
-
-	mod_node_page_state(page_pgdat(page), idx, val);
-}
-
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 					 int val)
 {
@@ -1470,14 +1452,14 @@ struct slabobj_ext {
 #endif
 } __aligned(8);
 
-static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
+static inline void inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
 {
-	__mod_lruvec_kmem_state(p, idx, 1);
+	mod_lruvec_kmem_state(p, idx, 1);
 }
 
-static inline void __dec_lruvec_kmem_state(void *p, enum node_stat_item idx)
+static inline void dec_lruvec_kmem_state(void *p, enum node_stat_item idx)
 {
-	__mod_lruvec_kmem_state(p, idx, -1);
+	mod_lruvec_kmem_state(p, idx, -1);
 }
 
 static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4b8a6414ed3..3a59d3ee92a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,7 +799,7 @@ void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
 }
 EXPORT_SYMBOL(__lruvec_stat_mod_folio);
 
-void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
+void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
 {
 	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
 	struct mem_cgroup *memcg;
diff --git a/mm/workingset.c b/mm/workingset.c
index d32dc2e02a61..892f6fe94ea9 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -749,7 +749,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (WARN_ON_ONCE(node->count != node->nr_values))
 		goto out_invalid;
 	xa_delete_node(node, workingset_update_node);
-	__inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
+	inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 3/4] memcg: remove __mod_lruvec_state
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
  2025-11-10 23:20 ` [PATCH 1/4] memcg: use mod_node_page_state to update stats Shakeel Butt
  2025-11-10 23:20 ` [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state Shakeel Butt
@ 2025-11-10 23:20 ` Shakeel Butt
  2025-11-11  5:21   ` Harry Yoo
  2025-11-11 18:58   ` Roman Gushchin
  2025-11-10 23:20 ` [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio Shakeel Butt
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 49+ messages in thread
From: Shakeel Butt @ 2025-11-10 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

The __mod_lruvec_state is already safe against irqs, so there is no
need to have a separate interface (i.e. mod_lruvec_state) which
wraps calls to it with irq disabling and reenabling. Let's rename
__mod_lruvec_state to mod_lruvec_state.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/mm_inline.h |  2 +-
 include/linux/vmstat.h    | 18 +-----------------
 mm/memcontrol.c           |  8 ++++----
 mm/migrate.c              | 20 ++++++++++----------
 mm/vmscan.c               |  4 ++--
 5 files changed, 18 insertions(+), 34 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 795b255abf65..d7b963255012 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -44,7 +44,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	lockdep_assert_held(&lruvec->lru_lock);
 	WARN_ON_ONCE(nr_pages != (int)nr_pages);
 
-	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
+	mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
 }
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 11a37aaa4dd9..4eb7753e6e5c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -520,19 +520,9 @@ static inline const char *vm_event_name(enum vm_event_item item)
 
 #ifdef CONFIG_MEMCG
 
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
 
-static inline void mod_lruvec_state(struct lruvec *lruvec,
-				    enum node_stat_item idx, int val)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__mod_lruvec_state(lruvec, idx, val);
-	local_irq_restore(flags);
-}
-
 void __lruvec_stat_mod_folio(struct folio *folio,
 			     enum node_stat_item idx, int val);
 
@@ -554,12 +544,6 @@ static inline void mod_lruvec_page_state(struct page *page,
 
 #else
 
-static inline void __mod_lruvec_state(struct lruvec *lruvec,
-				      enum node_stat_item idx, int val)
-{
-	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
-}
-
 static inline void mod_lruvec_state(struct lruvec *lruvec,
 				    enum node_stat_item idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a59d3ee92a7..c31074e5852b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -757,7 +757,7 @@ static void mod_memcg_lruvec_state(struct lruvec *lruvec,
 }
 
 /**
- * __mod_lruvec_state - update lruvec memory statistics
+ * mod_lruvec_state - update lruvec memory statistics
  * @lruvec: the lruvec
  * @idx: the stat item
  * @val: delta to add to the counter, can be negative
@@ -766,7 +766,7 @@ static void mod_memcg_lruvec_state(struct lruvec *lruvec,
  * function updates the all three counters that are affected by a
  * change of state at this level: per-node, per-cgroup, per-lruvec.
  */
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val)
 {
 	/* Update node */
@@ -794,7 +794,7 @@ void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
 	}
 
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_lruvec_state(lruvec, idx, val);
+	mod_lruvec_state(lruvec, idx, val);
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL(__lruvec_stat_mod_folio);
@@ -818,7 +818,7 @@ void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
 		mod_node_page_state(pgdat, idx, val);
 	} else {
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		__mod_lruvec_state(lruvec, idx, val);
+		mod_lruvec_state(lruvec, idx, val);
 	}
 	rcu_read_unlock();
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 567dfae4d9f8..be00c3c82f3a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -675,27 +675,27 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
 		new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
 
-		__mod_lruvec_state(old_lruvec, NR_FILE_PAGES, -nr);
-		__mod_lruvec_state(new_lruvec, NR_FILE_PAGES, nr);
+		mod_lruvec_state(old_lruvec, NR_FILE_PAGES, -nr);
+		mod_lruvec_state(new_lruvec, NR_FILE_PAGES, nr);
 		if (folio_test_swapbacked(folio) && !folio_test_swapcache(folio)) {
-			__mod_lruvec_state(old_lruvec, NR_SHMEM, -nr);
-			__mod_lruvec_state(new_lruvec, NR_SHMEM, nr);
+			mod_lruvec_state(old_lruvec, NR_SHMEM, -nr);
+			mod_lruvec_state(new_lruvec, NR_SHMEM, nr);
 
 			if (folio_test_pmd_mappable(folio)) {
-				__mod_lruvec_state(old_lruvec, NR_SHMEM_THPS, -nr);
-				__mod_lruvec_state(new_lruvec, NR_SHMEM_THPS, nr);
+				mod_lruvec_state(old_lruvec, NR_SHMEM_THPS, -nr);
+				mod_lruvec_state(new_lruvec, NR_SHMEM_THPS, nr);
 			}
 		}
 #ifdef CONFIG_SWAP
 		if (folio_test_swapcache(folio)) {
-			__mod_lruvec_state(old_lruvec, NR_SWAPCACHE, -nr);
-			__mod_lruvec_state(new_lruvec, NR_SWAPCACHE, nr);
+			mod_lruvec_state(old_lruvec, NR_SWAPCACHE, -nr);
+			mod_lruvec_state(new_lruvec, NR_SWAPCACHE, nr);
 		}
 #endif
 		if (dirty && mapping_can_writeback(mapping)) {
-			__mod_lruvec_state(old_lruvec, NR_FILE_DIRTY, -nr);
+			mod_lruvec_state(old_lruvec, NR_FILE_DIRTY, -nr);
 			__mod_zone_page_state(oldzone, NR_ZONE_WRITE_PENDING, -nr);
-			__mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
+			mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
 			__mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr);
 		}
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba760072830b..b3231bdde4e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2019,7 +2019,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	spin_lock_irq(&lruvec->lru_lock);
 	move_folios_to_lru(lruvec, &folio_list);
 
-	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
+	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
@@ -4745,7 +4745,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		reset_batch_size(walk);
 	}
 
-	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
+	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
 
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
                   ` (2 preceding siblings ...)
  2025-11-10 23:20 ` [PATCH 3/4] memcg: remove __mod_lruvec_state Shakeel Butt
@ 2025-11-10 23:20 ` Shakeel Butt
  2025-11-11  5:41   ` Harry Yoo
  2025-11-11 18:59   ` Roman Gushchin
  2025-11-11  0:59 ` [PATCH 0/4] memcg: cleanup the memcg stats interfaces Harry Yoo
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 49+ messages in thread
From: Shakeel Butt @ 2025-11-10 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

The __lruvec_stat_mod_folio is already safe against irqs, so there is no
need to have a separate interface (i.e. lruvec_stat_mod_folio) which
wraps calls to it with irq disabling and reenabling. Let's rename
__lruvec_stat_mod_folio to lruvec_stat_mod_folio.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/vmstat.h | 30 +-----------------------------
 mm/filemap.c           | 20 ++++++++++----------
 mm/huge_memory.c       |  4 ++--
 mm/khugepaged.c        |  8 ++++----
 mm/memcontrol.c        |  4 ++--
 mm/page-writeback.c    |  2 +-
 mm/rmap.c              |  4 ++--
 mm/shmem.c             |  6 +++---
 8 files changed, 25 insertions(+), 53 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 4eb7753e6e5c..3398a345bda8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -523,19 +523,9 @@ static inline const char *vm_event_name(enum vm_event_item item)
 void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
 
-void __lruvec_stat_mod_folio(struct folio *folio,
+void lruvec_stat_mod_folio(struct folio *folio,
 			     enum node_stat_item idx, int val);
 
-static inline void lruvec_stat_mod_folio(struct folio *folio,
-					 enum node_stat_item idx, int val)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__lruvec_stat_mod_folio(folio, idx, val);
-	local_irq_restore(flags);
-}
-
 static inline void mod_lruvec_page_state(struct page *page,
 					 enum node_stat_item idx, int val)
 {
@@ -550,12 +540,6 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
 }
 
-static inline void __lruvec_stat_mod_folio(struct folio *folio,
-					 enum node_stat_item idx, int val)
-{
-	mod_node_page_state(folio_pgdat(folio), idx, val);
-}
-
 static inline void lruvec_stat_mod_folio(struct folio *folio,
 					 enum node_stat_item idx, int val)
 {
@@ -570,18 +554,6 @@ static inline void mod_lruvec_page_state(struct page *page,
 
 #endif /* CONFIG_MEMCG */
 
-static inline void __lruvec_stat_add_folio(struct folio *folio,
-					   enum node_stat_item idx)
-{
-	__lruvec_stat_mod_folio(folio, idx, folio_nr_pages(folio));
-}
-
-static inline void __lruvec_stat_sub_folio(struct folio *folio,
-					   enum node_stat_item idx)
-{
-	__lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
-}
-
 static inline void lruvec_stat_add_folio(struct folio *folio,
 					 enum node_stat_item idx)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 63eb163af99c..9a52fb3ba093 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -182,13 +182,13 @@ static void filemap_unaccount_folio(struct address_space *mapping,
 
 	nr = folio_nr_pages(folio);
 
-	__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
+	lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
 	if (folio_test_swapbacked(folio)) {
-		__lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
+		lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
 		if (folio_test_pmd_mappable(folio))
-			__lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
+			lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
 	} else if (folio_test_pmd_mappable(folio)) {
-		__lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
+		lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
 		filemap_nr_thps_dec(mapping);
 	}
 	if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags))
@@ -831,13 +831,13 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
 	old->mapping = NULL;
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!folio_test_hugetlb(old))
-		__lruvec_stat_sub_folio(old, NR_FILE_PAGES);
+		lruvec_stat_sub_folio(old, NR_FILE_PAGES);
 	if (!folio_test_hugetlb(new))
-		__lruvec_stat_add_folio(new, NR_FILE_PAGES);
+		lruvec_stat_add_folio(new, NR_FILE_PAGES);
 	if (folio_test_swapbacked(old))
-		__lruvec_stat_sub_folio(old, NR_SHMEM);
+		lruvec_stat_sub_folio(old, NR_SHMEM);
 	if (folio_test_swapbacked(new))
-		__lruvec_stat_add_folio(new, NR_SHMEM);
+		lruvec_stat_add_folio(new, NR_SHMEM);
 	xas_unlock_irq(&xas);
 	if (free_folio)
 		free_folio(old);
@@ -920,9 +920,9 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 		/* hugetlb pages do not participate in page cache accounting */
 		if (!huge) {
-			__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
+			lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
 			if (folio_test_pmd_mappable(folio))
-				__lruvec_stat_mod_folio(folio,
+				lruvec_stat_mod_folio(folio,
 						NR_FILE_THPS, nr);
 		}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 949250932bb4..943099eae8d5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3866,10 +3866,10 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			if (folio_test_pmd_mappable(folio) &&
 			    new_order < HPAGE_PMD_ORDER) {
 				if (folio_test_swapbacked(folio)) {
-					__lruvec_stat_mod_folio(folio,
+					lruvec_stat_mod_folio(folio,
 							NR_SHMEM_THPS, -nr);
 				} else {
-					__lruvec_stat_mod_folio(folio,
+					lruvec_stat_mod_folio(folio,
 							NR_FILE_THPS, -nr);
 					filemap_nr_thps_dec(mapping);
 				}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1a08673b0d8b..2a460664a67d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2174,14 +2174,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	}
 
 	if (is_shmem)
-		__lruvec_stat_mod_folio(new_folio, NR_SHMEM_THPS, HPAGE_PMD_NR);
+		lruvec_stat_mod_folio(new_folio, NR_SHMEM_THPS, HPAGE_PMD_NR);
 	else
-		__lruvec_stat_mod_folio(new_folio, NR_FILE_THPS, HPAGE_PMD_NR);
+		lruvec_stat_mod_folio(new_folio, NR_FILE_THPS, HPAGE_PMD_NR);
 
 	if (nr_none) {
-		__lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, nr_none);
+		lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, nr_none);
 		/* nr_none is always 0 for non-shmem. */
-		__lruvec_stat_mod_folio(new_folio, NR_SHMEM, nr_none);
+		lruvec_stat_mod_folio(new_folio, NR_SHMEM, nr_none);
 	}
 
 	/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c31074e5852b..7f074d72dabc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -777,7 +777,7 @@ void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 		mod_memcg_lruvec_state(lruvec, idx, val);
 }
 
-void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
+void lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
 			     int val)
 {
 	struct mem_cgroup *memcg;
@@ -797,7 +797,7 @@ void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
 	mod_lruvec_state(lruvec, idx, val);
 	rcu_read_unlock();
 }
-EXPORT_SYMBOL(__lruvec_stat_mod_folio);
+EXPORT_SYMBOL(lruvec_stat_mod_folio);
 
 void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a124ab6a205d..ccdeb0e84d39 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2652,7 +2652,7 @@ static void folio_account_dirtied(struct folio *folio,
 		inode_attach_wb(inode, folio);
 		wb = inode_to_wb(inode);
 
-		__lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
+		lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
 		__zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
 		__node_stat_mod_folio(folio, NR_DIRTIED, nr);
 		wb_stat_mod(wb, WB_RECLAIMABLE, nr);
diff --git a/mm/rmap.c b/mm/rmap.c
index 60c3cd70b6ea..1b3a3c7b0aeb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1212,12 +1212,12 @@ static void __folio_mod_stat(struct folio *folio, int nr, int nr_pmdmapped)
 
 	if (nr) {
 		idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
-		__lruvec_stat_mod_folio(folio, idx, nr);
+		lruvec_stat_mod_folio(folio, idx, nr);
 	}
 	if (nr_pmdmapped) {
 		if (folio_test_anon(folio)) {
 			idx = NR_ANON_THPS;
-			__lruvec_stat_mod_folio(folio, idx, nr_pmdmapped);
+			lruvec_stat_mod_folio(folio, idx, nr_pmdmapped);
 		} else {
 			/* NR_*_PMDMAPPED are not maintained per-memcg */
 			idx = folio_test_swapbacked(folio) ?
diff --git a/mm/shmem.c b/mm/shmem.c
index c3ed2dcd17f8..4fba8a597256 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -882,9 +882,9 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
 static void shmem_update_stats(struct folio *folio, int nr_pages)
 {
 	if (folio_test_pmd_mappable(folio))
-		__lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr_pages);
-	__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
-	__lruvec_stat_mod_folio(folio, NR_SHMEM, nr_pages);
+		lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SHMEM, nr_pages);
 }
 
 /*
-- 
2.47.3



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
                   ` (3 preceding siblings ...)
  2025-11-10 23:20 ` [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio Shakeel Butt
@ 2025-11-11  0:59 ` Harry Yoo
  2025-11-11  2:23   ` Qi Zheng
  2025-11-11  8:36 ` Qi Zheng
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Harry Yoo @ 2025-11-11  0:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Mon, Nov 10, 2025 at 03:20:04PM -0800, Shakeel Butt wrote:
> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However for some stats which are also maintained
> at node level, it is using irq unsafe interface and thus requiring the
> users to still disables irqs or use interfaces which explicitly disables
> irqs. Let's move memcg code to use irq safe node level stats function
> which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
> (all major ones), so there will not be any performance penalty for its
> usage.

Are you or Qi planning a follow-up that converts spin_lock_irq() to
spin_lock() in places where they disabled IRQs was just to update vmstat?

Qi's zombie memcg series will depends on that work I guess..

-- 
Cheers,
Harry / Hyeonggon

> Shakeel Butt (4):
>   memcg: use mod_node_page_state to update stats
>   memcg: remove __mod_lruvec_kmem_state
>   memcg: remove __mod_lruvec_state
>   memcg: remove __lruvec_stat_mod_folio
> 
>  include/linux/memcontrol.h | 28 ++++------------------
>  include/linux/mm_inline.h  |  2 +-
>  include/linux/vmstat.h     | 48 ++------------------------------------
>  mm/filemap.c               | 20 ++++++++--------
>  mm/huge_memory.c           |  4 ++--
>  mm/khugepaged.c            |  8 +++----
>  mm/memcontrol.c            | 20 ++++++++--------
>  mm/migrate.c               | 20 ++++++++--------
>  mm/page-writeback.c        |  2 +-
>  mm/rmap.c                  |  4 ++--
>  mm/shmem.c                 |  6 ++---
>  mm/vmscan.c                |  4 ++--
>  mm/workingset.c            |  2 +-
>  13 files changed, 53 insertions(+), 115 deletions(-)
> 
> -- 
> 2.47.3
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2025-11-10 23:20 ` [PATCH 1/4] memcg: use mod_node_page_state to update stats Shakeel Butt
@ 2025-11-11  1:39   ` Harry Yoo
  2025-11-11 18:58   ` Roman Gushchin
  2026-01-29 13:05   ` Dev Jain
  2 siblings, 0 replies; 49+ messages in thread
From: Harry Yoo @ 2025-11-11  1:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Mon, Nov 10, 2025 at 03:20:05PM -0800, Shakeel Butt wrote:
> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However some code paths for memcg stats also
> update the node level stats and use irq unsafe interface and thus
> require the users to disable irqs. However node level stats, on
> architectures with HAVE_CMPXCHG_LOCAL (all major ones), has interface
> which does not require irq disabling. Let's move memcg stats code to
> start using that interface for node level stats.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state
  2025-11-10 23:20 ` [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state Shakeel Butt
@ 2025-11-11  1:46   ` Harry Yoo
  2025-11-11  8:23   ` Qi Zheng
  2025-11-11 18:58   ` Roman Gushchin
  2 siblings, 0 replies; 49+ messages in thread
From: Harry Yoo @ 2025-11-11  1:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Mon, Nov 10, 2025 at 03:20:06PM -0800, Shakeel Butt wrote:
> The __mod_lruvec_kmem_state is already safe against irqs, so there is no
> need to have a separate interface (i.e. mod_lruvec_kmem_state) which
> wraps calls to it with irq disabling and reenabling. Let's rename
> __mod_lruvec_kmem_state to mod_lruvec_kmem_state.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  0:59 ` [PATCH 0/4] memcg: cleanup the memcg stats interfaces Harry Yoo
@ 2025-11-11  2:23   ` Qi Zheng
  2025-11-11  2:39     ` Shakeel Butt
  0 siblings, 1 reply; 49+ messages in thread
From: Qi Zheng @ 2025-11-11  2:23 UTC (permalink / raw)
  To: Harry Yoo, Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Vlastimil Babka, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Hi,

On 11/11/25 8:59 AM, Harry Yoo wrote:
> On Mon, Nov 10, 2025 at 03:20:04PM -0800, Shakeel Butt wrote:
>> The memcg stats are safe against irq (and nmi) context and thus does not
>> require disabling irqs. However for some stats which are also maintained
>> at node level, it is using irq unsafe interface and thus requiring the
>> users to still disables irqs or use interfaces which explicitly disables
>> irqs. Let's move memcg code to use irq safe node level stats function
>> which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
>> (all major ones), so there will not be any performance penalty for its
>> usage.

Good job. Thanks!

> 
> Are you or Qi planning a follow-up that converts spin_lock_irq() to
> spin_lock() in places where they disabled IRQs was just to update vmstat?

Perhaps this change could be implemented together in [PATCH 1/4]?

Of course, it's also reasonable to make it a separate patch. If we
choose this method, I’m fine with either me or Shakeel doing it.

> 
> Qi's zombie memcg series will depends on that work I guess..

Yes, and there are other places that also need to be converted, such as
__folio_migrate_mapping().

Thanks,
Qi

> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  2:23   ` Qi Zheng
@ 2025-11-11  2:39     ` Shakeel Butt
  2025-11-11  2:48       ` Qi Zheng
  0 siblings, 1 reply; 49+ messages in thread
From: Shakeel Butt @ 2025-11-11  2:39 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Harry Yoo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Tue, Nov 11, 2025 at 10:23:15AM +0800, Qi Zheng wrote:
> Hi,
> 
[...]
> > 
> > Are you or Qi planning a follow-up that converts spin_lock_irq() to
> > spin_lock() in places where they disabled IRQs was just to update vmstat?
> 
> Perhaps this change could be implemented together in [PATCH 1/4]?
> 
> Of course, it's also reasonable to make it a separate patch. If we
> choose this method, I’m fine with either me or Shakeel doing it.
> 

Let's do it separately as I wanted to keep the memcg related changes
self-contained.

Qi, can you please take a stab at that?

> > 
> > Qi's zombie memcg series will depends on that work I guess..
> 
> Yes, and there are other places that also need to be converted, such as
> __folio_migrate_mapping().

I see __mod_zone_page_state() usage in __folio_migrate_mapping() and
using the same reasoning we can convert it to use mod_zone_page_state().
Where else do you need to do these conversions (other than
__folio_migrate_mapping)?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  2:39     ` Shakeel Butt
@ 2025-11-11  2:48       ` Qi Zheng
  2025-11-11  3:00         ` Shakeel Butt
  2025-11-11  3:05         ` Harry Yoo
  0 siblings, 2 replies; 49+ messages in thread
From: Qi Zheng @ 2025-11-11  2:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Harry Yoo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Hi Shakeel,

On 11/11/25 10:39 AM, Shakeel Butt wrote:
> On Tue, Nov 11, 2025 at 10:23:15AM +0800, Qi Zheng wrote:
>> Hi,
>>
> [...]
>>>
>>> Are you or Qi planning a follow-up that converts spin_lock_irq() to
>>> spin_lock() in places where they disabled IRQs was just to update vmstat?
>>
>> Perhaps this change could be implemented together in [PATCH 1/4]?
>>
>> Of course, it's also reasonable to make it a separate patch. If we
>> choose this method, I’m fine with either me or Shakeel doing it.
>>
> 
> Let's do it separately as I wanted to keep the memcg related changes
> self-contained.

OK.

> 
> Qi, can you please take a stab at that?

Sure, I will do it.

> 
>>>
>>> Qi's zombie memcg series will depends on that work I guess..
>>
>> Yes, and there are other places that also need to be converted, such as
>> __folio_migrate_mapping().
> 
> I see __mod_zone_page_state() usage in __folio_migrate_mapping() and
> using the same reasoning we can convert it to use mod_zone_page_state().
> Where else do you need to do these conversions (other than
> __folio_migrate_mapping)?

I mean converting these places to use spin_lock() instead of
spin_lock_irq().

Thanks,
Qi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  2:48       ` Qi Zheng
@ 2025-11-11  3:00         ` Shakeel Butt
  2025-11-11  3:07           ` Qi Zheng
  2025-11-11  3:05         ` Harry Yoo
  1 sibling, 1 reply; 49+ messages in thread
From: Shakeel Butt @ 2025-11-11  3:00 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Harry Yoo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Tue, Nov 11, 2025 at 10:48:18AM +0800, Qi Zheng wrote:
> Hi Shakeel,
> 
> On 11/11/25 10:39 AM, Shakeel Butt wrote:
> > On Tue, Nov 11, 2025 at 10:23:15AM +0800, Qi Zheng wrote:
> > > Hi,
> > > 
> > [...]
> > > > 
> > > > Are you or Qi planning a follow-up that converts spin_lock_irq() to
> > > > spin_lock() in places where they disabled IRQs was just to update vmstat?
> > > 
> > > Perhaps this change could be implemented together in [PATCH 1/4]?
> > > 
> > > Of course, it's also reasonable to make it a separate patch. If we
> > > choose this method, I’m fine with either me or Shakeel doing it.
> > > 
> > 
> > Let's do it separately as I wanted to keep the memcg related changes
> > self-contained.
> 
> OK.
> 
> > 
> > Qi, can you please take a stab at that?
> 
> Sure, I will do it.
> 
> > 
> > > > 
> > > > Qi's zombie memcg series will depends on that work I guess..
> > > 
> > > Yes, and there are other places that also need to be converted, such as
> > > __folio_migrate_mapping().
> > 
> > I see __mod_zone_page_state() usage in __folio_migrate_mapping() and
> > using the same reasoning we can convert it to use mod_zone_page_state().
> > Where else do you need to do these conversions (other than
> > __folio_migrate_mapping)?
> 
> I mean converting these places to use spin_lock() instead of
> spin_lock_irq().

For only stats, right?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  2:48       ` Qi Zheng
  2025-11-11  3:00         ` Shakeel Butt
@ 2025-11-11  3:05         ` Harry Yoo
  2025-11-11  8:01           ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 49+ messages in thread
From: Harry Yoo @ 2025-11-11  3:05 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team, Sebastian Andrzej Siewior,
	Clark Williams, linux-rt-devel

On Tue, Nov 11, 2025 at 10:48:18AM +0800, Qi Zheng wrote:
> Hi Shakeel,
> 
> On 11/11/25 10:39 AM, Shakeel Butt wrote:
> > On Tue, Nov 11, 2025 at 10:23:15AM +0800, Qi Zheng wrote:
> > > Hi,
> > > 
> > [...]
> > > > 
> > > > Are you or Qi planning a follow-up that converts spin_lock_irq() to
> > > > spin_lock() in places where they disabled IRQs was just to update vmstat?
> > > 
> > > Perhaps this change could be implemented together in [PATCH 1/4]?
> > > 
> > > Of course, it's also reasonable to make it a separate patch. If we
> > > choose this method, I’m fine with either me or Shakeel doing it.
> > > 
> > 
> > Let's do it separately as I wanted to keep the memcg related changes
> > self-contained.
> 
> OK.

Agreed.

> > Qi, can you please take a stab at that?
> 
> Sure, I will do it.

I'll be more than happy to review that ;)

> > > > Qi's zombie memcg series will depends on that work I guess..
> > > 
> > > Yes, and there are other places that also need to be converted, such as
> > > __folio_migrate_mapping().
> > 
> > I see __mod_zone_page_state() usage in __folio_migrate_mapping() and
> > using the same reasoning we can convert it to use mod_zone_page_state().
> > Where else do you need to do these conversions (other than
> > __folio_migrate_mapping)?
> 
> I mean converting these places to use spin_lock() instead of
> spin_lock_irq().

Just one thing I noticed while looking at __folio_migrate_mapping()...

- xas_lock_irq() -> xas_unlock() -> local_irq_enable()
- swap_cluster_get_and_lock_irq() -> swap_cluster_unlock() -> local_irq_enable()

is wrong because spin_lock_irq() doesn't disable IRQ on PREEMPT_RT.

Not 100% sure if it would be benign or lead to actual bugs that need
to be fixed in -stable kernels.

Cc'ing RT folks again :)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  3:00         ` Shakeel Butt
@ 2025-11-11  3:07           ` Qi Zheng
  2025-11-11  3:18             ` Harry Yoo
  0 siblings, 1 reply; 49+ messages in thread
From: Qi Zheng @ 2025-11-11  3:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Harry Yoo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team



On 11/11/25 11:00 AM, Shakeel Butt wrote:
> On Tue, Nov 11, 2025 at 10:48:18AM +0800, Qi Zheng wrote:
>> Hi Shakeel,
>>
>> On 11/11/25 10:39 AM, Shakeel Butt wrote:
>>> On Tue, Nov 11, 2025 at 10:23:15AM +0800, Qi Zheng wrote:
>>>> Hi,
>>>>
>>> [...]
>>>>>
>>>>> Are you or Qi planning a follow-up that converts spin_lock_irq() to
>>>>> spin_lock() in places where they disabled IRQs was just to update vmstat?
>>>>
>>>> Perhaps this change could be implemented together in [PATCH 1/4]?
>>>>
>>>> Of course, it's also reasonable to make it a separate patch. If we
>>>> choose this method, I’m fine with either me or Shakeel doing it.
>>>>
>>>
>>> Let's do it separately as I wanted to keep the memcg related changes
>>> self-contained.
>>
>> OK.
>>
>>>
>>> Qi, can you please take a stab at that?
>>
>> Sure, I will do it.
>>
>>>
>>>>>
>>>>> Qi's zombie memcg series will depends on that work I guess..
>>>>
>>>> Yes, and there are other places that also need to be converted, such as
>>>> __folio_migrate_mapping().
>>>
>>> I see __mod_zone_page_state() usage in __folio_migrate_mapping() and
>>> using the same reasoning we can convert it to use mod_zone_page_state().
>>> Where else do you need to do these conversions (other than
>>> __folio_migrate_mapping)?
>>
>> I mean converting these places to use spin_lock() instead of
>> spin_lock_irq().
> 
> For only stats, right?

Right, for thoes places where they disabled IRQs was just to update
vmstat.




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  3:07           ` Qi Zheng
@ 2025-11-11  3:18             ` Harry Yoo
  2025-11-11  3:29               ` Qi Zheng
  0 siblings, 1 reply; 49+ messages in thread
From: Harry Yoo @ 2025-11-11  3:18 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Tue, Nov 11, 2025 at 11:07:09AM +0800, Qi Zheng wrote:
> 
> 
> On 11/11/25 11:00 AM, Shakeel Butt wrote:
> > On Tue, Nov 11, 2025 at 10:48:18AM +0800, Qi Zheng wrote:
> > > Hi Shakeel,
> > > 
> > > On 11/11/25 10:39 AM, Shakeel Butt wrote:
> > > > On Tue, Nov 11, 2025 at 10:23:15AM +0800, Qi Zheng wrote:
> > > > > Hi,
> > > > > 
> > > > [...]
> > > > > > 
> > > > > > Are you or Qi planning a follow-up that converts spin_lock_irq() to
> > > > > > spin_lock() in places where they disabled IRQs was just to update vmstat?
> > > > > 
> > > > > Perhaps this change could be implemented together in [PATCH 1/4]?
> > > > > 
> > > > > Of course, it's also reasonable to make it a separate patch. If we
> > > > > choose this method, I’m fine with either me or Shakeel doing it.
> > > > > 
> > > > 
> > > > Let's do it separately as I wanted to keep the memcg related changes
> > > > self-contained.
> > > 
> > > OK.
> > > 
> > > > 
> > > > Qi, can you please take a stab at that?
> > > 
> > > Sure, I will do it.
> > > 
> > > > 
> > > > > > 
> > > > > > Qi's zombie memcg series will depends on that work I guess..
> > > > > 
> > > > > Yes, and there are other places that also need to be converted, such as
> > > > > __folio_migrate_mapping().
> > > > 
> > > > I see __mod_zone_page_state() usage in __folio_migrate_mapping() and
> > > > using the same reasoning we can convert it to use mod_zone_page_state().
> > > > Where else do you need to do these conversions (other than
> > > > __folio_migrate_mapping)?
> > > 
> > > I mean converting these places to use spin_lock() instead of
> > > spin_lock_irq().
> > 
> > For only stats, right?
> 
> Right, for thoes places where they disabled IRQs was just to update
> vmstat.

...Or if they disabled IRQs for other reasons as well, we can still move
vmstat update code outside the IRQ disabled region.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  3:18             ` Harry Yoo
@ 2025-11-11  3:29               ` Qi Zheng
  0 siblings, 0 replies; 49+ messages in thread
From: Qi Zheng @ 2025-11-11  3:29 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team



On 11/11/25 11:18 AM, Harry Yoo wrote:
> On Tue, Nov 11, 2025 at 11:07:09AM +0800, Qi Zheng wrote:
>>
>>
>> On 11/11/25 11:00 AM, Shakeel Butt wrote:
>>> On Tue, Nov 11, 2025 at 10:48:18AM +0800, Qi Zheng wrote:
>>>> Hi Shakeel,
>>>>
>>>> On 11/11/25 10:39 AM, Shakeel Butt wrote:
>>>>> On Tue, Nov 11, 2025 at 10:23:15AM +0800, Qi Zheng wrote:
>>>>>> Hi,
>>>>>>
>>>>> [...]
>>>>>>>
>>>>>>> Are you or Qi planning a follow-up that converts spin_lock_irq() to
>>>>>>> spin_lock() in places where they disabled IRQs was just to update vmstat?
>>>>>>
>>>>>> Perhaps this change could be implemented together in [PATCH 1/4]?
>>>>>>
>>>>>> Of course, it's also reasonable to make it a separate patch. If we
>>>>>> choose this method, I’m fine with either me or Shakeel doing it.
>>>>>>
>>>>>
>>>>> Let's do it separately as I wanted to keep the memcg related changes
>>>>> self-contained.
>>>>
>>>> OK.
>>>>
>>>>>
>>>>> Qi, can you please take a stab at that?
>>>>
>>>> Sure, I will do it.
>>>>
>>>>>
>>>>>>>
>>>>>>> Qi's zombie memcg series will depends on that work I guess..
>>>>>>
>>>>>> Yes, and there are other places that also need to be converted, such as
>>>>>> __folio_migrate_mapping().
>>>>>
>>>>> I see __mod_zone_page_state() usage in __folio_migrate_mapping() and
>>>>> using the same reasoning we can convert it to use mod_zone_page_state().
>>>>> Where else do you need to do these conversions (other than
>>>>> __folio_migrate_mapping)?
>>>>
>>>> I mean converting these places to use spin_lock() instead of
>>>> spin_lock_irq().
>>>
>>> For only stats, right?
>>
>> Right, for thoes places where they disabled IRQs was just to update
>> vmstat.
> 
> ...Or if they disabled IRQs for other reasons as well, we can still move
> vmstat update code outside the IRQ disabled region.

Ok, I will take a closer look.

> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] memcg: remove __mod_lruvec_state
  2025-11-10 23:20 ` [PATCH 3/4] memcg: remove __mod_lruvec_state Shakeel Butt
@ 2025-11-11  5:21   ` Harry Yoo
  2025-11-11 18:58   ` Roman Gushchin
  1 sibling, 0 replies; 49+ messages in thread
From: Harry Yoo @ 2025-11-11  5:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Mon, Nov 10, 2025 at 03:20:07PM -0800, Shakeel Butt wrote:
> The __mod_lruvec_state is already safe against irqs, so there is no
> need to have a separate interface (i.e. mod_lruvec_state) which
> wraps calls to it with irq disabling and reenabling. Let's rename
> __mod_lruvec_state to mod_lruvec_state.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio
  2025-11-10 23:20 ` [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio Shakeel Butt
@ 2025-11-11  5:41   ` Harry Yoo
  2025-11-11 18:59   ` Roman Gushchin
  1 sibling, 0 replies; 49+ messages in thread
From: Harry Yoo @ 2025-11-11  5:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Mon, Nov 10, 2025 at 03:20:08PM -0800, Shakeel Butt wrote:
> The __lruvec_stat_mod_folio is already safe against irqs, so there is no
> need to have a separate interface (i.e. lruvec_stat_mod_folio) which
> wraps calls to it with irq disabling and reenabling. Let's rename
> __lruvec_stat_mod_folio to lruvec_stat_mod_folio.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  3:05         ` Harry Yoo
@ 2025-11-11  8:01           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 49+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-11  8:01 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Qi Zheng, Shakeel Butt, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Vlastimil Babka,
	linux-mm, cgroups, linux-kernel, Meta kernel team,
	Clark Williams, linux-rt-devel

> > I mean converting these places to use spin_lock() instead of
> > spin_lock_irq().
> 
> Just one thing I noticed while looking at __folio_migrate_mapping()...
> 
> - xas_lock_irq() -> xas_unlock() -> local_irq_enable()
> - swap_cluster_get_and_lock_irq() -> swap_cluster_unlock() -> local_irq_enable()
> 
> is wrong because spin_lock_irq() doesn't disable IRQ on PREEMPT_RT.
> 
> Not 100% sure if it would be benign or lead to actual bugs that need
> to be fixed in -stable kernels.
> 
> Cc'ing RT folks again :)

The tail in __folio_migrate_mapping() after xas_unlock()/
swap_cluster_unlock(), is limited to __mod_lruvec_state() based stats
updates. There is a preempt_disable_nested() in __mod_zone_page_state()
to ensure that the update happens on the same CPU and is not preempted.
On PREEMPT_RT there should be no in-IRQ updates of these counters.
The IRQ enable at the end does nothing. There might be CPU migration
between the individual stats updates.
If it remains like this, it is fine. Please don't advertise ;)

Sebastian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state
  2025-11-10 23:20 ` [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state Shakeel Butt
  2025-11-11  1:46   ` Harry Yoo
@ 2025-11-11  8:23   ` Qi Zheng
  2025-11-11 18:58   ` Roman Gushchin
  2 siblings, 0 replies; 49+ messages in thread
From: Qi Zheng @ 2025-11-11  8:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Vlastimil Babka, linux-mm, cgroups, linux-kernel,
	Meta kernel team, Andrew Morton

Hi Shakeel,

On 11/11/25 7:20 AM, Shakeel Butt wrote:
> The __mod_lruvec_kmem_state is already safe against irqs, so there is no
> need to have a separate interface (i.e. mod_lruvec_kmem_state) which
> wraps calls to it with irq disabling and reenabling. Let's rename
> __mod_lruvec_kmem_state to mod_lruvec_kmem_state.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>

One nit below:

> ---
>   include/linux/memcontrol.h | 28 +++++-----------------------
>   mm/memcontrol.c            |  2 +-
>   mm/workingset.c            |  2 +-
>   3 files changed, 7 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f82fac2fd988..1384a9d305e1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -957,17 +957,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>   void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
>   void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
>   
> -void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
> -
> -static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> -					 int val)
> -{
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
> -	__mod_lruvec_kmem_state(p, idx, val);
> -	local_irq_restore(flags);
> -}
> +void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
>   
>   void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
>   			unsigned long count);
> @@ -1403,14 +1393,6 @@ static inline void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
>   {
>   }
>   
> -static inline void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> -					   int val)
> -{
> -	struct page *page = virt_to_head_page(p);
> -
> -	mod_node_page_state(page_pgdat(page), idx, val);
> -}
> -
>   static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
>   					 int val)
>   {
> @@ -1470,14 +1452,14 @@ struct slabobj_ext {
>   #endif
>   } __aligned(8);
>   
> -static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
> +static inline void inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
>   {
> -	__mod_lruvec_kmem_state(p, idx, 1);
> +	mod_lruvec_kmem_state(p, idx, 1);
>   }

The inc_lruvec_kmem_state() has only one user.

>   
> -static inline void __dec_lruvec_kmem_state(void *p, enum node_stat_item idx)
> +static inline void dec_lruvec_kmem_state(void *p, enum node_stat_item idx)
>   {
> -	__mod_lruvec_kmem_state(p, idx, -1);
> +	mod_lruvec_kmem_state(p, idx, -1);
>   }

The dec_lruvec_kmem_state() has no user.

Not sure whether inc_lruvec_kmem_state() and dec_lruvec_kmem_state()
also need to be removed.

Thanks,
Qi

>   
>   static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4b8a6414ed3..3a59d3ee92a7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -799,7 +799,7 @@ void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
>   }
>   EXPORT_SYMBOL(__lruvec_stat_mod_folio);
>   
> -void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
> +void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
>   {
>   	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
>   	struct mem_cgroup *memcg;
> diff --git a/mm/workingset.c b/mm/workingset.c
> index d32dc2e02a61..892f6fe94ea9 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -749,7 +749,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>   	if (WARN_ON_ONCE(node->count != node->nr_values))
>   		goto out_invalid;
>   	xa_delete_node(node, workingset_update_node);
> -	__inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
> +	inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
>   
>   out_invalid:
>   	xa_unlock_irq(&mapping->i_pages);



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
                   ` (4 preceding siblings ...)
  2025-11-11  0:59 ` [PATCH 0/4] memcg: cleanup the memcg stats interfaces Harry Yoo
@ 2025-11-11  8:36 ` Qi Zheng
  2025-11-11 16:45   ` Shakeel Butt
  2025-11-11  9:54 ` Vlastimil Babka
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Qi Zheng @ 2025-11-11  8:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Vlastimil Babka, linux-mm, cgroups, linux-kernel,
	Meta kernel team, Andrew Morton

Hi Shakeel,

On 11/11/25 7:20 AM, Shakeel Butt wrote:
> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However for some stats which are also maintained
> at node level, it is using irq unsafe interface and thus requiring the
> users to still disables irqs or use interfaces which explicitly disables
> irqs. Let's move memcg code to use irq safe node level stats function
> which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
> (all major ones), so there will not be any performance penalty for its
> usage.

Generally, places that call __mod_lruvec_state() also call
__mod_zone_page_state(), and it also has the corresponding optimized
version (mod_zone_page_state()). It seems necessary to clean that up
as well, so that those disabling-IRQs that are only used for updating
vmstat can be removed.

Thanks,
Qi

> 
> Shakeel Butt (4):
>    memcg: use mod_node_page_state to update stats
>    memcg: remove __mod_lruvec_kmem_state
>    memcg: remove __mod_lruvec_state
>    memcg: remove __lruvec_stat_mod_folio
> 
>   include/linux/memcontrol.h | 28 ++++------------------
>   include/linux/mm_inline.h  |  2 +-
>   include/linux/vmstat.h     | 48 ++------------------------------------
>   mm/filemap.c               | 20 ++++++++--------
>   mm/huge_memory.c           |  4 ++--
>   mm/khugepaged.c            |  8 +++----
>   mm/memcontrol.c            | 20 ++++++++--------
>   mm/migrate.c               | 20 ++++++++--------
>   mm/page-writeback.c        |  2 +-
>   mm/rmap.c                  |  4 ++--
>   mm/shmem.c                 |  6 ++---
>   mm/vmscan.c                |  4 ++--
>   mm/workingset.c            |  2 +-
>   13 files changed, 53 insertions(+), 115 deletions(-)
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
                   ` (5 preceding siblings ...)
  2025-11-11  8:36 ` Qi Zheng
@ 2025-11-11  9:54 ` Vlastimil Babka
  2025-11-11 19:01 ` Roman Gushchin
  2025-11-15 19:27 ` Shakeel Butt
  8 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2025-11-11  9:54 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On 11/11/25 00:20, Shakeel Butt wrote:
> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However for some stats which are also maintained
> at node level, it is using irq unsafe interface and thus requiring the
> users to still disables irqs or use interfaces which explicitly disables
> irqs. Let's move memcg code to use irq safe node level stats function
> which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
> (all major ones), so there will not be any performance penalty for its
> usage.
> 
> Shakeel Butt (4):
>   memcg: use mod_node_page_state to update stats
>   memcg: remove __mod_lruvec_kmem_state
>   memcg: remove __mod_lruvec_state
>   memcg: remove __lruvec_stat_mod_folio

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> 
>  include/linux/memcontrol.h | 28 ++++------------------
>  include/linux/mm_inline.h  |  2 +-
>  include/linux/vmstat.h     | 48 ++------------------------------------
>  mm/filemap.c               | 20 ++++++++--------
>  mm/huge_memory.c           |  4 ++--
>  mm/khugepaged.c            |  8 +++----
>  mm/memcontrol.c            | 20 ++++++++--------
>  mm/migrate.c               | 20 ++++++++--------
>  mm/page-writeback.c        |  2 +-
>  mm/rmap.c                  |  4 ++--
>  mm/shmem.c                 |  6 ++---
>  mm/vmscan.c                |  4 ++--
>  mm/workingset.c            |  2 +-
>  13 files changed, 53 insertions(+), 115 deletions(-)
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11  8:36 ` Qi Zheng
@ 2025-11-11 16:45   ` Shakeel Butt
  2025-11-12  2:11     ` Qi Zheng
  0 siblings, 1 reply; 49+ messages in thread
From: Shakeel Butt @ 2025-11-11 16:45 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Vlastimil Babka, linux-mm, cgroups, linux-kernel,
	Meta kernel team, Andrew Morton

On Tue, Nov 11, 2025 at 04:36:14PM +0800, Qi Zheng wrote:
> Hi Shakeel,
> 
> On 11/11/25 7:20 AM, Shakeel Butt wrote:
> > The memcg stats are safe against irq (and nmi) context and thus does not
> > require disabling irqs. However for some stats which are also maintained
> > at node level, it is using irq unsafe interface and thus requiring the
> > users to still disables irqs or use interfaces which explicitly disables
> > irqs. Let's move memcg code to use irq safe node level stats function
> > which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
> > (all major ones), so there will not be any performance penalty for its
> > usage.
> 
> Generally, places that call __mod_lruvec_state() also call
> __mod_zone_page_state(), and it also has the corresponding optimized
> version (mod_zone_page_state()). It seems necessary to clean that up
> as well, so that those disabling-IRQs that are only used for updating
> vmstat can be removed.

I agree, please take a stab at that.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2025-11-10 23:20 ` [PATCH 1/4] memcg: use mod_node_page_state to update stats Shakeel Butt
  2025-11-11  1:39   ` Harry Yoo
@ 2025-11-11 18:58   ` Roman Gushchin
  2026-01-29 13:05   ` Dev Jain
  2 siblings, 0 replies; 49+ messages in thread
From: Roman Gushchin @ 2025-11-11 18:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Shakeel Butt <shakeel.butt@linux.dev> writes:

> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However some code paths for memcg stats also
> update the node level stats and use irq unsafe interface and thus
> require the users to disable irqs. However node level stats, on
> architectures with HAVE_CMPXCHG_LOCAL (all major ones), has interface
> which does not require irq disabling. Let's move memcg stats code to
> start using that interface for node level stats.
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state
  2025-11-10 23:20 ` [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state Shakeel Butt
  2025-11-11  1:46   ` Harry Yoo
  2025-11-11  8:23   ` Qi Zheng
@ 2025-11-11 18:58   ` Roman Gushchin
  2 siblings, 0 replies; 49+ messages in thread
From: Roman Gushchin @ 2025-11-11 18:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Shakeel Butt <shakeel.butt@linux.dev> writes:

> The __mod_lruvec_kmem_state is already safe against irqs, so there is no
> need to have a separate interface (i.e. mod_lruvec_kmem_state) which
> wraps calls to it with irq disabling and reenabling. Let's rename
> __mod_lruvec_kmem_state to mod_lruvec_kmem_state.
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] memcg: remove __mod_lruvec_state
  2025-11-10 23:20 ` [PATCH 3/4] memcg: remove __mod_lruvec_state Shakeel Butt
  2025-11-11  5:21   ` Harry Yoo
@ 2025-11-11 18:58   ` Roman Gushchin
  1 sibling, 0 replies; 49+ messages in thread
From: Roman Gushchin @ 2025-11-11 18:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Shakeel Butt <shakeel.butt@linux.dev> writes:

> The __mod_lruvec_state is already safe against irqs, so there is no
> need to have a separate interface (i.e. mod_lruvec_state) which
> wraps calls to it with irq disabling and reenabling. Let's rename
> __mod_lruvec_state to mod_lruvec_state.
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio
  2025-11-10 23:20 ` [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio Shakeel Butt
  2025-11-11  5:41   ` Harry Yoo
@ 2025-11-11 18:59   ` Roman Gushchin
  1 sibling, 0 replies; 49+ messages in thread
From: Roman Gushchin @ 2025-11-11 18:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Shakeel Butt <shakeel.butt@linux.dev> writes:

> The __lruvec_stat_mod_folio is already safe against irqs, so there is no
> need to have a separate interface (i.e. lruvec_stat_mod_folio) which
> wraps calls to it with irq disabling and reenabling. Let's rename
> __lruvec_stat_mod_folio to lruvec_stat_mod_folio.
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
                   ` (6 preceding siblings ...)
  2025-11-11  9:54 ` Vlastimil Babka
@ 2025-11-11 19:01 ` Roman Gushchin
  2025-11-11 19:34   ` Shakeel Butt
  2025-11-15 19:27 ` Shakeel Butt
  8 siblings, 1 reply; 49+ messages in thread
From: Roman Gushchin @ 2025-11-11 19:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Shakeel Butt <shakeel.butt@linux.dev> writes:

> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However for some stats which are also maintained
> at node level, it is using irq unsafe interface and thus requiring the
> users to still disables irqs or use interfaces which explicitly disables
> irqs. Let's move memcg code to use irq safe node level stats function
> which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
> (all major ones), so there will not be any performance penalty for its
> usage.

Do you have any production data for this or it's theory-based?

In general I feel we need a benchmark focused on memcg stats:
there was a number of performance improvements and regressions in this
code over last years, so a dedicated benchmark can help with measuring
them.

Nice cleanup btw, thanks!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11 19:01 ` Roman Gushchin
@ 2025-11-11 19:34   ` Shakeel Butt
  0 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2025-11-11 19:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

On Tue, Nov 11, 2025 at 11:01:47AM -0800, Roman Gushchin wrote:
> Shakeel Butt <shakeel.butt@linux.dev> writes:
> 
> > The memcg stats are safe against irq (and nmi) context and thus does not
> > require disabling irqs. However for some stats which are also maintained
> > at node level, it is using irq unsafe interface and thus requiring the
> > users to still disables irqs or use interfaces which explicitly disables
> > irqs. Let's move memcg code to use irq safe node level stats function
> > which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
> > (all major ones), so there will not be any performance penalty for its
> > usage.
> 
> Do you have any production data for this or it's theory-based?

At the moment it is theory-based or more specifically based on the
comments on HAVE_CMPXCHG_LOCAL variants of stats update functions.

> 
> In general I feel we need a benchmark focused on memcg stats:
> there was a number of performance improvements and regressions in this
> code over last years, so a dedicated benchmark can help with measuring
> them.

Yeah it makes sense to have a benchmark. Let me see which benchmark
trigger this code paths a lot. At the high level, these interfaces are
used in reclaim and migration which are not really that performance
critical. I will try benchmarks with a lot of allocs/frees.

> 
> Nice cleanup btw, thanks!

Thanks.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-11 16:45   ` Shakeel Butt
@ 2025-11-12  2:11     ` Qi Zheng
  0 siblings, 0 replies; 49+ messages in thread
From: Qi Zheng @ 2025-11-12  2:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Vlastimil Babka, linux-mm, cgroups, linux-kernel,
	Meta kernel team, Andrew Morton



On 11/12/25 12:45 AM, Shakeel Butt wrote:
> On Tue, Nov 11, 2025 at 04:36:14PM +0800, Qi Zheng wrote:
>> Hi Shakeel,
>>
>> On 11/11/25 7:20 AM, Shakeel Butt wrote:
>>> The memcg stats are safe against irq (and nmi) context and thus does not
>>> require disabling irqs. However for some stats which are also maintained
>>> at node level, it is using irq unsafe interface and thus requiring the
>>> users to still disables irqs or use interfaces which explicitly disables
>>> irqs. Let's move memcg code to use irq safe node level stats function
>>> which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
>>> (all major ones), so there will not be any performance penalty for its
>>> usage.
>>
>> Generally, places that call __mod_lruvec_state() also call
>> __mod_zone_page_state(), and it also has the corresponding optimized
>> version (mod_zone_page_state()). It seems necessary to clean that up
>> as well, so that those disabling-IRQs that are only used for updating
>> vmstat can be removed.
> 
> I agree, please take a stab at that.

OK, will do.




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] memcg: cleanup the memcg stats interfaces
  2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
                   ` (7 preceding siblings ...)
  2025-11-11 19:01 ` Roman Gushchin
@ 2025-11-15 19:27 ` Shakeel Butt
  8 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2025-11-15 19:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Hi Andrew, can you please pick this series as it is ready for wider
testing.

thanks,
Shakeel

On Mon, Nov 10, 2025 at 03:20:04PM -0800, Shakeel Butt wrote:
> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However for some stats which are also maintained
> at node level, it is using irq unsafe interface and thus requiring the
> users to still disables irqs or use interfaces which explicitly disables
> irqs. Let's move memcg code to use irq safe node level stats function
> which is already optimized for architectures with HAVE_CMPXCHG_LOCAL
> (all major ones), so there will not be any performance penalty for its
> usage.
> 
> Shakeel Butt (4):
>   memcg: use mod_node_page_state to update stats
>   memcg: remove __mod_lruvec_kmem_state
>   memcg: remove __mod_lruvec_state
>   memcg: remove __lruvec_stat_mod_folio
> 
>  include/linux/memcontrol.h | 28 ++++------------------
>  include/linux/mm_inline.h  |  2 +-
>  include/linux/vmstat.h     | 48 ++------------------------------------
>  mm/filemap.c               | 20 ++++++++--------
>  mm/huge_memory.c           |  4 ++--
>  mm/khugepaged.c            |  8 +++----
>  mm/memcontrol.c            | 20 ++++++++--------
>  mm/migrate.c               | 20 ++++++++--------
>  mm/page-writeback.c        |  2 +-
>  mm/rmap.c                  |  4 ++--
>  mm/shmem.c                 |  6 ++---
>  mm/vmscan.c                |  4 ++--
>  mm/workingset.c            |  2 +-
>  13 files changed, 53 insertions(+), 115 deletions(-)
> 
> -- 
> 2.47.3
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2025-11-10 23:20 ` [PATCH 1/4] memcg: use mod_node_page_state to update stats Shakeel Butt
  2025-11-11  1:39   ` Harry Yoo
  2025-11-11 18:58   ` Roman Gushchin
@ 2026-01-29 13:05   ` Dev Jain
  2026-02-02  4:26     ` Shakeel Butt
  2 siblings, 1 reply; 49+ messages in thread
From: Dev Jain @ 2026-01-29 13:05 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team


On 11/11/25 4:50 am, Shakeel Butt wrote:
> The memcg stats are safe against irq (and nmi) context and thus does not
> require disabling irqs. However some code paths for memcg stats also
> update the node level stats and use irq unsafe interface and thus
> require the users to disable irqs. However node level stats, on
> architectures with HAVE_CMPXCHG_LOCAL (all major ones), has interface
> which does not require irq disabling. Let's move memcg stats code to
> start using that interface for node level stats.
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---

Hello Shakeel,

We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
to munmap. Please see below if my understanding of this patch is correct.

>  include/linux/memcontrol.h | 2 +-
>  include/linux/vmstat.h     | 4 ++--
>  mm/memcontrol.c            | 6 +++---
>  3 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8c0f15e5978f..f82fac2fd988 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1408,7 +1408,7 @@ static inline void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
>  {
>  	struct page *page = virt_to_head_page(p);
>  
> -	__mod_node_page_state(page_pgdat(page), idx, val);
> +	mod_node_page_state(page_pgdat(page), idx, val);
>  }
>  
>  static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index c287998908bf..11a37aaa4dd9 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -557,7 +557,7 @@ static inline void mod_lruvec_page_state(struct page *page,
>  static inline void __mod_lruvec_state(struct lruvec *lruvec,
>  				      enum node_stat_item idx, int val)
>  {
> -	__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
> +	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
>  }
>  
>  static inline void mod_lruvec_state(struct lruvec *lruvec,
> @@ -569,7 +569,7 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
>  static inline void __lruvec_stat_mod_folio(struct folio *folio,
>  					 enum node_stat_item idx, int val)
>  {
> -	__mod_node_page_state(folio_pgdat(folio), idx, val);
> +	mod_node_page_state(folio_pgdat(folio), idx, val);
>  }

See folio_remove_rmap_ptes -> __folio_mod_stat -> __lruvec_stat_mod_folio. This path now
has the unconditional overhead of doing this_cpu_try_cmpxchg(). AFAIU the purpose of
this patch was to remove local_irq_save and optimize it by using a cmpxchg atomic
(coupled with the fact that the caller will have ensured preempt_disable), but
there are code paths which are not doing local_irq_save in the first place, so
those get regressed.

>  
>  static inline void lruvec_stat_mod_folio(struct folio *folio,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 025da46d9959..f4b8a6414ed3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -770,7 +770,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  			int val)
>  {
>  	/* Update node */
> -	__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
> +	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
>  
>  	/* Update memcg and lruvec */
>  	if (!mem_cgroup_disabled())
> @@ -789,7 +789,7 @@ void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
>  	/* Untracked pages have no memcg, no lruvec. Update only the node */
>  	if (!memcg) {
>  		rcu_read_unlock();
> -		__mod_node_page_state(pgdat, idx, val);
> +		mod_node_page_state(pgdat, idx, val);
>  		return;
>  	}
>  
> @@ -815,7 +815,7 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
>  	 * vmstats to keep it correct for the root memcg.
>  	 */
>  	if (!memcg) {
> -		__mod_node_page_state(pgdat, idx, val);
> +		mod_node_page_state(pgdat, idx, val);
>  	} else {
>  		lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  		__mod_lruvec_state(lruvec, idx, val);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-01-29 13:05   ` Dev Jain
@ 2026-02-02  4:26     ` Shakeel Butt
  2026-02-02  4:48       ` Dev Jain
  0 siblings, 1 reply; 49+ messages in thread
From: Shakeel Butt @ 2026-02-02  4:26 UTC (permalink / raw)
  To: Dev Jain
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On Thu, Jan 29, 2026 at 06:35:21PM +0530, Dev Jain wrote:
> 
> On 11/11/25 4:50 am, Shakeel Butt wrote:
> > The memcg stats are safe against irq (and nmi) context and thus does not
> > require disabling irqs. However some code paths for memcg stats also
> > update the node level stats and use irq unsafe interface and thus
> > require the users to disable irqs. However node level stats, on
> > architectures with HAVE_CMPXCHG_LOCAL (all major ones), has interface
> > which does not require irq disabling. Let's move memcg stats code to
> > start using that interface for node level stats.
> >
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > ---
> 
> Hello Shakeel,
> 
> We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
> the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
> to munmap. Please see below if my understanding of this patch is correct.

Thanks for the report. Are you seeing regression in just the benchmark
or some real workload as well? Also how much regression are you seeing?
I have a kernel rebot regression report [1] for this patch as well which
says 2.6% regression and thus it was on the back-burner for now. I will
take look at this again soon.

[1] https://lore.kernel.org/all/202512101408.af3876df-lkp@intel.com/



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-02  4:26     ` Shakeel Butt
@ 2026-02-02  4:48       ` Dev Jain
  2026-02-02  4:54         ` Shakeel Butt
  0 siblings, 1 reply; 49+ messages in thread
From: Dev Jain @ 2026-02-02  4:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team


On 02/02/26 9:56 am, Shakeel Butt wrote:
> On Thu, Jan 29, 2026 at 06:35:21PM +0530, Dev Jain wrote:
>> On 11/11/25 4:50 am, Shakeel Butt wrote:
>>> The memcg stats are safe against irq (and nmi) context and thus does not
>>> require disabling irqs. However some code paths for memcg stats also
>>> update the node level stats and use irq unsafe interface and thus
>>> require the users to disable irqs. However node level stats, on
>>> architectures with HAVE_CMPXCHG_LOCAL (all major ones), has interface
>>> which does not require irq disabling. Let's move memcg stats code to
>>> start using that interface for node level stats.
>>>
>>> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> ---
>> Hello Shakeel,
>>
>> We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
>> the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
>> to munmap. Please see below if my understanding of this patch is correct.
> Thanks for the report. Are you seeing regression in just the benchmark
> or some real workload as well? Also how much regression are you seeing?
> I have a kernel rebot regression report [1] for this patch as well which
> says 2.6% regression and thus it was on the back-burner for now. I will
> take look at this again soon.

The munmap regression is ~24%. Haven't observed a regression in any other
benchmark yet.

>
> [1] https://lore.kernel.org/all/202512101408.af3876df-lkp@intel.com/
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-02  4:48       ` Dev Jain
@ 2026-02-02  4:54         ` Shakeel Butt
  2026-02-02  8:53           ` Dev Jain
  0 siblings, 1 reply; 49+ messages in thread
From: Shakeel Butt @ 2026-02-02  4:54 UTC (permalink / raw)
  To: Dev Jain
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

> > 
> > > 
> > > Hello Shakeel,
> > > 
> > >  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
> > >  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
> > >  to munmap. Please see below if my understanding of this patch is correct.
> > > 
> >  Thanks for the report. Are you seeing regression in just the benchmark
> >  or some real workload as well? Also how much regression are you seeing?
> >  I have a kernel rebot regression report [1] for this patch as well which
> >  says 2.6% regression and thus it was on the back-burner for now. I will
> >  take look at this again soon.
> > 
> The munmap regression is ~24%. Haven't observed a regression in any other
> benchmark yet.

Please share the code/benchmark which shows such regression, also if you can
share the perf profile, that would be awesome.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-02  4:54         ` Shakeel Butt
@ 2026-02-02  8:53           ` Dev Jain
  2026-02-04 20:38             ` Shakeel Butt
  0 siblings, 1 reply; 49+ messages in thread
From: Dev Jain @ 2026-02-02  8:53 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team


On 02/02/26 10:24 am, Shakeel Butt wrote:
>>>> Hello Shakeel,
>>>>
>>>>  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
>>>>  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
>>>>  to munmap. Please see below if my understanding of this patch is correct.
>>>>
>>>  Thanks for the report. Are you seeing regression in just the benchmark
>>>  or some real workload as well? Also how much regression are you seeing?
>>>  I have a kernel rebot regression report [1] for this patch as well which
>>>  says 2.6% regression and thus it was on the back-burner for now. I will
>>>  take look at this again soon.
>>>
>> The munmap regression is ~24%. Haven't observed a regression in any other
>> benchmark yet.
> Please share the code/benchmark which shows such regression, also if you can
> share the perf profile, that would be awesome.

https://gitlab.arm.com/tooling/fastpath/-/blob/main/containers/microbench/micromm.c
You can run this with
./micromm 0 munmap 10

Don't have a perf profile, I measured the time taken by above command, with and
without the patch.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-02  8:53           ` Dev Jain
@ 2026-02-04 20:38             ` Shakeel Butt
  2026-02-05  5:20               ` Dev Jain
  0 siblings, 1 reply; 49+ messages in thread
From: Shakeel Butt @ 2026-02-04 20:38 UTC (permalink / raw)
  To: Dev Jain
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On Mon, Feb 02, 2026 at 02:23:54PM +0530, Dev Jain wrote:
> 
> On 02/02/26 10:24 am, Shakeel Butt wrote:
> >>>> Hello Shakeel,
> >>>>
> >>>>  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
> >>>>  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
> >>>>  to munmap. Please see below if my understanding of this patch is correct.
> >>>>
> >>>  Thanks for the report. Are you seeing regression in just the benchmark
> >>>  or some real workload as well? Also how much regression are you seeing?
> >>>  I have a kernel rebot regression report [1] for this patch as well which
> >>>  says 2.6% regression and thus it was on the back-burner for now. I will
> >>>  take look at this again soon.
> >>>
> >> The munmap regression is ~24%. Haven't observed a regression in any other
> >> benchmark yet.
> > Please share the code/benchmark which shows such regression, also if you can
> > share the perf profile, that would be awesome.
> 
> https://gitlab.arm.com/tooling/fastpath/-/blob/main/containers/microbench/micromm.c
> You can run this with
> ./micromm 0 munmap 10
> 
> Don't have a perf profile, I measured the time taken by above command, with and
> without the patch.
> 

Hi Dev, can you please try the following patch?


From 40155feca7e7bc846800ab8449735bdb03164d6d Mon Sep 17 00:00:00 2001
From: Shakeel Butt <shakeel.butt@linux.dev>
Date: Wed, 4 Feb 2026 08:46:08 -0800
Subject: [PATCH] vmstat: use preempt disable instead of try_cmpxchg

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/mmzone.h |  2 +-
 mm/vmstat.c            | 58 ++++++++++++++++++------------------------
 2 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..499cd53efdd6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -776,7 +776,7 @@ struct per_cpu_zonestat {
 
 struct per_cpu_nodestat {
 	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	long vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 86b14b0f77b5..0930695597bb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -377,7 +377,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -456,8 +456,8 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	long __percpu *p = pcp->vm_node_stat_diff + item;
+	long v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -467,7 +467,7 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		long overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -512,8 +512,8 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	long __percpu *p = pcp->vm_node_stat_diff + item;
+	long v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -523,7 +523,7 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		long overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
@@ -619,9 +619,8 @@ static inline void mod_node_state(struct pglist_data *pgdat,
        enum node_stat_item item, int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	long n, t, z;
-	s8 o;
+	long __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
 
 	if (vmstat_item_in_bytes(item)) {
 		/*
@@ -634,32 +633,25 @@ static inline void mod_node_state(struct pglist_data *pgdat,
 		delta >>= PAGE_SHIFT;
 	}
 
+	preempt_disable();
+
 	o = this_cpu_read(*p);
-	do {
-		z = 0;  /* overflow to node counters */
+	n = o + delta;
 
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a node.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
+	t = this_cpu_read(pcp->stat_threshold);
+	z = 0;
 
-		n = delta + (long)o;
+	if (abs(n) > t) {
+		int os = overstep_mode * (t >> 1);
 
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
+		/* Overflow must be added to node counters */
+		z = n + os;
+		n = -os;
+	}
 
-			/* Overflow must be added to node counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (!this_cpu_try_cmpxchg(*p, &o, n));
+	this_cpu_add(*p, n - o);
+
+	preempt_enable();
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
@@ -866,7 +858,7 @@ static bool refresh_cpu_vm_stats(bool do_pagesets)
 		struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
 
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
-			int v;
+			long v;
 
 			v = this_cpu_xchg(p->vm_node_stat_diff[i], 0);
 			if (v) {
@@ -929,7 +921,7 @@ void cpu_vm_stats_fold(int cpu)
 
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 			if (p->vm_node_stat_diff[i]) {
-				int v;
+				long v;
 
 				v = p->vm_node_stat_diff[i];
 				p->vm_node_stat_diff[i] = 0;
-- 
2.47.3



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-04 20:38             ` Shakeel Butt
@ 2026-02-05  5:20               ` Dev Jain
  2026-02-05  5:45                 ` Harry Yoo
  0 siblings, 1 reply; 49+ messages in thread
From: Dev Jain @ 2026-02-05  5:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Harry Yoo, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team


On 05/02/26 2:08 am, Shakeel Butt wrote:
> On Mon, Feb 02, 2026 at 02:23:54PM +0530, Dev Jain wrote:
>> On 02/02/26 10:24 am, Shakeel Butt wrote:
>>>>>> Hello Shakeel,
>>>>>>
>>>>>>  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
>>>>>>  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
>>>>>>  to munmap. Please see below if my understanding of this patch is correct.
>>>>>>
>>>>>  Thanks for the report. Are you seeing regression in just the benchmark
>>>>>  or some real workload as well? Also how much regression are you seeing?
>>>>>  I have a kernel rebot regression report [1] for this patch as well which
>>>>>  says 2.6% regression and thus it was on the back-burner for now. I will
>>>>>  take look at this again soon.
>>>>>
>>>> The munmap regression is ~24%. Haven't observed a regression in any other
>>>> benchmark yet.
>>> Please share the code/benchmark which shows such regression, also if you can
>>> share the perf profile, that would be awesome.
>> https://gitlab.arm.com/tooling/fastpath/-/blob/main/containers/microbench/micromm.c
>> You can run this with
>> ./micromm 0 munmap 10
>>
>> Don't have a perf profile, I measured the time taken by above command, with and
>> without the patch.
>>
> Hi Dev, can you please try the following patch?
>
>
> From 40155feca7e7bc846800ab8449735bdb03164d6d Mon Sep 17 00:00:00 2001
> From: Shakeel Butt <shakeel.butt@linux.dev>
> Date: Wed, 4 Feb 2026 08:46:08 -0800
> Subject: [PATCH] vmstat: use preempt disable instead of try_cmpxchg
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  include/linux/mmzone.h |  2 +-
>  mm/vmstat.c            | 58 ++++++++++++++++++------------------------
>  2 files changed, 26 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3e51190a55e4..499cd53efdd6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -776,7 +776,7 @@ struct per_cpu_zonestat {
>  
>  struct per_cpu_nodestat {
>  	s8 stat_threshold;
> -	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
> +	long vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>  };
>  
>  #endif /* !__GENERATING_BOUNDS.H */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 86b14b0f77b5..0930695597bb 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -377,7 +377,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>  				long delta)
>  {
>  	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> -	s8 __percpu *p = pcp->vm_node_stat_diff + item;
> +	long __percpu *p = pcp->vm_node_stat_diff + item;
>  	long x;
>  	long t;
>  
> @@ -456,8 +456,8 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  {
>  	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> -	s8 __percpu *p = pcp->vm_node_stat_diff + item;
> -	s8 v, t;
> +	long __percpu *p = pcp->vm_node_stat_diff + item;
> +	long v, t;
>  
>  	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
>  
> @@ -467,7 +467,7 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  	v = __this_cpu_inc_return(*p);
>  	t = __this_cpu_read(pcp->stat_threshold);
>  	if (unlikely(v > t)) {
> -		s8 overstep = t >> 1;
> +		long overstep = t >> 1;
>  
>  		node_page_state_add(v + overstep, pgdat, item);
>  		__this_cpu_write(*p, -overstep);
> @@ -512,8 +512,8 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  {
>  	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> -	s8 __percpu *p = pcp->vm_node_stat_diff + item;
> -	s8 v, t;
> +	long __percpu *p = pcp->vm_node_stat_diff + item;
> +	long v, t;
>  
>  	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
>  
> @@ -523,7 +523,7 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  	v = __this_cpu_dec_return(*p);
>  	t = __this_cpu_read(pcp->stat_threshold);
>  	if (unlikely(v < - t)) {
> -		s8 overstep = t >> 1;
> +		long overstep = t >> 1;
>  
>  		node_page_state_add(v - overstep, pgdat, item);
>  		__this_cpu_write(*p, overstep);
> @@ -619,9 +619,8 @@ static inline void mod_node_state(struct pglist_data *pgdat,
>         enum node_stat_item item, int delta, int overstep_mode)
>  {
>  	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> -	s8 __percpu *p = pcp->vm_node_stat_diff + item;
> -	long n, t, z;
> -	s8 o;
> +	long __percpu *p = pcp->vm_node_stat_diff + item;
> +	long o, n, t, z;
>  
>  	if (vmstat_item_in_bytes(item)) {
>  		/*
> @@ -634,32 +633,25 @@ static inline void mod_node_state(struct pglist_data *pgdat,
>  		delta >>= PAGE_SHIFT;
>  	}
>  
> +	preempt_disable();
> +
>  	o = this_cpu_read(*p);
> -	do {
> -		z = 0;  /* overflow to node counters */
> +	n = o + delta;
>  
> -		/*
> -		 * The fetching of the stat_threshold is racy. We may apply
> -		 * a counter threshold to the wrong the cpu if we get
> -		 * rescheduled while executing here. However, the next
> -		 * counter update will apply the threshold again and
> -		 * therefore bring the counter under the threshold again.
> -		 *
> -		 * Most of the time the thresholds are the same anyways
> -		 * for all cpus in a node.
> -		 */
> -		t = this_cpu_read(pcp->stat_threshold);
> +	t = this_cpu_read(pcp->stat_threshold);
> +	z = 0;
>  
> -		n = delta + (long)o;
> +	if (abs(n) > t) {
> +		int os = overstep_mode * (t >> 1);
>  
> -		if (abs(n) > t) {
> -			int os = overstep_mode * (t >> 1) ;
> +		/* Overflow must be added to node counters */
> +		z = n + os;
> +		n = -os;
> +	}
>  
> -			/* Overflow must be added to node counters */
> -			z = n + os;
> -			n = -os;
> -		}
> -	} while (!this_cpu_try_cmpxchg(*p, &o, n));
> +	this_cpu_add(*p, n - o);
> +
> +	preempt_enable();
>  
>  	if (z)
>  		node_page_state_add(z, pgdat, item);
> @@ -866,7 +858,7 @@ static bool refresh_cpu_vm_stats(bool do_pagesets)
>  		struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
>  
>  		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> -			int v;
> +			long v;
>  
>  			v = this_cpu_xchg(p->vm_node_stat_diff[i], 0);
>  			if (v) {
> @@ -929,7 +921,7 @@ void cpu_vm_stats_fold(int cpu)
>  
>  		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>  			if (p->vm_node_stat_diff[i]) {
> -				int v;
> +				long v;
>  
>  				v = p->vm_node_stat_diff[i];
>  				p->vm_node_stat_diff[i] = 0;

Thanks for looking into this.

But this doesn't solve it :( preempt_disable() contains a compiler barrier,
probably that's why.

Also can you confirm whether my analysis of the regression was correct?
Because if it was, then this diff looks wrong - AFAIU preempt_disable()
won't stop an irq handler from interrupting the execution, so this
will introduce a bug for code paths running in irq context.




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-05  5:20               ` Dev Jain
@ 2026-02-05  5:45                 ` Harry Yoo
  2026-02-05  5:58                   ` Shakeel Butt
  0 siblings, 1 reply; 49+ messages in thread
From: Harry Yoo @ 2026-02-05  5:45 UTC (permalink / raw)
  To: Dev Jain
  Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On Thu, Feb 05, 2026 at 10:50:06AM +0530, Dev Jain wrote:
> 
> On 05/02/26 2:08 am, Shakeel Butt wrote:
> > On Mon, Feb 02, 2026 at 02:23:54PM +0530, Dev Jain wrote:
> >> On 02/02/26 10:24 am, Shakeel Butt wrote:
> >>>>>> Hello Shakeel,
> >>>>>>
> >>>>>>  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
> >>>>>>  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
> >>>>>>  to munmap. Please see below if my understanding of this patch is correct.
> >>>>>>
> >>>>>  Thanks for the report. Are you seeing regression in just the benchmark
> >>>>>  or some real workload as well? Also how much regression are you seeing?
> >>>>>  I have a kernel rebot regression report [1] for this patch as well which
> >>>>>  says 2.6% regression and thus it was on the back-burner for now. I will
> >>>>>  take look at this again soon.
> >>>>>
> >>>> The munmap regression is ~24%. Haven't observed a regression in any other
> >>>> benchmark yet.
> >>> Please share the code/benchmark which shows such regression, also if you can
> >>> share the perf profile, that would be awesome.
> >> https://gitlab.arm.com/tooling/fastpath/-/blob/main/containers/microbench/micromm.c
> >> You can run this with
> >> ./micromm 0 munmap 10
> >>
> >> Don't have a perf profile, I measured the time taken by above command, with and
> >> without the patch.
> >>
> > Hi Dev, can you please try the following patch?
> >
> >
> > From 40155feca7e7bc846800ab8449735bdb03164d6d Mon Sep 17 00:00:00 2001
> > From: Shakeel Butt <shakeel.butt@linux.dev>
> > Date: Wed, 4 Feb 2026 08:46:08 -0800
> > Subject: [PATCH] vmstat: use preempt disable instead of try_cmpxchg
> >
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > ---

[...snip...]

> 
> Thanks for looking into this.
> 
> But this doesn't solve it :( preempt_disable() contains a compiler barrier,
> probably that's why.

I think the reason why it doesn't solve the regression is because of how
arm64 implements this_cpu_add_8() and this_cpu_try_cmpxchg_8().

On arm64, IIUC both this_cpu_try_cmpxchg_8() and this_cpu_add_8() are
implemented using LL/SC instructions or LSE atomics (if supported).

See:
- this_cpu_add_8()
  -> __percpu_add_case_64
     (which is generated from PERCPU_OP)

- this_cpu_try_cmpxchg_8()
  -> __cpu_fallback_try_cmpxchg(..., this_cpu_cmpxchg_8)
  -> this_cpu_cmpxchg_8()
  -> cmpxchg_relaxed()
  -> raw_cmpxchg_relaxed()
  -> arch_cmpxchg_relaxed()
  -> __cmpxchg_wrapper()
  -> __cmpxchg_case_64()
  -> __lse_ll_sc_body(_cmpxchg_case_64, ...)

> Also can you confirm whether my analysis of the regression was correct?
> Because if it was, then this diff looks wrong - AFAIU preempt_disable()
> won't stop an irq handler from interrupting the execution, so this
> will introduce a bug for code paths running in irq context.

I was worried about the correctness too, but this_cpu_add() is safe
against IRQs and so the stat will be _eventually_ consistent?

Ofc it's so confusing! Maybe I'm the one confused.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-05  5:45                 ` Harry Yoo
@ 2026-02-05  5:58                   ` Shakeel Butt
  2026-02-10  7:38                     ` Dev Jain
  2026-02-12  1:31                     ` Shakeel Butt
  0 siblings, 2 replies; 49+ messages in thread
From: Shakeel Butt @ 2026-02-05  5:58 UTC (permalink / raw)
  To: Harry Yoo, Dev Jain
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team

> 
> On Thu, Feb 05, 2026 at 10:50:06AM +0530, Dev Jain wrote:
> 
> > 
> > On 05/02/26 2:08 am, Shakeel Butt wrote:
> >  On Mon, Feb 02, 2026 at 02:23:54PM +0530, Dev Jain wrote:
> >  On 02/02/26 10:24 am, Shakeel Butt wrote:
> >  Hello Shakeel,
> > 
> >  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
> >  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
> >  to munmap. Please see below if my understanding of this patch is correct.
> > 
> >  Thanks for the report. Are you seeing regression in just the benchmark
> >  or some real workload as well? Also how much regression are you seeing?
> >  I have a kernel rebot regression report [1] for this patch as well which
> >  says 2.6% regression and thus it was on the back-burner for now. I will
> >  take look at this again soon.
> > 
> >  The munmap regression is ~24%. Haven't observed a regression in any other
> >  benchmark yet.
> >  Please share the code/benchmark which shows such regression, also if you can
> >  share the perf profile, that would be awesome.
> >  https://gitlab.arm.com/tooling/fastpath/-/blob/main/containers/microbench/micromm.c
> >  You can run this with
> >  ./micromm 0 munmap 10
> > 
> >  Don't have a perf profile, I measured the time taken by above command, with and
> >  without the patch.
> > 
> >  Hi Dev, can you please try the following patch?
> > 
> >  From 40155feca7e7bc846800ab8449735bdb03164d6d Mon Sep 17 00:00:00 2001
> >  From: Shakeel Butt <shakeel.butt@linux.dev>
> >  Date: Wed, 4 Feb 2026 08:46:08 -0800
> >  Subject: [PATCH] vmstat: use preempt disable instead of try_cmpxchg
> > 
> >  Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> >  ---
> > 
> [...snip...]
> 
> > 
> > Thanks for looking into this.
> >  
> >  But this doesn't solve it :( preempt_disable() contains a compiler barrier,
> >  probably that's why.
> > 
> I think the reason why it doesn't solve the regression is because of how
> arm64 implements this_cpu_add_8() and this_cpu_try_cmpxchg_8().
> 
> On arm64, IIUC both this_cpu_try_cmpxchg_8() and this_cpu_add_8() are
> implemented using LL/SC instructions or LSE atomics (if supported).
> 
> See:
> - this_cpu_add_8()
>  -> __percpu_add_case_64
>  (which is generated from PERCPU_OP)
> 
> - this_cpu_try_cmpxchg_8()
>  -> __cpu_fallback_try_cmpxchg(..., this_cpu_cmpxchg_8)
>  -> this_cpu_cmpxchg_8()
>  -> cmpxchg_relaxed()
>  -> raw_cmpxchg_relaxed()
>  -> arch_cmpxchg_relaxed()
>  -> __cmpxchg_wrapper()
>  -> __cmpxchg_case_64()
>  -> __lse_ll_sc_body(_cmpxchg_case_64, ...)
> 

Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
double underscore, uses LL/SC instructions. 

Need more thought on this. 

> > 
> > Also can you confirm whether my analysis of the regression was correct?
> >  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
> >  won't stop an irq handler from interrupting the execution, so this
> >  will introduce a bug for code paths running in irq context.
> > 
> I was worried about the correctness too, but this_cpu_add() is safe
> against IRQs and so the stat will be _eventually_ consistent?
> 
> Ofc it's so confusing! Maybe I'm the one confused.

Yeah there is no issue with proposed patch as it is making the function
re-entrant safe.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-05  5:58                   ` Shakeel Butt
@ 2026-02-10  7:38                     ` Dev Jain
  2026-02-10 16:29                       ` Shakeel Butt
  2026-02-12  1:31                     ` Shakeel Butt
  1 sibling, 1 reply; 49+ messages in thread
From: Dev Jain @ 2026-02-10  7:38 UTC (permalink / raw)
  To: Shakeel Butt, Harry Yoo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team


On 05/02/26 11:28 am, Shakeel Butt wrote:
>> On Thu, Feb 05, 2026 at 10:50:06AM +0530, Dev Jain wrote:
>>
>>> On 05/02/26 2:08 am, Shakeel Butt wrote:
>>>  On Mon, Feb 02, 2026 at 02:23:54PM +0530, Dev Jain wrote:
>>>  On 02/02/26 10:24 am, Shakeel Butt wrote:
>>>  Hello Shakeel,
>>>
>>>  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
>>>  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
>>>  to munmap. Please see below if my understanding of this patch is correct.
>>>
>>>  Thanks for the report. Are you seeing regression in just the benchmark
>>>  or some real workload as well? Also how much regression are you seeing?
>>>  I have a kernel rebot regression report [1] for this patch as well which
>>>  says 2.6% regression and thus it was on the back-burner for now. I will
>>>  take look at this again soon.
>>>
>>>  The munmap regression is ~24%. Haven't observed a regression in any other
>>>  benchmark yet.
>>>  Please share the code/benchmark which shows such regression, also if you can
>>>  share the perf profile, that would be awesome.
>>>  https://gitlab.arm.com/tooling/fastpath/-/blob/main/containers/microbench/micromm.c
>>>  You can run this with
>>>  ./micromm 0 munmap 10
>>>
>>>  Don't have a perf profile, I measured the time taken by above command, with and
>>>  without the patch.
>>>
>>>  Hi Dev, can you please try the following patch?
>>>
>>>  From 40155feca7e7bc846800ab8449735bdb03164d6d Mon Sep 17 00:00:00 2001
>>>  From: Shakeel Butt <shakeel.butt@linux.dev>
>>>  Date: Wed, 4 Feb 2026 08:46:08 -0800
>>>  Subject: [PATCH] vmstat: use preempt disable instead of try_cmpxchg
>>>
>>>  Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>  ---
>>>
>> [...snip...]
>>
>>> Thanks for looking into this.
>>>  
>>>  But this doesn't solve it :( preempt_disable() contains a compiler barrier,
>>>  probably that's why.
>>>
>> I think the reason why it doesn't solve the regression is because of how
>> arm64 implements this_cpu_add_8() and this_cpu_try_cmpxchg_8().
>>
>> On arm64, IIUC both this_cpu_try_cmpxchg_8() and this_cpu_add_8() are
>> implemented using LL/SC instructions or LSE atomics (if supported).
>>
>> See:
>> - this_cpu_add_8()
>>  -> __percpu_add_case_64
>>  (which is generated from PERCPU_OP)
>>
>> - this_cpu_try_cmpxchg_8()
>>  -> __cpu_fallback_try_cmpxchg(..., this_cpu_cmpxchg_8)
>>  -> this_cpu_cmpxchg_8()
>>  -> cmpxchg_relaxed()
>>  -> raw_cmpxchg_relaxed()
>>  -> arch_cmpxchg_relaxed()
>>  -> __cmpxchg_wrapper()
>>  -> __cmpxchg_case_64()
>>  -> __lse_ll_sc_body(_cmpxchg_case_64, ...)
>>
> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
> double underscore, uses LL/SC instructions. 
>
> Need more thought on this. 
>
>>> Also can you confirm whether my analysis of the regression was correct?
>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
>>>  won't stop an irq handler from interrupting the execution, so this
>>>  will introduce a bug for code paths running in irq context.
>>>
>> I was worried about the correctness too, but this_cpu_add() is safe
>> against IRQs and so the stat will be _eventually_ consistent?
>>
>> Ofc it's so confusing! Maybe I'm the one confused.
> Yeah there is no issue with proposed patch as it is making the function
> re-entrant safe.

Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.

I am still puzzled whether the original patch was a bug fix or an optimization.
The patch description says that node stat updation uses irq unsafe interface.
Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
there were code paths which directly called __foo() - so, your patch fixes a bug right
(in which case we should have a Fixes tag)? The patch ensures that mod_node_page_state
is used, and depending on HAVE_CMPXCHG_LOCAL, either uses irq disabling or
preempt_disable + cmpxchg - making the interface irq safe.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-10  7:38                     ` Dev Jain
@ 2026-02-10 16:29                       ` Shakeel Butt
  2026-02-11  7:37                         ` Dev Jain
  0 siblings, 1 reply; 49+ messages in thread
From: Shakeel Butt @ 2026-02-10 16:29 UTC (permalink / raw)
  To: Dev Jain
  Cc: Harry Yoo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
[...]
> 
> >>
> > Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
> > the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
> > double underscore, uses LL/SC instructions. 
> >
> > Need more thought on this. 
> >
> >>> Also can you confirm whether my analysis of the regression was correct?
> >>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
> >>>  won't stop an irq handler from interrupting the execution, so this
> >>>  will introduce a bug for code paths running in irq context.
> >>>
> >> I was worried about the correctness too, but this_cpu_add() is safe
> >> against IRQs and so the stat will be _eventually_ consistent?
> >>
> >> Ofc it's so confusing! Maybe I'm the one confused.
> > Yeah there is no issue with proposed patch as it is making the function
> > re-entrant safe.
> 
> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
> 
> I am still puzzled whether the original patch was a bug fix or an optimization.

The original patch was a cleanup patch. The memcg stats update functions
were already irq/nmi safe without disabling irqs and that patch did the
same for the numa stats. Though it seems like that is causing regression
for arm64 as this_cpu* ops are expensive on arm64. 

> The patch description says that node stat updation uses irq unsafe interface.
> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
> there were code paths which directly called __foo() - so, your patch fixes a bug right

No, those places were already disabling irqs and should be fine.

I am working on adding batched stats update functionality in the hope
that will fix the regression.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-10 16:29                       ` Shakeel Butt
@ 2026-02-11  7:37                         ` Dev Jain
  2026-02-11  8:53                           ` Harry Yoo
  0 siblings, 1 reply; 49+ messages in thread
From: Dev Jain @ 2026-02-11  7:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Harry Yoo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team


On 10/02/26 9:59 pm, Shakeel Butt wrote:
> On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
> [...]
>>> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
>>> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
>>> double underscore, uses LL/SC instructions. 
>>>
>>> Need more thought on this. 
>>>
>>>>> Also can you confirm whether my analysis of the regression was correct?
>>>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
>>>>>  won't stop an irq handler from interrupting the execution, so this
>>>>>  will introduce a bug for code paths running in irq context.
>>>>>
>>>> I was worried about the correctness too, but this_cpu_add() is safe
>>>> against IRQs and so the stat will be _eventually_ consistent?
>>>>
>>>> Ofc it's so confusing! Maybe I'm the one confused.
>>> Yeah there is no issue with proposed patch as it is making the function
>>> re-entrant safe.
>> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
>>
>> I am still puzzled whether the original patch was a bug fix or an optimization.
> The original patch was a cleanup patch. The memcg stats update functions
> were already irq/nmi safe without disabling irqs and that patch did the
> same for the numa stats. Though it seems like that is causing regression
> for arm64 as this_cpu* ops are expensive on arm64. 
>
>> The patch description says that node stat updation uses irq unsafe interface.
>> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
>> there were code paths which directly called __foo() - so, your patch fixes a bug right
> No, those places were already disabling irqs and should be fine.

Please correct me if I am missing something here. Simply putting an
if (!irqs_disabled()) -> dump_stack() in __lruvec_stat_mod_folio, before
calling __mod_node_page_state, reveals:

[ 6.486375] Call trace:
[ 6.486376] show_stack+0x20/0x38 (C)
[ 6.486379] dump_stack_lvl+0x74/0x90
[ 6.486382] dump_stack+0x18/0x28
[ 6.486383] __lruvec_stat_mod_folio+0x160/0x180
[ 6.486385] folio_add_file_rmap_ptes+0x128/0x480
[ 6.486388] set_pte_range+0xe8/0x320
[ 6.486389] finish_fault+0x260/0x508
[ 6.486390] do_fault+0x2d0/0x598
[ 6.486391] __handle_mm_fault+0x398/0xb60
[ 6.486393] handle_mm_fault+0x15c/0x298
[ 6.486394] __get_user_pages+0x204/0xb88
[ 6.486395] populate_vma_page_range+0xbc/0x1b8
[ 6.486396] __mm_populate+0xcc/0x1e0
[ 6.486397] __arm64_sys_mlockall+0x1d4/0x1f8
[ 6.486398] invoke_syscall+0x50/0x120
[ 6.486399] el0_svc_common.constprop.0+0x48/0xf0
[ 6.486400] do_el0_svc+0x24/0x38
[ 6.486400] el0_svc+0x34/0xf0
[ 6.486402] el0t_64_sync_handler+0xa0/0xe8
[ 6.486404] el0t_64_sync+0x198/0x1a0

Indeed finish_fault() takes a PTL spin lock without irq disablement.

>
> I am working on adding batched stats update functionality in the hope
> that will fix the regression.

Thanks! FYI, I have zeroed in the issue on to preempt_disable(). Dropping this
from _pcpu_protect_return solves the regression. Unlike x86, arm64 does a preempt_disable
when doing this_cpu_*. On a cursory look it seems like this is unnecessary - since we
are doing preempt_enable() immediately after reading the pointer, CPU migration is
possible anyways, so there is nothing to be gained by reading pcpu pointer with
preemption disabled. I am investigating whether we can simply drop this in general.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-11  7:37                         ` Dev Jain
@ 2026-02-11  8:53                           ` Harry Yoo
  2026-02-11  9:24                             ` Shakeel Butt
  2026-02-12  5:14                             ` Dev Jain
  0 siblings, 2 replies; 49+ messages in thread
From: Harry Yoo @ 2026-02-11  8:53 UTC (permalink / raw)
  To: Dev Jain
  Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On Wed, Feb 11, 2026 at 01:07:40PM +0530, Dev Jain wrote:
> 
> On 10/02/26 9:59 pm, Shakeel Butt wrote:
> > On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
> > [...]
> >>> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
> >>> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
> >>> double underscore, uses LL/SC instructions. 
> >>>
> >>> Need more thought on this. 
> >>>
> >>>>> Also can you confirm whether my analysis of the regression was correct?
> >>>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
> >>>>>  won't stop an irq handler from interrupting the execution, so this
> >>>>>  will introduce a bug for code paths running in irq context.
> >>>>>
> >>>> I was worried about the correctness too, but this_cpu_add() is safe
> >>>> against IRQs and so the stat will be _eventually_ consistent?
> >>>>
> >>>> Ofc it's so confusing! Maybe I'm the one confused.
> >>> Yeah there is no issue with proposed patch as it is making the function
> >>> re-entrant safe.
> >> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
> >>
> >> I am still puzzled whether the original patch was a bug fix or an optimization.
> > The original patch was a cleanup patch. The memcg stats update functions
> > were already irq/nmi safe without disabling irqs and that patch did the
> > same for the numa stats. Though it seems like that is causing regression
> > for arm64 as this_cpu* ops are expensive on arm64. 
> >
> >> The patch description says that node stat updation uses irq unsafe interface.
> >> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
> >> there were code paths which directly called __foo() - so, your patch fixes a bug right
> > No, those places were already disabling irqs and should be fine.
> 
> Please correct me if I am missing something here. Simply putting an
> if (!irqs_disabled()) -> dump_stack() in __lruvec_stat_mod_folio, before
> calling __mod_node_page_state, reveals:
> 
> [ 6.486375] Call trace:
> [ 6.486376] show_stack+0x20/0x38 (C)
> [ 6.486379] dump_stack_lvl+0x74/0x90
> [ 6.486382] dump_stack+0x18/0x28
> [ 6.486383] __lruvec_stat_mod_folio+0x160/0x180
> [ 6.486385] folio_add_file_rmap_ptes+0x128/0x480
> [ 6.486388] set_pte_range+0xe8/0x320
> [ 6.486389] finish_fault+0x260/0x508
> [ 6.486390] do_fault+0x2d0/0x598
> [ 6.486391] __handle_mm_fault+0x398/0xb60
> [ 6.486393] handle_mm_fault+0x15c/0x298
> [ 6.486394] __get_user_pages+0x204/0xb88
> [ 6.486395] populate_vma_page_range+0xbc/0x1b8
> [ 6.486396] __mm_populate+0xcc/0x1e0
> [ 6.486397] __arm64_sys_mlockall+0x1d4/0x1f8
> [ 6.486398] invoke_syscall+0x50/0x120
> [ 6.486399] el0_svc_common.constprop.0+0x48/0xf0
> [ 6.486400] do_el0_svc+0x24/0x38
> [ 6.486400] el0_svc+0x34/0xf0
> [ 6.486402] el0t_64_sync_handler+0xa0/0xe8
> [ 6.486404] el0t_64_sync+0x198/0x1a0
> 
> Indeed finish_fault() takes a PTL spin lock without irq disablement.

That indeed looks incorrect to me.
I was assuming __foo() is always called with IRQs disabled!

> > I am working on adding batched stats update functionality in the hope
> > that will fix the regression.
> 
> Thanks! FYI, I have zeroed in the issue on to preempt_disable(). Dropping this
> from _pcpu_protect_return solves the regression.

That's interesting, why is the cost of preempt disable/enable so high?

> Unlike x86, arm64 does a preempt_disable
> when doing this_cpu_*. On a cursory look it seems like this is unnecessary - since we
> are doing preempt_enable() immediately after reading the pointer, CPU migration is
> possible anyways, so there is nothing to be gained by reading pcpu pointer with
> preemption disabled. I am investigating whether we can simply drop this in general.

Let me quote an old email from Mark Rutland [1]:
> We also thought that initially, but there's a sbutle race that can
> occur, and so we added code to disable preemption in commit:
> 
>   f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")
> 
> The problem on arm64 is that our atomics take a single base register,
> and we have to generate the percpu address with separate instructions
> from the atomic itself. That means we can get preempted between address
> generation and the atomic, which is problematic for sequences like:
> 
> 	// Thread-A			// Thread-B
> 
> 	this_cpu_add(var)
> 					local_irq_disable(flags)
> 					...
> 					v = __this_cpu_read(var);
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					...
> 					local_irq_restore(flags)
> 
> ... which can unexpectedly race as:
> 
> 
> 	// Thread-A			// Thread-B
> 
> 	< generate CPU X addr >
> 	< preempted >
> 
> 					< scheduled on CPU X >
> 					local_irq_disable(flags);
> 					v = __this_cpu_read(var);
> 
> 	< scheduled on CPU Y >
> 	< add to CPU X's var >
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					local_irq_restore(flags);
> 
> ... and hence we lose an update to a percpu variable.

... so, removing preempt disable _in general_ is probably not a good idea.

[1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-11  8:53                           ` Harry Yoo
@ 2026-02-11  9:24                             ` Shakeel Butt
  2026-02-11 10:14                               ` Harry Yoo
  2026-02-12  5:16                               ` Dev Jain
  2026-02-12  5:14                             ` Dev Jain
  1 sibling, 2 replies; 49+ messages in thread
From: Shakeel Butt @ 2026-02-11  9:24 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Dev Jain, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On Wed, Feb 11, 2026 at 05:53:38PM +0900, Harry Yoo wrote:
> On Wed, Feb 11, 2026 at 01:07:40PM +0530, Dev Jain wrote:
> > 
> > On 10/02/26 9:59 pm, Shakeel Butt wrote:
> > > On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
> > > [...]
> > >>> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
> > >>> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
> > >>> double underscore, uses LL/SC instructions. 
> > >>>
> > >>> Need more thought on this. 
> > >>>
> > >>>>> Also can you confirm whether my analysis of the regression was correct?
> > >>>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
> > >>>>>  won't stop an irq handler from interrupting the execution, so this
> > >>>>>  will introduce a bug for code paths running in irq context.
> > >>>>>
> > >>>> I was worried about the correctness too, but this_cpu_add() is safe
> > >>>> against IRQs and so the stat will be _eventually_ consistent?
> > >>>>
> > >>>> Ofc it's so confusing! Maybe I'm the one confused.
> > >>> Yeah there is no issue with proposed patch as it is making the function
> > >>> re-entrant safe.
> > >> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
> > >>
> > >> I am still puzzled whether the original patch was a bug fix or an optimization.
> > > The original patch was a cleanup patch. The memcg stats update functions
> > > were already irq/nmi safe without disabling irqs and that patch did the
> > > same for the numa stats. Though it seems like that is causing regression
> > > for arm64 as this_cpu* ops are expensive on arm64. 
> > >
> > >> The patch description says that node stat updation uses irq unsafe interface.
> > >> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
> > >> there were code paths which directly called __foo() - so, your patch fixes a bug right
> > > No, those places were already disabling irqs and should be fine.
> > 
> > Please correct me if I am missing something here. Simply putting an
> > if (!irqs_disabled()) -> dump_stack() in __lruvec_stat_mod_folio, before
> > calling __mod_node_page_state, reveals:
> > 
> > [ 6.486375] Call trace:
> > [ 6.486376] show_stack+0x20/0x38 (C)
> > [ 6.486379] dump_stack_lvl+0x74/0x90
> > [ 6.486382] dump_stack+0x18/0x28
> > [ 6.486383] __lruvec_stat_mod_folio+0x160/0x180
> > [ 6.486385] folio_add_file_rmap_ptes+0x128/0x480
> > [ 6.486388] set_pte_range+0xe8/0x320
> > [ 6.486389] finish_fault+0x260/0x508
> > [ 6.486390] do_fault+0x2d0/0x598
> > [ 6.486391] __handle_mm_fault+0x398/0xb60
> > [ 6.486393] handle_mm_fault+0x15c/0x298
> > [ 6.486394] __get_user_pages+0x204/0xb88
> > [ 6.486395] populate_vma_page_range+0xbc/0x1b8
> > [ 6.486396] __mm_populate+0xcc/0x1e0
> > [ 6.486397] __arm64_sys_mlockall+0x1d4/0x1f8
> > [ 6.486398] invoke_syscall+0x50/0x120
> > [ 6.486399] el0_svc_common.constprop.0+0x48/0xf0
> > [ 6.486400] do_el0_svc+0x24/0x38
> > [ 6.486400] el0_svc+0x34/0xf0
> > [ 6.486402] el0t_64_sync_handler+0xa0/0xe8
> > [ 6.486404] el0t_64_sync+0x198/0x1a0
> > 
> > Indeed finish_fault() takes a PTL spin lock without irq disablement.
> 
> That indeed looks incorrect to me.
> I was assuming __foo() is always called with IRQs disabled!

Not necessarily. For stats which never get updated in IRQ context, can
be updated using __foo() with just premption disabled.

> 
> > > I am working on adding batched stats update functionality in the hope
> > > that will fix the regression.
> > 
> > Thanks! FYI, I have zeroed in the issue on to preempt_disable(). Dropping this
> > from _pcpu_protect_return solves the regression.
> 
> That's interesting, why is the cost of preempt disable/enable so high?
> 

What made you (Dev) so convinced that preempt_disable is that expensive.

> > Unlike x86, arm64 does a preempt_disable
> > when doing this_cpu_*. On a cursory look it seems like this is unnecessary - since we
> > are doing preempt_enable() immediately after reading the pointer, CPU migration is
> > possible anyways, so there is nothing to be gained by reading pcpu pointer with
> > preemption disabled. I am investigating whether we can simply drop this in general.
> 
[...]
> 
> ... so, removing preempt disable _in general_ is probably not a good idea.
> 

Yup, I agree here.

> [1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com
> 
> -- 
> Cheers,
> Harry / Hyeonggon
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-11  9:24                             ` Shakeel Butt
@ 2026-02-11 10:14                               ` Harry Yoo
  2026-02-12  5:16                               ` Dev Jain
  1 sibling, 0 replies; 49+ messages in thread
From: Harry Yoo @ 2026-02-11 10:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dev Jain, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On Wed, Feb 11, 2026 at 01:24:35AM -0800, Shakeel Butt wrote:
> On Wed, Feb 11, 2026 at 05:53:38PM +0900, Harry Yoo wrote:
> > On Wed, Feb 11, 2026 at 01:07:40PM +0530, Dev Jain wrote:
> > > 
> > > On 10/02/26 9:59 pm, Shakeel Butt wrote:
> > > > On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
> > > > [...]
> > > >>> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
> > > >>> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
> > > >>> double underscore, uses LL/SC instructions. 
> > > >>>
> > > >>> Need more thought on this. 
> > > >>>
> > > >>>>> Also can you confirm whether my analysis of the regression was correct?
> > > >>>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
> > > >>>>>  won't stop an irq handler from interrupting the execution, so this
> > > >>>>>  will introduce a bug for code paths running in irq context.
> > > >>>>>
> > > >>>> I was worried about the correctness too, but this_cpu_add() is safe
> > > >>>> against IRQs and so the stat will be _eventually_ consistent?
> > > >>>>
> > > >>>> Ofc it's so confusing! Maybe I'm the one confused.
> > > >>> Yeah there is no issue with proposed patch as it is making the function
> > > >>> re-entrant safe.
> > > >> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
> > > >>
> > > >> I am still puzzled whether the original patch was a bug fix or an optimization.
> > > > The original patch was a cleanup patch. The memcg stats update functions
> > > > were already irq/nmi safe without disabling irqs and that patch did the
> > > > same for the numa stats. Though it seems like that is causing regression
> > > > for arm64 as this_cpu* ops are expensive on arm64. 
> > > >
> > > >> The patch description says that node stat updation uses irq unsafe interface.
> > > >> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
> > > >> there were code paths which directly called __foo() - so, your patch fixes a bug right
> > > > No, those places were already disabling irqs and should be fine.
> > > 
> > > Please correct me if I am missing something here. Simply putting an
> > > if (!irqs_disabled()) -> dump_stack() in __lruvec_stat_mod_folio, before
> > > calling __mod_node_page_state, reveals:
> > > 
> > > [ 6.486375] Call trace:
> > > [ 6.486376] show_stack+0x20/0x38 (C)
> > > [ 6.486379] dump_stack_lvl+0x74/0x90
> > > [ 6.486382] dump_stack+0x18/0x28
> > > [ 6.486383] __lruvec_stat_mod_folio+0x160/0x180
> > > [ 6.486385] folio_add_file_rmap_ptes+0x128/0x480
> > > [ 6.486388] set_pte_range+0xe8/0x320
> > > [ 6.486389] finish_fault+0x260/0x508
> > > [ 6.486390] do_fault+0x2d0/0x598
> > > [ 6.486391] __handle_mm_fault+0x398/0xb60
> > > [ 6.486393] handle_mm_fault+0x15c/0x298
> > > [ 6.486394] __get_user_pages+0x204/0xb88
> > > [ 6.486395] populate_vma_page_range+0xbc/0x1b8
> > > [ 6.486396] __mm_populate+0xcc/0x1e0
> > > [ 6.486397] __arm64_sys_mlockall+0x1d4/0x1f8
> > > [ 6.486398] invoke_syscall+0x50/0x120
> > > [ 6.486399] el0_svc_common.constprop.0+0x48/0xf0
> > > [ 6.486400] do_el0_svc+0x24/0x38
> > > [ 6.486400] el0_svc+0x34/0xf0
> > > [ 6.486402] el0t_64_sync_handler+0xa0/0xe8
> > > [ 6.486404] el0t_64_sync+0x198/0x1a0
> > > 
> > > Indeed finish_fault() takes a PTL spin lock without irq disablement.
> > 
> > That indeed looks incorrect to me.
> > I was assuming __foo() is always called with IRQs disabled!
> 
> Not necessarily. For stats which never get updated in IRQ context, can
> be updated using __foo() with just premption disabled.

Ah, thanks. I was missing that aspect.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-05  5:58                   ` Shakeel Butt
  2026-02-10  7:38                     ` Dev Jain
@ 2026-02-12  1:31                     ` Shakeel Butt
  1 sibling, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2026-02-12  1:31 UTC (permalink / raw)
  To: Harry Yoo, Dev Jain
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team, shy828301, cl

+Yang Shi and Christoph Lameter

On Thu, Feb 05, 2026 at 05:58:44AM +0000, Shakeel Butt wrote:
> > 
> > On Thu, Feb 05, 2026 at 10:50:06AM +0530, Dev Jain wrote:
> > 
> > > 
> > > On 05/02/26 2:08 am, Shakeel Butt wrote:
> > >  On Mon, Feb 02, 2026 at 02:23:54PM +0530, Dev Jain wrote:
> > >  On 02/02/26 10:24 am, Shakeel Butt wrote:
> > >  Hello Shakeel,
> > > 
> > >  We are seeing a regression in micromm/munmap benchmark with this patch, on arm64 -
> > >  the benchmark mmmaps a lot of memory, memsets it, and measures the time taken
> > >  to munmap. Please see below if my understanding of this patch is correct.
> > > 
> > >  Thanks for the report. Are you seeing regression in just the benchmark
> > >  or some real workload as well? Also how much regression are you seeing?
> > >  I have a kernel rebot regression report [1] for this patch as well which
> > >  says 2.6% regression and thus it was on the back-burner for now. I will
> > >  take look at this again soon.
> > > 
> > >  The munmap regression is ~24%. Haven't observed a regression in any other
> > >  benchmark yet.
> > >  Please share the code/benchmark which shows such regression, also if you can
> > >  share the perf profile, that would be awesome.
> > >  https://gitlab.arm.com/tooling/fastpath/-/blob/main/containers/microbench/micromm.c
> > >  You can run this with
> > >  ./micromm 0 munmap 10
> > > 
> > >  Don't have a perf profile, I measured the time taken by above command, with and
> > >  without the patch.
> > > 
> > >  Hi Dev, can you please try the following patch?
> > > 
> > >  From 40155feca7e7bc846800ab8449735bdb03164d6d Mon Sep 17 00:00:00 2001
> > >  From: Shakeel Butt <shakeel.butt@linux.dev>
> > >  Date: Wed, 4 Feb 2026 08:46:08 -0800
> > >  Subject: [PATCH] vmstat: use preempt disable instead of try_cmpxchg
> > > 
> > >  Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > >  ---
> > > 
> > [...snip...]
> > 
> > > 
> > > Thanks for looking into this.
> > >  
> > >  But this doesn't solve it :( preempt_disable() contains a compiler barrier,
> > >  probably that's why.
> > > 
> > I think the reason why it doesn't solve the regression is because of how
> > arm64 implements this_cpu_add_8() and this_cpu_try_cmpxchg_8().
> > 
> > On arm64, IIUC both this_cpu_try_cmpxchg_8() and this_cpu_add_8() are
> > implemented using LL/SC instructions or LSE atomics (if supported).
> > 
> > See:
> > - this_cpu_add_8()
> >  -> __percpu_add_case_64
> >  (which is generated from PERCPU_OP)
> > 
> > - this_cpu_try_cmpxchg_8()
> >  -> __cpu_fallback_try_cmpxchg(..., this_cpu_cmpxchg_8)
> >  -> this_cpu_cmpxchg_8()
> >  -> cmpxchg_relaxed()
> >  -> raw_cmpxchg_relaxed()
> >  -> arch_cmpxchg_relaxed()
> >  -> __cmpxchg_wrapper()
> >  -> __cmpxchg_case_64()
> >  -> __lse_ll_sc_body(_cmpxchg_case_64, ...)
> > 
> 
> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
> double underscore, uses LL/SC instructions. 
> 
> Need more thought on this. 
>

It seems like Yang Shi is looking into improving this_cpu_ops for arm64.

https://lore.kernel.org/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-11  8:53                           ` Harry Yoo
  2026-02-11  9:24                             ` Shakeel Butt
@ 2026-02-12  5:14                             ` Dev Jain
  1 sibling, 0 replies; 49+ messages in thread
From: Dev Jain @ 2026-02-12  5:14 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm,
	cgroups, linux-kernel, Meta kernel team


On 11/02/26 2:23 pm, Harry Yoo wrote:
> On Wed, Feb 11, 2026 at 01:07:40PM +0530, Dev Jain wrote:
>> On 10/02/26 9:59 pm, Shakeel Butt wrote:
>>> On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
>>> [...]
>>>>> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
>>>>> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
>>>>> double underscore, uses LL/SC instructions. 
>>>>>
>>>>> Need more thought on this. 
>>>>>
>>>>>>> Also can you confirm whether my analysis of the regression was correct?
>>>>>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
>>>>>>>  won't stop an irq handler from interrupting the execution, so this
>>>>>>>  will introduce a bug for code paths running in irq context.
>>>>>>>
>>>>>> I was worried about the correctness too, but this_cpu_add() is safe
>>>>>> against IRQs and so the stat will be _eventually_ consistent?
>>>>>>
>>>>>> Ofc it's so confusing! Maybe I'm the one confused.
>>>>> Yeah there is no issue with proposed patch as it is making the function
>>>>> re-entrant safe.
>>>> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
>>>>
>>>> I am still puzzled whether the original patch was a bug fix or an optimization.
>>> The original patch was a cleanup patch. The memcg stats update functions
>>> were already irq/nmi safe without disabling irqs and that patch did the
>>> same for the numa stats. Though it seems like that is causing regression
>>> for arm64 as this_cpu* ops are expensive on arm64. 
>>>
>>>> The patch description says that node stat updation uses irq unsafe interface.
>>>> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
>>>> there were code paths which directly called __foo() - so, your patch fixes a bug right
>>> No, those places were already disabling irqs and should be fine.
>> Please correct me if I am missing something here. Simply putting an
>> if (!irqs_disabled()) -> dump_stack() in __lruvec_stat_mod_folio, before
>> calling __mod_node_page_state, reveals:
>>
>> [ 6.486375] Call trace:
>> [ 6.486376] show_stack+0x20/0x38 (C)
>> [ 6.486379] dump_stack_lvl+0x74/0x90
>> [ 6.486382] dump_stack+0x18/0x28
>> [ 6.486383] __lruvec_stat_mod_folio+0x160/0x180
>> [ 6.486385] folio_add_file_rmap_ptes+0x128/0x480
>> [ 6.486388] set_pte_range+0xe8/0x320
>> [ 6.486389] finish_fault+0x260/0x508
>> [ 6.486390] do_fault+0x2d0/0x598
>> [ 6.486391] __handle_mm_fault+0x398/0xb60
>> [ 6.486393] handle_mm_fault+0x15c/0x298
>> [ 6.486394] __get_user_pages+0x204/0xb88
>> [ 6.486395] populate_vma_page_range+0xbc/0x1b8
>> [ 6.486396] __mm_populate+0xcc/0x1e0
>> [ 6.486397] __arm64_sys_mlockall+0x1d4/0x1f8
>> [ 6.486398] invoke_syscall+0x50/0x120
>> [ 6.486399] el0_svc_common.constprop.0+0x48/0xf0
>> [ 6.486400] do_el0_svc+0x24/0x38
>> [ 6.486400] el0_svc+0x34/0xf0
>> [ 6.486402] el0t_64_sync_handler+0xa0/0xe8
>> [ 6.486404] el0t_64_sync+0x198/0x1a0
>>
>> Indeed finish_fault() takes a PTL spin lock without irq disablement.
> That indeed looks incorrect to me.
> I was assuming __foo() is always called with IRQs disabled!
>
>>> I am working on adding batched stats update functionality in the hope
>>> that will fix the regression.
>> Thanks! FYI, I have zeroed in the issue on to preempt_disable(). Dropping this
>> from _pcpu_protect_return solves the regression.
> That's interesting, why is the cost of preempt disable/enable so high?
>
>> Unlike x86, arm64 does a preempt_disable
>> when doing this_cpu_*. On a cursory look it seems like this is unnecessary - since we
>> are doing preempt_enable() immediately after reading the pointer, CPU migration is
>> possible anyways, so there is nothing to be gained by reading pcpu pointer with
>> preemption disabled. I am investigating whether we can simply drop this in general.
> Let me quote an old email from Mark Rutland [1]:
>> We also thought that initially, but there's a sbutle race that can
>> occur, and so we added code to disable preemption in commit:
>>
>>   f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")
>>
>> The problem on arm64 is that our atomics take a single base register,
>> and we have to generate the percpu address with separate instructions
>> from the atomic itself. That means we can get preempted between address
>> generation and the atomic, which is problematic for sequences like:
>>
>> 	// Thread-A			// Thread-B
>>
>> 	this_cpu_add(var)
>> 					local_irq_disable(flags)
>> 					...
>> 					v = __this_cpu_read(var);
>> 					v = some_function(v);
>> 					__this_cpu_write(var, v);
>> 					...
>> 					local_irq_restore(flags)
>>
>> ... which can unexpectedly race as:
>>
>>
>> 	// Thread-A			// Thread-B
>>
>> 	< generate CPU X addr >
>> 	< preempted >
>>
>> 					< scheduled on CPU X >
>> 					local_irq_disable(flags);
>> 					v = __this_cpu_read(var);
>>
>> 	< scheduled on CPU Y >
>> 	< add to CPU X's var >
>> 					v = some_function(v);
>> 					__this_cpu_write(var, v);
>> 					local_irq_restore(flags);
>>
>> ... and hence we lose an update to a percpu variable.
> ... so, removing preempt disable _in general_ is probably not a good idea.
>
> [1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com

Thanks for the link!



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats
  2026-02-11  9:24                             ` Shakeel Butt
  2026-02-11 10:14                               ` Harry Yoo
@ 2026-02-12  5:16                               ` Dev Jain
  1 sibling, 0 replies; 49+ messages in thread
From: Dev Jain @ 2026-02-12  5:16 UTC (permalink / raw)
  To: Shakeel Butt, Harry Yoo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Vlastimil Babka, linux-mm, cgroups,
	linux-kernel, Meta kernel team


On 11/02/26 2:54 pm, Shakeel Butt wrote:
> On Wed, Feb 11, 2026 at 05:53:38PM +0900, Harry Yoo wrote:
>> On Wed, Feb 11, 2026 at 01:07:40PM +0530, Dev Jain wrote:
>>> On 10/02/26 9:59 pm, Shakeel Butt wrote:
>>>> On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
>>>> [...]
>>>>>> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
>>>>>> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
>>>>>> double underscore, uses LL/SC instructions. 
>>>>>>
>>>>>> Need more thought on this. 
>>>>>>
>>>>>>>> Also can you confirm whether my analysis of the regression was correct?
>>>>>>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
>>>>>>>>  won't stop an irq handler from interrupting the execution, so this
>>>>>>>>  will introduce a bug for code paths running in irq context.
>>>>>>>>
>>>>>>> I was worried about the correctness too, but this_cpu_add() is safe
>>>>>>> against IRQs and so the stat will be _eventually_ consistent?
>>>>>>>
>>>>>>> Ofc it's so confusing! Maybe I'm the one confused.
>>>>>> Yeah there is no issue with proposed patch as it is making the function
>>>>>> re-entrant safe.
>>>>> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
>>>>>
>>>>> I am still puzzled whether the original patch was a bug fix or an optimization.
>>>> The original patch was a cleanup patch. The memcg stats update functions
>>>> were already irq/nmi safe without disabling irqs and that patch did the
>>>> same for the numa stats. Though it seems like that is causing regression
>>>> for arm64 as this_cpu* ops are expensive on arm64. 
>>>>
>>>>> The patch description says that node stat updation uses irq unsafe interface.
>>>>> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
>>>>> there were code paths which directly called __foo() - so, your patch fixes a bug right
>>>> No, those places were already disabling irqs and should be fine.
>>> Please correct me if I am missing something here. Simply putting an
>>> if (!irqs_disabled()) -> dump_stack() in __lruvec_stat_mod_folio, before
>>> calling __mod_node_page_state, reveals:
>>>
>>> [ 6.486375] Call trace:
>>> [ 6.486376] show_stack+0x20/0x38 (C)
>>> [ 6.486379] dump_stack_lvl+0x74/0x90
>>> [ 6.486382] dump_stack+0x18/0x28
>>> [ 6.486383] __lruvec_stat_mod_folio+0x160/0x180
>>> [ 6.486385] folio_add_file_rmap_ptes+0x128/0x480
>>> [ 6.486388] set_pte_range+0xe8/0x320
>>> [ 6.486389] finish_fault+0x260/0x508
>>> [ 6.486390] do_fault+0x2d0/0x598
>>> [ 6.486391] __handle_mm_fault+0x398/0xb60
>>> [ 6.486393] handle_mm_fault+0x15c/0x298
>>> [ 6.486394] __get_user_pages+0x204/0xb88
>>> [ 6.486395] populate_vma_page_range+0xbc/0x1b8
>>> [ 6.486396] __mm_populate+0xcc/0x1e0
>>> [ 6.486397] __arm64_sys_mlockall+0x1d4/0x1f8
>>> [ 6.486398] invoke_syscall+0x50/0x120
>>> [ 6.486399] el0_svc_common.constprop.0+0x48/0xf0
>>> [ 6.486400] do_el0_svc+0x24/0x38
>>> [ 6.486400] el0_svc+0x34/0xf0
>>> [ 6.486402] el0t_64_sync_handler+0xa0/0xe8
>>> [ 6.486404] el0t_64_sync+0x198/0x1a0
>>>
>>> Indeed finish_fault() takes a PTL spin lock without irq disablement.
>> That indeed looks incorrect to me.
>> I was assuming __foo() is always called with IRQs disabled!
> Not necessarily. For stats which never get updated in IRQ context, can
> be updated using __foo() with just premption disabled.
>
>>>> I am working on adding batched stats update functionality in the hope
>>>> that will fix the regression.
>>> Thanks! FYI, I have zeroed in the issue on to preempt_disable(). Dropping this
>>> from _pcpu_protect_return solves the regression.
>> That's interesting, why is the cost of preempt disable/enable so high?
>>
> What made you (Dev) so convinced that preempt_disable is that expensive.

As I wrote above, dropping the preempt disable from _pcp_protect_return solved
the regression. So, it hints at the cost of this - although it seems surprising
that this may be expensive, so need to investigate : )

>
>>> Unlike x86, arm64 does a preempt_disable
>>> when doing this_cpu_*. On a cursory look it seems like this is unnecessary - since we
>>> are doing preempt_enable() immediately after reading the pointer, CPU migration is
>>> possible anyways, so there is nothing to be gained by reading pcpu pointer with
>>> preemption disabled. I am investigating whether we can simply drop this in general.
> [...]
>> ... so, removing preempt disable _in general_ is probably not a good idea.
>>
> Yup, I agree here.
>
>> [1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com
>>
>> -- 
>> Cheers,
>> Harry / Hyeonggon
>>


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2026-02-12  5:16 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-10 23:20 [PATCH 0/4] memcg: cleanup the memcg stats interfaces Shakeel Butt
2025-11-10 23:20 ` [PATCH 1/4] memcg: use mod_node_page_state to update stats Shakeel Butt
2025-11-11  1:39   ` Harry Yoo
2025-11-11 18:58   ` Roman Gushchin
2026-01-29 13:05   ` Dev Jain
2026-02-02  4:26     ` Shakeel Butt
2026-02-02  4:48       ` Dev Jain
2026-02-02  4:54         ` Shakeel Butt
2026-02-02  8:53           ` Dev Jain
2026-02-04 20:38             ` Shakeel Butt
2026-02-05  5:20               ` Dev Jain
2026-02-05  5:45                 ` Harry Yoo
2026-02-05  5:58                   ` Shakeel Butt
2026-02-10  7:38                     ` Dev Jain
2026-02-10 16:29                       ` Shakeel Butt
2026-02-11  7:37                         ` Dev Jain
2026-02-11  8:53                           ` Harry Yoo
2026-02-11  9:24                             ` Shakeel Butt
2026-02-11 10:14                               ` Harry Yoo
2026-02-12  5:16                               ` Dev Jain
2026-02-12  5:14                             ` Dev Jain
2026-02-12  1:31                     ` Shakeel Butt
2025-11-10 23:20 ` [PATCH 2/4] memcg: remove __mod_lruvec_kmem_state Shakeel Butt
2025-11-11  1:46   ` Harry Yoo
2025-11-11  8:23   ` Qi Zheng
2025-11-11 18:58   ` Roman Gushchin
2025-11-10 23:20 ` [PATCH 3/4] memcg: remove __mod_lruvec_state Shakeel Butt
2025-11-11  5:21   ` Harry Yoo
2025-11-11 18:58   ` Roman Gushchin
2025-11-10 23:20 ` [PATCH 4/4] memcg: remove __lruvec_stat_mod_folio Shakeel Butt
2025-11-11  5:41   ` Harry Yoo
2025-11-11 18:59   ` Roman Gushchin
2025-11-11  0:59 ` [PATCH 0/4] memcg: cleanup the memcg stats interfaces Harry Yoo
2025-11-11  2:23   ` Qi Zheng
2025-11-11  2:39     ` Shakeel Butt
2025-11-11  2:48       ` Qi Zheng
2025-11-11  3:00         ` Shakeel Butt
2025-11-11  3:07           ` Qi Zheng
2025-11-11  3:18             ` Harry Yoo
2025-11-11  3:29               ` Qi Zheng
2025-11-11  3:05         ` Harry Yoo
2025-11-11  8:01           ` Sebastian Andrzej Siewior
2025-11-11  8:36 ` Qi Zheng
2025-11-11 16:45   ` Shakeel Butt
2025-11-12  2:11     ` Qi Zheng
2025-11-11  9:54 ` Vlastimil Babka
2025-11-11 19:01 ` Roman Gushchin
2025-11-11 19:34   ` Shakeel Butt
2025-11-15 19:27 ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox