[PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages()
@ 2024-12-18  3:07 alexei.starovoitov
  2024-12-18  3:07 ` [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation alexei.starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: alexei.starovoitov @ 2024-12-18  3:07 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Hi All,

The main motivation is to make alloc page and slab reentrant and
remove bpf_mem_alloc.

v2->v3:
To address the issues spotted by Sebastian, Vlastimil, Steven:
- Made GFP_TRYLOCK internal to mm/internal.h
  try_alloc_pages() and free_pages_nolock() are the only interfaces.
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
  disable such usage in lock_trylock and in try_alloc_pages().
  In such case free_pages_nolock() falls back to llist right away.
- Process trylock_free_pages llist when preemptible.
- Check for things like unaccepted memory and order <= 3 early.
- Don't call into __alloc_pages_slowpath() at all.
- Inspired by Vlastimil's struct local_tryirq_lock adopted it in
  local_lock_t. Extra 4 bytes in !RT in local_lock_t shouldn't
  affect any of the current local_lock_t users. This is patch 3.
- Tested with bpf selftests in RT and !RT and realized how much
  more work is necessary on bpf side to play nice with RT.
  The urgency of this work got higher. The alternative is to
  convert bpf bits left and right to bpf_mem_alloc.

v2:
https://lore.kernel.org/bpf/20241210023936.46871-1-alexei.starovoitov@gmail.com/

v1->v2:
- fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter.
- optimize all paths by doing spin_trylock_irqsave() first
  and only then check for gfp_flags & __GFP_TRYLOCK.
  Then spin_lock_irqsave() if it's a regular mode.
  So new gfp flag will not add performance overhead.
- patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock()
  and memcg support. So it's in usable shape for bpf in patch 6.

v1:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (6):
  mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  mm, bpf: Introduce free_pages_nolock()
  locking/local_lock: Introduce local_trylock_irqsave()
  memcg: Use trylock to access memcg stock_lock.
  mm, bpf: Use memcg in try_alloc_pages().
  bpf: Use try_alloc_pages() to allocate pages for bpf needs.

 include/linux/gfp.h                 |   4 +
 include/linux/gfp_types.h           |   1 +
 include/linux/local_lock.h          |   9 ++
 include/linux/local_lock_internal.h |  76 ++++++++++++--
 include/linux/mm_types.h            |   4 +
 include/linux/mmzone.h              |   3 +
 kernel/bpf/syscall.c                |   4 +-
 mm/internal.h                       |   2 +
 mm/memcontrol.c                     |  20 +++-
 mm/page_alloc.c                     | 152 +++++++++++++++++++++++++---
 10 files changed, 250 insertions(+), 25 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-18  3:07 [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages() alexei.starovoitov
@ 2024-12-18  3:07 ` alexei.starovoitov
  2024-12-18 11:32   ` Michal Hocko
  2024-12-19  0:10   ` Shakeel Butt
  2024-12-18  3:07 ` [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock() alexei.starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 36+ messages in thread
From: alexei.starovoitov @ 2024-12-18  3:07 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce internal __GFP_TRYLOCK flag that makes page
allocator accessible from any context. It relies on percpu free
list of pages that rmqueue_pcplist() should be able to pop the
page from. If it fails (due to IRQ re-entrancy or list being
empty) then try_alloc_pages() attempts to spin_trylock zone->lock
and refill percpu freelist as normal.
BPF program may execute with IRQs disabled and zone->lock is
sleeping in RT, so trylock is the only option. In theory we can
introduce percpu reentrance counter and increment it every time
spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
on it. Even if this cpu is not in page_alloc path the
spin_lock_irqsave() is not safe, since BPF prog might be called
from tracepoint where preemption is disabled. So trylock only.

Note, free_page and memcg are not taught about __GFP_TRYLOCK yet.
The support comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h       |  3 ++
 include/linux/gfp_types.h |  1 +
 mm/internal.h             |  2 ++
 mm/page_alloc.c           | 69 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b0fe9f62d15b..65b8df1db26a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -347,6 +347,9 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
+struct page *try_alloc_pages_noprof(int nid, unsigned int order);
+#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
 
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 65db9349f905..65b148ec86eb 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -48,6 +48,7 @@ enum {
 	___GFP_THISNODE_BIT,
 	___GFP_ACCOUNT_BIT,
 	___GFP_ZEROTAGS_BIT,
+	___GFP_TRYLOCK_BIT,
 #ifdef CONFIG_KASAN_HW_TAGS
 	___GFP_SKIP_ZERO_BIT,
 	___GFP_SKIP_KASAN_BIT,
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..122fce7e1a9e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1175,6 +1175,8 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#define __GFP_TRYLOCK		((__force gfp_t)BIT(___GFP_TRYLOCK_BIT))
+#define ALLOC_TRYLOCK		0x1000000 /* Only use spin_trylock in allocation path */
 
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1cb4b8c8886d..d23545057b6e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2304,7 +2304,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+			return 0;
+		spin_lock_irqsave(&zone->lock, flags);
+	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -2904,7 +2908,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
+		if (!spin_trylock_irqsave(&zone->lock, flags)) {
+			if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+				return NULL;
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
@@ -4001,6 +4009,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	 */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
 	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);
+	BUILD_BUG_ON(__GFP_TRYLOCK != (__force gfp_t) ALLOC_TRYLOCK);
 
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
@@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int)
-		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
+		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
 
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		/*
@@ -4509,7 +4518,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 
 	might_alloc(gfp_mask);
 
-	if (should_fail_alloc_page(gfp_mask, order))
+	if (!(*alloc_flags & ALLOC_TRYLOCK) &&
+	    should_fail_alloc_page(gfp_mask, order))
 		return false;
 
 	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
@@ -7023,3 +7033,54 @@ static bool __free_unaccepted(struct page *page)
 }
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+
+struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
+			  __GFP_NOMEMALLOC | __GFP_TRYLOCK;
+	unsigned int alloc_flags = ALLOC_TRYLOCK;
+	struct alloc_context ac = { };
+	struct page *page;
+
+	/*
+	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
+	 * If spin_trylock() is called from hard IRQ the current task may be
+	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
+	 * task as the owner of another rt_spin_lock which will confuse PI
+	 * logic, so return immediately if called form hard IRQ or NMI.
+	 *
+	 * Note, irqs_disabled() case is ok. This function can be called
+	 * from raw_spin_lock_irqsave region.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;
+	if (!pcp_allowed_order(order))
+		return NULL;
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	if (has_unaccepted_memory() && !list_empty(&zone->unaccepted_pages))
+		return NULL;
+#endif
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+
+	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
+			    &alloc_gfp, &alloc_flags);
+
+	/*
+	 * Best effort allocation from percpu free list.
+	 * If it's empty attempt to spin_trylock zone->lock.
+	 * Do not specify __GFP_KSWAPD_RECLAIM to avoid wakeup_kswapd
+	 * that may need to grab a lock.
+	 * Do not specify __GFP_ACCOUNT to avoid local_lock.
+	 * Do not warn either.
+	 */
+	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
+
+	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+
+	trace_mm_page_alloc(page, order, alloc_gfp & ~__GFP_TRYLOCK, ac.migratetype);
+	kmsan_alloc_page(page, order, alloc_gfp);
+	return page;
+}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  3:07 [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages() alexei.starovoitov
  2024-12-18  3:07 ` [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation alexei.starovoitov
@ 2024-12-18  3:07 ` alexei.starovoitov
  2024-12-18  4:58   ` Yosry Ahmed
  2024-12-18 11:32   ` Michal Hocko
  2024-12-18  3:07 ` [PATCH bpf-next v3 3/6] locking/local_lock: Introduce local_trylock_irqsave() alexei.starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 36+ messages in thread
From: alexei.starovoitov @ 2024-12-18  3:07 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h      |  1 +
 include/linux/mm_types.h |  4 ++
 include/linux/mmzone.h   |  3 ++
 mm/page_alloc.c          | 79 ++++++++++++++++++++++++++++++++++++----
 4 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65b8df1db26a..ff9060af6295 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -372,6 +372,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
 	__get_free_pages((gfp_mask) | GFP_DMA, (order))
 
 extern void __free_pages(struct page *page, unsigned int order);
+extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68..52547b3e5fd8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,10 @@ struct page {
 				/* Or, free page */
 				struct list_head buddy_list;
 				struct list_head pcp_list;
+				struct {
+					struct llist_node pcp_llist;
+					unsigned int order;
+				};
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16..1a854e0a9e3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -953,6 +953,9 @@ struct zone {
 	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Pages to be freed when next trylock succeeds */
+	struct llist_head	trylock_free_pages;
+
 	/* Write-intensive fields used by compaction and vmstats. */
 	CACHELINE_PADDING(_pad2_);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d23545057b6e..10918bfc6734 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Free the page without taking locks. Rely on trylock only. */
+#define FPI_TRYLOCK		((__force fpi_t)BIT(2))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	}
 }
 
+static void add_page_to_zone_llist(struct zone *zone, struct page *page,
+				   unsigned int order)
+{
+	/* Remember the order */
+	page->order = order;
+	/* Add the page to the free list */
+	llist_add(&page->pcp_llist, &zone->trylock_free_pages);
+}
+
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
 			  fpi_t fpi_flags)
 {
+	struct llist_head *llhead;
 	unsigned long flags;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+			add_page_to_zone_llist(zone, page, order);
+			return;
+		}
+		spin_lock_irqsave(&zone->lock, flags);
+	}
+
+	/* The lock succeeded. Process deferred pages. */
+	llhead = &zone->trylock_free_pages;
+	if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
+		struct llist_node *llnode;
+		struct page *p, *tmp;
+
+		llnode = llist_del_all(llhead);
+		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+			unsigned int p_order = p->order;
+
+			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+			__count_vm_events(PGFREE, 1 << p_order);
+		}
+	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2596,7 +2630,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
-				   unsigned int order)
+				   unsigned int order, fpi_t fpi_flags)
 {
 	int high, batch;
 	int pindex;
@@ -2631,6 +2665,14 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
 		pcp->free_count += (1 << order);
+
+	if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+		/*
+		 * Do not attempt to take a zone lock. Let pcp->count get
+		 * over high mark temporarily.
+		 */
+		return;
+	}
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2645,7 +2687,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 /*
  * Free a pcp page
  */
-void free_unref_page(struct page *page, unsigned int order)
+static void __free_unref_page(struct page *page, unsigned int order,
+			      fpi_t fpi_flags)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2654,7 +2697,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, fpi_flags);
 		return;
 	}
 
@@ -2671,24 +2714,33 @@ void free_unref_page(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, fpi_flags);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
 	zone = page_zone(page);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) {
+		add_page_to_zone_llist(zone, page, order);
+		return;
+	}
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, migratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, fpi_flags);
 	}
 	pcp_trylock_finish(UP_flags);
 }
 
+void free_unref_page(struct page *page, unsigned int order)
+{
+	__free_unref_page(page, order, FPI_NONE);
+}
+
 /*
  * Free a batch of folios
  */
@@ -2777,7 +2829,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				       order, FPI_NONE);
 	}
 
 	if (pcp) {
@@ -4854,6 +4906,17 @@ void __free_pages(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL(__free_pages);
 
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI,
+ * but only for pages that came from try_alloc_pages():
+ * order <= 3, !folio, etc
+ */
+void free_pages_nolock(struct page *page, unsigned int order)
+{
+	if (put_page_testzero(page))
+		__free_unref_page(page, order, FPI_TRYLOCK);
+}
+
 void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0) {
-- 
2.43.5



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH bpf-next v3 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-18  3:07 [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages() alexei.starovoitov
  2024-12-18  3:07 ` [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation alexei.starovoitov
  2024-12-18  3:07 ` [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock() alexei.starovoitov
@ 2024-12-18  3:07 ` alexei.starovoitov
  2024-12-18  3:07 ` [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock alexei.starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: alexei.starovoitov @ 2024-12-18  3:07 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

This is inspired by 'struct local_tryirq_lock' in:
https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/

Similar to local_lock_irqsave() introduce local_trylock_irqsave().
It uses spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
and instantly fails otherwise.
In !RT it uses simple active flag that prevents IRQs or NMIs reentering
locked region.

Note there is no need to use local_inc for active flag.
If IRQ handler grabs the same local_lock after READ_ONCE(lock->active)
already completed it has to unlock it before returning.
Similar with NMI handler. So there is a strict nesting of scopes.
It's a per cpu lock, so multiple cpus do not access it in parallel.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          |  9 ++++
 include/linux/local_lock_internal.h | 76 ++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb..84ee560c4f51 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -30,6 +30,15 @@
 #define local_lock_irqsave(lock, flags)				\
 	__local_lock_irqsave(lock, flags)
 
+/**
+ * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ *			   interrupts. Always fails in RT when in_hardirq or NMI.
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ */
+#define local_trylock_irqsave(lock, flags)			\
+	__local_trylock_irqsave(lock, flags)
+
 /**
  * local_unlock - Release a per CPU local lock
  * @lock:	The lock variable
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..93672127c73d 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -9,6 +9,7 @@
 #ifndef CONFIG_PREEMPT_RT
 
 typedef struct {
+	int active;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 	struct task_struct	*owner;
@@ -22,7 +23,7 @@ typedef struct {
 		.wait_type_inner = LD_WAIT_CONFIG,	\
 		.lock_type = LD_LOCK_PERCPU,		\
 	},						\
-	.owner = NULL,
+	.owner = NULL, .active = 0
 
 static inline void local_lock_acquire(local_lock_t *l)
 {
@@ -31,6 +32,13 @@ static inline void local_lock_acquire(local_lock_t *l)
 	l->owner = current;
 }
 
+static inline void local_trylock_acquire(local_lock_t *l)
+{
+	lock_map_acquire_try(&l->dep_map);
+	DEBUG_LOCKS_WARN_ON(l->owner);
+	l->owner = current;
+}
+
 static inline void local_lock_release(local_lock_t *l)
 {
 	DEBUG_LOCKS_WARN_ON(l->owner != current);
@@ -45,6 +53,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
 #else /* CONFIG_DEBUG_LOCK_ALLOC */
 # define LOCAL_LOCK_DEBUG_INIT(lockname)
 static inline void local_lock_acquire(local_lock_t *l) { }
+static inline void local_trylock_acquire(local_lock_t *l) { }
 static inline void local_lock_release(local_lock_t *l) { }
 static inline void local_lock_debug_init(local_lock_t *l) { }
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
@@ -60,6 +69,7 @@ do {								\
 			      0, LD_WAIT_CONFIG, LD_WAIT_INV,	\
 			      LD_LOCK_PERCPU);			\
 	local_lock_debug_init(lock);				\
+	(lock)->active = 0;					\
 } while (0)
 
 #define __spinlock_nested_bh_init(lock)				\
@@ -75,37 +85,73 @@ do {								\
 
 #define __local_lock(lock)					\
 	do {							\
+		local_lock_t *l;				\
 		preempt_disable();				\
-		local_lock_acquire(this_cpu_ptr(lock));		\
+		l = this_cpu_ptr(lock);				\
+		lockdep_assert(l->active == 0);			\
+		WRITE_ONCE(l->active, 1);			\
+		local_lock_acquire(l);				\
 	} while (0)
 
 #define __local_lock_irq(lock)					\
 	do {							\
+		local_lock_t *l;				\
 		local_irq_disable();				\
-		local_lock_acquire(this_cpu_ptr(lock));		\
+		l = this_cpu_ptr(lock);				\
+		lockdep_assert(l->active == 0);			\
+		WRITE_ONCE(l->active, 1);			\
+		local_lock_acquire(l);				\
 	} while (0)
 
 #define __local_lock_irqsave(lock, flags)			\
 	do {							\
+		local_lock_t *l;				\
 		local_irq_save(flags);				\
-		local_lock_acquire(this_cpu_ptr(lock));		\
+		l = this_cpu_ptr(lock);				\
+		lockdep_assert(l->active == 0);			\
+		WRITE_ONCE(l->active, 1);			\
+		local_lock_acquire(l);				\
 	} while (0)
 
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		local_lock_t *l;				\
+		local_irq_save(flags);				\
+		l = this_cpu_ptr(lock);				\
+		if (READ_ONCE(l->active) == 1) {		\
+			local_irq_restore(flags);		\
+			l = NULL;				\
+		} else {					\
+			WRITE_ONCE(l->active, 1);		\
+			local_trylock_acquire(l);		\
+		}						\
+		!!l;						\
+	})
+
 #define __local_unlock(lock)					\
 	do {							\
-		local_lock_release(this_cpu_ptr(lock));		\
+		local_lock_t *l = this_cpu_ptr(lock);		\
+		lockdep_assert(l->active == 1);			\
+		WRITE_ONCE(l->active, 0);			\
+		local_lock_release(l);				\
 		preempt_enable();				\
 	} while (0)
 
 #define __local_unlock_irq(lock)				\
 	do {							\
-		local_lock_release(this_cpu_ptr(lock));		\
+		local_lock_t *l = this_cpu_ptr(lock);		\
+		lockdep_assert(l->active == 1);			\
+		WRITE_ONCE(l->active, 0);			\
+		local_lock_release(l);				\
 		local_irq_enable();				\
 	} while (0)
 
 #define __local_unlock_irqrestore(lock, flags)			\
 	do {							\
-		local_lock_release(this_cpu_ptr(lock));		\
+		local_lock_t *l = this_cpu_ptr(lock);		\
+		lockdep_assert(l->active == 1);			\
+		WRITE_ONCE(l->active, 0);			\
+		local_lock_release(l);				\
 		local_irq_restore(flags);			\
 	} while (0)
 
@@ -148,6 +194,22 @@ typedef spinlock_t local_lock_t;
 		__local_lock(lock);				\
 	} while (0)
 
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		__label__ out;					\
+		int ret = 0;					\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		if (in_nmi() || in_hardirq())			\
+			goto out;				\
+		migrate_disable();				\
+		ret = spin_trylock(this_cpu_ptr((lock)));	\
+		if (!ret)					\
+			migrate_enable();			\
+	out:							\
+		ret;						\
+	})
+
 #define __local_unlock(__lock)					\
 	do {							\
 		spin_unlock(this_cpu_ptr((__lock)));		\
-- 
2.43.5



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-18  3:07 [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages() alexei.starovoitov
                   ` (2 preceding siblings ...)
  2024-12-18  3:07 ` [PATCH bpf-next v3 3/6] locking/local_lock: Introduce local_trylock_irqsave() alexei.starovoitov
@ 2024-12-18  3:07 ` alexei.starovoitov
  2024-12-18 11:32   ` Michal Hocko
  2024-12-18  3:07 ` [PATCH bpf-next v3 5/6] mm, bpf: Use memcg in try_alloc_pages() alexei.starovoitov
  2024-12-18  3:07 ` [PATCH bpf-next v3 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs alexei.starovoitov
  5 siblings, 1 reply; 36+ messages in thread
From: alexei.starovoitov @ 2024-12-18  3:07 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Teach memcg to operate under trylock conditions when
spinning locks cannot be used.
The end result is __memcg_kmem_charge_page() and
__memcg_kmem_uncharge_page() are safe to use from
any context in RT and !RT.
In !RT the NMI handler may fail to trylock stock_lock.
In RT hard IRQ and NMI handlers will not attempt to trylock.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/memcontrol.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..f168d223375f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1756,7 +1756,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
  *
  * returns true if successful, false otherwise.
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			  gfp_t gfp_mask)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned int stock_pages;
@@ -1766,7 +1767,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		if (gfp_mask & __GFP_TRYLOCK)
+			return ret;
+		local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	}
 
 	stock = this_cpu_ptr(&memcg_stock);
 	stock_pages = READ_ONCE(stock->nr_pages);
@@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		/*
+		 * In case of unlikely failure to lock percpu stock_lock
+		 * uncharge memcg directly.
+		 */
+		mem_cgroup_cancel_charge(memcg, nr_pages);
+		return;
+	}
 	__refill_stock(memcg, nr_pages);
 	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
@@ -2196,7 +2208,7 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long pflags;
 
 retry:
-	if (consume_stock(memcg, nr_pages))
+	if (consume_stock(memcg, nr_pages, gfp_mask))
 		return 0;
 
 	if (!do_memsw_account() ||
-- 
2.43.5



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH bpf-next v3 5/6] mm, bpf: Use memcg in try_alloc_pages().
  2024-12-18  3:07 [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages() alexei.starovoitov
                   ` (3 preceding siblings ...)
  2024-12-18  3:07 ` [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock alexei.starovoitov
@ 2024-12-18  3:07 ` alexei.starovoitov
  2024-12-18  3:07 ` [PATCH bpf-next v3 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs alexei.starovoitov
  5 siblings, 0 replies; 36+ messages in thread
From: alexei.starovoitov @ 2024-12-18  3:07 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
The caller is responsible to setup memcg correctly.
All BPF memory accounting is memcg based.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/page_alloc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 10918bfc6734..5d0e56fbb65b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7100,7 +7100,7 @@ static bool __free_unaccepted(struct page *page)
 struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 {
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
-			  __GFP_NOMEMALLOC | __GFP_TRYLOCK;
+			  __GFP_NOMEMALLOC | __GFP_TRYLOCK | __GFP_ACCOUNT;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
@@ -7136,13 +7136,17 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 	 * If it's empty attempt to spin_trylock zone->lock.
 	 * Do not specify __GFP_KSWAPD_RECLAIM to avoid wakeup_kswapd
 	 * that may need to grab a lock.
-	 * Do not specify __GFP_ACCOUNT to avoid local_lock.
 	 * Do not warn either.
 	 */
 	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
 
 	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
 
+	if (memcg_kmem_online() && page &&
+	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
+		free_pages_nolock(page, order);
+		page = NULL;
+	}
 	trace_mm_page_alloc(page, order, alloc_gfp & ~__GFP_TRYLOCK, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
-- 
2.43.5



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH bpf-next v3 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
  2024-12-18  3:07 [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages() alexei.starovoitov
                   ` (4 preceding siblings ...)
  2024-12-18  3:07 ` [PATCH bpf-next v3 5/6] mm, bpf: Use memcg in try_alloc_pages() alexei.starovoitov
@ 2024-12-18  3:07 ` alexei.starovoitov
  5 siblings, 0 replies; 36+ messages in thread
From: alexei.starovoitov @ 2024-12-18  3:07 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use try_alloc_pages() and free_pages_nolock()

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/syscall.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e88797fdbeb..45099d24909c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -582,14 +582,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
 	old_memcg = set_active_memcg(memcg);
 #endif
 	for (i = 0; i < nr_pages; i++) {
-		pg = alloc_pages_node(nid, gfp | __GFP_ACCOUNT, 0);
+		pg = try_alloc_pages(nid, 0);
 
 		if (pg) {
 			pages[i] = pg;
 			continue;
 		}
 		for (j = 0; j < i; j++)
-			__free_page(pages[j]);
+			free_pages_nolock(pages[j], 0);
 		ret = -ENOMEM;
 		break;
 	}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  3:07 ` [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock() alexei.starovoitov
@ 2024-12-18  4:58   ` Yosry Ahmed
  2024-12-18  5:33     ` Alexei Starovoitov
  2024-12-18 11:32   ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-12-18  4:58 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On Tue, Dec 17, 2024 at 7:07 PM <alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce free_pages_nolock() that can free pages without taking locks.
> It relies on trylock and can be called from any context.
> Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> it uses lockless link list to stash the pages which will be freed
> by subsequent free_pages() from good context.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/gfp.h      |  1 +
>  include/linux/mm_types.h |  4 ++
>  include/linux/mmzone.h   |  3 ++
>  mm/page_alloc.c          | 79 ++++++++++++++++++++++++++++++++++++----
>  4 files changed, 79 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 65b8df1db26a..ff9060af6295 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -372,6 +372,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
>         __get_free_pages((gfp_mask) | GFP_DMA, (order))
>
>  extern void __free_pages(struct page *page, unsigned int order);
> +extern void free_pages_nolock(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
>
>  #define __free_page(page) __free_pages((page), 0)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7361a8f3ab68..52547b3e5fd8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -99,6 +99,10 @@ struct page {
>                                 /* Or, free page */
>                                 struct list_head buddy_list;
>                                 struct list_head pcp_list;
> +                               struct {
> +                                       struct llist_node pcp_llist;
> +                                       unsigned int order;
> +                               };
>                         };
>                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
>                         struct address_space *mapping;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b36124145a16..1a854e0a9e3b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -953,6 +953,9 @@ struct zone {
>         /* Primarily protects free_area */
>         spinlock_t              lock;
>
> +       /* Pages to be freed when next trylock succeeds */
> +       struct llist_head       trylock_free_pages;
> +
>         /* Write-intensive fields used by compaction and vmstats. */
>         CACHELINE_PADDING(_pad2_);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d23545057b6e..10918bfc6734 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
>   */
>  #define FPI_TO_TAIL            ((__force fpi_t)BIT(1))
>
> +/* Free the page without taking locks. Rely on trylock only. */
> +#define FPI_TRYLOCK            ((__force fpi_t)BIT(2))
> +

The comment above the definition of fpi_t mentions that it's for
non-pcp variants of free_pages(), so I guess that needs to be updated
in this patch.

More importantly, I think the comment states this mainly because the
existing flags won't be properly handled when freeing pages to the
pcplist. The flags will be lost once the pages are added to the
pcplist, and won't be propagated when the pages are eventually freed
to the buddy allocator (e.g. through free_pcppages_bulk()).

So I think we need to at least explicitly check which flags are
allowed when freeing pages to the pcplists or something similar.

>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,
>         }
>  }
>
> +static void add_page_to_zone_llist(struct zone *zone, struct page *page,
> +                                  unsigned int order)
> +{
> +       /* Remember the order */
> +       page->order = order;
> +       /* Add the page to the free list */
> +       llist_add(&page->pcp_llist, &zone->trylock_free_pages);
> +}
> +
>  static void free_one_page(struct zone *zone, struct page *page,
>                           unsigned long pfn, unsigned int order,
>                           fpi_t fpi_flags)
>  {
> +       struct llist_head *llhead;
>         unsigned long flags;
>
> -       spin_lock_irqsave(&zone->lock, flags);
> +       if (!spin_trylock_irqsave(&zone->lock, flags)) {
> +               if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> +                       add_page_to_zone_llist(zone, page, order);
> +                       return;
> +               }
> +               spin_lock_irqsave(&zone->lock, flags);
> +       }
> +
> +       /* The lock succeeded. Process deferred pages. */
> +       llhead = &zone->trylock_free_pages;
> +       if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
> +               struct llist_node *llnode;
> +               struct page *p, *tmp;
> +
> +               llnode = llist_del_all(llhead);
> +               llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
> +                       unsigned int p_order = p->order;
> +
> +                       split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> +                       __count_vm_events(PGFREE, 1 << p_order);
> +               }
> +       }
>         split_large_buddy(zone, page, pfn, order, fpi_flags);
>         spin_unlock_irqrestore(&zone->lock, flags);
>
> @@ -2596,7 +2630,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>
>  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>                                    struct page *page, int migratetype,
> -                                  unsigned int order)
> +                                  unsigned int order, fpi_t fpi_flags)
>  {
>         int high, batch;
>         int pindex;
> @@ -2631,6 +2665,14 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>         }
>         if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
>                 pcp->free_count += (1 << order);
> +
> +       if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> +               /*
> +                * Do not attempt to take a zone lock. Let pcp->count get
> +                * over high mark temporarily.
> +                */
> +               return;
> +       }
>         high = nr_pcp_high(pcp, zone, batch, free_high);
>         if (pcp->count >= high) {
>                 free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> @@ -2645,7 +2687,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>  /*
>   * Free a pcp page
>   */
> -void free_unref_page(struct page *page, unsigned int order)
> +static void __free_unref_page(struct page *page, unsigned int order,
> +                             fpi_t fpi_flags)
>  {
>         unsigned long __maybe_unused UP_flags;
>         struct per_cpu_pages *pcp;
> @@ -2654,7 +2697,7 @@ void free_unref_page(struct page *page, unsigned int order)
>         int migratetype;
>
>         if (!pcp_allowed_order(order)) {
> -               __free_pages_ok(page, order, FPI_NONE);
> +               __free_pages_ok(page, order, fpi_flags);
>                 return;
>         }
>
> @@ -2671,24 +2714,33 @@ void free_unref_page(struct page *page, unsigned int order)
>         migratetype = get_pfnblock_migratetype(page, pfn);
>         if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
>                 if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
> +                       free_one_page(page_zone(page), page, pfn, order, fpi_flags);
>                         return;
>                 }
>                 migratetype = MIGRATE_MOVABLE;
>         }
>
>         zone = page_zone(page);
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) {
> +               add_page_to_zone_llist(zone, page, order);
> +               return;
> +       }
>         pcp_trylock_prepare(UP_flags);
>         pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>         if (pcp) {
> -               free_unref_page_commit(zone, pcp, page, migratetype, order);
> +               free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
>                 pcp_spin_unlock(pcp);
>         } else {
> -               free_one_page(zone, page, pfn, order, FPI_NONE);
> +               free_one_page(zone, page, pfn, order, fpi_flags);
>         }
>         pcp_trylock_finish(UP_flags);
>  }
>
> +void free_unref_page(struct page *page, unsigned int order)
> +{
> +       __free_unref_page(page, order, FPI_NONE);
> +}
> +
>  /*
>   * Free a batch of folios
>   */
> @@ -2777,7 +2829,7 @@ void free_unref_folios(struct folio_batch *folios)
>
>                 trace_mm_page_free_batched(&folio->page);
>                 free_unref_page_commit(zone, pcp, &folio->page, migratetype,
> -                               order);
> +                                      order, FPI_NONE);
>         }
>
>         if (pcp) {
> @@ -4854,6 +4906,17 @@ void __free_pages(struct page *page, unsigned int order)
>  }
>  EXPORT_SYMBOL(__free_pages);
>
> +/*
> + * Can be called while holding raw_spin_lock or from IRQ and NMI,
> + * but only for pages that came from try_alloc_pages():
> + * order <= 3, !folio, etc
> + */
> +void free_pages_nolock(struct page *page, unsigned int order)
> +{
> +       if (put_page_testzero(page))
> +               __free_unref_page(page, order, FPI_TRYLOCK);
> +}
> +
>  void free_pages(unsigned long addr, unsigned int order)
>  {
>         if (addr != 0) {
> --
> 2.43.5
>
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  4:58   ` Yosry Ahmed
@ 2024-12-18  5:33     ` Alexei Starovoitov
  2024-12-18  5:57       ` Yosry Ahmed
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-18  5:33 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 17, 2024 at 8:59 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Dec 17, 2024 at 7:07 PM <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce free_pages_nolock() that can free pages without taking locks.
> > It relies on trylock and can be called from any context.
> > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> > it uses lockless link list to stash the pages which will be freed
> > by subsequent free_pages() from good context.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/gfp.h      |  1 +
> >  include/linux/mm_types.h |  4 ++
> >  include/linux/mmzone.h   |  3 ++
> >  mm/page_alloc.c          | 79 ++++++++++++++++++++++++++++++++++++----
> >  4 files changed, 79 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 65b8df1db26a..ff9060af6295 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -372,6 +372,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
> >         __get_free_pages((gfp_mask) | GFP_DMA, (order))
> >
> >  extern void __free_pages(struct page *page, unsigned int order);
> > +extern void free_pages_nolock(struct page *page, unsigned int order);
> >  extern void free_pages(unsigned long addr, unsigned int order);
> >
> >  #define __free_page(page) __free_pages((page), 0)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 7361a8f3ab68..52547b3e5fd8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -99,6 +99,10 @@ struct page {
> >                                 /* Or, free page */
> >                                 struct list_head buddy_list;
> >                                 struct list_head pcp_list;
> > +                               struct {
> > +                                       struct llist_node pcp_llist;
> > +                                       unsigned int order;
> > +                               };
> >                         };
> >                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> >                         struct address_space *mapping;
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index b36124145a16..1a854e0a9e3b 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -953,6 +953,9 @@ struct zone {
> >         /* Primarily protects free_area */
> >         spinlock_t              lock;
> >
> > +       /* Pages to be freed when next trylock succeeds */
> > +       struct llist_head       trylock_free_pages;
> > +
> >         /* Write-intensive fields used by compaction and vmstats. */
> >         CACHELINE_PADDING(_pad2_);
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d23545057b6e..10918bfc6734 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
> >   */
> >  #define FPI_TO_TAIL            ((__force fpi_t)BIT(1))
> >
> > +/* Free the page without taking locks. Rely on trylock only. */
> > +#define FPI_TRYLOCK            ((__force fpi_t)BIT(2))
> > +
>
> The comment above the definition of fpi_t mentions that it's for
> non-pcp variants of free_pages(), so I guess that needs to be updated
> in this patch.

No. The comment:
/* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
typedef int __bitwise fpi_t;

is still valid.
Most of the objective of the FPI_TRYLOCK flag is used after pcp is over.

> More importantly, I think the comment states this mainly because the
> existing flags won't be properly handled when freeing pages to the
> pcplist. The flags will be lost once the pages are added to the
> pcplist, and won't be propagated when the pages are eventually freed
> to the buddy allocator (e.g. through free_pcppages_bulk()).

Correct. fpi_t flags have a local effect. Nothing new here.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  5:33     ` Alexei Starovoitov
@ 2024-12-18  5:57       ` Yosry Ahmed
  2024-12-18  6:37         ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-12-18  5:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 17, 2024 at 9:33 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 17, 2024 at 8:59 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Dec 17, 2024 at 7:07 PM <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Introduce free_pages_nolock() that can free pages without taking locks.
> > > It relies on trylock and can be called from any context.
> > > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> > > it uses lockless link list to stash the pages which will be freed
> > > by subsequent free_pages() from good context.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  include/linux/gfp.h      |  1 +
> > >  include/linux/mm_types.h |  4 ++
> > >  include/linux/mmzone.h   |  3 ++
> > >  mm/page_alloc.c          | 79 ++++++++++++++++++++++++++++++++++++----
> > >  4 files changed, 79 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 65b8df1db26a..ff9060af6295 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -372,6 +372,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
> > >         __get_free_pages((gfp_mask) | GFP_DMA, (order))
> > >
> > >  extern void __free_pages(struct page *page, unsigned int order);
> > > +extern void free_pages_nolock(struct page *page, unsigned int order);
> > >  extern void free_pages(unsigned long addr, unsigned int order);
> > >
> > >  #define __free_page(page) __free_pages((page), 0)
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 7361a8f3ab68..52547b3e5fd8 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -99,6 +99,10 @@ struct page {
> > >                                 /* Or, free page */
> > >                                 struct list_head buddy_list;
> > >                                 struct list_head pcp_list;
> > > +                               struct {
> > > +                                       struct llist_node pcp_llist;
> > > +                                       unsigned int order;
> > > +                               };
> > >                         };
> > >                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> > >                         struct address_space *mapping;
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index b36124145a16..1a854e0a9e3b 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -953,6 +953,9 @@ struct zone {
> > >         /* Primarily protects free_area */
> > >         spinlock_t              lock;
> > >
> > > +       /* Pages to be freed when next trylock succeeds */
> > > +       struct llist_head       trylock_free_pages;
> > > +
> > >         /* Write-intensive fields used by compaction and vmstats. */
> > >         CACHELINE_PADDING(_pad2_);
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index d23545057b6e..10918bfc6734 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
> > >   */
> > >  #define FPI_TO_TAIL            ((__force fpi_t)BIT(1))
> > >
> > > +/* Free the page without taking locks. Rely on trylock only. */
> > > +#define FPI_TRYLOCK            ((__force fpi_t)BIT(2))
> > > +
> >
> > The comment above the definition of fpi_t mentions that it's for
> > non-pcp variants of free_pages(), so I guess that needs to be updated
> > in this patch.
>
> No. The comment:
> /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
> typedef int __bitwise fpi_t;
>
> is still valid.
> Most of the objective of the FPI_TRYLOCK flag is used after pcp is over.

Right, but the comment says the flags are for non-pcp variants yet we
are passing them now to pcp variants. Not very clear.

>
> > More importantly, I think the comment states this mainly because the
> > existing flags won't be properly handled when freeing pages to the
> > pcplist. The flags will be lost once the pages are added to the
> > pcplist, and won't be propagated when the pages are eventually freed
> > to the buddy allocator (e.g. through free_pcppages_bulk()).
>
> Correct. fpi_t flags have a local effect. Nothing new here.

What I mean is, functions like __free_unref_page() and
free_unref_page_commit() now accept fpi_flags, but any flags other
than FPI_TRYLOCK are essentially ignored, also not very clear.

Anyway, these are just my 2c, I am just passing by and I thought it's
a bit confusing :)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  5:57       ` Yosry Ahmed
@ 2024-12-18  6:37         ` Alexei Starovoitov
  2024-12-18  6:49           ` Yosry Ahmed
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-18  6:37 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 17, 2024 at 9:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> What I mean is, functions like __free_unref_page() and
> free_unref_page_commit() now accept fpi_flags, but any flags other
> than FPI_TRYLOCK are essentially ignored, also not very clear.

They're not ignored. They are just not useful in this context.
The code rules over comment. If you have a concrete suggestion on
how to improve the comment please say so.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  6:37         ` Alexei Starovoitov
@ 2024-12-18  6:49           ` Yosry Ahmed
  2024-12-18  7:25             ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-12-18  6:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 17, 2024 at 10:37 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 17, 2024 at 9:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > What I mean is, functions like __free_unref_page() and
> > free_unref_page_commit() now accept fpi_flags, but any flags other
> > than FPI_TRYLOCK are essentially ignored, also not very clear.
>
> They're not ignored. They are just not useful in this context.

I think they are. For example, if you pass FPI_SKIP_REPORT_NOTIFY to
__free_unref_page(), page_reporting_notify_free() will still be called
when the page is eventually freed to the buddy allocator. Same goes
for FPI_NO_TAIL.

> The code rules over comment. If you have a concrete suggestion on
> how to improve the comment please say so.

What I had in mind is adding a WARN in the pcp freeing functions if
any FPI flag but FPI_TRYLOCK is passed, and/or explicitly calling out
that other flags should not be passed as they have no effect in this
context (whether at the function definition, above the WARN, or at the
flag definitions).


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  6:49           ` Yosry Ahmed
@ 2024-12-18  7:25             ` Alexei Starovoitov
  2024-12-18  7:40               ` Yosry Ahmed
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-18  7:25 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 17, 2024 at 10:49 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Dec 17, 2024 at 10:37 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Dec 17, 2024 at 9:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > What I mean is, functions like __free_unref_page() and
> > > free_unref_page_commit() now accept fpi_flags, but any flags other
> > > than FPI_TRYLOCK are essentially ignored, also not very clear.
> >
> > They're not ignored. They are just not useful in this context.
>
> I think they are. For example, if you pass FPI_SKIP_REPORT_NOTIFY to
> __free_unref_page(), page_reporting_notify_free() will still be called
> when the page is eventually freed to the buddy allocator. Same goes
> for FPI_NO_TAIL.

free_pcppages_bulk()->page_reporting_notify_free() will _not_ be called
when FPI_TRYLOCK is specified.
They are internal flags. The callers cannot make try_alloc_pages()
pass these extra flags.
The flags are more or less exclusive.

> > The code rules over comment. If you have a concrete suggestion on
> > how to improve the comment please say so.
>
> What I had in mind is adding a WARN in the pcp freeing functions if
> any FPI flag but FPI_TRYLOCK is passed, and/or explicitly calling out
> that other flags should not be passed as they have no effect in this
> context (whether at the function definition, above the WARN, or at the
> flag definitions).

pcp freeing functions?
In particular?
tbh this sounds like defensive programming...
BUILD_BUG_ON is a good thing when api is misused,
but WARN will be wasting run-time cycles on something that should
have been caught during code review.
free_unref_page_commit() is a shallow function when FPI_TRYLOCK is used.
There is no need to propagate fpi_flags further into free_pcppages_bulk
just to issue a WARN.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  7:25             ` Alexei Starovoitov
@ 2024-12-18  7:40               ` Yosry Ahmed
  0 siblings, 0 replies; 36+ messages in thread
From: Yosry Ahmed @ 2024-12-18  7:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 17, 2024 at 11:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 17, 2024 at 10:49 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Dec 17, 2024 at 10:37 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Dec 17, 2024 at 9:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > What I mean is, functions like __free_unref_page() and
> > > > free_unref_page_commit() now accept fpi_flags, but any flags other
> > > > than FPI_TRYLOCK are essentially ignored, also not very clear.
> > >
> > > They're not ignored. They are just not useful in this context.
> >
> > I think they are. For example, if you pass FPI_SKIP_REPORT_NOTIFY to
> > __free_unref_page(), page_reporting_notify_free() will still be called
> > when the page is eventually freed to the buddy allocator. Same goes
> > for FPI_NO_TAIL.
>
> free_pcppages_bulk()->page_reporting_notify_free() will _not_ be called
> when FPI_TRYLOCK is specified.
> They are internal flags. The callers cannot make try_alloc_pages()
> pass these extra flags.
> The flags are more or less exclusive.
>
> > > The code rules over comment. If you have a concrete suggestion on
> > > how to improve the comment please say so.
> >
> > What I had in mind is adding a WARN in the pcp freeing functions if
> > any FPI flag but FPI_TRYLOCK is passed, and/or explicitly calling out
> > that other flags should not be passed as they have no effect in this
> > context (whether at the function definition, above the WARN, or at the
> > flag definitions).
>
> pcp freeing functions?
> In particular?
> tbh this sounds like defensive programming...
> BUILD_BUG_ON is a good thing when api is misused,
> but WARN will be wasting run-time cycles on something that should
> have been caught during code review.
> free_unref_page_commit() is a shallow function when FPI_TRYLOCK is used.
> There is no need to propagate fpi_flags further into free_pcppages_bulk
> just to issue a WARN.

I meant WARN in free_unref_page_commit() if any other FPI flags are
used. Anyway, I don't feel strongly about this as I mentioned earlier.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-18  3:07 ` [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation alexei.starovoitov
@ 2024-12-18 11:32   ` Michal Hocko
  2024-12-19  0:05     ` Shakeel Butt
  2024-12-19  1:18     ` Alexei Starovoitov
  2024-12-19  0:10   ` Shakeel Butt
  1 sibling, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2024-12-18 11:32 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj, linux-mm,
	kernel-team

I like this proposal better. I am still not convinced that we really
need internal __GFP_TRYLOCK though. 

If we reduce try_alloc_pages to the gfp usage we are at the following

On Tue 17-12-24 19:07:14, alexei.starovoitov@gmail.com wrote:
[...]
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
> +			  __GFP_NOMEMALLOC | __GFP_TRYLOCK;
> +	unsigned int alloc_flags = ALLOC_TRYLOCK;
[...]
> +	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> +			    &alloc_gfp, &alloc_flags);
[...]
> +	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> +
> +	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> +
> +	trace_mm_page_alloc(page, order, alloc_gfp & ~__GFP_TRYLOCK, ac.migratetype);
> +	kmsan_alloc_page(page, order, alloc_gfp);
[...]

From those that care about __GFP_TRYLOCK only kmsan_alloc_page doesn't
have alloc_flags. Those could make the locking decision based on
ALLOC_TRYLOCK.

I am not familiar with kmsan internals and my main question is whether
this specific usecase really needs a dedicated reentrant
kmsan_alloc_page rather than rely on gfp flag to be sufficient.
Currently kmsan_in_runtime bails out early in some contexts. The
associated comment about hooks is not completely clear to me though.
Memory allocation down the road is one of those but it is not really
clear to me whether this is the only one.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18  3:07 ` [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock() alexei.starovoitov
  2024-12-18  4:58   ` Yosry Ahmed
@ 2024-12-18 11:32   ` Michal Hocko
  2024-12-19  1:45     ` Alexei Starovoitov
  1 sibling, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-18 11:32 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Tue 17-12-24 19:07:15, alexei.starovoitov@gmail.com wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce free_pages_nolock() that can free pages without taking locks.
> It relies on trylock and can be called from any context.
> Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> it uses lockless link list to stash the pages which will be freed
> by subsequent free_pages() from good context.

Yes, this makes sense. Have you tried a simpler implementation that
would just queue on the lockless link list unconditionally? That would
certainly reduce the complexity. Essentially something similar that we
do in vfree_atomic (well, except the queue_work which is likely too
heavy for the usecase and potentialy not reentrant).

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-18  3:07 ` [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock alexei.starovoitov
@ 2024-12-18 11:32   ` Michal Hocko
  2024-12-19  1:53     ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-18 11:32 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Tue 17-12-24 19:07:17, alexei.starovoitov@gmail.com wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Teach memcg to operate under trylock conditions when
> spinning locks cannot be used.

Can we make this trylock unconditional? I hope I am not really missing
anything important but if the local_lock is just IRQ disabling on !RT.
For RT this is more involved but does it make sense to spin/sleep on the
cache if we can go ahead and charge directly to counters? I mean doesn't
this defeat the purpose of the cache in the first place?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-18 11:32   ` Michal Hocko
@ 2024-12-19  0:05     ` Shakeel Butt
  2024-12-19  7:18       ` Michal Hocko
  2024-12-19  1:18     ` Alexei Starovoitov
  1 sibling, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2024-12-19  0:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: alexei.starovoitov, bpf, andrii, memxor, akpm, peterz, vbabka,
	bigeasy, rostedt, houtao1, hannes, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On Wed, Dec 18, 2024 at 12:32:20PM +0100, Michal Hocko wrote:
> I like this proposal better. I am still not convinced that we really
> need internal __GFP_TRYLOCK though. 
> 
> If we reduce try_alloc_pages to the gfp usage we are at the following
> 
> On Tue 17-12-24 19:07:14, alexei.starovoitov@gmail.com wrote:
> [...]
> > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > +{
> > +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
> > +			  __GFP_NOMEMALLOC | __GFP_TRYLOCK;
> > +	unsigned int alloc_flags = ALLOC_TRYLOCK;
> [...]
> > +	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > +			    &alloc_gfp, &alloc_flags);
> [...]
> > +	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> > +
> > +	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> > +
> > +	trace_mm_page_alloc(page, order, alloc_gfp & ~__GFP_TRYLOCK, ac.migratetype);
> > +	kmsan_alloc_page(page, order, alloc_gfp);
> [...]
> 
> From those that care about __GFP_TRYLOCK only kmsan_alloc_page doesn't
> have alloc_flags. Those could make the locking decision based on
> ALLOC_TRYLOCK.
> 
> I am not familiar with kmsan internals and my main question is whether
> this specific usecase really needs a dedicated reentrant
> kmsan_alloc_page rather than rely on gfp flag to be sufficient.
> Currently kmsan_in_runtime bails out early in some contexts. The
> associated comment about hooks is not completely clear to me though.
> Memory allocation down the road is one of those but it is not really
> clear to me whether this is the only one.

Is the suggestion that just introduce and use ALLOC_TRYLOCK without the
need of __GFP_TRYLOCK?

Regarding KMSAN, the __GFP_ZERO would bypass it. Maybe a comment to
explain that.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-18  3:07 ` [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation alexei.starovoitov
  2024-12-18 11:32   ` Michal Hocko
@ 2024-12-19  0:10   ` Shakeel Butt
  2024-12-19  1:39     ` Alexei Starovoitov
  1 sibling, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2024-12-19  0:10 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Tue, Dec 17, 2024 at 07:07:14PM -0800, alexei.starovoitov@gmail.com wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
[...]
> +
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
> +			  __GFP_NOMEMALLOC | __GFP_TRYLOCK;

I think the above needs a comment to be more clear. Basically why zero,
nomemalloc and no warn? Otherwise this looks good. I don't have a strong
opinion on __GFP_TRYLOCK and maybe just ALLOC_TRYLOCK would be good
enough.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-18 11:32   ` Michal Hocko
  2024-12-19  0:05     ` Shakeel Butt
@ 2024-12-19  1:18     ` Alexei Starovoitov
  2024-12-19  7:13       ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-19  1:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Dec 18, 2024 at 3:32 AM Michal Hocko <mhocko@suse.com> wrote:
>
> I like this proposal better. I am still not convinced that we really
> need internal __GFP_TRYLOCK though.
>
> If we reduce try_alloc_pages to the gfp usage we are at the following
>
> On Tue 17-12-24 19:07:14, alexei.starovoitov@gmail.com wrote:
> [...]
> > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > +{
> > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
> > +                       __GFP_NOMEMALLOC | __GFP_TRYLOCK;
> > +     unsigned int alloc_flags = ALLOC_TRYLOCK;
> [...]
> > +     prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > +                         &alloc_gfp, &alloc_flags);
> [...]
> > +     page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> > +
> > +     /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> > +
> > +     trace_mm_page_alloc(page, order, alloc_gfp & ~__GFP_TRYLOCK, ac.migratetype);
> > +     kmsan_alloc_page(page, order, alloc_gfp);
> [...]
>
> From those that care about __GFP_TRYLOCK only kmsan_alloc_page doesn't
> have alloc_flags. Those could make the locking decision based on
> ALLOC_TRYLOCK.

__GFP_TRYLOCK here sets a baseline and is used in patch 4 by inner
bits of memcg's consume_stock() logic while called from
try_alloc_pages() in patch 5.
We cannot pass alloc_flags into it.
Just too much overhead.

__memcg_kmem_charge_page()
  -> obj_cgroup_charge_pages()
      -> try_charge_memcg()
         -> consume_stock()

all of them would need an extra 'u32 alloc_flags'.
This is too high cost to avoid ___GFP_TRYLOCK_BIT in gfp_types.h

> I am not familiar with kmsan internals and my main question is whether
> this specific usecase really needs a dedicated reentrant
> kmsan_alloc_page rather than rely on gfp flag to be sufficient.
> Currently kmsan_in_runtime bails out early in some contexts. The
> associated comment about hooks is not completely clear to me though.
> Memory allocation down the road is one of those but it is not really
> clear to me whether this is the only one.

As I mentioned in giant v2 thread I'm not touching kasan/kmsan
in this patch set, since it needs its own eyes
from experts in those bits,
but when it happens gfp & __GFP_TRYLOCK would be the way
to adjust whatever is necessary in kasan/kmsan internals.

As Shakeel mentioned, currently kmsan_alloc_page() is gutted,
since I'm using __GFP_ZERO unconditionally here.
We don't even get to kmsan_in_runtime() check.
For bpf use cases __GFP_ZERO and __GFP_ACCOUNT are pretty much
mandatory. When there will be a 2nd user of this try_alloc_pages()
api we can consider making flags for these two
and at that time full analysis kmsan reentrance would be necessary.
It works in this patch because of GFP_ZERO.

So __GFP_TRYLOCK is needed in many cases:
- to make decisions in consume_stock()
- in the future in kasan/kmsan
- and in slab kmalloc. There I'm going to introduce try_kmalloc()
(or kmalloc_nolock(), naming is hard) that will use this
internal __GFP_TRYLOCK flag to avoid locks and when it gets
to new_slab()->allocate_slab()->alloc_slab_page()
the latter will use try_alloc_pages() instead of alloc_pages().


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-19  0:10   ` Shakeel Butt
@ 2024-12-19  1:39     ` Alexei Starovoitov
  0 siblings, 0 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-19  1:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Michal Hocko,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Dec 18, 2024 at 4:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Dec 17, 2024 at 07:07:14PM -0800, alexei.starovoitov@gmail.com wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> [...]
> > +
> > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > +{
> > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
> > +                       __GFP_NOMEMALLOC | __GFP_TRYLOCK;
>
> I think the above needs a comment to be more clear. Basically why zero,
> nomemalloc and no warn? Otherwise this looks good. I don't have a strong
> opinion on __GFP_TRYLOCK and maybe just ALLOC_TRYLOCK would be good
> enough.

__GFP_NOWARN is what bpf uses almost everywhere. There is never a reason
to warn. Also warn means printk() which is unsafe from various places.
Cannot really use printk_deferred_enter() either. The running ctx is unknown.

__GFP_ZERO is to make sure that call to kmsan_alloc_page() is safe
and it's necessary for bpf anyway.

Initially __GFP_NOMEMALLOC was added to avoid ALLOC_HIGHATOMIC paths
when __alloc_pages_slowpath() was still there in try_alloc_pages().
Later the slowpath was removed, but I left __GFP_NOMEMALLOC
as a self-explanatory statement that try_alloc_pages()
doesn't want to deplete reserves.

I'll add a comment in the next version.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-18 11:32   ` Michal Hocko
@ 2024-12-19  1:45     ` Alexei Starovoitov
  2024-12-19  7:03       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-19  1:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Dec 18, 2024 at 3:32 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-12-24 19:07:15, alexei.starovoitov@gmail.com wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce free_pages_nolock() that can free pages without taking locks.
> > It relies on trylock and can be called from any context.
> > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> > it uses lockless link list to stash the pages which will be freed
> > by subsequent free_pages() from good context.
>
> Yes, this makes sense. Have you tried a simpler implementation that
> would just queue on the lockless link list unconditionally? That would
> certainly reduce the complexity. Essentially something similar that we
> do in vfree_atomic (well, except the queue_work which is likely too
> heavy for the usecase and potentialy not reentrant).

We cannot use llist approach unconditionally.
One of the ways bpf maps are used is non-stop alloc/free.
We cannot delay the free part. When memory is free it's better to
be available for kernel and bpf uses right away.
llist is the last resort.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-18 11:32   ` Michal Hocko
@ 2024-12-19  1:53     ` Alexei Starovoitov
  2024-12-19  7:08       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-19  1:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Dec 18, 2024 at 3:32 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-12-24 19:07:17, alexei.starovoitov@gmail.com wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Teach memcg to operate under trylock conditions when
> > spinning locks cannot be used.
>
> Can we make this trylock unconditional? I hope I am not really missing
> anything important but if the local_lock is just IRQ disabling on !RT.
> For RT this is more involved but does it make sense to spin/sleep on the
> cache if we can go ahead and charge directly to counters? I mean doesn't
> this defeat the purpose of the cache in the first place?

memcg folks please correct me.
My understanding is that memcg_stock is a batching mechanism.
Not really a cache. If we keep charging the counters directly
the performance will suffer due to atomic operations and hierarchy walk.
Hence I had to make sure consume_stock() is functioning as designed
and fallback when unlucky.
In !RT case the unlucky part can only happen in_nmi which should be
very rare.
In RT the unlucky part is in_hardirq or in_nmi, since spin_trylock
doesn't really work in those conditions as Steven explained.
Sebastian mentioned that he's working on removing preempt_disable()
from tracepoints. That should help bpf greatly too.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-19  1:45     ` Alexei Starovoitov
@ 2024-12-19  7:03       ` Michal Hocko
  2024-12-20  0:42         ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-19  7:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed 18-12-24 17:45:20, Alexei Starovoitov wrote:
> On Wed, Dec 18, 2024 at 3:32 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 17-12-24 19:07:15, alexei.starovoitov@gmail.com wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Introduce free_pages_nolock() that can free pages without taking locks.
> > > It relies on trylock and can be called from any context.
> > > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> > > it uses lockless link list to stash the pages which will be freed
> > > by subsequent free_pages() from good context.
> >
> > Yes, this makes sense. Have you tried a simpler implementation that
> > would just queue on the lockless link list unconditionally? That would
> > certainly reduce the complexity. Essentially something similar that we
> > do in vfree_atomic (well, except the queue_work which is likely too
> > heavy for the usecase and potentialy not reentrant).
> 
> We cannot use llist approach unconditionally.
> One of the ways bpf maps are used is non-stop alloc/free.
> We cannot delay the free part. When memory is free it's better to
> be available for kernel and bpf uses right away.
> llist is the last resort.

This is an important detail that should be mentioned in the changelog.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-19  1:53     ` Alexei Starovoitov
@ 2024-12-19  7:08       ` Michal Hocko
  2024-12-19  7:27         ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-19  7:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed 18-12-24 17:53:50, Alexei Starovoitov wrote:
> On Wed, Dec 18, 2024 at 3:32 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 17-12-24 19:07:17, alexei.starovoitov@gmail.com wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Teach memcg to operate under trylock conditions when
> > > spinning locks cannot be used.
> >
> > Can we make this trylock unconditional? I hope I am not really missing
> > anything important but if the local_lock is just IRQ disabling on !RT.
> > For RT this is more involved but does it make sense to spin/sleep on the
> > cache if we can go ahead and charge directly to counters? I mean doesn't
> > this defeat the purpose of the cache in the first place?
> 
> memcg folks please correct me.
> My understanding is that memcg_stock is a batching mechanism.

Yes, it is an optimization to avoid charging the page_counter directly
which involves atomic operations and that scales with the depth of the
hierarchy. So yes, it is a batching mechanism to optimize a common case
where the same memcg charges on the same cpu several allocations which
is a common case.

> Not really a cache. If we keep charging the counters directly
> the performance will suffer due to atomic operations and hierarchy walk.
> Hence I had to make sure consume_stock() is functioning as designed
> and fallback when unlucky.
> In !RT case the unlucky part can only happen in_nmi which should be
> very rare.

Right

> In RT the unlucky part is in_hardirq or in_nmi, since spin_trylock
> doesn't really work in those conditions as Steven explained.

Right

All that being said, the message I wanted to get through is that atomic
(NOWAIT) charges could be trully reentrant if the stock local lock uses
trylock. We do not need a dedicated gfp flag for that now.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-19  1:18     ` Alexei Starovoitov
@ 2024-12-19  7:13       ` Michal Hocko
  2024-12-20  0:41         ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-19  7:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed 18-12-24 17:18:51, Alexei Starovoitov wrote:
> On Wed, Dec 18, 2024 at 3:32 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > I like this proposal better. I am still not convinced that we really
> > need internal __GFP_TRYLOCK though.
> >
> > If we reduce try_alloc_pages to the gfp usage we are at the following
> >
> > On Tue 17-12-24 19:07:14, alexei.starovoitov@gmail.com wrote:
> > [...]
> > > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > > +{
> > > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
> > > +                       __GFP_NOMEMALLOC | __GFP_TRYLOCK;
> > > +     unsigned int alloc_flags = ALLOC_TRYLOCK;
> > [...]
> > > +     prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > > +                         &alloc_gfp, &alloc_flags);
> > [...]
> > > +     page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> > > +
> > > +     /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> > > +
> > > +     trace_mm_page_alloc(page, order, alloc_gfp & ~__GFP_TRYLOCK, ac.migratetype);
> > > +     kmsan_alloc_page(page, order, alloc_gfp);
> > [...]
> >
> > From those that care about __GFP_TRYLOCK only kmsan_alloc_page doesn't
> > have alloc_flags. Those could make the locking decision based on
> > ALLOC_TRYLOCK.
> 
> __GFP_TRYLOCK here sets a baseline and is used in patch 4 by inner
> bits of memcg's consume_stock() logic while called from
> try_alloc_pages() in patch 5.

Yes, I have addressed that part in a reply. In short I believe we can
achieve reentrancy for NOWAIT/ATOMIC charges without a dedicated gfp
flag.

[...]
> > I am not familiar with kmsan internals and my main question is whether
> > this specific usecase really needs a dedicated reentrant
> > kmsan_alloc_page rather than rely on gfp flag to be sufficient.
> > Currently kmsan_in_runtime bails out early in some contexts. The
> > associated comment about hooks is not completely clear to me though.
> > Memory allocation down the road is one of those but it is not really
> > clear to me whether this is the only one.
> 
> As I mentioned in giant v2 thread I'm not touching kasan/kmsan
> in this patch set, since it needs its own eyes
> from experts in those bits,
> but when it happens gfp & __GFP_TRYLOCK would be the way
> to adjust whatever is necessary in kasan/kmsan internals.
> 
> As Shakeel mentioned, currently kmsan_alloc_page() is gutted,
> since I'm using __GFP_ZERO unconditionally here.
> We don't even get to kmsan_in_runtime() check.

I have missed that part! That means that you can drop kmsan_alloc_page
altogether no?

[...]

> - and in slab kmalloc. There I'm going to introduce try_kmalloc()
> (or kmalloc_nolock(), naming is hard) that will use this
> internal __GFP_TRYLOCK flag to avoid locks and when it gets
> to new_slab()->allocate_slab()->alloc_slab_page()
> the latter will use try_alloc_pages() instead of alloc_pages().

I cannot really comment on the slab side of things. All I am saying is
that we should _try_ to avoid __GFP_TRYLOCK if possible/feasible. It
seems that the page allocator can do without that. Maybe slab side can
as well.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-19  0:05     ` Shakeel Butt
@ 2024-12-19  7:18       ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2024-12-19  7:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: alexei.starovoitov, bpf, andrii, memxor, akpm, peterz, vbabka,
	bigeasy, rostedt, houtao1, hannes, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On Wed 18-12-24 16:05:25, Shakeel Butt wrote:
> On Wed, Dec 18, 2024 at 12:32:20PM +0100, Michal Hocko wrote:
> > I like this proposal better. I am still not convinced that we really
> > need internal __GFP_TRYLOCK though. 
> > 
> > If we reduce try_alloc_pages to the gfp usage we are at the following
> > 
> > On Tue 17-12-24 19:07:14, alexei.starovoitov@gmail.com wrote:
> > [...]
> > > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > > +{
> > > +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO |
> > > +			  __GFP_NOMEMALLOC | __GFP_TRYLOCK;
> > > +	unsigned int alloc_flags = ALLOC_TRYLOCK;
> > [...]
> > > +	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > > +			    &alloc_gfp, &alloc_flags);
> > [...]
> > > +	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> > > +
> > > +	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> > > +
> > > +	trace_mm_page_alloc(page, order, alloc_gfp & ~__GFP_TRYLOCK, ac.migratetype);
> > > +	kmsan_alloc_page(page, order, alloc_gfp);
> > [...]
> > 
> > From those that care about __GFP_TRYLOCK only kmsan_alloc_page doesn't
> > have alloc_flags. Those could make the locking decision based on
> > ALLOC_TRYLOCK.
> > 
> > I am not familiar with kmsan internals and my main question is whether
> > this specific usecase really needs a dedicated reentrant
> > kmsan_alloc_page rather than rely on gfp flag to be sufficient.
> > Currently kmsan_in_runtime bails out early in some contexts. The
> > associated comment about hooks is not completely clear to me though.
> > Memory allocation down the road is one of those but it is not really
> > clear to me whether this is the only one.
> 
> Is the suggestion that just introduce and use ALLOC_TRYLOCK without the
> need of __GFP_TRYLOCK?

Exactly! Because ALLOC_$FOO is strictly internal allocator flag that
cannot leak out to external users by design. __GFP_TRYLOCK in this
implementation tries to achieve the same by hiding it which would work
but it is both ugly and likely unnecessary.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-19  7:08       ` Michal Hocko
@ 2024-12-19  7:27         ` Michal Hocko
  2024-12-19  7:52           ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-19  7:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Thu 19-12-24 08:08:44, Michal Hocko wrote:
> All that being said, the message I wanted to get through is that atomic
> (NOWAIT) charges could be trully reentrant if the stock local lock uses
> trylock. We do not need a dedicated gfp flag for that now.

And I want to add. Not only we can achieve that, I also think this is
desirable because for !RT this will be no functional change and for RT
it makes more sense to simply do deterministic (albeit more costly
page_counter update) than spin over a lock to use the batch (or learn
the batch cannot be used).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-19  7:27         ` Michal Hocko
@ 2024-12-19  7:52           ` Michal Hocko
  2024-12-20  0:39             ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-19  7:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Thu 19-12-24 08:27:06, Michal Hocko wrote:
> On Thu 19-12-24 08:08:44, Michal Hocko wrote:
> > All that being said, the message I wanted to get through is that atomic
> > (NOWAIT) charges could be trully reentrant if the stock local lock uses
> > trylock. We do not need a dedicated gfp flag for that now.
> 
> And I want to add. Not only we can achieve that, I also think this is
> desirable because for !RT this will be no functional change and for RT
> it makes more sense to simply do deterministic (albeit more costly
> page_counter update) than spin over a lock to use the batch (or learn
> the batch cannot be used).

So effectively this on top of yours
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f168d223375f..29a831f6109c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1768,7 +1768,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 		return ret;
 
 	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
-		if (gfp_mask & __GFP_TRYLOCK)
+		if (!gfpflags_allow_blockingk(gfp_mask))
 			return ret;
 		local_lock_irqsave(&memcg_stock.stock_lock, flags);
 	}
@@ -2211,6 +2211,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (consume_stock(memcg, nr_pages, gfp_mask))
 		return 0;
 
+	if (!gfpflags_allow_blockingk(gfp_mask))
+		batch = nr_pages;
+
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 		if (page_counter_try_charge(&memcg->memory, batch, &counter))
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-19  7:52           ` Michal Hocko
@ 2024-12-20  0:39             ` Alexei Starovoitov
  2024-12-20  8:24               ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-20  0:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Dec 18, 2024 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 19-12-24 08:27:06, Michal Hocko wrote:
> > On Thu 19-12-24 08:08:44, Michal Hocko wrote:
> > > All that being said, the message I wanted to get through is that atomic
> > > (NOWAIT) charges could be trully reentrant if the stock local lock uses
> > > trylock. We do not need a dedicated gfp flag for that now.
> >
> > And I want to add. Not only we can achieve that, I also think this is
> > desirable because for !RT this will be no functional change and for RT
> > it makes more sense to simply do deterministic (albeit more costly
> > page_counter update) than spin over a lock to use the batch (or learn
> > the batch cannot be used).
>
> So effectively this on top of yours
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f168d223375f..29a831f6109c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1768,7 +1768,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
>                 return ret;
>
>         if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> -               if (gfp_mask & __GFP_TRYLOCK)
> +               if (!gfpflags_allow_blockingk(gfp_mask))
>                         return ret;
>                 local_lock_irqsave(&memcg_stock.stock_lock, flags);

I don't quite understand such a strong desire to avoid the new GFP flag
especially when it's in mm/internal.h. There are lots of bits left.
It's not like PF_* flags that are limited, but fine
let's try to avoid GFP_TRYLOCK_BIT.

You're correct that in !RT the above will work, but in RT
spin_trylock vs spin_lock might cause spurious direct page_counter
charge instead of batching. It's still correct and unlikely to
cause performance issues, so probably fine, but in other
places like slub.c gfpflags_allow_blocking() is too coarse.
All of GFP_NOWAIT will fall into such 'trylock' category,
more slub bits will be trylock-ing and potentially returning ENOMEM
for existing GPF_NOWAIT users which is not great.

I think we can do better, though it's a bit odd to indicate
trylock gfp mode by _absence_ of flags instead of presence
of __GFP_TRYLOCK bit.

How about the following:

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ff9060af6295..f06131d5234f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,17 @@ static inline bool gfpflags_allow_blocking(const
gfp_t gfp_flags)
        return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }

+static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
+{
+       /*
+        * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
+        * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
+        * All GFP_* flags including GFP_NOWAIT use one or both flags.
+        * try_alloc_pages() is the only API that doesn't specify either flag.
+        */
+       return !(gfp_flags & __GFP_RECLAIM);
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f168d223375f..545d345c22de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1768,7 +1768,7 @@ static bool consume_stock(struct mem_cgroup
*memcg, unsigned int nr_pages,
                return ret;

        if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
-               if (gfp_mask & __GFP_TRYLOCK)
+               if (!gfpflags_allow_spinning(gfp_mask))
                        return ret;
                local_lock_irqsave(&memcg_stock.stock_lock, flags);
        }

If that's acceptable then such an approach will work for
my slub.c reentrance changes too.
GPF_NOWAIT users will not be affected.
The slub's trylock mode will be only for my upcoming try_kmalloc()
that won't use either gfp_*_reclaim flag.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2024-12-19  7:13       ` Michal Hocko
@ 2024-12-20  0:41         ` Alexei Starovoitov
  0 siblings, 0 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-20  0:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Dec 18, 2024 at 11:14 PM Michal Hocko <mhocko@suse.com> wrote:
>
> > As I mentioned in giant v2 thread I'm not touching kasan/kmsan
> > in this patch set, since it needs its own eyes
> > from experts in those bits,
> > but when it happens gfp & __GFP_TRYLOCK would be the way
> > to adjust whatever is necessary in kasan/kmsan internals.
> >
> > As Shakeel mentioned, currently kmsan_alloc_page() is gutted,
> > since I'm using __GFP_ZERO unconditionally here.
> > We don't even get to kmsan_in_runtime() check.
>
> I have missed that part! That means that you can drop kmsan_alloc_page
> altogether no?

I think it's still necessary.
kmsan_alloc_page() still needs to do the zero-ing.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-19  7:03       ` Michal Hocko
@ 2024-12-20  0:42         ` Alexei Starovoitov
  0 siblings, 0 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-20  0:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Dec 18, 2024 at 11:03 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 18-12-24 17:45:20, Alexei Starovoitov wrote:
> > On Wed, Dec 18, 2024 at 3:32 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 17-12-24 19:07:15, alexei.starovoitov@gmail.com wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Introduce free_pages_nolock() that can free pages without taking locks.
> > > > It relies on trylock and can be called from any context.
> > > > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> > > > it uses lockless link list to stash the pages which will be freed
> > > > by subsequent free_pages() from good context.
> > >
> > > Yes, this makes sense. Have you tried a simpler implementation that
> > > would just queue on the lockless link list unconditionally? That would
> > > certainly reduce the complexity. Essentially something similar that we
> > > do in vfree_atomic (well, except the queue_work which is likely too
> > > heavy for the usecase and potentialy not reentrant).
> >
> > We cannot use llist approach unconditionally.
> > One of the ways bpf maps are used is non-stop alloc/free.
> > We cannot delay the free part. When memory is free it's better to
> > be available for kernel and bpf uses right away.
> > llist is the last resort.
>
> This is an important detail that should be mentioned in the changelog.

yeah. The commit log is a bit too short.
Will expand in the next version.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-20  0:39             ` Alexei Starovoitov
@ 2024-12-20  8:24               ` Michal Hocko
  2024-12-20 16:10                 ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2024-12-20  8:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Thu 19-12-24 16:39:43, Alexei Starovoitov wrote:
> On Wed, Dec 18, 2024 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 19-12-24 08:27:06, Michal Hocko wrote:
> > > On Thu 19-12-24 08:08:44, Michal Hocko wrote:
> > > > All that being said, the message I wanted to get through is that atomic
> > > > (NOWAIT) charges could be trully reentrant if the stock local lock uses
> > > > trylock. We do not need a dedicated gfp flag for that now.
> > >
> > > And I want to add. Not only we can achieve that, I also think this is
> > > desirable because for !RT this will be no functional change and for RT
> > > it makes more sense to simply do deterministic (albeit more costly
> > > page_counter update) than spin over a lock to use the batch (or learn
> > > the batch cannot be used).
> >
> > So effectively this on top of yours
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f168d223375f..29a831f6109c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1768,7 +1768,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> >                 return ret;
> >
> >         if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> > -               if (gfp_mask & __GFP_TRYLOCK)
> > +               if (!gfpflags_allow_blockingk(gfp_mask))
> >                         return ret;
> >                 local_lock_irqsave(&memcg_stock.stock_lock, flags);
> 
> I don't quite understand such a strong desire to avoid the new GFP flag
> especially when it's in mm/internal.h. There are lots of bits left.
> It's not like PF_* flags that are limited, but fine
> let's try to avoid GFP_TRYLOCK_BIT.

Because historically this has proven to be a bad idea that usually
backfires.  As I've said in other email I do care much less now that
this is mostly internal (one can still do that but would need to try
hard). But still if we _can_ avoid it and it makes the code generally
_sensible_ then let's not introduce a new flag.

[...]
> How about the following:
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ff9060af6295..f06131d5234f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,6 +39,17 @@ static inline bool gfpflags_allow_blocking(const
> gfp_t gfp_flags)
>         return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
>  }
> 
> +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> +{
> +       /*
> +        * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> +        * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> +        * All GFP_* flags including GFP_NOWAIT use one or both flags.
> +        * try_alloc_pages() is the only API that doesn't specify either flag.

I wouldn't be surprised if we had other allocations like that. git grep
is generally not very helpful as many/most allocations use gfp argument
of a sort. I would slightly reword this to be more explicit.
	  /*
	   * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
	   * those are guaranteed to never block on a sleeping lock.
	   * Here we are enforcing that the allaaction doesn't ever spin
	   * on any locks (i.e. only trylocks). There is no highlevel
	   * GFP_$FOO flag for this use try_alloc_pages as the
	   * regular page allocator doesn't fully support this
	   * allocation mode.
> +        */
> +       return !(gfp_flags & __GFP_RECLAIM);
> +}
> +
>  #ifdef CONFIG_HIGHMEM
>  #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
>  #else
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f168d223375f..545d345c22de 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1768,7 +1768,7 @@ static bool consume_stock(struct mem_cgroup
> *memcg, unsigned int nr_pages,
>                 return ret;
> 
>         if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> -               if (gfp_mask & __GFP_TRYLOCK)
> +               if (!gfpflags_allow_spinning(gfp_mask))
>                         return ret;
>                 local_lock_irqsave(&memcg_stock.stock_lock, flags);
>         }
> 
> If that's acceptable then such an approach will work for
> my slub.c reentrance changes too.

It certainly is acceptable for me. Do not forget to add another hunk to
avoid charging the full batch in this case.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-20  8:24               ` Michal Hocko
@ 2024-12-20 16:10                 ` Alexei Starovoitov
  2024-12-20 19:45                   ` Shakeel Butt
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2024-12-20 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Fri, Dec 20, 2024 at 12:24 AM Michal Hocko <mhocko@suse.com> wrote:
>
> > +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> > +{
> > +       /*
> > +        * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> > +        * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> > +        * All GFP_* flags including GFP_NOWAIT use one or both flags.
> > +        * try_alloc_pages() is the only API that doesn't specify either flag.
>
> I wouldn't be surprised if we had other allocations like that. git grep
> is generally not very helpful as many/most allocations use gfp argument
> of a sort. I would slightly reword this to be more explicit.
>           /*
>            * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
>            * those are guaranteed to never block on a sleeping lock.
>            * Here we are enforcing that the allaaction doesn't ever spin
>            * on any locks (i.e. only trylocks). There is no highlevel
>            * GFP_$FOO flag for this use try_alloc_pages as the
>            * regular page allocator doesn't fully support this
>            * allocation mode.

Makes sense. I like this new wording. Will incorporate.

> > +        */
> > +       return !(gfp_flags & __GFP_RECLAIM);
> > +}
> > +
> >  #ifdef CONFIG_HIGHMEM
> >  #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
> >  #else
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f168d223375f..545d345c22de 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1768,7 +1768,7 @@ static bool consume_stock(struct mem_cgroup
> > *memcg, unsigned int nr_pages,
> >                 return ret;
> >
> >         if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> > -               if (gfp_mask & __GFP_TRYLOCK)
> > +               if (!gfpflags_allow_spinning(gfp_mask))
> >                         return ret;
> >                 local_lock_irqsave(&memcg_stock.stock_lock, flags);
> >         }
> >
> > If that's acceptable then such an approach will work for
> > my slub.c reentrance changes too.
>
> It certainly is acceptable for me.

Great.

> Do not forget to add another hunk to
> avoid charging the full batch in this case.

Well. It looks like you spotted the existing bug ?

Instead of
+       if (!gfpflags_allow_blockingk(gfp_mask))
+               batch = nr_pages;

it should be unconditional:

+               batch = nr_pages;

after consume_stock() returns false.

Consider:
        stock_pages = READ_ONCE(stock->nr_pages);
        if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) {

stock_pages == 10
nr_pages == 20

so after consume_stock() returns false
the batch will stay == MEMCG_CHARGE_BATCH == 64
and
page_counter_try_charge(&memcg->memsw, batch,...

will charge too much ?

and the bug was there for a long time.

Johaness,

looks like it's mostly your code?

Pls help us out.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-20 16:10                 ` Alexei Starovoitov
@ 2024-12-20 19:45                   ` Shakeel Butt
  2024-12-21  7:20                     ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2024-12-20 19:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Matthew Wilcox,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team

On Fri, Dec 20, 2024 at 08:10:47AM -0800, Alexei Starovoitov wrote:
> On Fri, Dec 20, 2024 at 12:24 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > > +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> > > +{
> > > +       /*
> > > +        * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> > > +        * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> > > +        * All GFP_* flags including GFP_NOWAIT use one or both flags.
> > > +        * try_alloc_pages() is the only API that doesn't specify either flag.
> >
> > I wouldn't be surprised if we had other allocations like that. git grep
> > is generally not very helpful as many/most allocations use gfp argument
> > of a sort. I would slightly reword this to be more explicit.
> >           /*
> >            * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
> >            * those are guaranteed to never block on a sleeping lock.
> >            * Here we are enforcing that the allaaction doesn't ever spin
> >            * on any locks (i.e. only trylocks). There is no highlevel
> >            * GFP_$FOO flag for this use try_alloc_pages as the
> >            * regular page allocator doesn't fully support this
> >            * allocation mode.
> 
> Makes sense. I like this new wording. Will incorporate.
> 
> > > +        */
> > > +       return !(gfp_flags & __GFP_RECLAIM);
> > > +}
> > > +
> > >  #ifdef CONFIG_HIGHMEM
> > >  #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
> > >  #else
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index f168d223375f..545d345c22de 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1768,7 +1768,7 @@ static bool consume_stock(struct mem_cgroup
> > > *memcg, unsigned int nr_pages,
> > >                 return ret;
> > >
> > >         if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> > > -               if (gfp_mask & __GFP_TRYLOCK)
> > > +               if (!gfpflags_allow_spinning(gfp_mask))
> > >                         return ret;
> > >                 local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > >         }
> > >
> > > If that's acceptable then such an approach will work for
> > > my slub.c reentrance changes too.
> >
> > It certainly is acceptable for me.
> 
> Great.
> 
> > Do not forget to add another hunk to
> > avoid charging the full batch in this case.
> 
> Well. It looks like you spotted the existing bug ?
> 
> Instead of
> +       if (!gfpflags_allow_blockingk(gfp_mask))
> +               batch = nr_pages;
> 
> it should be unconditional:
> 
> +               batch = nr_pages;
> 
> after consume_stock() returns false.
> 
> Consider:
>         stock_pages = READ_ONCE(stock->nr_pages);
>         if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) {
> 
> stock_pages == 10
> nr_pages == 20
> 
> so after consume_stock() returns false
> the batch will stay == MEMCG_CHARGE_BATCH == 64
> and
> page_counter_try_charge(&memcg->memsw, batch,...
> 
> will charge too much ?
> 
> and the bug was there for a long time.
> 
> Johaness,
> 
> looks like it's mostly your code?
> 
> Pls help us out.

I think the code is fine as the overcharge amount will be refilled into
the stock (old one will be flushed). 

	if (gfpflags_allow_spinning(gfp_mask))
		batch = nr_pages;

The above code will just avoid the refill and flushing the older stock.
Maybe Michal's suggestion is due to that reason.

BTW after the done_restock tag in try_charge_memcg(), we will another
gfpflags_allow_spinning() check to avoid schedule_work() and
mem_cgroup_handle_over_high(). Maybe simply return early for
gfpflags_allow_spinning() without checking high marks.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock.
  2024-12-20 19:45                   ` Shakeel Butt
@ 2024-12-21  7:20                     ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2024-12-21  7:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Vlastimil Babka, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Fri 20-12-24 11:45:47, Shakeel Butt wrote:
[...]
> I think the code is fine as the overcharge amount will be refilled into
> the stock (old one will be flushed). 
> 
> 	if (gfpflags_allow_spinning(gfp_mask))
> 		batch = nr_pages;
> 
> The above code will just avoid the refill and flushing the older stock.
> Maybe Michal's suggestion is due to that reason.

Exactly. This surely begs for a comment to explain that.

> BTW after the done_restock tag in try_charge_memcg(), we will another
> gfpflags_allow_spinning() check to avoid schedule_work() and
> mem_cgroup_handle_over_high(). Maybe simply return early for
> gfpflags_allow_spinning() without checking high marks.

Right you are.

Btw. I will be mostly offline until Jan 6.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-12-21  7:20 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-18  3:07 [PATCH bpf-next v3 0/6] bpf, mm: Introduce try_alloc_pages() alexei.starovoitov
2024-12-18  3:07 ` [PATCH bpf-next v3 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation alexei.starovoitov
2024-12-18 11:32   ` Michal Hocko
2024-12-19  0:05     ` Shakeel Butt
2024-12-19  7:18       ` Michal Hocko
2024-12-19  1:18     ` Alexei Starovoitov
2024-12-19  7:13       ` Michal Hocko
2024-12-20  0:41         ` Alexei Starovoitov
2024-12-19  0:10   ` Shakeel Butt
2024-12-19  1:39     ` Alexei Starovoitov
2024-12-18  3:07 ` [PATCH bpf-next v3 2/6] mm, bpf: Introduce free_pages_nolock() alexei.starovoitov
2024-12-18  4:58   ` Yosry Ahmed
2024-12-18  5:33     ` Alexei Starovoitov
2024-12-18  5:57       ` Yosry Ahmed
2024-12-18  6:37         ` Alexei Starovoitov
2024-12-18  6:49           ` Yosry Ahmed
2024-12-18  7:25             ` Alexei Starovoitov
2024-12-18  7:40               ` Yosry Ahmed
2024-12-18 11:32   ` Michal Hocko
2024-12-19  1:45     ` Alexei Starovoitov
2024-12-19  7:03       ` Michal Hocko
2024-12-20  0:42         ` Alexei Starovoitov
2024-12-18  3:07 ` [PATCH bpf-next v3 3/6] locking/local_lock: Introduce local_trylock_irqsave() alexei.starovoitov
2024-12-18  3:07 ` [PATCH bpf-next v3 4/6] memcg: Use trylock to access memcg stock_lock alexei.starovoitov
2024-12-18 11:32   ` Michal Hocko
2024-12-19  1:53     ` Alexei Starovoitov
2024-12-19  7:08       ` Michal Hocko
2024-12-19  7:27         ` Michal Hocko
2024-12-19  7:52           ` Michal Hocko
2024-12-20  0:39             ` Alexei Starovoitov
2024-12-20  8:24               ` Michal Hocko
2024-12-20 16:10                 ` Alexei Starovoitov
2024-12-20 19:45                   ` Shakeel Butt
2024-12-21  7:20                     ` Michal Hocko
2024-12-18  3:07 ` [PATCH bpf-next v3 5/6] mm, bpf: Use memcg in try_alloc_pages() alexei.starovoitov
2024-12-18  3:07 ` [PATCH bpf-next v3 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs alexei.starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox