linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK
@ 2024-12-10  2:39 Alexei Starovoitov
  2024-12-10  2:39 ` [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10  2:39 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Hi All,

This is a more complete patch set that introduces __GFP_TRYLOCK
for opportunistic page allocation and lockless page freeing.
It's usable for bpf as-is.
The main motivation is to remove bpf_mem_alloc and make
alloc page and slab reentrant.
These patch set is a first step. Once try_alloc_pages() is available
new_slab() can be converted to it and the rest of kmalloc/slab_alloc.

I started hacking kmalloc() to replace bpf_mem_alloc() completely,
but ___slab_alloc() is quite complex to convert to trylock.
Mainly deactivate_slab part. It cannot fail, but when only trylock
is available I'm running out of ideas.
So far I'm thinking to limit it to:
- USE_LOCKLESS_FAST_PATH
  Which would mean that we would need to keep bpf_mem_alloc only for RT :(
- slab->flags & __CMPXCHG_DOUBLE, because various debugs cannot work in
  trylock mode. bit slab_lock() cannot be made to work with trylock either.
- simple kasan poison/unposion, since kasan_kmalloc and kasan_slab_free are
  too fancy with their own locks.

v1->v2:
- fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter.
- optimize all paths by doing spin_trylock_irqsave() first
  and only then check for gfp_flags & __GFP_TRYLOCK.
  Then spin_lock_irqsave() if it's a regular mode.
  So new gfp flag will not add performance overhead.
- patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock()
  and memcg support. So it's in usable shape for bpf in patch 6.

v1:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (6):
  mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  mm, bpf: Introduce free_pages_nolock()
  locking/local_lock: Introduce local_trylock_irqsave()
  memcg: Add __GFP_TRYLOCK support.
  mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages().
  bpf: Use try_alloc_pages() to allocate pages for bpf needs.

 include/linux/gfp.h                 | 25 ++++++++
 include/linux/gfp_types.h           |  3 +
 include/linux/local_lock.h          |  9 +++
 include/linux/local_lock_internal.h | 23 +++++++
 include/linux/mm_types.h            |  4 ++
 include/linux/mmzone.h              |  3 +
 include/trace/events/mmflags.h      |  1 +
 kernel/bpf/syscall.c                |  4 +-
 mm/fail_page_alloc.c                |  6 ++
 mm/internal.h                       |  1 +
 mm/memcontrol.c                     | 21 +++++--
 mm/page_alloc.c                     | 94 +++++++++++++++++++++++++----
 tools/perf/builtin-kmem.c           |  1 +
 13 files changed, 177 insertions(+), 18 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  2:39 [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK Alexei Starovoitov
@ 2024-12-10  2:39 ` Alexei Starovoitov
  2024-12-10  5:31   ` Matthew Wilcox
                     ` (2 more replies)
  2024-12-10  2:39 ` [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10  2:39 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs execute from tracepoints and kprobes where running
context is unknown, but they need to request additional memory.
The prior workarounds were using pre-allocated memory and BPF specific
freelists to satisfy such allocation requests. Instead, introduce
__GFP_TRYLOCK flag that makes page allocator accessible from any context.
It relies on percpu free list of pages that rmqueue_pcplist() should be
able to pop the page from. If it fails (due to IRQ re-entrancy or list
being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
and refill percpu freelist as normal.
BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
so trylock is the only option.
In theory we can introduce percpu reentrance counter and increment it
every time spin_lock_irqsave(&zone->lock, flags) is used,
but we cannot rely on it. Even if this cpu is not in page_alloc path
the spin_lock_irqsave() is not safe, since BPF prog might be called
from tracepoint where preemption is disabled. So trylock only.

Note, free_page and memcg are not taught about __GFP_TRYLOCK yet.
The support comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h            | 25 +++++++++++++++++++++++++
 include/linux/gfp_types.h      |  3 +++
 include/trace/events/mmflags.h |  1 +
 mm/fail_page_alloc.c           |  6 ++++++
 mm/internal.h                  |  1 +
 mm/page_alloc.c                | 17 ++++++++++++++---
 tools/perf/builtin-kmem.c      |  1 +
 7 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b0fe9f62d15b..f68daa9c997b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -347,6 +347,31 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
+static inline struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+	/*
+	 * If spin_locks are not held and interrupts are enabled, use normal
+	 * path. BPF progs run under rcu_read_lock(), so in PREEMPT_RT
+	 * rcu_preempt_depth() will be >= 1 and will use trylock path.
+	 */
+	if (preemptible() && !rcu_preempt_depth())
+		return alloc_pages_node_noprof(nid,
+					       GFP_NOWAIT | __GFP_ZERO,
+					       order);
+	/*
+	 * Best effort allocation from percpu free list.
+	 * If it's empty attempt to spin_trylock zone->lock.
+	 * Do not specify __GFP_KSWAPD_RECLAIM to avoid wakeup_kswapd
+	 * that may need to grab a lock.
+	 * Do not specify __GFP_ACCOUNT to avoid local_lock.
+	 * Do not warn either.
+	 */
+	return alloc_pages_node_noprof(nid,
+				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
+				       order);
+}
+#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
 
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 65db9349f905..72b385a7888d 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -48,6 +48,7 @@ enum {
 	___GFP_THISNODE_BIT,
 	___GFP_ACCOUNT_BIT,
 	___GFP_ZEROTAGS_BIT,
+	___GFP_TRYLOCK_BIT,
 #ifdef CONFIG_KASAN_HW_TAGS
 	___GFP_SKIP_ZERO_BIT,
 	___GFP_SKIP_KASAN_BIT,
@@ -86,6 +87,7 @@ enum {
 #define ___GFP_THISNODE		BIT(___GFP_THISNODE_BIT)
 #define ___GFP_ACCOUNT		BIT(___GFP_ACCOUNT_BIT)
 #define ___GFP_ZEROTAGS		BIT(___GFP_ZEROTAGS_BIT)
+#define ___GFP_TRYLOCK		BIT(___GFP_TRYLOCK_BIT)
 #ifdef CONFIG_KASAN_HW_TAGS
 #define ___GFP_SKIP_ZERO	BIT(___GFP_SKIP_ZERO_BIT)
 #define ___GFP_SKIP_KASAN	BIT(___GFP_SKIP_KASAN_BIT)
@@ -293,6 +295,7 @@ enum {
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
 #define __GFP_ZEROTAGS	((__force gfp_t)___GFP_ZEROTAGS)
+#define __GFP_TRYLOCK	((__force gfp_t)___GFP_TRYLOCK)
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index bb8a59c6caa2..592c93ee5f35 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,6 +50,7 @@
 	gfpflag_string(__GFP_RECLAIM),		\
 	gfpflag_string(__GFP_DIRECT_RECLAIM),	\
 	gfpflag_string(__GFP_KSWAPD_RECLAIM),	\
+	gfpflag_string(__GFP_TRYLOCK),		\
 	gfpflag_string(__GFP_ZEROTAGS)
 
 #ifdef CONFIG_KASAN_HW_TAGS
diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c
index 7647096170e9..b3b297d67909 100644
--- a/mm/fail_page_alloc.c
+++ b/mm/fail_page_alloc.c
@@ -31,6 +31,12 @@ bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 		return false;
 	if (gfp_mask & __GFP_NOFAIL)
 		return false;
+	if (gfp_mask & __GFP_TRYLOCK)
+		/*
+		 * Internals of should_fail_ex() are not compatible
+		 * with trylock concept.
+		 */
+		return false;
 	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
 		return false;
 	if (fail_page_alloc.ignore_gfp_reclaim &&
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..c082b8fa1d71 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1175,6 +1175,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#define ALLOC_TRYLOCK		0x1000000 /* Only use spin_trylock in allocation path */
 
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1cb4b8c8886d..d511e68903c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2304,7 +2304,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+			return 0;
+		spin_lock_irqsave(&zone->lock, flags);
+	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -2904,7 +2908,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
+		if (!spin_trylock_irqsave(&zone->lock, flags)){
+			if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+				return NULL;
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
@@ -4001,6 +4009,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	 */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
 	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);
+	BUILD_BUG_ON(__GFP_TRYLOCK != (__force gfp_t) ALLOC_TRYLOCK);
 
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
@@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int)
-		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
+		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
 
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		/*
@@ -4509,6 +4518,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 
 	might_alloc(gfp_mask);
 
+	*alloc_flags |= (__force int) (gfp_mask & __GFP_TRYLOCK);
+
 	if (should_fail_alloc_page(gfp_mask, order))
 		return false;
 
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 4d8d94146f8d..1f7f4269fa10 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -682,6 +682,7 @@ static const struct {
 	{ "__GFP_RECLAIM",		"R" },
 	{ "__GFP_DIRECT_RECLAIM",	"DR" },
 	{ "__GFP_KSWAPD_RECLAIM",	"KR" },
+	{ "__GFP_TRYLOCK",		"TL" },
 };
 
 static size_t max_gfp_len;
-- 
2.43.5



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-10  2:39 [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK Alexei Starovoitov
  2024-12-10  2:39 ` [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Alexei Starovoitov
@ 2024-12-10  2:39 ` Alexei Starovoitov
  2024-12-10  8:35   ` Sebastian Andrzej Siewior
  2024-12-11 10:11   ` Vlastimil Babka
  2024-12-10  2:39 ` [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10  2:39 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce free_pages_nolock() that can free a page without taking locks.
It relies on trylock only and can be called from any context.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h      |  1 +
 include/linux/mm_types.h |  4 +++
 include/linux/mmzone.h   |  3 ++
 mm/page_alloc.c          | 72 +++++++++++++++++++++++++++++++++++-----
 4 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f68daa9c997b..dcae733ed006 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -394,6 +394,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
 	__get_free_pages((gfp_mask) | GFP_DMA, (order))
 
 extern void __free_pages(struct page *page, unsigned int order);
+extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68..52547b3e5fd8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,10 @@ struct page {
 				/* Or, free page */
 				struct list_head buddy_list;
 				struct list_head pcp_list;
+				struct {
+					struct llist_node pcp_llist;
+					unsigned int order;
+				};
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16..1a854e0a9e3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -953,6 +953,9 @@ struct zone {
 	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Pages to be freed when next trylock succeeds */
+	struct llist_head	trylock_free_pages;
+
 	/* Write-intensive fields used by compaction and vmstats. */
 	CACHELINE_PADDING(_pad2_);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d511e68903c6..a969a62ec0c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Free the page without taking locks. Rely on trylock only. */
+#define FPI_TRYLOCK		((__force fpi_t)BIT(2))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1251,9 +1254,33 @@ static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
 			  fpi_t fpi_flags)
 {
+	struct llist_head *llhead;
 	unsigned long flags;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+			/* Remember the order */
+			page->order = order;
+			/* Add the page to the free list */
+			llist_add(&page->pcp_llist, &zone->trylock_free_pages);
+			return;
+		}
+		spin_lock_irqsave(&zone->lock, flags);
+	}
+
+	/* The lock succeeded. Process deferred pages. */
+	llhead = &zone->trylock_free_pages;
+	if (unlikely(!llist_empty(llhead))) {
+		struct llist_node *llnode;
+		struct page *p, *tmp;
+
+		llnode = llist_del_all(llhead);
+		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+			unsigned int p_order = p->order;
+			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+			__count_vm_events(PGFREE, 1 << p_order);
+		}
+	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2596,7 +2623,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
-				   unsigned int order)
+				   unsigned int order, fpi_t fpi_flags)
 {
 	int high, batch;
 	int pindex;
@@ -2631,6 +2658,14 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
 		pcp->free_count += (1 << order);
+
+	if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+		/*
+		 * Do not attempt to take a zone lock. Let pcp->count get
+		 * over high mark temporarily.
+		 */
+		return;
+	}
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2645,7 +2680,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 /*
  * Free a pcp page
  */
-void free_unref_page(struct page *page, unsigned int order)
+static void __free_unref_page(struct page *page, unsigned int order,
+			      fpi_t fpi_flags)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2654,7 +2690,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, fpi_flags);
 		return;
 	}
 
@@ -2671,7 +2707,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, fpi_flags);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -2681,14 +2717,19 @@ void free_unref_page(struct page *page, unsigned int order)
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, migratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, fpi_flags);
 	}
 	pcp_trylock_finish(UP_flags);
 }
 
+void free_unref_page(struct page *page, unsigned int order)
+{
+	__free_unref_page(page, order, FPI_NONE);
+}
+
 /*
  * Free a batch of folios
  */
@@ -2777,7 +2818,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				       order, FPI_NONE);
 	}
 
 	if (pcp) {
@@ -4855,6 +4896,21 @@ void __free_pages(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL(__free_pages);
 
+/* Can be called while holding raw_spin_lock or from IRQ. RCU must be watching. */
+void free_pages_nolock(struct page *page, unsigned int order)
+{
+	int head = PageHead(page);
+	struct alloc_tag *tag = pgalloc_tag_get(page);
+
+	if (put_page_testzero(page)) {
+		__free_unref_page(page, order, FPI_TRYLOCK);
+	} else if (!head) {
+		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
+		while (order-- > 0)
+			__free_unref_page(page + (1 << order), order, FPI_TRYLOCK);
+	}
+}
+
 void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0) {
-- 
2.43.5



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-10  2:39 [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK Alexei Starovoitov
  2024-12-10  2:39 ` [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Alexei Starovoitov
  2024-12-10  2:39 ` [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2024-12-10  2:39 ` Alexei Starovoitov
  2024-12-11 10:53   ` Vlastimil Babka
  2024-12-12 15:15   ` Sebastian Andrzej Siewior
  2024-12-10  2:39 ` [PATCH bpf-next v2 4/6] memcg: Add __GFP_TRYLOCK support Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10  2:39 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Similar to local_lock_irqsave() introduce local_trylock_irqsave().
It uses spin_trylock in PREEMPT_RT and always succeeds when !RT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          |  9 +++++++++
 include/linux/local_lock_internal.h | 23 +++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb..6880c29b8b98 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -30,6 +30,15 @@
 #define local_lock_irqsave(lock, flags)				\
 	__local_lock_irqsave(lock, flags)
 
+/**
+ * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ *			   interrupts. Always succeeds in !PREEMPT_RT.
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ */
+#define local_trylock_irqsave(lock, flags)			\
+	__local_trylock_irqsave(lock, flags)
+
 /**
  * local_unlock - Release a per CPU local lock
  * @lock:	The lock variable
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..2c0f8a49c2d0 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -31,6 +31,13 @@ static inline void local_lock_acquire(local_lock_t *l)
 	l->owner = current;
 }
 
+static inline void local_trylock_acquire(local_lock_t *l)
+{
+	lock_map_acquire_try(&l->dep_map);
+	DEBUG_LOCKS_WARN_ON(l->owner);
+	l->owner = current;
+}
+
 static inline void local_lock_release(local_lock_t *l)
 {
 	DEBUG_LOCKS_WARN_ON(l->owner != current);
@@ -45,6 +52,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
 #else /* CONFIG_DEBUG_LOCK_ALLOC */
 # define LOCAL_LOCK_DEBUG_INIT(lockname)
 static inline void local_lock_acquire(local_lock_t *l) { }
+static inline void local_trylock_acquire(local_lock_t *l) { }
 static inline void local_lock_release(local_lock_t *l) { }
 static inline void local_lock_debug_init(local_lock_t *l) { }
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
@@ -91,6 +99,13 @@ do {								\
 		local_lock_acquire(this_cpu_ptr(lock));		\
 	} while (0)
 
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		local_irq_save(flags);				\
+		local_trylock_acquire(this_cpu_ptr(lock));	\
+		1;						\
+	})
+
 #define __local_unlock(lock)					\
 	do {							\
 		local_lock_release(this_cpu_ptr(lock));		\
@@ -148,6 +163,14 @@ typedef spinlock_t local_lock_t;
 		__local_lock(lock);				\
 	} while (0)
 
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		migrate_disable();				\
+		spin_trylock(this_cpu_ptr((__lock)));		\
+	})
+
 #define __local_unlock(__lock)					\
 	do {							\
 		spin_unlock(this_cpu_ptr((__lock)));		\
-- 
2.43.5



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next v2 4/6] memcg: Add __GFP_TRYLOCK support.
  2024-12-10  2:39 [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2024-12-10  2:39 ` [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
@ 2024-12-10  2:39 ` Alexei Starovoitov
  2024-12-11 23:47   ` kernel test robot
  2024-12-10  2:39 ` [PATCH bpf-next v2 5/6] mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages() Alexei Starovoitov
  2024-12-10  2:39 ` [PATCH bpf-next v2 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
  5 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10  2:39 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Teach memcg to operate under __GFP_TRYLOCK conditions when
spinning locks cannot be used.
The end result is __memcg_kmem_charge_page() and
__memcg_kmem_uncharge_page() become lockless.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/memcontrol.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..459f35f15819 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1756,7 +1756,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
  *
  * returns true if successful, false otherwise.
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			  gfp_t gfp_mask)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned int stock_pages;
@@ -1766,7 +1767,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		if (gfp_mask & __GFP_TRYLOCK)
+			return ret;
+		local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	}
 
 	stock = this_cpu_ptr(&memcg_stock);
 	stock_pages = READ_ONCE(stock->nr_pages);
@@ -1851,7 +1856,15 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		/*
+		 * In !RT local_trylock_irqsave() always succeeds.
+		 * In case of unlikely failure to lock percpu stock_lock in RT
+		 * uncharge memcg directly.
+		 */
+		mem_cgroup_cancel_charge(memcg, nr_pages);
+		return;
+	}
 	__refill_stock(memcg, nr_pages);
 	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
@@ -2196,7 +2209,7 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long pflags;
 
 retry:
-	if (consume_stock(memcg, nr_pages))
+	if (consume_stock(memcg, nr_pages, gfp_mask))
 		return 0;
 
 	if (!do_memsw_account() ||
-- 
2.43.5



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next v2 5/6] mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages().
  2024-12-10  2:39 [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2024-12-10  2:39 ` [PATCH bpf-next v2 4/6] memcg: Add __GFP_TRYLOCK support Alexei Starovoitov
@ 2024-12-10  2:39 ` Alexei Starovoitov
  2024-12-11 12:05   ` Vlastimil Babka
  2024-12-10  2:39 ` [PATCH bpf-next v2 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
  5 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10  2:39 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
The caller is responsible to setup memcg correctly.
All BPF memory accounting is memcg based.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h | 5 ++---
 mm/page_alloc.c     | 5 ++++-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dcae733ed006..820c4938c9cd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -356,18 +356,17 @@ static inline struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 	 */
 	if (preemptible() && !rcu_preempt_depth())
 		return alloc_pages_node_noprof(nid,
-					       GFP_NOWAIT | __GFP_ZERO,
+					       GFP_NOWAIT | __GFP_ZERO | __GFP_ACCOUNT,
 					       order);
 	/*
 	 * Best effort allocation from percpu free list.
 	 * If it's empty attempt to spin_trylock zone->lock.
 	 * Do not specify __GFP_KSWAPD_RECLAIM to avoid wakeup_kswapd
 	 * that may need to grab a lock.
-	 * Do not specify __GFP_ACCOUNT to avoid local_lock.
 	 * Do not warn either.
 	 */
 	return alloc_pages_node_noprof(nid,
-				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
+				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT,
 				       order);
 }
 #define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a969a62ec0c3..1fada16b8a14 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4818,7 +4818,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
 out:
 	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
 	    unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
-		__free_pages(page, order);
+		if (unlikely(gfp & __GFP_TRYLOCK))
+			free_pages_nolock(page, order);
+		else
+			__free_pages(page, order);
 		page = NULL;
 	}
 
-- 
2.43.5



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next v2 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
  2024-12-10  2:39 [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2024-12-10  2:39 ` [PATCH bpf-next v2 5/6] mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages() Alexei Starovoitov
@ 2024-12-10  2:39 ` Alexei Starovoitov
  5 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10  2:39 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use try_alloc_pages() and free_pages_nolock()

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/syscall.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5684e8ce132d..70589208b545 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -582,14 +582,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
 	old_memcg = set_active_memcg(memcg);
 #endif
 	for (i = 0; i < nr_pages; i++) {
-		pg = alloc_pages_node(nid, gfp | __GFP_ACCOUNT, 0);
+		pg = try_alloc_pages(nid, 0);
 
 		if (pg) {
 			pages[i] = pg;
 			continue;
 		}
 		for (j = 0; j < i; j++)
-			__free_page(pages[j]);
+			free_pages_nolock(pages[j], 0);
 		ret = -ENOMEM;
 		break;
 	}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  2:39 ` [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Alexei Starovoitov
@ 2024-12-10  5:31   ` Matthew Wilcox
  2024-12-10  9:05     ` Michal Hocko
  2024-12-10 21:42     ` Alexei Starovoitov
  2024-12-10  9:01   ` Sebastian Andrzej Siewior
  2024-12-10 18:39   ` Vlastimil Babka
  2 siblings, 2 replies; 50+ messages in thread
From: Matthew Wilcox @ 2024-12-10  5:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, mhocko, tglx, tj, linux-mm,
	kernel-team

On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> +	if (preemptible() && !rcu_preempt_depth())
> +		return alloc_pages_node_noprof(nid,
> +					       GFP_NOWAIT | __GFP_ZERO,
> +					       order);
> +	return alloc_pages_node_noprof(nid,
> +				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> +				       order);

[...]

> @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
>  	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
>  	 */
>  	alloc_flags |= (__force int)
> -		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> +		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));

It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
I was originally wondering if this wasn't a memalloc_nolock_save() /
memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),
but I wonder if we can simply do:

	if (!preemptible() || rcu_preempt_depth())
		alloc_flags |= ALLOC_TRYLOCK;


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-10  2:39 ` [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2024-12-10  8:35   ` Sebastian Andrzej Siewior
  2024-12-10 22:49     ` Alexei Starovoitov
  2024-12-11 10:11   ` Vlastimil Babka
  1 sibling, 1 reply; 50+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-10  8:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

On 2024-12-09 18:39:32 [-0800], Alexei Starovoitov wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d511e68903c6..a969a62ec0c3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1251,9 +1254,33 @@ static void free_one_page(struct zone *zone, struct page *page,
>  			  unsigned long pfn, unsigned int order,
>  			  fpi_t fpi_flags)
>  {
> +	struct llist_head *llhead;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&zone->lock, flags);
> +	if (!spin_trylock_irqsave(&zone->lock, flags)) {
> +		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> +			/* Remember the order */
> +			page->order = order;
> +			/* Add the page to the free list */
> +			llist_add(&page->pcp_llist, &zone->trylock_free_pages);
> +			return;
> +		}
> +		spin_lock_irqsave(&zone->lock, flags);
> +	}
> +
> +	/* The lock succeeded. Process deferred pages. */
> +	llhead = &zone->trylock_free_pages;
> +	if (unlikely(!llist_empty(llhead))) {
> +		struct llist_node *llnode;
> +		struct page *p, *tmp;
> +
> +		llnode = llist_del_all(llhead);

Do you really need to turn the list around?

> +		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
> +			unsigned int p_order = p->order;
> +			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> +			__count_vm_events(PGFREE, 1 << p_order);
> +		}

We had something like that (returning memory in IRQ/ irq-off) in RT tree
and we got rid of it before posting the needed bits to mm.

If we really intend to do something like this, could we please process
this list in an explicitly locked section? I mean not in a try-lock
fashion which might have originated in an IRQ-off region on PREEMPT_RT
but in an explicit locked section which would remain preemptible. This
would also avoid the locking problem down the road when
shuffle_pick_tail() invokes get_random_u64() which in turn acquires a
spinlock_t.

> +	}
>  	split_large_buddy(zone, page, pfn, order, fpi_flags);
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  

Sebastian


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  2:39 ` [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Alexei Starovoitov
  2024-12-10  5:31   ` Matthew Wilcox
@ 2024-12-10  9:01   ` Sebastian Andrzej Siewior
  2024-12-10 21:53     ` Alexei Starovoitov
  2024-12-10 18:39   ` Vlastimil Babka
  2 siblings, 1 reply; 50+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-10  9:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where running
> context is unknown, but they need to request additional memory.
> The prior workarounds were using pre-allocated memory and BPF specific
> freelists to satisfy such allocation requests. Instead, introduce
> __GFP_TRYLOCK flag that makes page allocator accessible from any context.
> It relies on percpu free list of pages that rmqueue_pcplist() should be
> able to pop the page from. If it fails (due to IRQ re-entrancy or list
> being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
> and refill percpu freelist as normal.
> BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
> so trylock is the only option.

The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF
where it is not known how much memory will be needed and what the
calling context is. I hope it does not spread across the kernel where
people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they
learn that this does not work, add this flag to the mix to make it work
without spending some time on reworking it.

Side note: I am in the process of hopefully getting rid of the
preempt_disable() from trace points. What remains then is attaching BPF
programs to any code/ function with a raw_spinlock_t and I am not yet
sure what to do here.

Sebastian


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  5:31   ` Matthew Wilcox
@ 2024-12-10  9:05     ` Michal Hocko
  2024-12-10 20:25       ` Shakeel Butt
  2024-12-10 22:06       ` Alexei Starovoitov
  2024-12-10 21:42     ` Alexei Starovoitov
  1 sibling, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2024-12-10  9:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Matthew Wilcox
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, tglx, tj, linux-mm, kernel-team

On Tue 10-12-24 05:31:30, Matthew Wilcox wrote:
> On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> > +	if (preemptible() && !rcu_preempt_depth())
> > +		return alloc_pages_node_noprof(nid,
> > +					       GFP_NOWAIT | __GFP_ZERO,
> > +					       order);
> > +	return alloc_pages_node_noprof(nid,
> > +				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> > +				       order);
> 
> [...]
> 
> > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> >  	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
> >  	 */
> >  	alloc_flags |= (__force int)
> > -		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> > +		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
> 
> It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
> I was originally wondering if this wasn't a memalloc_nolock_save() /
> memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),
> but I wonder if we can simply do:
> 
> 	if (!preemptible() || rcu_preempt_depth())
> 		alloc_flags |= ALLOC_TRYLOCK;

preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that
__GFP_TRYLOCK is not really a preferred way to go forward. For 3
reasons. 

First I do not really like the name as it tells what it does rather than
how it should be used. This is a general pattern of many gfp flags
unfotrunatelly and historically it has turned out error prone. If a gfp
flag is really needed then something like __GFP_ANY_CONTEXT should be
used.  If the current implementation requires to use try_lock for
zone->lock or other changes is not an implementation detail but the user
should have a clear understanding that allocation is allowed from any
context (NMI, IRQ or otherwise atomic contexts).

Is there any reason why GFP_ATOMIC cannot be extended to support new
contexts? This allocation mode is already documented to be usable from
atomic contexts except from NMI and raw_spinlocks. But is it feasible to
extend the current implementation to use only trylock on zone->lock if
called from in_nmi() to reduce unexpected failures on contention for
existing users?

Third, do we even want such a strong guarantee in the generic page
allocator path and make it even more complex and harder to maintain? We
already have a precence in form of __alloc_pages_bulk which is a special
case allocator mode living outside of the page allocator path. It seems
that it covers most of your requirements except the fallback to the
regular allocation path AFAICS. Is this something you could piggy back
on?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  2:39 ` [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Alexei Starovoitov
  2024-12-10  5:31   ` Matthew Wilcox
  2024-12-10  9:01   ` Sebastian Andrzej Siewior
@ 2024-12-10 18:39   ` Vlastimil Babka
  2024-12-10 22:42     ` Alexei Starovoitov
  2 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-10 18:39 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
	shakeel.butt, mhocko, willy, tglx, tj, linux-mm, kernel-team

On 12/10/24 03:39, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where running
> context is unknown, but they need to request additional memory.
> The prior workarounds were using pre-allocated memory and BPF specific
> freelists to satisfy such allocation requests. Instead, introduce
> __GFP_TRYLOCK flag that makes page allocator accessible from any context.
> It relies on percpu free list of pages that rmqueue_pcplist() should be
> able to pop the page from. If it fails (due to IRQ re-entrancy or list
> being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
> and refill percpu freelist as normal.
> BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
> so trylock is the only option.
> In theory we can introduce percpu reentrance counter and increment it
> every time spin_lock_irqsave(&zone->lock, flags) is used,
> but we cannot rely on it. Even if this cpu is not in page_alloc path
> the spin_lock_irqsave() is not safe, since BPF prog might be called
> from tracepoint where preemption is disabled. So trylock only.
> 
> Note, free_page and memcg are not taught about __GFP_TRYLOCK yet.
> The support comes in the next patches.
> 
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

I think there might be more non-try spin_locks reachable from page allocations:

- in reserve_highatomic_pageblock() which I think is reachable unless this
is limited to order-0
- try_to_accept_memory_one()
- as part of post_alloc_hook() in set_page_owner(), stack depot might do
raw_spin_lock_irqsave(), is that one ok?

hope I didn't miss anything else especially in those other debugging hooks
(KASAN etc)



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  9:05     ` Michal Hocko
@ 2024-12-10 20:25       ` Shakeel Butt
  2024-12-11 10:08         ` Michal Hocko
  2024-12-10 22:06       ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Shakeel Butt @ 2024-12-10 20:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Matthew Wilcox, bpf, andrii, memxor, akpm,
	peterz, vbabka, bigeasy, rostedt, houtao1, hannes, tglx, tj,
	linux-mm, kernel-team

On Tue, Dec 10, 2024 at 10:05:22AM +0100, Michal Hocko wrote:
> On Tue 10-12-24 05:31:30, Matthew Wilcox wrote:
> > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> > > +	if (preemptible() && !rcu_preempt_depth())
> > > +		return alloc_pages_node_noprof(nid,
> > > +					       GFP_NOWAIT | __GFP_ZERO,
> > > +					       order);
> > > +	return alloc_pages_node_noprof(nid,
> > > +				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> > > +				       order);
> > 
> > [...]
> > 
> > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> > >  	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
> > >  	 */
> > >  	alloc_flags |= (__force int)
> > > -		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> > > +		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
> > 
> > It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
> > I was originally wondering if this wasn't a memalloc_nolock_save() /
> > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),
> > but I wonder if we can simply do:
> > 
> > 	if (!preemptible() || rcu_preempt_depth())
> > 		alloc_flags |= ALLOC_TRYLOCK;
> 
> preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that
> __GFP_TRYLOCK is not really a preferred way to go forward. For 3
> reasons. 
> 
> First I do not really like the name as it tells what it does rather than
> how it should be used. This is a general pattern of many gfp flags
> unfotrunatelly and historically it has turned out error prone. If a gfp
> flag is really needed then something like __GFP_ANY_CONTEXT should be
> used.  If the current implementation requires to use try_lock for
> zone->lock or other changes is not an implementation detail but the user
> should have a clear understanding that allocation is allowed from any
> context (NMI, IRQ or otherwise atomic contexts).
> 
> Is there any reason why GFP_ATOMIC cannot be extended to support new

GFP_ATOMIC has access to memory reserves. I see GFP_NOWAIT a better fit
and if someone wants access to the reserve they can use __GFP_HIGH with
GFP_NOWAIT.

> contexts? This allocation mode is already documented to be usable from
> atomic contexts except from NMI and raw_spinlocks. But is it feasible to
> extend the current implementation to use only trylock on zone->lock if
> called from in_nmi() to reduce unexpected failures on contention for
> existing users?

I think this is the question we (MM folks) need to answer, not the
users.

> 
> Third, do we even want such a strong guarantee in the generic page
> allocator path and make it even more complex and harder to maintain?

I think the alternative would be higher maintenance cost i.e. everyone
creating their own layer/solution/caching over page allocator which I
think we agree we want to avoid (Vlastimil's LSFMM talk).

> We
> already have a precence in form of __alloc_pages_bulk which is a special
> case allocator mode living outside of the page allocator path. It seems
> that it covers most of your requirements except the fallback to the
> regular allocation path AFAICS. Is this something you could piggy back
> on?

BPF already have bpf_mem_alloc() and IIUC this series is an effort to
unify and have a single solution.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  5:31   ` Matthew Wilcox
  2024-12-10  9:05     ` Michal Hocko
@ 2024-12-10 21:42     ` Alexei Starovoitov
  1 sibling, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10 21:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Mon, Dec 9, 2024 at 9:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> > +     if (preemptible() && !rcu_preempt_depth())
> > +             return alloc_pages_node_noprof(nid,
> > +                                            GFP_NOWAIT | __GFP_ZERO,
> > +                                            order);
> > +     return alloc_pages_node_noprof(nid,
> > +                                    __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> > +                                    order);
>
> [...]
>
> > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> >        * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
> >        */
> >       alloc_flags |= (__force int)
> > -             (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> > +             (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
>
> It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
> I was originally wondering if this wasn't a memalloc_nolock_save() /
> memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),

Interesting idea. It could be useful to pass extra flags into free_page
path, since it doesn't have flags today and I'm adding free_pages_nolock()
in patch 2 just to pass fpi_t fpi_flags around.

memalloc_nofs_save()-like makes the most sense when there are
multiple allocations and code path attempts to be generic.
For bpf use case it's probably overkill.
I guess we might have both __GFP_TRYLOCK and
memalloc_nolock_save() that clear many flags.
Note it needs to clear __GFP_KSWAPD_RECLAIM which is not safe
when raw spin lock is held.

> but I wonder if we can simply do:
>
>         if (!preemptible() || rcu_preempt_depth())
>                 alloc_flags |= ALLOC_TRYLOCK;

I don't think we can do that.
It will penalize existing GFP_ATOMIC/NOWAIT users.
kmalloc from RCU CS with GFP_NOWAIT is fine today.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  9:01   ` Sebastian Andrzej Siewior
@ 2024-12-10 21:53     ` Alexei Starovoitov
  2024-12-11  8:38       ` Vlastimil Babka
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10 21:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where running
> > context is unknown, but they need to request additional memory.
> > The prior workarounds were using pre-allocated memory and BPF specific
> > freelists to satisfy such allocation requests. Instead, introduce
> > __GFP_TRYLOCK flag that makes page allocator accessible from any context.
> > It relies on percpu free list of pages that rmqueue_pcplist() should be
> > able to pop the page from. If it fails (due to IRQ re-entrancy or list
> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
> > and refill percpu freelist as normal.
> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
> > so trylock is the only option.
>
> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF
> where it is not known how much memory will be needed and what the
> calling context is.

Exactly.

> I hope it does not spread across the kernel where
> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they
> learn that this does not work, add this flag to the mix to make it work
> without spending some time on reworking it.

We can call it __GFP_BPF to discourage any other usage,
but that seems like an odd "solution" to code review problem.
If people start using __GFP_TRYLOCK just to shut up lockdep splats
they will soon realize that it's an _oportunistic_ allocator.
bpf doesn't need more than a page and single page will likely
will be found in percpu free page pool, so this opportunistic approach
will work most of the time for bpf, but unlikely for others
that need order >= PAGE_ALLOC_COSTLY_ORDER (3).

> Side note: I am in the process of hopefully getting rid of the
> preempt_disable() from trace points. What remains then is attaching BPF
> programs to any code/ function with a raw_spinlock_t and I am not yet
> sure what to do here.

That won't help the bpf core.
There are tracepoints that are called after preemption is disabled.
The worst is trace_contention_begin() and people have good reasons
to attach bpf prog there to collect contention stats.
In such case bpf prog has no idea what kind of spin_lock is contending.
It might have disabled preemption and/or irqs before getting to
that tracepoint. So removal of preempt_disable from tracepoint
handling logic doesn't help bpf core. It's a good thing to do anyway.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10  9:05     ` Michal Hocko
  2024-12-10 20:25       ` Shakeel Butt
@ 2024-12-10 22:06       ` Alexei Starovoitov
  2024-12-11 10:19         ` Michal Hocko
  2024-12-12 15:07         ` Sebastian Sewior
  1 sibling, 2 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10 22:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 10, 2024 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 10-12-24 05:31:30, Matthew Wilcox wrote:
> > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> > > +   if (preemptible() && !rcu_preempt_depth())
> > > +           return alloc_pages_node_noprof(nid,
> > > +                                          GFP_NOWAIT | __GFP_ZERO,
> > > +                                          order);
> > > +   return alloc_pages_node_noprof(nid,
> > > +                                  __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> > > +                                  order);
> >
> > [...]
> >
> > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> > >      * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
> > >      */
> > >     alloc_flags |= (__force int)
> > > -           (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> > > +           (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
> >
> > It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
> > I was originally wondering if this wasn't a memalloc_nolock_save() /
> > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),
> > but I wonder if we can simply do:
> >
> >       if (!preemptible() || rcu_preempt_depth())
> >               alloc_flags |= ALLOC_TRYLOCK;
>
> preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that
> __GFP_TRYLOCK is not really a preferred way to go forward. For 3
> reasons.
>
> First I do not really like the name as it tells what it does rather than
> how it should be used. This is a general pattern of many gfp flags
> unfotrunatelly and historically it has turned out error prone. If a gfp
> flag is really needed then something like __GFP_ANY_CONTEXT should be
> used.  If the current implementation requires to use try_lock for
> zone->lock or other changes is not an implementation detail but the user
> should have a clear understanding that allocation is allowed from any
> context (NMI, IRQ or otherwise atomic contexts).

__GFP_ANY_CONTEXT would make sense if we wanted to make it available
for all kernel users. In this case I agree with Sebastian.
This is bpf specific feature, since it doesn't know the context.
All other kernel users should pick GFP_KERNEL or ATOMIC or NOWAIT.
Exposing GFP_ANY_CONTEXT to all may lead to sloppy code in drivers
and elsewhere.

> Is there any reason why GFP_ATOMIC cannot be extended to support new
> contexts? This allocation mode is already documented to be usable from
> atomic contexts except from NMI and raw_spinlocks. But is it feasible to
> extend the current implementation to use only trylock on zone->lock if
> called from in_nmi() to reduce unexpected failures on contention for
> existing users?

No. in_nmi() doesn't help. It's the lack of reentrance of slab and page
allocator that is an issue.
The page alloctor might grab zone lock. In !RT it will disable irqs.
In RT will stay sleepable. Both paths will be calling other
kernel code including tracepoints, potential kprobes, etc
and bpf prog may be attached somewhere.
If it calls alloc_page() it may deadlock on zone->lock.
pcpu lock is thankfully trylock already.
So !irqs_disabled() part of preemptible() guarantees that
zone->lock won't deadlock in !RT.
And rcu_preempt_depth() case just steers bpf into try lock only path in RT.
Since there is no way to tell whether it's safe to call
sleepable spin_lock(&zone->lock).

>
> Third, do we even want such a strong guarantee in the generic page
> allocator path and make it even more complex and harder to maintain?

I'm happy to add myself as R: or M: for trylock bits,
since that will be a fundamental building block for bpf.

> We
> already have a precence in form of __alloc_pages_bulk which is a special
> case allocator mode living outside of the page allocator path. It seems
> that it covers most of your requirements except the fallback to the
> regular allocation path AFAICS. Is this something you could piggy back
> on?

__alloc_pages_bulk() has all the same issues. It takes locks.
Also it doesn't support GFP_ACCOUNT which is a show stopper.
All bpf allocations are going through memcg.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10 18:39   ` Vlastimil Babka
@ 2024-12-10 22:42     ` Alexei Starovoitov
  2024-12-11  8:48       ` Vlastimil Babka
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10 22:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 10, 2024 at 10:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 03:39, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where running
> > context is unknown, but they need to request additional memory.
> > The prior workarounds were using pre-allocated memory and BPF specific
> > freelists to satisfy such allocation requests. Instead, introduce
> > __GFP_TRYLOCK flag that makes page allocator accessible from any context.
> > It relies on percpu free list of pages that rmqueue_pcplist() should be
> > able to pop the page from. If it fails (due to IRQ re-entrancy or list
> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
> > and refill percpu freelist as normal.
> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
> > so trylock is the only option.
> > In theory we can introduce percpu reentrance counter and increment it
> > every time spin_lock_irqsave(&zone->lock, flags) is used,
> > but we cannot rely on it. Even if this cpu is not in page_alloc path
> > the spin_lock_irqsave() is not safe, since BPF prog might be called
> > from tracepoint where preemption is disabled. So trylock only.
> >
> > Note, free_page and memcg are not taught about __GFP_TRYLOCK yet.
> > The support comes in the next patches.
> >
> > This is a first step towards supporting BPF requirements in SLUB
> > and getting rid of bpf_mem_alloc.
> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> I think there might be more non-try spin_locks reachable from page allocations:
>
> - in reserve_highatomic_pageblock() which I think is reachable unless this
> is limited to order-0

Good point. I missed this bit:
   if (order > 0)
     alloc_flags |= ALLOC_HIGHATOMIC;

In bpf use case it will be called with order == 0 only,
but it's better to fool proof it.
I will switch to:
__GFP_NOMEMALLOC | __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT


> - try_to_accept_memory_one()

when I studied the code it looked to me that there should be no
unaccepted_pages.
I think you're saying that there could be unaccepted memory
from the previous allocation and trylock attempt just got unlucky
to reach that path?
What do you think of the following:
-               cond_accept_memory(zone, order);
+               cond_accept_memory(zone, order, alloc_flags);

                /*
                 * Detect whether the number of free pages is below high
@@ -7024,7 +7024,8 @@ static inline bool has_unaccepted_memory(void)
        return static_branch_unlikely(&zones_with_unaccepted_pages);
 }

-static bool cond_accept_memory(struct zone *zone, unsigned int order)
+static bool cond_accept_memory(struct zone *zone, unsigned int order,
+                              unsigned int alloc_flags)
 {
        long to_accept;
        bool ret = false;
@@ -7032,6 +7033,9 @@ static bool cond_accept_memory(struct zone
*zone, unsigned int order)
        if (!has_unaccepted_memory())
                return false;

+       if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+               return false;
+

or is there a better approach?

Reading from current->flags the way Matthew proposed?

> - as part of post_alloc_hook() in set_page_owner(), stack depot might do
> raw_spin_lock_irqsave(), is that one ok?

Well, I looked at the stack depot and was tempted to add trylock
handling there, but it looked to be a bit dodgy in general and
I figured it should be done separately from this set.
Like:
        if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
                page = alloc_pages(gfp_nested_mask(alloc_flags),
followed by:
        if (in_nmi()) {
                /* We can never allocate in NMI context. */
                WARN_ON_ONCE(can_alloc);

that warn is too late. If we were in_nmi and called alloc_pages
the kernel might be misbehaving already.

>
> hope I didn't miss anything else especially in those other debugging hooks
> (KASAN etc)

I looked through them and could be missing something, of course.
kasan usage in alloc_page path seems fine.
But for slab I found kasan_quarantine logic which needs a special treatment.
Other slab debugging bits pose issues too.
The rough idea is to do kmalloc_nolock() / kfree_nolock() that
don't call into any pre/post hooks (including slab_free_hook,
slab_pre_alloc_hook).
kmalloc_nolock() will pretty much call __slab_alloc_node() directly
and do basic kasan poison stuff that needs no locks.

I will be going over all the paths again, of course.

Thanks for the reviews so far!


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-10  8:35   ` Sebastian Andrzej Siewior
@ 2024-12-10 22:49     ` Alexei Starovoitov
  2024-12-12 14:44       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-10 22:49 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Tue, Dec 10, 2024 at 12:35 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2024-12-09 18:39:32 [-0800], Alexei Starovoitov wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d511e68903c6..a969a62ec0c3 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1251,9 +1254,33 @@ static void free_one_page(struct zone *zone, struct page *page,
> >                         unsigned long pfn, unsigned int order,
> >                         fpi_t fpi_flags)
> >  {
> > +     struct llist_head *llhead;
> >       unsigned long flags;
> >
> > -     spin_lock_irqsave(&zone->lock, flags);
> > +     if (!spin_trylock_irqsave(&zone->lock, flags)) {
> > +             if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> > +                     /* Remember the order */
> > +                     page->order = order;
> > +                     /* Add the page to the free list */
> > +                     llist_add(&page->pcp_llist, &zone->trylock_free_pages);
> > +                     return;
> > +             }
> > +             spin_lock_irqsave(&zone->lock, flags);
> > +     }
> > +
> > +     /* The lock succeeded. Process deferred pages. */
> > +     llhead = &zone->trylock_free_pages;
> > +     if (unlikely(!llist_empty(llhead))) {
> > +             struct llist_node *llnode;
> > +             struct page *p, *tmp;
> > +
> > +             llnode = llist_del_all(llhead);
>
> Do you really need to turn the list around?

I didn't think LIFO vs FIFO would make a difference.
Why spend time rotating it?

> > +             llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
> > +                     unsigned int p_order = p->order;
> > +                     split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> > +                     __count_vm_events(PGFREE, 1 << p_order);
> > +             }
>
> We had something like that (returning memory in IRQ/ irq-off) in RT tree
> and we got rid of it before posting the needed bits to mm.
>
> If we really intend to do something like this, could we please process
> this list in an explicitly locked section? I mean not in a try-lock
> fashion which might have originated in an IRQ-off region on PREEMPT_RT
> but in an explicit locked section which would remain preemptible. This
> would also avoid the locking problem down the road when
> shuffle_pick_tail() invokes get_random_u64() which in turn acquires a
> spinlock_t.

I see. So the concern is though spin_lock_irqsave(&zone->lock)
is sleepable in RT, bpf prog might have been called in the context
where preemption is disabled and do split_large_buddy() for many
pages might take too much time?
How about kicking irq_work then? The callback is in kthread in RT.
We can irq_work_queue() right after llist_add().

Or we can process only N pages at a time in this loop and
llist_add() leftover back into zone->trylock_free_pages.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10 21:53     ` Alexei Starovoitov
@ 2024-12-11  8:38       ` Vlastimil Babka
  2024-12-12  2:14         ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-11  8:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Sebastian Andrzej Siewior
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Hou Tao, Johannes Weiner,
	shakeel.butt, Michal Hocko, Matthew Wilcox, Thomas Gleixner,
	Tejun Heo, linux-mm, Kernel Team

On 12/10/24 22:53, Alexei Starovoitov wrote:
> On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote:
>> > From: Alexei Starovoitov <ast@kernel.org>
>> >
>> > Tracing BPF programs execute from tracepoints and kprobes where running
>> > context is unknown, but they need to request additional memory.
>> > The prior workarounds were using pre-allocated memory and BPF specific
>> > freelists to satisfy such allocation requests. Instead, introduce
>> > __GFP_TRYLOCK flag that makes page allocator accessible from any context.
>> > It relies on percpu free list of pages that rmqueue_pcplist() should be
>> > able to pop the page from. If it fails (due to IRQ re-entrancy or list
>> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
>> > and refill percpu freelist as normal.
>> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
>> > so trylock is the only option.
>>
>> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF
>> where it is not known how much memory will be needed and what the
>> calling context is.
> 
> Exactly.
> 
>> I hope it does not spread across the kernel where
>> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they
>> learn that this does not work, add this flag to the mix to make it work
>> without spending some time on reworking it.
> 
> We can call it __GFP_BPF to discourage any other usage,
> but that seems like an odd "solution" to code review problem.

Could we perhaps not expose the flag to public headers at all, and keep it
only as an internal detail of try_alloc_pages_noprof()?




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10 22:42     ` Alexei Starovoitov
@ 2024-12-11  8:48       ` Vlastimil Babka
  0 siblings, 0 replies; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-11  8:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On 12/10/24 23:42, Alexei Starovoitov wrote:
> On Tue, Dec 10, 2024 at 10:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/10/24 03:39, Alexei Starovoitov wrote:
>> > From: Alexei Starovoitov <ast@kernel.org>
>> >
>> > Tracing BPF programs execute from tracepoints and kprobes where running
>> > context is unknown, but they need to request additional memory.
>> > The prior workarounds were using pre-allocated memory and BPF specific
>> > freelists to satisfy such allocation requests. Instead, introduce
>> > __GFP_TRYLOCK flag that makes page allocator accessible from any context.
>> > It relies on percpu free list of pages that rmqueue_pcplist() should be
>> > able to pop the page from. If it fails (due to IRQ re-entrancy or list
>> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
>> > and refill percpu freelist as normal.
>> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
>> > so trylock is the only option.
>> > In theory we can introduce percpu reentrance counter and increment it
>> > every time spin_lock_irqsave(&zone->lock, flags) is used,
>> > but we cannot rely on it. Even if this cpu is not in page_alloc path
>> > the spin_lock_irqsave() is not safe, since BPF prog might be called
>> > from tracepoint where preemption is disabled. So trylock only.
>> >
>> > Note, free_page and memcg are not taught about __GFP_TRYLOCK yet.
>> > The support comes in the next patches.
>> >
>> > This is a first step towards supporting BPF requirements in SLUB
>> > and getting rid of bpf_mem_alloc.
>> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
>> >
>> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>
>> I think there might be more non-try spin_locks reachable from page allocations:
>>
>> - in reserve_highatomic_pageblock() which I think is reachable unless this
>> is limited to order-0
> 
> Good point. I missed this bit:
>    if (order > 0)
>      alloc_flags |= ALLOC_HIGHATOMIC;
> 
> In bpf use case it will be called with order == 0 only,
> but it's better to fool proof it.
> I will switch to:
> __GFP_NOMEMALLOC | __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT

Should work, yeah.

>> - try_to_accept_memory_one()
> 
> when I studied the code it looked to me that there should be no
> unaccepted_pages.
> I think you're saying that there could be unaccepted memory
> from the previous allocation and trylock attempt just got unlucky
> to reach that path?

The unaccepted memory can exist after boot in confidential computing guests
to speed up their boot, and then is accepted (encrypted for the guest) on
demand.

> What do you think of the following:
> -               cond_accept_memory(zone, order);
> +               cond_accept_memory(zone, order, alloc_flags);
> 
>                 /*
>                  * Detect whether the number of free pages is below high
> @@ -7024,7 +7024,8 @@ static inline bool has_unaccepted_memory(void)
>         return static_branch_unlikely(&zones_with_unaccepted_pages);
>  }
> 
> -static bool cond_accept_memory(struct zone *zone, unsigned int order)
> +static bool cond_accept_memory(struct zone *zone, unsigned int order,
> +                              unsigned int alloc_flags)
>  {
>         long to_accept;
>         bool ret = false;
> @@ -7032,6 +7033,9 @@ static bool cond_accept_memory(struct zone
> *zone, unsigned int order)
>         if (!has_unaccepted_memory())
>                 return false;
> 
> +       if (unlikely(alloc_flags & ALLOC_TRYLOCK))
> +               return false;
> +

Yeah that should be the straightforward fix.

>>
>> hope I didn't miss anything else especially in those other debugging hooks
>> (KASAN etc)
> 
> I looked through them and could be missing something, of course.
> kasan usage in alloc_page path seems fine.
> But for slab I found kasan_quarantine logic which needs a special treatment.
> Other slab debugging bits pose issues too.
> The rough idea is to do kmalloc_nolock() / kfree_nolock() that
> don't call into any pre/post hooks (including slab_free_hook,
> slab_pre_alloc_hook).
> kmalloc_nolock() will pretty much call __slab_alloc_node() directly
> and do basic kasan poison stuff that needs no locks.

You'll probably want the memcg handling though? Alloc tagging would be good
too. Both IIRC propagate the gfp flags when trying to allocate their
metadata and can deal with failure so maybe it will work in a straightfoward
way. Slab debugging will have to likely be limited, but it's also not
handled in the pre/post hooks but deeper.

> I will be going over all the paths again, of course.
> 
> Thanks for the reviews so far!



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10 20:25       ` Shakeel Butt
@ 2024-12-11 10:08         ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2024-12-11 10:08 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Matthew Wilcox, bpf, andrii, memxor, akpm,
	peterz, vbabka, bigeasy, rostedt, houtao1, hannes, tglx, tj,
	linux-mm, kernel-team

On Tue 10-12-24 12:25:04, Shakeel Butt wrote:
> On Tue, Dec 10, 2024 at 10:05:22AM +0100, Michal Hocko wrote:
> > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote:
> > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> > > > +	if (preemptible() && !rcu_preempt_depth())
> > > > +		return alloc_pages_node_noprof(nid,
> > > > +					       GFP_NOWAIT | __GFP_ZERO,
> > > > +					       order);
> > > > +	return alloc_pages_node_noprof(nid,
> > > > +				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> > > > +				       order);
> > > 
> > > [...]
> > > 
> > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> > > >  	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
> > > >  	 */
> > > >  	alloc_flags |= (__force int)
> > > > -		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> > > > +		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
> > > 
> > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
> > > I was originally wondering if this wasn't a memalloc_nolock_save() /
> > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),
> > > but I wonder if we can simply do:
> > > 
> > > 	if (!preemptible() || rcu_preempt_depth())
> > > 		alloc_flags |= ALLOC_TRYLOCK;
> > 
> > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that
> > __GFP_TRYLOCK is not really a preferred way to go forward. For 3
> > reasons. 
> > 
> > First I do not really like the name as it tells what it does rather than
> > how it should be used. This is a general pattern of many gfp flags
> > unfotrunatelly and historically it has turned out error prone. If a gfp
> > flag is really needed then something like __GFP_ANY_CONTEXT should be
> > used.  If the current implementation requires to use try_lock for
> > zone->lock or other changes is not an implementation detail but the user
> > should have a clear understanding that allocation is allowed from any
> > context (NMI, IRQ or otherwise atomic contexts).
> > 
> > Is there any reason why GFP_ATOMIC cannot be extended to support new
> 
> GFP_ATOMIC has access to memory reserves. I see GFP_NOWAIT a better fit
> and if someone wants access to the reserve they can use __GFP_HIGH with
> GFP_NOWAIT.

Right. The problem with GFP_NOWAIT is that it is very often used as an
opportunistic allocation attempt before a more costly fallback. Failing
those just because of the zone lock (or other internal locks) contention
seems too aggressive.

> > Third, do we even want such a strong guarantee in the generic page
> > allocator path and make it even more complex and harder to maintain?
> 
> I think the alternative would be higher maintenance cost i.e. everyone
> creating their own layer/solution/caching over page allocator which I
> think we agree we want to avoid (Vlastimil's LSFMM talk).

Yes, I do agree that we do not want to grow special case allocators. I
was merely interested in an option to reuse existing bulk allocator for
this new purpose.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-10  2:39 ` [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
  2024-12-10  8:35   ` Sebastian Andrzej Siewior
@ 2024-12-11 10:11   ` Vlastimil Babka
  2024-12-12  1:43     ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-11 10:11 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
	shakeel.butt, mhocko, willy, tglx, tj, linux-mm, kernel-team

On 12/10/24 03:39, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce free_pages_nolock() that can free a page without taking locks.
> It relies on trylock only and can be called from any context.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

<snip>

> +/* Can be called while holding raw_spin_lock or from IRQ. RCU must be watching. */
> +void free_pages_nolock(struct page *page, unsigned int order)

What does "RCU must be watching" mean and why?

> +{
> +	int head = PageHead(page);
> +	struct alloc_tag *tag = pgalloc_tag_get(page);
> +
> +	if (put_page_testzero(page)) {
> +		__free_unref_page(page, order, FPI_TRYLOCK);
> +	} else if (!head) {
> +		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
> +		while (order-- > 0)
> +			__free_unref_page(page + (1 << order), order, FPI_TRYLOCK);

Not your fault for trying to support everything __free_pages did,
specifically order > 0 pages that are not compound and thus needing this
extra !head branch. We'd love to get rid of that eventually in
__free_pages(), but at least I think we don't need to support it in a new
API and instead any users switching to it should know it's not supported.
I suppose BFP doesn't need that, right?
Maybe if the function was taking a folio instead of page, it would be the
best as that has to be order-0 or compound by definition. It also wouldn't
need the order parameter. What do you think, Matthew?

> +	}
> +}
> +
>  void free_pages(unsigned long addr, unsigned int order)
>  {
>  	if (addr != 0) {



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10 22:06       ` Alexei Starovoitov
@ 2024-12-11 10:19         ` Michal Hocko
  2024-12-12 15:07         ` Sebastian Sewior
  1 sibling, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2024-12-11 10:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matthew Wilcox, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Tue 10-12-24 14:06:32, Alexei Starovoitov wrote:
> On Tue, Dec 10, 2024 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote:
> > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> > > > +   if (preemptible() && !rcu_preempt_depth())
> > > > +           return alloc_pages_node_noprof(nid,
> > > > +                                          GFP_NOWAIT | __GFP_ZERO,
> > > > +                                          order);
> > > > +   return alloc_pages_node_noprof(nid,
> > > > +                                  __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> > > > +                                  order);
> > >
> > > [...]
> > >
> > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> > > >      * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
> > > >      */
> > > >     alloc_flags |= (__force int)
> > > > -           (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> > > > +           (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
> > >
> > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
> > > I was originally wondering if this wasn't a memalloc_nolock_save() /
> > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),
> > > but I wonder if we can simply do:
> > >
> > >       if (!preemptible() || rcu_preempt_depth())
> > >               alloc_flags |= ALLOC_TRYLOCK;
> >
> > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that
> > __GFP_TRYLOCK is not really a preferred way to go forward. For 3
> > reasons.
> >
> > First I do not really like the name as it tells what it does rather than
> > how it should be used. This is a general pattern of many gfp flags
> > unfotrunatelly and historically it has turned out error prone. If a gfp
> > flag is really needed then something like __GFP_ANY_CONTEXT should be
> > used.  If the current implementation requires to use try_lock for
> > zone->lock or other changes is not an implementation detail but the user
> > should have a clear understanding that allocation is allowed from any
> > context (NMI, IRQ or otherwise atomic contexts).
> 
> __GFP_ANY_CONTEXT would make sense if we wanted to make it available
> for all kernel users. In this case I agree with Sebastian.
> This is bpf specific feature, since it doesn't know the context.
> All other kernel users should pick GFP_KERNEL or ATOMIC or NOWAIT.
> Exposing GFP_ANY_CONTEXT to all may lead to sloppy code in drivers
> and elsewhere.

I do not think we want a single user special allocation mode. Not only
there is no way to enforce this to remain BPF special feature, it is
also not really a good idea to have a single user feature in the
allocator.

> > Is there any reason why GFP_ATOMIC cannot be extended to support new
> > contexts? This allocation mode is already documented to be usable from
> > atomic contexts except from NMI and raw_spinlocks. But is it feasible to
> > extend the current implementation to use only trylock on zone->lock if
> > called from in_nmi() to reduce unexpected failures on contention for
> > existing users?
> 
> No. in_nmi() doesn't help. It's the lack of reentrance of slab and page
> allocator that is an issue.
> The page alloctor might grab zone lock. In !RT it will disable irqs.
> In RT will stay sleepable. Both paths will be calling other
> kernel code including tracepoints, potential kprobes, etc
> and bpf prog may be attached somewhere.
> If it calls alloc_page() it may deadlock on zone->lock.
> pcpu lock is thankfully trylock already.
> So !irqs_disabled() part of preemptible() guarantees that
> zone->lock won't deadlock in !RT.
> And rcu_preempt_depth() case just steers bpf into try lock only path in RT.
> Since there is no way to tell whether it's safe to call
> sleepable spin_lock(&zone->lock).

OK I see!

> > We
> > already have a precence in form of __alloc_pages_bulk which is a special
> > case allocator mode living outside of the page allocator path. It seems
> > that it covers most of your requirements except the fallback to the
> > regular allocation path AFAICS. Is this something you could piggy back
> > on?
> 
> __alloc_pages_bulk() has all the same issues. It takes locks.
> Also it doesn't support GFP_ACCOUNT which is a show stopper.
> All bpf allocations are going through memcg.

OK, this requirement was not clear until I've reached later patches in
the series (now).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-10  2:39 ` [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
@ 2024-12-11 10:53   ` Vlastimil Babka
  2024-12-11 11:55     ` Vlastimil Babka
  2024-12-12 15:15   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-11 10:53 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
	shakeel.butt, mhocko, willy, tglx, tj, linux-mm, kernel-team

On 12/10/24 03:39, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Similar to local_lock_irqsave() introduce local_trylock_irqsave().
> It uses spin_trylock in PREEMPT_RT and always succeeds when !RT.

Hmm but is that correct to always succeed? If we're in an nmi, we might be
preempting an existing local_(try)lock_irqsave() critical section because
disabling irqs doesn't stop NMI's, right?

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/local_lock.h          |  9 +++++++++
>  include/linux/local_lock_internal.h | 23 +++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index 091dc0b6bdfb..6880c29b8b98 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -30,6 +30,15 @@
>  #define local_lock_irqsave(lock, flags)				\
>  	__local_lock_irqsave(lock, flags)
>  
> +/**
> + * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
> + *			   interrupts. Always succeeds in !PREEMPT_RT.
> + * @lock:	The lock variable
> + * @flags:	Storage for interrupt flags
> + */
> +#define local_trylock_irqsave(lock, flags)			\
> +	__local_trylock_irqsave(lock, flags)
> +
>  /**
>   * local_unlock - Release a per CPU local lock
>   * @lock:	The lock variable
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index 8dd71fbbb6d2..2c0f8a49c2d0 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -31,6 +31,13 @@ static inline void local_lock_acquire(local_lock_t *l)
>  	l->owner = current;
>  }
>  
> +static inline void local_trylock_acquire(local_lock_t *l)
> +{
> +	lock_map_acquire_try(&l->dep_map);
> +	DEBUG_LOCKS_WARN_ON(l->owner);
> +	l->owner = current;
> +}
> +
>  static inline void local_lock_release(local_lock_t *l)
>  {
>  	DEBUG_LOCKS_WARN_ON(l->owner != current);
> @@ -45,6 +52,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
>  #else /* CONFIG_DEBUG_LOCK_ALLOC */
>  # define LOCAL_LOCK_DEBUG_INIT(lockname)
>  static inline void local_lock_acquire(local_lock_t *l) { }
> +static inline void local_trylock_acquire(local_lock_t *l) { }
>  static inline void local_lock_release(local_lock_t *l) { }
>  static inline void local_lock_debug_init(local_lock_t *l) { }
>  #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
> @@ -91,6 +99,13 @@ do {								\
>  		local_lock_acquire(this_cpu_ptr(lock));		\
>  	} while (0)
>  
> +#define __local_trylock_irqsave(lock, flags)			\
> +	({							\
> +		local_irq_save(flags);				\
> +		local_trylock_acquire(this_cpu_ptr(lock));	\
> +		1;						\
> +	})
> +
>  #define __local_unlock(lock)					\
>  	do {							\
>  		local_lock_release(this_cpu_ptr(lock));		\
> @@ -148,6 +163,14 @@ typedef spinlock_t local_lock_t;
>  		__local_lock(lock);				\
>  	} while (0)
>  
> +#define __local_trylock_irqsave(lock, flags)			\
> +	({							\
> +		typecheck(unsigned long, flags);		\
> +		flags = 0;					\
> +		migrate_disable();				\
> +		spin_trylock(this_cpu_ptr((__lock)));		\
> +	})
> +
>  #define __local_unlock(__lock)					\
>  	do {							\
>  		spin_unlock(this_cpu_ptr((__lock)));		\



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-11 10:53   ` Vlastimil Babka
@ 2024-12-11 11:55     ` Vlastimil Babka
  2024-12-12  2:49       ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-11 11:55 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
	shakeel.butt, mhocko, willy, tglx, tj, linux-mm, kernel-team,
	Jann Horn

On 12/11/24 11:53, Vlastimil Babka wrote:
> On 12/10/24 03:39, Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>> 
>> Similar to local_lock_irqsave() introduce local_trylock_irqsave().
>> It uses spin_trylock in PREEMPT_RT and always succeeds when !RT.
> 
> Hmm but is that correct to always succeed? If we're in an nmi, we might be
> preempting an existing local_(try)lock_irqsave() critical section because
> disabling irqs doesn't stop NMI's, right?

So unless I'm missing something, it would need to be a new kind of local
lock to support this trylock operation on !RT? Perhaps based on the same
principle of a simple active/locked flag that I tried in my sheaves RFC? [1]
There could be also the advantage that if all (potentially) irq contexts
(not just nmi) used trylock, it would be sufficient to disable preeemption
and not interrupts, which is cheaper.
The RT variant could work as you proposed here, that was wrong in my RFC as
you already pointed out when we discussed v1 of this series.

[1]
https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/

>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
>>  include/linux/local_lock.h          |  9 +++++++++
>>  include/linux/local_lock_internal.h | 23 +++++++++++++++++++++++
>>  2 files changed, 32 insertions(+)
>> 
>> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
>> index 091dc0b6bdfb..6880c29b8b98 100644
>> --- a/include/linux/local_lock.h
>> +++ b/include/linux/local_lock.h
>> @@ -30,6 +30,15 @@
>>  #define local_lock_irqsave(lock, flags)				\
>>  	__local_lock_irqsave(lock, flags)
>>  
>> +/**
>> + * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
>> + *			   interrupts. Always succeeds in !PREEMPT_RT.
>> + * @lock:	The lock variable
>> + * @flags:	Storage for interrupt flags
>> + */
>> +#define local_trylock_irqsave(lock, flags)			\
>> +	__local_trylock_irqsave(lock, flags)
>> +
>>  /**
>>   * local_unlock - Release a per CPU local lock
>>   * @lock:	The lock variable
>> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
>> index 8dd71fbbb6d2..2c0f8a49c2d0 100644
>> --- a/include/linux/local_lock_internal.h
>> +++ b/include/linux/local_lock_internal.h
>> @@ -31,6 +31,13 @@ static inline void local_lock_acquire(local_lock_t *l)
>>  	l->owner = current;
>>  }
>>  
>> +static inline void local_trylock_acquire(local_lock_t *l)
>> +{
>> +	lock_map_acquire_try(&l->dep_map);
>> +	DEBUG_LOCKS_WARN_ON(l->owner);
>> +	l->owner = current;
>> +}
>> +
>>  static inline void local_lock_release(local_lock_t *l)
>>  {
>>  	DEBUG_LOCKS_WARN_ON(l->owner != current);
>> @@ -45,6 +52,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
>>  #else /* CONFIG_DEBUG_LOCK_ALLOC */
>>  # define LOCAL_LOCK_DEBUG_INIT(lockname)
>>  static inline void local_lock_acquire(local_lock_t *l) { }
>> +static inline void local_trylock_acquire(local_lock_t *l) { }
>>  static inline void local_lock_release(local_lock_t *l) { }
>>  static inline void local_lock_debug_init(local_lock_t *l) { }
>>  #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
>> @@ -91,6 +99,13 @@ do {								\
>>  		local_lock_acquire(this_cpu_ptr(lock));		\
>>  	} while (0)
>>  
>> +#define __local_trylock_irqsave(lock, flags)			\
>> +	({							\
>> +		local_irq_save(flags);				\
>> +		local_trylock_acquire(this_cpu_ptr(lock));	\
>> +		1;						\
>> +	})
>> +
>>  #define __local_unlock(lock)					\
>>  	do {							\
>>  		local_lock_release(this_cpu_ptr(lock));		\
>> @@ -148,6 +163,14 @@ typedef spinlock_t local_lock_t;
>>  		__local_lock(lock);				\
>>  	} while (0)
>>  
>> +#define __local_trylock_irqsave(lock, flags)			\
>> +	({							\
>> +		typecheck(unsigned long, flags);		\
>> +		flags = 0;					\
>> +		migrate_disable();				\
>> +		spin_trylock(this_cpu_ptr((__lock)));		\
>> +	})
>> +
>>  #define __local_unlock(__lock)					\
>>  	do {							\
>>  		spin_unlock(this_cpu_ptr((__lock)));		\
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 5/6] mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages().
  2024-12-10  2:39 ` [PATCH bpf-next v2 5/6] mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages() Alexei Starovoitov
@ 2024-12-11 12:05   ` Vlastimil Babka
  2024-12-12  2:54     ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-11 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
	shakeel.butt, mhocko, willy, tglx, tj, linux-mm, kernel-team

On 12/10/24 03:39, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
> The caller is responsible to setup memcg correctly.
> All BPF memory accounting is memcg based.

Hm I guess if there was later another user besides bpf that didn't want
memcg for some reason, a variant of the function (or a bool param, to avoid
passing full gfp param which would be thrown away besides __GFP_ACCOUNT)
could be easily created.

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/gfp.h | 5 ++---
>  mm/page_alloc.c     | 5 ++++-
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index dcae733ed006..820c4938c9cd 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -356,18 +356,17 @@ static inline struct page *try_alloc_pages_noprof(int nid, unsigned int order)
>  	 */
>  	if (preemptible() && !rcu_preempt_depth())
>  		return alloc_pages_node_noprof(nid,
> -					       GFP_NOWAIT | __GFP_ZERO,
> +					       GFP_NOWAIT | __GFP_ZERO | __GFP_ACCOUNT,
>  					       order);
>  	/*
>  	 * Best effort allocation from percpu free list.
>  	 * If it's empty attempt to spin_trylock zone->lock.
>  	 * Do not specify __GFP_KSWAPD_RECLAIM to avoid wakeup_kswapd
>  	 * that may need to grab a lock.
> -	 * Do not specify __GFP_ACCOUNT to avoid local_lock.
>  	 * Do not warn either.
>  	 */
>  	return alloc_pages_node_noprof(nid,
> -				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> +				       __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT,
>  				       order);
>  }
>  #define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a969a62ec0c3..1fada16b8a14 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4818,7 +4818,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>  out:
>  	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
>  	    unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
> -		__free_pages(page, order);
> +		if (unlikely(gfp & __GFP_TRYLOCK))
> +			free_pages_nolock(page, order);
> +		else
> +			__free_pages(page, order);
>  		page = NULL;
>  	}
>  



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 4/6] memcg: Add __GFP_TRYLOCK support.
  2024-12-10  2:39 ` [PATCH bpf-next v2 4/6] memcg: Add __GFP_TRYLOCK support Alexei Starovoitov
@ 2024-12-11 23:47   ` kernel test robot
  0 siblings, 0 replies; 50+ messages in thread
From: kernel test robot @ 2024-12-11 23:47 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: oe-kbuild-all, andrii, memxor, akpm, peterz, vbabka, bigeasy,
	rostedt, houtao1, hannes, shakeel.butt, mhocko, willy, tglx, tj,
	linux-mm, kernel-team

Hi Alexei,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Alexei-Starovoitov/mm-bpf-Introduce-__GFP_TRYLOCK-for-opportunistic-page-allocation/20241210-104250
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20241210023936.46871-5-alexei.starovoitov%40gmail.com
patch subject: [PATCH bpf-next v2 4/6] memcg: Add __GFP_TRYLOCK support.
config: x86_64-buildonly-randconfig-001-20241210 (https://download.01.org/0day-ci/archive/20241212/202412120745.nnianRgw-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241212/202412120745.nnianRgw-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412120745.nnianRgw-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/memcontrol.c:1761: warning: Function parameter or struct member 'gfp_mask' not described in 'consume_stock'


vim +1761 mm/memcontrol.c

cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15  1743  
56751146238723 Sebastian Andrzej Siewior 2022-03-22  1744  static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
bf4f059954dcb2 Roman Gushchin            2020-08-06  1745  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
bf4f059954dcb2 Roman Gushchin            2020-08-06  1746  				     struct mem_cgroup *root_memcg);
bf4f059954dcb2 Roman Gushchin            2020-08-06  1747  
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1748  /**
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1749   * consume_stock: Try to consume stocked charge on this cpu.
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1750   * @memcg: memcg to consume from.
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1751   * @nr_pages: how many pages to charge.
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1752   *
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1753   * The charges will only happen if @memcg matches the current cpu's memcg
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1754   * stock, and at least @nr_pages are available in that stock.  Failure to
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1755   * service an allocation will refill the stock.
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1756   *
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1757   * returns true if successful, false otherwise.
cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15  1758   */
c889f3c0bc9a18 Alexei Starovoitov        2024-12-09  1759  static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
c889f3c0bc9a18 Alexei Starovoitov        2024-12-09  1760  			  gfp_t gfp_mask)
cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15 @1761  {
cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15  1762  	struct memcg_stock_pcp *stock;
1872b3bcd5874b Breno Leitao              2024-05-01  1763  	unsigned int stock_pages;
db2ba40c277dc5 Johannes Weiner           2016-09-19  1764  	unsigned long flags;
3e32cb2e0a12b6 Johannes Weiner           2014-12-10  1765  	bool ret = false;
cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15  1766  
a983b5ebee5720 Johannes Weiner           2018-01-31  1767  	if (nr_pages > MEMCG_CHARGE_BATCH)
3e32cb2e0a12b6 Johannes Weiner           2014-12-10  1768  		return ret;
a0956d54492eb7 Suleiman Souhlal          2012-12-18  1769  
c889f3c0bc9a18 Alexei Starovoitov        2024-12-09  1770  	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
c889f3c0bc9a18 Alexei Starovoitov        2024-12-09  1771  		if (gfp_mask & __GFP_TRYLOCK)
c889f3c0bc9a18 Alexei Starovoitov        2024-12-09  1772  			return ret;
56751146238723 Sebastian Andrzej Siewior 2022-03-22  1773  		local_lock_irqsave(&memcg_stock.stock_lock, flags);
c889f3c0bc9a18 Alexei Starovoitov        2024-12-09  1774  	}
db2ba40c277dc5 Johannes Weiner           2016-09-19  1775  
db2ba40c277dc5 Johannes Weiner           2016-09-19  1776  	stock = this_cpu_ptr(&memcg_stock);
1872b3bcd5874b Breno Leitao              2024-05-01  1777  	stock_pages = READ_ONCE(stock->nr_pages);
1872b3bcd5874b Breno Leitao              2024-05-01  1778  	if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) {
1872b3bcd5874b Breno Leitao              2024-05-01  1779  		WRITE_ONCE(stock->nr_pages, stock_pages - nr_pages);
3e32cb2e0a12b6 Johannes Weiner           2014-12-10  1780  		ret = true;
3e32cb2e0a12b6 Johannes Weiner           2014-12-10  1781  	}
db2ba40c277dc5 Johannes Weiner           2016-09-19  1782  
56751146238723 Sebastian Andrzej Siewior 2022-03-22  1783  	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
db2ba40c277dc5 Johannes Weiner           2016-09-19  1784  
cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15  1785  	return ret;
cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15  1786  }
cdec2e4265dfa0 KAMEZAWA Hiroyuki         2009-12-15  1787  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-11 10:11   ` Vlastimil Babka
@ 2024-12-12  1:43     ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-12  1:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Wed, Dec 11, 2024 at 2:11 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 03:39, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce free_pages_nolock() that can free a page without taking locks.
> > It relies on trylock only and can be called from any context.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> <snip>
>
> > +/* Can be called while holding raw_spin_lock or from IRQ. RCU must be watching. */
> > +void free_pages_nolock(struct page *page, unsigned int order)
>
> What does "RCU must be watching" mean and why?

Meaning that rcu_is_watching() must be true.
Because pgalloc_tag_get() relies on RCU.
ree_unref_page() and things it calls doesn't use RCU directly,
but it calls tracepoints and they need RCU too.

Hence the comment to say that free_pages_nolock() cannot be
called in _any_ context. Some care needs to be taken.
bpf needs RCU, so we don't allow attach bpf to idle and notrace,
but, last I heard, notrace attr is not 100% reliable on some odd arch-es.

> > +{
> > +     int head = PageHead(page);
> > +     struct alloc_tag *tag = pgalloc_tag_get(page);
> > +
> > +     if (put_page_testzero(page)) {
> > +             __free_unref_page(page, order, FPI_TRYLOCK);
> > +     } else if (!head) {
> > +             pgalloc_tag_sub_pages(tag, (1 << order) - 1);
> > +             while (order-- > 0)
> > +                     __free_unref_page(page + (1 << order), order, FPI_TRYLOCK);
>
> Not your fault for trying to support everything __free_pages did,
> specifically order > 0 pages that are not compound and thus needing this
> extra !head branch. We'd love to get rid of that eventually in
> __free_pages(), but at least I think we don't need to support it in a new
> API and instead any users switching to it should know it's not supported.

Great. Good to know.

> I suppose BFP doesn't need that, right?

Nope.
Initially I wrote above bit as:

void free_pages_nolock(struct page *page, unsigned int order)
{
         if (put_page_testzero(page))
                __free_unref_page(page, order, FPI_TRYLOCK);
}

but then I studied Matthew's commit
e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")

and couldn't convince myself that the race he described
+               get_page(page);
+               free_pages(addr, 3);
+               put_page(page);

will never happen.
It won't happen if bpf is the only user of this api,
but who knows what happens in the future.
So copy-pasted this part just in case.
Will be happy to drop it.

> Maybe if the function was taking a folio instead of page, it would be the
> best as that has to be order-0 or compound by definition. It also wouldn't
> need the order parameter. What do you think, Matthew?

For bpf there is no use case for folios.
All we need is a page order 0 and separately
later kmalloc_nolock() will call it from new_slab().
And I think new_slab() might need order >= 1.
So that's the reason to have order parameter.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-11  8:38       ` Vlastimil Babka
@ 2024-12-12  2:14         ` Alexei Starovoitov
  2024-12-12  8:54           ` Vlastimil Babka
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-12  2:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sebastian Andrzej Siewior, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Tejun Heo,
	linux-mm, Kernel Team

On Wed, Dec 11, 2024 at 12:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 22:53, Alexei Starovoitov wrote:
> > On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> >>
> >> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote:
> >> > From: Alexei Starovoitov <ast@kernel.org>
> >> >
> >> > Tracing BPF programs execute from tracepoints and kprobes where running
> >> > context is unknown, but they need to request additional memory.
> >> > The prior workarounds were using pre-allocated memory and BPF specific
> >> > freelists to satisfy such allocation requests. Instead, introduce
> >> > __GFP_TRYLOCK flag that makes page allocator accessible from any context.
> >> > It relies on percpu free list of pages that rmqueue_pcplist() should be
> >> > able to pop the page from. If it fails (due to IRQ re-entrancy or list
> >> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
> >> > and refill percpu freelist as normal.
> >> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
> >> > so trylock is the only option.
> >>
> >> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF
> >> where it is not known how much memory will be needed and what the
> >> calling context is.
> >
> > Exactly.
> >
> >> I hope it does not spread across the kernel where
> >> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they
> >> learn that this does not work, add this flag to the mix to make it work
> >> without spending some time on reworking it.
> >
> > We can call it __GFP_BPF to discourage any other usage,
> > but that seems like an odd "solution" to code review problem.
>
> Could we perhaps not expose the flag to public headers at all, and keep it
> only as an internal detail of try_alloc_pages_noprof()?

public headers?
To pass additional bit via gfp flags into alloc_pages
gfp_types.h has to be touched.

If you mean moving try_alloc_pages() into mm/page_alloc.c and
adding another argument to __alloc_pages_noprof then it's not pretty.
It has 'gfp_t gfp' argument. It should to be used to pass the intent.

We don't have to add GFP_TRYLOCK at all if we go with
memalloc_nolock_save() approach.
So I started looking at it,
but immediately hit trouble with bits.
There are 5 bits left in PF_ and 3 already used for mm needs.
That doesn't look sustainable long term.
How about we alias nolock concept with PF_MEMALLOC_PIN ?

As far as I could trace PF_MEMALLOC_PIN clears GFP_MOVABLE and nothing else.

The same bit plus lack of __GFP_KSWAPD_RECLAIM in gfp flags
would mean nolock mode in alloc_pages,
while PF_MEMALLOC_PIN alone would mean nolock in free_pages
and deeper inside memcg paths and such.

thoughts? too hacky?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-11 11:55     ` Vlastimil Babka
@ 2024-12-12  2:49       ` Alexei Starovoitov
  2024-12-12  9:15         ` Vlastimil Babka
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-12  2:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team, Jann Horn

On Wed, Dec 11, 2024 at 3:55 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/11/24 11:53, Vlastimil Babka wrote:
> > On 12/10/24 03:39, Alexei Starovoitov wrote:
> >> From: Alexei Starovoitov <ast@kernel.org>
> >>
> >> Similar to local_lock_irqsave() introduce local_trylock_irqsave().
> >> It uses spin_trylock in PREEMPT_RT and always succeeds when !RT.
> >
> > Hmm but is that correct to always succeed? If we're in an nmi, we might be
> > preempting an existing local_(try)lock_irqsave() critical section because
> > disabling irqs doesn't stop NMI's, right?
>
> So unless I'm missing something, it would need to be a new kind of local
> lock to support this trylock operation on !RT?

Ohh. Correct. Forgot about nmi interrupting local_lock_irqsave region in !RT.

> Perhaps based on the same
> principle of a simple active/locked flag that I tried in my sheaves RFC? [1]
> There could be also the advantage that if all (potentially) irq contexts
> (not just nmi) used trylock, it would be sufficient to disable preeemption
> and not interrupts, which is cheaper.

I don't think it's the case.
pushf was slow on old x86.
According to https://www.agner.org/optimize/instruction_tables.pdf
it's 3 uops on skylake.
That could be faster than preempt_disable (incl %gs:addr)
which is 3-4 uops assuming cache hit.

> The RT variant could work as you proposed here, that was wrong in my RFC as
> you already pointed out when we discussed v1 of this series.
>
> [1]
> https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/

I like your
+struct local_tryirq_lock
approach, but let's put it in local_lock.h ?
and it probably needs local_inc_return() instead of READ/WRITE_ONCE.
With irq and nmis it's racy.

In the meantime I think I will fix below:

> >> +#define __local_trylock_irqsave(lock, flags)                        \
> >> +    ({                                                      \
> >> +            local_irq_save(flags);                          \
> >> +            local_trylock_acquire(this_cpu_ptr(lock));      \
> >> +            1;                                              \
> >> +    })

as
#define __local_trylock_irqsave(lock, flags)                    \
        ({                                                      \
                local_irq_save(flags);                          \
                local_trylock_acquire(this_cpu_ptr(lock));      \
                !in_nmi();                                      \
        })

I think that's good enough for memcg patch 4 and
doesn't grow local_lock_t on !RT.

We can introduce

typedef struct {
        int count;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        struct lockdep_map      dep_map;
        struct task_struct      *owner;
#endif
} local_trylock_t;

and the whole set of local_trylock_lock, local_trylock_unlock,...
But that's quite a bit of code. Feels a bit overkill for memcg patch 4.
At this point it feels that adding 'int count' to existing local_lock_t
is cleaner.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 5/6] mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages().
  2024-12-11 12:05   ` Vlastimil Babka
@ 2024-12-12  2:54     ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-12  2:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Wed, Dec 11, 2024 at 4:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 03:39, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
> > The caller is responsible to setup memcg correctly.
> > All BPF memory accounting is memcg based.
>
> Hm I guess if there was later another user besides bpf that didn't want
> memcg for some reason, a variant of the function (or a bool param, to avoid
> passing full gfp param which would be thrown away besides __GFP_ACCOUNT)
> could be easily created.

Let's cross that bridge when we get there.
tbh the code that doesn't play nice with memcg should be shamed.
I think CONFIG_MEMCG should disappear and become a default.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-12  2:14         ` Alexei Starovoitov
@ 2024-12-12  8:54           ` Vlastimil Babka
  0 siblings, 0 replies; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-12  8:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Andrzej Siewior, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Hou Tao, Johannes Weiner, shakeel.butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Tejun Heo,
	linux-mm, Kernel Team

On 12/12/24 03:14, Alexei Starovoitov wrote:
> On Wed, Dec 11, 2024 at 12:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/10/24 22:53, Alexei Starovoitov wrote:
>> > On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior
>> > <bigeasy@linutronix.de> wrote:
>> >>
>> >> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote:
>> >> > From: Alexei Starovoitov <ast@kernel.org>
>> >> >
>> >> > Tracing BPF programs execute from tracepoints and kprobes where running
>> >> > context is unknown, but they need to request additional memory.
>> >> > The prior workarounds were using pre-allocated memory and BPF specific
>> >> > freelists to satisfy such allocation requests. Instead, introduce
>> >> > __GFP_TRYLOCK flag that makes page allocator accessible from any context.
>> >> > It relies on percpu free list of pages that rmqueue_pcplist() should be
>> >> > able to pop the page from. If it fails (due to IRQ re-entrancy or list
>> >> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock
>> >> > and refill percpu freelist as normal.
>> >> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT,
>> >> > so trylock is the only option.
>> >>
>> >> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF
>> >> where it is not known how much memory will be needed and what the
>> >> calling context is.
>> >
>> > Exactly.
>> >
>> >> I hope it does not spread across the kernel where
>> >> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they
>> >> learn that this does not work, add this flag to the mix to make it work
>> >> without spending some time on reworking it.
>> >
>> > We can call it __GFP_BPF to discourage any other usage,
>> > but that seems like an odd "solution" to code review problem.
>>
>> Could we perhaps not expose the flag to public headers at all, and keep it
>> only as an internal detail of try_alloc_pages_noprof()?
> 
> public headers?

I mean it could be (with some work) defined only in e.g. mm/internal.h,
which the flag printing code would then need to include.

> To pass additional bit via gfp flags into alloc_pages
> gfp_types.h has to be touched.

Ah right, try_alloc_pages_noprof() would need to move to page_alloc.c
instead of being static inline in the header.

> If you mean moving try_alloc_pages() into mm/page_alloc.c and
> adding another argument to __alloc_pages_noprof then it's not pretty.
> It has 'gfp_t gfp' argument. It should to be used to pass the intent.

__GFP_TRYLOCK could be visible in page_alloc.c to do this, but not ouside mm
code.

> We don't have to add GFP_TRYLOCK at all if we go with
> memalloc_nolock_save() approach.

I have doubts about that idea. We recently rejected PF_MEMALLOC_NORECLAIM
because it could lead to allocations nested in that scope failing and they
might not expect it. Scoped trylock would have even higher chance of failing.

I think here we need to pass the flag as part of gfp flags only within
nested allocations (for metadata or debugging) within the slab/page
allocator itself, which we already mostly do. The harder problem is not
missing any place where it should affect taking a lock, and a PF_ flag won't
help with that (as we can't want all locking functions to look at it).
Maybe it could help with lockdep helping us find locks that we missed, but
I'm sure lockdep could be made to track the trylock scope even without a PF
flag?

> So I started looking at it,
> but immediately hit trouble with bits.
> There are 5 bits left in PF_ and 3 already used for mm needs.
> That doesn't look sustainable long term.
> How about we alias nolock concept with PF_MEMALLOC_PIN ?
> 
> As far as I could trace PF_MEMALLOC_PIN clears GFP_MOVABLE and nothing else.
> 
> The same bit plus lack of __GFP_KSWAPD_RECLAIM in gfp flags
> would mean nolock mode in alloc_pages,
> while PF_MEMALLOC_PIN alone would mean nolock in free_pages
> and deeper inside memcg paths and such.
> 
> thoughts? too hacky?



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-12  2:49       ` Alexei Starovoitov
@ 2024-12-12  9:15         ` Vlastimil Babka
  2024-12-13 14:02           ` Vlastimil Babka
  0 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-12  9:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team, Jann Horn

On 12/12/24 03:49, Alexei Starovoitov wrote:
> On Wed, Dec 11, 2024 at 3:55 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/11/24 11:53, Vlastimil Babka wrote:
>> > On 12/10/24 03:39, Alexei Starovoitov wrote:
>> >> From: Alexei Starovoitov <ast@kernel.org>
>> >>
>> >> Similar to local_lock_irqsave() introduce local_trylock_irqsave().
>> >> It uses spin_trylock in PREEMPT_RT and always succeeds when !RT.
>> >
>> > Hmm but is that correct to always succeed? If we're in an nmi, we might be
>> > preempting an existing local_(try)lock_irqsave() critical section because
>> > disabling irqs doesn't stop NMI's, right?
>>
>> So unless I'm missing something, it would need to be a new kind of local
>> lock to support this trylock operation on !RT?
> 
> Ohh. Correct. Forgot about nmi interrupting local_lock_irqsave region in !RT.
> 
>> Perhaps based on the same
>> principle of a simple active/locked flag that I tried in my sheaves RFC? [1]
>> There could be also the advantage that if all (potentially) irq contexts
>> (not just nmi) used trylock, it would be sufficient to disable preeemption
>> and not interrupts, which is cheaper.
> 
> I don't think it's the case.
> pushf was slow on old x86.
> According to https://www.agner.org/optimize/instruction_tables.pdf
> it's 3 uops on skylake.
> That could be faster than preempt_disable (incl %gs:addr)
> which is 3-4 uops assuming cache hit.

I think the costly ones are not pushf, but cli and popf. I did some
microbenchmark in the kernel (Ryzen 2700, not really latest thing but
anyway) and IIRC it was twice slower to do the irqsave/restore than preempt
only.

>> The RT variant could work as you proposed here, that was wrong in my RFC as
>> you already pointed out when we discussed v1 of this series.
>>
>> [1]
>> https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/
> 
> I like your
> +struct local_tryirq_lock
> approach, but let's put it in local_lock.h ?

Sure, that was a proof of concept so kept it local.

> and it probably needs local_inc_return() instead of READ/WRITE_ONCE.
> With irq and nmis it's racy.

Hm guess you are right, thanks!

> In the meantime I think I will fix below:
> 
>> >> +#define __local_trylock_irqsave(lock, flags)                        \
>> >> +    ({                                                      \
>> >> +            local_irq_save(flags);                          \
>> >> +            local_trylock_acquire(this_cpu_ptr(lock));      \
>> >> +            1;                                              \
>> >> +    })
> 
> as
> #define __local_trylock_irqsave(lock, flags)                    \
>         ({                                                      \
>                 local_irq_save(flags);                          \
>                 local_trylock_acquire(this_cpu_ptr(lock));      \
>                 !in_nmi();                                      \
>         })
> 
> I think that's good enough for memcg patch 4 and
> doesn't grow local_lock_t on !RT.

But that means you'll never succeed in nmi, doesn't that limit the bpf use case?

> We can introduce
> 
> typedef struct {
>         int count;
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         struct lockdep_map      dep_map;
>         struct task_struct      *owner;
> #endif
> } local_trylock_t;
> 
> and the whole set of local_trylock_lock, local_trylock_unlock,...
> But that's quite a bit of code. Feels a bit overkill for memcg patch 4.

SLUB also uses local_locks so it would be needed there later too.

> At this point it feels that adding 'int count' to existing local_lock_t
> is cleaner.

We have Peter and Thomas in Cc let's see what they think :)


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-10 22:49     ` Alexei Starovoitov
@ 2024-12-12 14:44       ` Sebastian Andrzej Siewior
  2024-12-12 19:57         ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-12 14:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On 2024-12-10 14:49:14 [-0800], Alexei Starovoitov wrote:
> On Tue, Dec 10, 2024 at 12:35 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > On 2024-12-09 18:39:32 [-0800], Alexei Starovoitov wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index d511e68903c6..a969a62ec0c3 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1251,9 +1254,33 @@ static void free_one_page(struct zone *zone, struct page *page,
> > >                         unsigned long pfn, unsigned int order,
> > >                         fpi_t fpi_flags)
> > >  {
> > > +     struct llist_head *llhead;
> > >       unsigned long flags;
> > >
> > > -     spin_lock_irqsave(&zone->lock, flags);
> > > +     if (!spin_trylock_irqsave(&zone->lock, flags)) {
> > > +             if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> > > +                     /* Remember the order */
> > > +                     page->order = order;
> > > +                     /* Add the page to the free list */
> > > +                     llist_add(&page->pcp_llist, &zone->trylock_free_pages);
> > > +                     return;
> > > +             }
> > > +             spin_lock_irqsave(&zone->lock, flags);
> > > +     }
> > > +
> > > +     /* The lock succeeded. Process deferred pages. */
> > > +     llhead = &zone->trylock_free_pages;
> > > +     if (unlikely(!llist_empty(llhead))) {
> > > +             struct llist_node *llnode;
> > > +             struct page *p, *tmp;
> > > +
> > > +             llnode = llist_del_all(llhead);
> >
> > Do you really need to turn the list around?
> 
> I didn't think LIFO vs FIFO would make a difference.
> Why spend time rotating it?

I'm sorry. I read llist_reverse_order() in there but it is not there. So
it is all good.

> > > +             llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
> > > +                     unsigned int p_order = p->order;
> > > +                     split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> > > +                     __count_vm_events(PGFREE, 1 << p_order);
> > > +             }
> >
> > We had something like that (returning memory in IRQ/ irq-off) in RT tree
> > and we got rid of it before posting the needed bits to mm.
> >
> > If we really intend to do something like this, could we please process
> > this list in an explicitly locked section? I mean not in a try-lock
> > fashion which might have originated in an IRQ-off region on PREEMPT_RT
> > but in an explicit locked section which would remain preemptible. This
> > would also avoid the locking problem down the road when
> > shuffle_pick_tail() invokes get_random_u64() which in turn acquires a
> > spinlock_t.
> 
> I see. So the concern is though spin_lock_irqsave(&zone->lock)
> is sleepable in RT, bpf prog might have been called in the context
> where preemption is disabled and do split_large_buddy() for many
> pages might take too much time?
Yes.

> How about kicking irq_work then? The callback is in kthread in RT.
> We can irq_work_queue() right after llist_add().
> 
> Or we can process only N pages at a time in this loop and
> llist_add() leftover back into zone->trylock_free_pages.

It could be simpler to not process the trylock_free_pages list in the
trylock attempt, only in the lock case which is preemptible.

Sebastian


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-10 22:06       ` Alexei Starovoitov
  2024-12-11 10:19         ` Michal Hocko
@ 2024-12-12 15:07         ` Sebastian Sewior
  2024-12-12 15:21           ` Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Sebastian Sewior @ 2024-12-12 15:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, Matthew Wilcox, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Vlastimil Babka, Steven Rostedt, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On 2024-12-10 14:06:32 [-0800], Alexei Starovoitov wrote:
> > Is there any reason why GFP_ATOMIC cannot be extended to support new
> > contexts? This allocation mode is already documented to be usable from
> > atomic contexts except from NMI and raw_spinlocks. But is it feasible to
> > extend the current implementation to use only trylock on zone->lock if
> > called from in_nmi() to reduce unexpected failures on contention for
> > existing users?
> 
> No. in_nmi() doesn't help. It's the lack of reentrance of slab and page
> allocator that is an issue.
> The page alloctor might grab zone lock. In !RT it will disable irqs.
> In RT will stay sleepable. Both paths will be calling other
> kernel code including tracepoints, potential kprobes, etc
> and bpf prog may be attached somewhere.
> If it calls alloc_page() it may deadlock on zone->lock.
> pcpu lock is thankfully trylock already.
> So !irqs_disabled() part of preemptible() guarantees that
> zone->lock won't deadlock in !RT.
> And rcu_preempt_depth() case just steers bpf into try lock only path in RT.
> Since there is no way to tell whether it's safe to call
> sleepable spin_lock(&zone->lock).

Oh. You don't need to check rcu_preempt_depth() for that. On PREEMPT_RT
rcu_preempt_depth() is incremented with every spin_lock() because we
need an explicit start of a RCU section (same thing happens with
preempt_disable() spin_lock()). If there is already a RCU section
(rcu_preempt_depth() > 0) you can still try to acquire a spinlock_t and
maybe schedule out/ sleep. That is okay.

But since I see in_nmi(). You can't trylock from NMI on RT. The trylock
part is easy but unlock might need to acquire rt_mutex_base::wait_lock
and worst case is to wake a waiter via wake_up_process().

Sebastian


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-10  2:39 ` [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
  2024-12-11 10:53   ` Vlastimil Babka
@ 2024-12-12 15:15   ` Sebastian Andrzej Siewior
  2024-12-12 19:59     ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-12 15:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, tj, linux-mm,
	kernel-team

> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index 8dd71fbbb6d2..2c0f8a49c2d0 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -148,6 +163,14 @@ typedef spinlock_t local_lock_t;
>  		__local_lock(lock);				\
>  	} while (0)
>  
> +#define __local_trylock_irqsave(lock, flags)			\
> +	({							\
> +		typecheck(unsigned long, flags);		\
> +		flags = 0;					\
> +		migrate_disable();				\
> +		spin_trylock(this_cpu_ptr((__lock)));		\

You should probably do a migrate_enable() here if the trylock fails.

> +	})
> +
>  #define __local_unlock(__lock)					\
>  	do {							\
>  		spin_unlock(this_cpu_ptr((__lock)));		\

Sebastian


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-12 15:07         ` Sebastian Sewior
@ 2024-12-12 15:21           ` Michal Hocko
  2024-12-12 15:35             ` Sebastian Sewior
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2024-12-12 15:21 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Alexei Starovoitov, Matthew Wilcox, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Vlastimil Babka, Steven Rostedt, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Thu 12-12-24 16:07:44, Sebastian Sewior wrote:
> But since I see in_nmi(). You can't trylock from NMI on RT. The trylock
> part is easy but unlock might need to acquire rt_mutex_base::wait_lock
> and worst case is to wake a waiter via wake_up_process().

Ohh, I didn't realize that. So try_lock would only be safe on
raw_spin_lock right?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-12 15:21           ` Michal Hocko
@ 2024-12-12 15:35             ` Sebastian Sewior
  2024-12-12 15:48               ` Steven Rostedt
  2024-12-12 21:57               ` Alexei Starovoitov
  0 siblings, 2 replies; 50+ messages in thread
From: Sebastian Sewior @ 2024-12-12 15:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Matthew Wilcox, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Vlastimil Babka, Steven Rostedt, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On 2024-12-12 16:21:28 [+0100], Michal Hocko wrote:
> On Thu 12-12-24 16:07:44, Sebastian Sewior wrote:
> > But since I see in_nmi(). You can't trylock from NMI on RT. The trylock
> > part is easy but unlock might need to acquire rt_mutex_base::wait_lock
> > and worst case is to wake a waiter via wake_up_process().
> 
> Ohh, I didn't realize that. So try_lock would only be safe on
> raw_spin_lock right?

If NMI is one of the possible calling contexts, yes.

One thing I am not 100% sure about is how "good" a spinlock_t trylock is
if attempted from hardirq (on PREEMPT_RT). Obtaining the lock und
unlocking is doable. The lock part will assign the "current" task as the
task that owns the lock now. This task is just randomly on the CPU while
the hardirq triggered. The regular spin_lock() will see this random task
as the owner and might PI-boost it. This could work…

Sebastian


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-12 15:35             ` Sebastian Sewior
@ 2024-12-12 15:48               ` Steven Rostedt
  2024-12-12 16:00                 ` Sebastian Sewior
  2024-12-12 21:57               ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2024-12-12 15:48 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Michal Hocko, Alexei Starovoitov, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Thu, 12 Dec 2024 16:35:06 +0100
Sebastian Sewior <bigeasy@linutronix.de> wrote:

> If NMI is one of the possible calling contexts, yes.
> 
> One thing I am not 100% sure about is how "good" a spinlock_t trylock is
> if attempted from hardirq (on PREEMPT_RT). Obtaining the lock und
> unlocking is doable. The lock part will assign the "current" task as the
> task that owns the lock now. This task is just randomly on the CPU while
> the hardirq triggered. The regular spin_lock() will see this random task
> as the owner and might PI-boost it. This could work…

Looking at the unlock code, it and the slowtrylock() appears to use
raw_spin_lock_irqsave(). Hence it expects that it can be called from
irq disabled context. If it can be used in interrupt disabled context,
I don't see why it wouldn't work in actual interrupt context.

-- Steve


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-12 15:48               ` Steven Rostedt
@ 2024-12-12 16:00                 ` Sebastian Sewior
  2024-12-13 17:44                   ` Steven Rostedt
  0 siblings, 1 reply; 50+ messages in thread
From: Sebastian Sewior @ 2024-12-12 16:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Michal Hocko, Alexei Starovoitov, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On 2024-12-12 10:48:09 [-0500], Steven Rostedt wrote:
> On Thu, 12 Dec 2024 16:35:06 +0100
> Sebastian Sewior <bigeasy@linutronix.de> wrote:
> 
> > If NMI is one of the possible calling contexts, yes.
> > 
> > One thing I am not 100% sure about is how "good" a spinlock_t trylock is
> > if attempted from hardirq (on PREEMPT_RT). Obtaining the lock und
> > unlocking is doable. The lock part will assign the "current" task as the
> > task that owns the lock now. This task is just randomly on the CPU while
> > the hardirq triggered. The regular spin_lock() will see this random task
> > as the owner and might PI-boost it. This could work…
> 
> Looking at the unlock code, it and the slowtrylock() appears to use
> raw_spin_lock_irqsave(). Hence it expects that it can be called from
> irq disabled context. If it can be used in interrupt disabled context,
> I don't see why it wouldn't work in actual interrupt context.

trylock records current as owner. This task did not attempt to acquire
the lock.
The lock part will might PI-boost it via task_blocks_on_rt_mutex() -
this random task. That task might already wait on one lock and now this
gets added to the mix.
This could be okay since a task can own multiple locks, wait on one and
get PI boosted on any of those at any time.
However, we never used that way.

The lockig of the raw_spinlock_t has irqsave. Correct. But not because
it expects to be called in interrupt disabled context or an actual
interrupt. It was _irq() but got changed because it is used in the early
init code and would unconditionally enable interrupts which should
remain disabled.

> -- Steve

Sebastian


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock()
  2024-12-12 14:44       ` Sebastian Andrzej Siewior
@ 2024-12-12 19:57         ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-12 19:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Thu, Dec 12, 2024 at 6:44 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2024-12-10 14:49:14 [-0800], Alexei Starovoitov wrote:
> > On Tue, Dec 10, 2024 at 12:35 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2024-12-09 18:39:32 [-0800], Alexei Starovoitov wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index d511e68903c6..a969a62ec0c3 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1251,9 +1254,33 @@ static void free_one_page(struct zone *zone, struct page *page,
> > > >                         unsigned long pfn, unsigned int order,
> > > >                         fpi_t fpi_flags)
> > > >  {
> > > > +     struct llist_head *llhead;
> > > >       unsigned long flags;
> > > >
> > > > -     spin_lock_irqsave(&zone->lock, flags);
> > > > +     if (!spin_trylock_irqsave(&zone->lock, flags)) {
> > > > +             if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> > > > +                     /* Remember the order */
> > > > +                     page->order = order;
> > > > +                     /* Add the page to the free list */
> > > > +                     llist_add(&page->pcp_llist, &zone->trylock_free_pages);
> > > > +                     return;
> > > > +             }
> > > > +             spin_lock_irqsave(&zone->lock, flags);
> > > > +     }
> > > > +
> > > > +     /* The lock succeeded. Process deferred pages. */
> > > > +     llhead = &zone->trylock_free_pages;
> > > > +     if (unlikely(!llist_empty(llhead))) {
> > > > +             struct llist_node *llnode;
> > > > +             struct page *p, *tmp;
> > > > +
> > > > +             llnode = llist_del_all(llhead);
> > >
> > > Do you really need to turn the list around?
> >
> > I didn't think LIFO vs FIFO would make a difference.
> > Why spend time rotating it?
>
> I'm sorry. I read llist_reverse_order() in there but it is not there. So
> it is all good.
>
> > > > +             llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
> > > > +                     unsigned int p_order = p->order;
> > > > +                     split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> > > > +                     __count_vm_events(PGFREE, 1 << p_order);
> > > > +             }
> > >
> > > We had something like that (returning memory in IRQ/ irq-off) in RT tree
> > > and we got rid of it before posting the needed bits to mm.
> > >
> > > If we really intend to do something like this, could we please process
> > > this list in an explicitly locked section? I mean not in a try-lock
> > > fashion which might have originated in an IRQ-off region on PREEMPT_RT
> > > but in an explicit locked section which would remain preemptible. This
> > > would also avoid the locking problem down the road when
> > > shuffle_pick_tail() invokes get_random_u64() which in turn acquires a
> > > spinlock_t.
> >
> > I see. So the concern is though spin_lock_irqsave(&zone->lock)
> > is sleepable in RT, bpf prog might have been called in the context
> > where preemption is disabled and do split_large_buddy() for many
> > pages might take too much time?
> Yes.
>
> > How about kicking irq_work then? The callback is in kthread in RT.
> > We can irq_work_queue() right after llist_add().
> >
> > Or we can process only N pages at a time in this loop and
> > llist_add() leftover back into zone->trylock_free_pages.
>
> It could be simpler to not process the trylock_free_pages list in the
> trylock attempt, only in the lock case which is preemptible.

Make sense. Will change to:

        /* The lock succeeded. Process deferred pages. */
        llhead = &zone->trylock_free_pages;
-       if (unlikely(!llist_empty(llhead))) {
+       if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-12 15:15   ` Sebastian Andrzej Siewior
@ 2024-12-12 19:59     ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-12 19:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Thu, Dec 12, 2024 at 7:15 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> > diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> > index 8dd71fbbb6d2..2c0f8a49c2d0 100644
> > --- a/include/linux/local_lock_internal.h
> > +++ b/include/linux/local_lock_internal.h
> > @@ -148,6 +163,14 @@ typedef spinlock_t local_lock_t;
> >               __local_lock(lock);                             \
> >       } while (0)
> >
> > +#define __local_trylock_irqsave(lock, flags)                 \
> > +     ({                                                      \
> > +             typecheck(unsigned long, flags);                \
> > +             flags = 0;                                      \
> > +             migrate_disable();                              \
> > +             spin_trylock(this_cpu_ptr((__lock)));           \
>
> You should probably do a migrate_enable() here if the trylock fails.

Great catch. Thanks!


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-12 15:35             ` Sebastian Sewior
  2024-12-12 15:48               ` Steven Rostedt
@ 2024-12-12 21:57               ` Alexei Starovoitov
  1 sibling, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-12 21:57 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Michal Hocko, Matthew Wilcox, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Vlastimil Babka, Steven Rostedt, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Thu, Dec 12, 2024 at 7:35 AM Sebastian Sewior <bigeasy@linutronix.de> wrote:
>
> On 2024-12-12 16:21:28 [+0100], Michal Hocko wrote:
> > On Thu 12-12-24 16:07:44, Sebastian Sewior wrote:
> > > But since I see in_nmi(). You can't trylock from NMI on RT. The trylock
> > > part is easy but unlock might need to acquire rt_mutex_base::wait_lock
> > > and worst case is to wake a waiter via wake_up_process().
> >
> > Ohh, I didn't realize that. So try_lock would only be safe on
> > raw_spin_lock right?
>
> If NMI is one of the possible calling contexts, yes.

Looks like in_nmi both trylock and unlock are not safe.

pcp_spin_trylock() calls __rt_spin_trylock() on RT,
which can deadlock inside rt_mutex_slowtrylock().
This part has a potential workaround like:

@@ -102,8 +102,11 @@ static __always_inline int
__rt_spin_trylock(spinlock_t *lock)
 {
        int ret = 1;

-       if (unlikely(!rt_mutex_cmpxchg_acquire(&lock->lock, NULL, current)))
+       if (unlikely(!rt_mutex_cmpxchg_acquire(&lock->lock, NULL, current))) {
+               if (in_nmi())
+                       return 0;
                ret = rt_mutex_slowtrylock(&lock->lock);
+       }

but when there are waiters and cmpxchg in this part fails:
        if (unlikely(!rt_mutex_cmpxchg_release(&lock->lock, current, NULL)))
                rt_mutex_slowunlock(&lock->lock);

will trigger slowunlock that is impossible to do from nmi.
We can punt it to irqwork with IRQ_WORK_HARD_IRQ to make sure
it runs as soon as nmi finishes.
Since it's hard irq debug_rt_mutex_unlock(lock); shouldn't complain.
The current will stay the same ?
Other ideas?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2024-12-12  9:15         ` Vlastimil Babka
@ 2024-12-13 14:02           ` Vlastimil Babka
  0 siblings, 0 replies; 50+ messages in thread
From: Vlastimil Babka @ 2024-12-13 14:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, shakeel.butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team, Jann Horn

On 12/12/24 10:15, Vlastimil Babka wrote:
> On 12/12/24 03:49, Alexei Starovoitov wrote:
>> I like your
>> +struct local_tryirq_lock
>> approach, but let's put it in local_lock.h ?
> 
> Sure, that was a proof of concept so kept it local.
> 
>> and it probably needs local_inc_return() instead of READ/WRITE_ONCE.
>> With irq and nmis it's racy.
> 
> Hm guess you are right, thanks!
> 

Ah I remember now, the idea was that if an irq or nmi interrupts someone
else between checking that active is 0 and setting it to 1, it will finish
the critical section and unlock before the interrupted context can continue,
so the race is harmless.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-12 16:00                 ` Sebastian Sewior
@ 2024-12-13 17:44                   ` Steven Rostedt
  2024-12-13 18:44                     ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2024-12-13 17:44 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Michal Hocko, Alexei Starovoitov, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Thu, 12 Dec 2024 17:00:09 +0100
Sebastian Sewior <bigeasy@linutronix.de> wrote:

> 
> The lockig of the raw_spinlock_t has irqsave. Correct. But not because
> it expects to be called in interrupt disabled context or an actual
> interrupt. It was _irq() but got changed because it is used in the early
> init code and would unconditionally enable interrupts which should
> remain disabled.
> 

Yep, I understand that. My point was that because it does it this way, it
should also work in hard interrupt context. But it doesn't!

Looking deeper, I do not think this is safe from interrupt context!

I'm looking at the rt_mutex_slowlock_block():


		if (waiter == rt_mutex_top_waiter(lock))
			owner = rt_mutex_owner(lock);
		else
			owner = NULL;
		raw_spin_unlock_irq(&lock->wait_lock);

		if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner))
			rt_mutex_schedule();


If we take an interrupt right after the raw_spin_unlock_irq() and then do a
trylock on an rt_mutex in the interrupt and it gets the lock. The task is
now both blocked on a lock and also holding a lock that's later in the
chain. I'm not sure the PI logic can handle such a case. That is, we have
in the chain of the task:

 lock A (blocked-waiting-for-lock) -> lock B (taken in interrupt)

If another task blocks on B, it will reverse order the lock logic. It will
see the owner is the task, but the task is blocked on A, the PI logic
assumes that for such a case, the lock order would be:

  B -> A

But this is not the case. I'm not sure what would happen here, but it is
definitely out of scope of the requirements of the PI logic and thus,
trylock must also not be used in hard interrupt context.

-- Steve


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-13 17:44                   ` Steven Rostedt
@ 2024-12-13 18:44                     ` Alexei Starovoitov
  2024-12-13 18:57                       ` Alexei Starovoitov
  2024-12-13 20:09                       ` Steven Rostedt
  0 siblings, 2 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-13 18:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sebastian Sewior, Michal Hocko, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Fri, Dec 13, 2024 at 9:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 12 Dec 2024 17:00:09 +0100
> Sebastian Sewior <bigeasy@linutronix.de> wrote:
>
> >
> > The lockig of the raw_spinlock_t has irqsave. Correct. But not because
> > it expects to be called in interrupt disabled context or an actual
> > interrupt. It was _irq() but got changed because it is used in the early
> > init code and would unconditionally enable interrupts which should
> > remain disabled.
> >
>
> Yep, I understand that. My point was that because it does it this way, it
> should also work in hard interrupt context. But it doesn't!
>
> Looking deeper, I do not think this is safe from interrupt context!
>
> I'm looking at the rt_mutex_slowlock_block():
>
>
>                 if (waiter == rt_mutex_top_waiter(lock))
>                         owner = rt_mutex_owner(lock);
>                 else
>                         owner = NULL;
>                 raw_spin_unlock_irq(&lock->wait_lock);
>
>                 if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner))
>                         rt_mutex_schedule();
>
>
> If we take an interrupt right after the raw_spin_unlock_irq() and then do a
> trylock on an rt_mutex in the interrupt and it gets the lock. The task is
> now both blocked on a lock and also holding a lock that's later in the
> chain. I'm not sure the PI logic can handle such a case. That is, we have
> in the chain of the task:
>
>  lock A (blocked-waiting-for-lock) -> lock B (taken in interrupt)
>
> If another task blocks on B, it will reverse order the lock logic. It will
> see the owner is the task, but the task is blocked on A, the PI logic
> assumes that for such a case, the lock order would be:
>
>   B -> A
>
> But this is not the case. I'm not sure what would happen here, but it is
> definitely out of scope of the requirements of the PI logic and thus,
> trylock must also not be used in hard interrupt context.

If hard-irq acquired rt_mutex B (spin_lock or spin_trylock doesn't
change the above analysis), the task won't schedule
and it has to release this rt_mutex B before reenabling irq.
The irqrestore without releasing the lock is a bug regardless.

What's the concern then? That PI may see an odd order of locks for this task ?
but it cannot do anything about it anyway, since the task won't schedule.
And before irq handler is over the B will be released and everything
will look normal again.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-13 18:44                     ` Alexei Starovoitov
@ 2024-12-13 18:57                       ` Alexei Starovoitov
  2024-12-13 20:09                       ` Steven Rostedt
  1 sibling, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-13 18:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sebastian Sewior, Michal Hocko, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Fri, Dec 13, 2024 at 10:44 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> If hard-irq acquired rt_mutex B (spin_lock or spin_trylock doesn't
> change the above analysis),

attempting to spin_lock from irq is obviously not safe. scratch that part.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-13 18:44                     ` Alexei Starovoitov
  2024-12-13 18:57                       ` Alexei Starovoitov
@ 2024-12-13 20:09                       ` Steven Rostedt
  2024-12-13 21:00                         ` Steven Rostedt
  1 sibling, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2024-12-13 20:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Sewior, Michal Hocko, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Fri, 13 Dec 2024 10:44:26 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > But this is not the case. I'm not sure what would happen here, but it is
> > definitely out of scope of the requirements of the PI logic and thus,
> > trylock must also not be used in hard interrupt context.  
> 
> If hard-irq acquired rt_mutex B (spin_lock or spin_trylock doesn't
> change the above analysis), the task won't schedule
> and it has to release this rt_mutex B before reenabling irq.
> The irqrestore without releasing the lock is a bug regardless.
> 
> What's the concern then? That PI may see an odd order of locks for this task ?
> but it cannot do anything about it anyway, since the task won't schedule.
> And before irq handler is over the B will be released and everything
> will look normal again.

The problem is the chain walk. It could also cause unwanted side effects in RT.

If low priority task 1 has lock A and is running on another CPU and low
priority task 2 blocks on lock A and then is interrupted right before going
to sleep as being "blocked on", and takes lock B in the interrupt context.
We then have high priority task 3 on another CPU block on B which will then
see that the owner of B is blocked (even though it is not blocked for B), it
will boost its priority as well as the owner of the lock (A). The A owner
will get boosted where it is not the task that is blocking the high
priority task.

My point is that RT is all about deterministic behavior. It would require
a pretty substantial audit to the PI logic to make sure that this doesn't
cause any unexpected results.

My point is, the PI logic was not designed for taking a lock after being
blocked on another lock. It may work, but we had better prove and show all
side effects that can happen in these cases.

-- Steve


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-13 20:09                       ` Steven Rostedt
@ 2024-12-13 21:00                         ` Steven Rostedt
  2024-12-13 22:02                           ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2024-12-13 21:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Sewior, Michal Hocko, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Fri, 13 Dec 2024 15:09:50 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > What's the concern then? That PI may see an odd order of locks for this task ?
> > but it cannot do anything about it anyway, since the task won't schedule.
> > And before irq handler is over the B will be released and everything
> > will look normal again.  
> 
> The problem is the chain walk. It could also cause unwanted side effects in RT.
> 
> If low priority task 1 has lock A and is running on another CPU and low
> priority task 2 blocks on lock A and then is interrupted right before going
> to sleep as being "blocked on", and takes lock B in the interrupt context.
> We then have high priority task 3 on another CPU block on B which will then
> see that the owner of B is blocked (even though it is not blocked for B), it
> will boost its priority as well as the owner of the lock (A). The A owner
> will get boosted where it is not the task that is blocking the high
> priority task.
> 
> My point is that RT is all about deterministic behavior. It would require
> a pretty substantial audit to the PI logic to make sure that this doesn't
> cause any unexpected results.
> 
> My point is, the PI logic was not designed for taking a lock after being
> blocked on another lock. It may work, but we had better prove and show all
> side effects that can happen in these cases.

When B is released, task 2 will be unboosted, but task 1 will not. That's
because a task is only unboosted when it releases a lock (or a lock times
out, and then the waiter will unboost the chain, but that's not this case).

Task 1 will unboost when it finally releases lock A.

Another issue is that interrupts are not stopped by spin_lock_irq*() as in
RT spin_lock_irq*() is the same as spin_lock(). As spin_locks() are assumed
to only be in threaded irq context, there's no reason to actually disable
interrupts when taking one of these spin_lock_irq*().

That means we could have task 1 trying to take lock B too. If we have a
lock order of:

  A -> B

Where B is the trylock in the irq context, we have:


  CPU 1			CPU 2			CPU 3
  -----			-----			-----
 task 1 takes A
			task 2 blocks on A, gets interrupted, trylock B and succeeds:

 task 1 takes B (blocks)

						High priority task 3 blocks on B

Now we have this in the PI chain:

  Task 3 boosts tasks 2 but task 2 is blocked on A
  Task 3 then boosts owner of A which is task 1 which is blocked on lock B
  Task 3 then boosts owner of lock B which is task 2
  [ wash, rinse, repeat!!! ]

There is a safety valve in the code that will prevent an infinite loop, and
it will trigger a printk message or it may stop because the priorities are
equal, but still, this is not desirable and may even have other side
effects. If deadlock detection is enabled, this will trigger it.

Again, allowing spin_locks being taken in hard irq context in RT, even with
trylock is going to open a nasty can of worms that will make this less
deterministic and determinism is the entire point of RT. If we allow one
user to have spin_lock_trylock() in hard irq context, we have to allow
anyone to do it.

-- Steve


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation
  2024-12-13 21:00                         ` Steven Rostedt
@ 2024-12-13 22:02                           ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2024-12-13 22:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sebastian Sewior, Michal Hocko, Matthew Wilcox, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Hou Tao, Johannes Weiner,
	shakeel.butt, Thomas Gleixner, Tejun Heo, linux-mm, Kernel Team

On Fri, Dec 13, 2024 at 1:00 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Again, allowing spin_locks being taken in hard irq context in RT, even with
> trylock is going to open a nasty can of worms that will make this less
> deterministic and determinism is the entire point of RT. If we allow one
> user to have spin_lock_trylock() in hard irq context, we have to allow
> anyone to do it.

The boosting mess is a concern indeed, but looks like it's
happening already.

See netpoll_send_udp() -> netpoll_send_skb() that does
local_irq_save() and then __netpoll_send_skb() ->
HARD_TX_TRYLOCK() -> __netif_tx_trylock() -> spin_trylock()

netconsole is the key mechanism for some hyperscalers.
It's not a niche feature.

Sounds like it's sort-of broken or rather not-working correctly
by RT standards, but replacing _xmit_lock with raw_spin_lock
is probably not an option.
So either netpoll needs to be redesigned somehow, since it has to
printk->netcons->udp->skb->tx_lock->xmit on the wire,
otherwise dmesg messages will be lost.
or make PI aware of this.

local_irq_save() is not an issue per-se,
it's printk->necons that can be called from any context
including hard irq.

For the purpose of this patch set I think I have to
if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_hardirq() || in_nmi()))
  goto out;

Looks like it's safe to call spin_trylock() on RT after
local_irq_save() and/or when irqs_disabled().
It's a hardirq/nmi context that is a problem.


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-12-13 22:02 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-10  2:39 [PATCH bpf-next v2 0/6] bpf, mm: Introduce __GFP_TRYLOCK Alexei Starovoitov
2024-12-10  2:39 ` [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Alexei Starovoitov
2024-12-10  5:31   ` Matthew Wilcox
2024-12-10  9:05     ` Michal Hocko
2024-12-10 20:25       ` Shakeel Butt
2024-12-11 10:08         ` Michal Hocko
2024-12-10 22:06       ` Alexei Starovoitov
2024-12-11 10:19         ` Michal Hocko
2024-12-12 15:07         ` Sebastian Sewior
2024-12-12 15:21           ` Michal Hocko
2024-12-12 15:35             ` Sebastian Sewior
2024-12-12 15:48               ` Steven Rostedt
2024-12-12 16:00                 ` Sebastian Sewior
2024-12-13 17:44                   ` Steven Rostedt
2024-12-13 18:44                     ` Alexei Starovoitov
2024-12-13 18:57                       ` Alexei Starovoitov
2024-12-13 20:09                       ` Steven Rostedt
2024-12-13 21:00                         ` Steven Rostedt
2024-12-13 22:02                           ` Alexei Starovoitov
2024-12-12 21:57               ` Alexei Starovoitov
2024-12-10 21:42     ` Alexei Starovoitov
2024-12-10  9:01   ` Sebastian Andrzej Siewior
2024-12-10 21:53     ` Alexei Starovoitov
2024-12-11  8:38       ` Vlastimil Babka
2024-12-12  2:14         ` Alexei Starovoitov
2024-12-12  8:54           ` Vlastimil Babka
2024-12-10 18:39   ` Vlastimil Babka
2024-12-10 22:42     ` Alexei Starovoitov
2024-12-11  8:48       ` Vlastimil Babka
2024-12-10  2:39 ` [PATCH bpf-next v2 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2024-12-10  8:35   ` Sebastian Andrzej Siewior
2024-12-10 22:49     ` Alexei Starovoitov
2024-12-12 14:44       ` Sebastian Andrzej Siewior
2024-12-12 19:57         ` Alexei Starovoitov
2024-12-11 10:11   ` Vlastimil Babka
2024-12-12  1:43     ` Alexei Starovoitov
2024-12-10  2:39 ` [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
2024-12-11 10:53   ` Vlastimil Babka
2024-12-11 11:55     ` Vlastimil Babka
2024-12-12  2:49       ` Alexei Starovoitov
2024-12-12  9:15         ` Vlastimil Babka
2024-12-13 14:02           ` Vlastimil Babka
2024-12-12 15:15   ` Sebastian Andrzej Siewior
2024-12-12 19:59     ` Alexei Starovoitov
2024-12-10  2:39 ` [PATCH bpf-next v2 4/6] memcg: Add __GFP_TRYLOCK support Alexei Starovoitov
2024-12-11 23:47   ` kernel test robot
2024-12-10  2:39 ` [PATCH bpf-next v2 5/6] mm, bpf: Use __GFP_ACCOUNT in try_alloc_pages() Alexei Starovoitov
2024-12-11 12:05   ` Vlastimil Babka
2024-12-12  2:54     ` Alexei Starovoitov
2024-12-10  2:39 ` [PATCH bpf-next v2 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox