linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages()
@ 2025-01-24  3:56 Alexei Starovoitov
  2025-01-24  3:56 ` [PATCH bpf-next v6 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  3:56 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Hi All,

The main motivation is to make alloc page and slab reentrant and
remove bpf_mem_alloc.

v5->v6:
- Addressed comments from Sebastian, Vlastimil
- New approach for local_lock_t in patch 3. Instead of unconditionally
  increasing local_lock_t size to 4 bytes introduce local_trylock_t
  and use _Generic() tricks to manipulate active field.
- Address stackdepot reentrance issues. alloc part in patch 1 and
  free part in patch 2.
- Inlined mem_cgroup_cancel_charge() in patch 4 since this helper
  is being removed.
- Added Acks.
- Dropped failslab, kfence, kmemleak patch.
- Improved bpf_map_alloc_pages() in patch 6 a bit to demo intended usage.
  It will be refactored further.
- Considered using __GFP_COMP in try_alloc_pages to simplify
  free_pages_nolock a bit, but then decided to make it work
  for all types of pages, since free_pages_nolock() is used by
  stackdepot and currently it's using non-compound order 2.
  I felt it's best to leave it as-is and make free_pages_nolock()
  support all pages.

v5:
https://lore.kernel.org/all/20250115021746.34691-1-alexei.starovoitov@gmail.com/

v4->v5:
- Fixed patch 1 and 4 commit logs and comments per Michal suggestions.
  Added Acks.
- Added patch 6 to make failslab, kfence, kmemleak complaint
  with trylock mode. It's a prerequisite for reentrant slab patches.

v4:
https://lore.kernel.org/bpf/20250114021922.92609-1-alexei.starovoitov@gmail.com/

v3->v4:
Addressed feedback from Michal and Shakeel:
- GFP_TRYLOCK flag is gone. gfpflags_allow_spinning() is used instead.
- Improved comments and commit logs.

v3:
https://lore.kernel.org/bpf/20241218030720.1602449-1-alexei.starovoitov@gmail.com/

v2->v3:
To address the issues spotted by Sebastian, Vlastimil, Steven:
- Made GFP_TRYLOCK internal to mm/internal.h
  try_alloc_pages() and free_pages_nolock() are the only interfaces.
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
  disable such usage in lock_trylock and in try_alloc_pages().
  In such case free_pages_nolock() falls back to llist right away.
- Process trylock_free_pages llist when preemptible.
- Check for things like unaccepted memory and order <= 3 early.
- Don't call into __alloc_pages_slowpath() at all.
- Inspired by Vlastimil's struct local_tryirq_lock adopted it in
  local_lock_t. Extra 4 bytes in !RT in local_lock_t shouldn't
  affect any of the current local_lock_t users. This is patch 3.
- Tested with bpf selftests in RT and !RT and realized how much
  more work is necessary on bpf side to play nice with RT.
  The urgency of this work got higher. The alternative is to
  convert bpf bits left and right to bpf_mem_alloc.

v2:
https://lore.kernel.org/bpf/20241210023936.46871-1-alexei.starovoitov@gmail.com/

v1->v2:
- fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter.
- optimize all paths by doing spin_trylock_irqsave() first
  and only then check for gfp_flags & __GFP_TRYLOCK.
  Then spin_lock_irqsave() if it's a regular mode.
  So new gfp flag will not add performance overhead.
- patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock()
  and memcg support. So it's in usable shape for bpf in patch 6.

v1:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (6):
  mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  mm, bpf: Introduce free_pages_nolock()
  locking/local_lock: Introduce local_trylock_t and
    local_trylock_irqsave()
  memcg: Use trylock to access memcg stock_lock.
  mm, bpf: Use memcg in try_alloc_pages().
  bpf: Use try_alloc_pages() to allocate pages for bpf needs.

 include/linux/bpf.h                 |   2 +-
 include/linux/gfp.h                 |  23 ++++
 include/linux/local_lock.h          |   9 ++
 include/linux/local_lock_internal.h |  79 ++++++++++-
 include/linux/mm_types.h            |   4 +
 include/linux/mmzone.h              |   3 +
 kernel/bpf/arena.c                  |   5 +-
 kernel/bpf/syscall.c                |  23 +++-
 lib/stackdepot.c                    |  10 +-
 mm/internal.h                       |   1 +
 mm/memcontrol.c                     |  30 ++++-
 mm/page_alloc.c                     | 200 ++++++++++++++++++++++++++--
 mm/page_owner.c                     |   8 +-
 13 files changed, 365 insertions(+), 32 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next v6 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
@ 2025-01-24  3:56 ` Alexei Starovoitov
  2025-01-28 18:09   ` Shakeel Butt
  2025-01-24  3:56 ` [PATCH bpf-next v6 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  3:56 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce gfpflags_allow_spinning() condition that signals
to the allocator that running context is unknown.
Then rely on percpu free list of pages to allocate a page.
try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
rmqueue_pcplist() will spin_trylock to grab the page from percpu
free list. If it fails (due to re-entrancy or list being empty)
then rmqueue_bulk()/rmqueue_buddy() will attempt to
spin_trylock zone->lock and grab the page from there.
spin_trylock() is not safe in PREEMPT_RT when in NMI or in hard IRQ.
Bailout early in such case.

The support for gfpflags_allow_spinning() mode for free_page and memcg
comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h |  22 ++++++++++
 lib/stackdepot.c    |   5 ++-
 mm/internal.h       |   1 +
 mm/page_alloc.c     | 104 ++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b0fe9f62d15b..82bfb65b8d15 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }
 
+static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
+{
+	/*
+	 * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
+	 * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
+	 * All GFP_* flags including GFP_NOWAIT use one or both flags.
+	 * try_alloc_pages() is the only API that doesn't specify either flag.
+	 *
+	 * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
+	 * those are guaranteed to never block on a sleeping lock.
+	 * Here we are enforcing that the allocation doesn't ever spin
+	 * on any locks (i.e. only trylocks). There is no high level
+	 * GFP_$FOO flag for this use in try_alloc_pages() as the
+	 * regular page allocator doesn't fully support this
+	 * allocation mode.
+	 */
+	return !(gfp_flags & __GFP_RECLAIM);
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
@@ -347,6 +366,9 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
+struct page *try_alloc_pages_noprof(int nid, unsigned int order);
+#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
 
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 245d5b416699..377194969e61 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -591,7 +591,8 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 	depot_stack_handle_t handle = 0;
 	struct page *page = NULL;
 	void *prealloc = NULL;
-	bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
+	bool allow_spin = gfpflags_allow_spinning(alloc_flags);
+	bool can_alloc = (depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC) && allow_spin;
 	unsigned long flags;
 	u32 hash;
 
@@ -630,7 +631,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 			prealloc = page_address(page);
 	}
 
-	if (in_nmi()) {
+	if (in_nmi() || !allow_spin) {
 		/* We can never allocate in NMI context. */
 		WARN_ON_ONCE(can_alloc);
 		/* Best effort; bail if we fail to take the lock. */
diff --git a/mm/internal.h b/mm/internal.h
index 9826f7dce607..6c3c664aa346 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1174,6 +1174,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01eab25edf89..a82bc67abbdb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2306,7 +2306,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+			return 0;
+		spin_lock_irqsave(&zone->lock, flags);
+	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -2906,7 +2910,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
+		if (!spin_trylock_irqsave(&zone->lock, flags)) {
+			if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+				return NULL;
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
@@ -4511,7 +4519,12 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 
 	might_alloc(gfp_mask);
 
-	if (should_fail_alloc_page(gfp_mask, order))
+	/*
+	 * Don't invoke should_fail logic, since it may call
+	 * get_random_u32() and printk() which need to spin_lock.
+	 */
+	if (!(*alloc_flags & ALLOC_TRYLOCK) &&
+	    should_fail_alloc_page(gfp_mask, order))
 		return false;
 
 	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
@@ -7028,3 +7041,88 @@ static bool __free_unaccepted(struct page *page)
 }
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+
+/**
+ * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
+ * @nid - node to allocate from
+ * @order - allocation order size
+ *
+ * Allocates pages of a given order from the given node. This is safe to
+ * call from any context (from atomic, NMI, and also reentrant
+ * allocator -> tracepoint -> try_alloc_pages_noprof).
+ * Allocation is best effort and to be expected to fail easily so nobody should
+ * rely on the success. Failures are not reported via warn_alloc().
+ * See always fail conditions below.
+ *
+ * Return: allocated page or NULL on failure.
+ */
+struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+	/*
+	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
+	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
+	 * is not safe in arbitrary context.
+	 *
+	 * These two are the conditions for gfpflags_allow_spinning() being true.
+	 *
+	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+	 * to warn. Also warn would trigger printk() which is unsafe from
+	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
+	 * since the running context is unknown.
+	 *
+	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
+	 * is safe in any context. Also zeroing the page is mandatory for
+	 * BPF use cases.
+	 *
+	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
+	 * specify it here to highlight that try_alloc_pages()
+	 * doesn't want to deplete reserves.
+	 */
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	unsigned int alloc_flags = ALLOC_TRYLOCK;
+	struct alloc_context ac = { };
+	struct page *page;
+
+	/*
+	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
+	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
+	 * task may be waiting for one rt_spin_lock, but rt_spin_trylock() will
+	 * mark the task as the owner of another rt_spin_lock which will
+	 * confuse PI logic, so return immediately if called form hard IRQ or
+	 * NMI.
+	 *
+	 * Note, irqs_disabled() case is ok. This function can be called
+	 * from raw_spin_lock_irqsave region.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;
+	if (!pcp_allowed_order(order))
+		return NULL;
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
+	if (has_unaccepted_memory())
+		return NULL;
+#endif
+	/* Bailout, since _deferred_grow_zone() needs to take a lock */
+	if (deferred_pages_enabled())
+		return NULL;
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+
+	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
+			    &alloc_gfp, &alloc_flags);
+
+	/*
+	 * Best effort allocation from percpu free list.
+	 * If it's empty attempt to spin_trylock zone->lock.
+	 */
+	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
+
+	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+
+	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+	kmsan_alloc_page(page, order, alloc_gfp);
+	return page;
+}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next v6 2/6] mm, bpf: Introduce free_pages_nolock()
  2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
  2025-01-24  3:56 ` [PATCH bpf-next v6 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-01-24  3:56 ` Alexei Starovoitov
  2025-01-28 18:18   ` Shakeel Butt
  2025-01-24  3:56 ` [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave() Alexei Starovoitov
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  3:56 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in PREEMPT_RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.

Do not use llist unconditionally. BPF maps continuously
allocate/free, so we cannot unconditionally delay the freeing to
llist. When the memory becomes free make it available to the
kernel and BPF users right away if possible, and fallback to
llist as the last resort.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h      |  1 +
 include/linux/mm_types.h |  4 ++
 include/linux/mmzone.h   |  3 ++
 lib/stackdepot.c         |  5 ++-
 mm/page_alloc.c          | 90 +++++++++++++++++++++++++++++++++++-----
 mm/page_owner.c          |  8 +++-
 6 files changed, 98 insertions(+), 13 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 82bfb65b8d15..a8233d09acfa 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -391,6 +391,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
 	__get_free_pages((gfp_mask) | GFP_DMA, (order))
 
 extern void __free_pages(struct page *page, unsigned int order);
+extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 825c04b56403..583bf59e2627 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,10 @@ struct page {
 				/* Or, free page */
 				struct list_head buddy_list;
 				struct list_head pcp_list;
+				struct {
+					struct llist_node pcp_llist;
+					unsigned int order;
+				};
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16..1a854e0a9e3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -953,6 +953,9 @@ struct zone {
 	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Pages to be freed when next trylock succeeds */
+	struct llist_head	trylock_free_pages;
+
 	/* Write-intensive fields used by compaction and vmstats. */
 	CACHELINE_PADDING(_pad2_);
 
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 377194969e61..73d7b50924ef 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -672,7 +672,10 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 exit:
 	if (prealloc) {
 		/* Stack depot didn't use this memory, free it. */
-		free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
+		if (!allow_spin)
+			free_pages_nolock(virt_to_page(prealloc), DEPOT_POOL_ORDER);
+		else
+			free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
 	}
 	if (found)
 		handle = found->handle.handle;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a82bc67abbdb..fa750c46e0fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Free the page without taking locks. Rely on trylock only. */
+#define FPI_TRYLOCK		((__force fpi_t)BIT(2))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1249,13 +1252,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	} while (1);
 }
 
+static void add_page_to_zone_llist(struct zone *zone, struct page *page,
+				   unsigned int order)
+{
+	/* Remember the order */
+	page->order = order;
+	/* Add the page to the free list */
+	llist_add(&page->pcp_llist, &zone->trylock_free_pages);
+}
+
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
 			  fpi_t fpi_flags)
 {
+	struct llist_head *llhead;
 	unsigned long flags;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+			add_page_to_zone_llist(zone, page, order);
+			return;
+		}
+		spin_lock_irqsave(&zone->lock, flags);
+	}
+
+	/* The lock succeeded. Process deferred pages. */
+	llhead = &zone->trylock_free_pages;
+	if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
+		struct llist_node *llnode;
+		struct page *p, *tmp;
+
+		llnode = llist_del_all(llhead);
+		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+			unsigned int p_order = p->order;
+
+			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+			__count_vm_events(PGFREE, 1 << p_order);
+		}
+	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2598,7 +2632,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
-				   unsigned int order)
+				   unsigned int order, fpi_t fpi_flags)
 {
 	int high, batch;
 	int pindex;
@@ -2633,6 +2667,14 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
 		pcp->free_count += (1 << order);
+
+	if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+		/*
+		 * Do not attempt to take a zone lock. Let pcp->count get
+		 * over high mark temporarily.
+		 */
+		return;
+	}
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2647,7 +2689,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 /*
  * Free a pcp page
  */
-void free_unref_page(struct page *page, unsigned int order)
+static void __free_unref_page(struct page *page, unsigned int order,
+			      fpi_t fpi_flags)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2656,7 +2699,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, fpi_flags);
 		return;
 	}
 
@@ -2673,24 +2716,34 @@ void free_unref_page(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, fpi_flags);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
 	zone = page_zone(page);
+	if (unlikely((fpi_flags & FPI_TRYLOCK) && IS_ENABLED(CONFIG_PREEMPT_RT)
+		     && (in_nmi() || in_hardirq()))) {
+		add_page_to_zone_llist(zone, page, order);
+		return;
+	}
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, migratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, fpi_flags);
 	}
 	pcp_trylock_finish(UP_flags);
 }
 
+void free_unref_page(struct page *page, unsigned int order)
+{
+	__free_unref_page(page, order, FPI_NONE);
+}
+
 /*
  * Free a batch of folios
  */
@@ -2779,7 +2832,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				       order, FPI_NONE);
 	}
 
 	if (pcp) {
@@ -4843,22 +4896,37 @@ EXPORT_SYMBOL(get_zeroed_page_noprof);
  * Context: May be called in interrupt context or while holding a normal
  * spinlock, but not in NMI context or while holding a raw spinlock.
  */
-void __free_pages(struct page *page, unsigned int order)
+static void ___free_pages(struct page *page, unsigned int order,
+			  fpi_t fpi_flags)
 {
 	/* get PageHead before we drop reference */
 	int head = PageHead(page);
 	struct alloc_tag *tag = pgalloc_tag_get(page);
 
 	if (put_page_testzero(page))
-		free_unref_page(page, order);
+		__free_unref_page(page, order, fpi_flags);
 	else if (!head) {
 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
 		while (order-- > 0)
-			free_unref_page(page + (1 << order), order);
+			__free_unref_page(page + (1 << order), order,
+					  fpi_flags);
 	}
 }
+void __free_pages(struct page *page, unsigned int order)
+{
+	___free_pages(page, order, FPI_NONE);
+}
 EXPORT_SYMBOL(__free_pages);
 
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI for any
+ * page type (not only those that came from try_alloc_pages)
+ */
+void free_pages_nolock(struct page *page, unsigned int order)
+{
+	___free_pages(page, order, FPI_TRYLOCK);
+}
+
 void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0) {
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 2d6360eaccbb..90e31d0e3ed7 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -294,7 +294,13 @@ void __reset_page_owner(struct page *page, unsigned short order)
 	page_owner = get_page_owner(page_ext);
 	alloc_handle = page_owner->handle;
 
-	handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
+	/*
+	 * Do not specify GFP_NOWAIT to make gfpflags_allow_spinning() == false
+	 * to prevent issues in stack_depot_save().
+	 * This is similar to try_alloc_pages() gfp flags, but only used
+	 * to signal stack_depot to avoid spin_locks.
+	 */
+	handle = save_stack(__GFP_NOWARN);
 	__update_page_owner_free_handle(page_ext, handle, order, current->pid,
 					current->tgid, free_ts_nsec);
 	page_ext_put(page_ext);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave()
  2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
  2025-01-24  3:56 ` [PATCH bpf-next v6 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
  2025-01-24  3:56 ` [PATCH bpf-next v6 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2025-01-24  3:56 ` Alexei Starovoitov
  2025-01-28 17:21   ` Sebastian Andrzej Siewior
  2025-01-24  3:56 ` [PATCH bpf-next v6 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  3:56 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
critical section, but it doesn't prevent NMI, so the fully reentrant
code cannot use local_lock_irqsave() for exclusive access.

Introduce local_trylock_t and local_trylock_irqsave() that
disables interrupts and sets active=1, so local_trylock_irqsave()
from NMI of the same lock will return false.

In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
Map local_trylock_irqsave() to preemptible spin_trylock().
When in hard IRQ or NMI return false right away, since
spin_trylock() is not safe due to PI issues.

Note there is no need to use local_inc for active variable,
since it's a percpu variable with strict nesting scopes.

Usage:

local_lock_t lock;                     // sizeof(lock) == 0 in !RT
local_lock_irqsave(&lock, ...);        // irqsave as before
if (local_trylock_irqsave(&lock, ...)) // compilation error

local_trylock_t lock;                  // sizeof(lock) == 4 in !RT
local_lock_irqsave(&lock, ...);        // irqsave and active = 1
if (local_trylock_irqsave(&lock, ...)) // if (!active) irqsave

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          |  9 ++++
 include/linux/local_lock_internal.h | 79 ++++++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb..f4bc3e9b2b20 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -30,6 +30,15 @@
 #define local_lock_irqsave(lock, flags)				\
 	__local_lock_irqsave(lock, flags)
 
+/**
+ * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ *			   interrupts. Fails in PREEMPT_RT when in hard IRQ or NMI.
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ */
+#define local_trylock_irqsave(lock, flags)			\
+	__local_trylock_irqsave(lock, flags)
+
 /**
  * local_unlock - Release a per CPU local lock
  * @lock:	The lock variable
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..14757b7aea99 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -15,6 +15,19 @@ typedef struct {
 #endif
 } local_lock_t;
 
+typedef struct {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+	struct task_struct	*owner;
+#endif
+	/*
+	 * Same layout as local_lock_t with 'active' field
+	 * at the end, since (local_trylock_t *) will be
+	 * casted to (local_lock_t *).
+	 */
+	int active;
+} local_trylock_t;
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define LOCAL_LOCK_DEBUG_INIT(lockname)		\
 	.dep_map = {					\
@@ -31,6 +44,13 @@ static inline void local_lock_acquire(local_lock_t *l)
 	l->owner = current;
 }
 
+static inline void local_trylock_acquire(local_lock_t *l)
+{
+	lock_map_acquire_try(&l->dep_map);
+	DEBUG_LOCKS_WARN_ON(l->owner);
+	l->owner = current;
+}
+
 static inline void local_lock_release(local_lock_t *l)
 {
 	DEBUG_LOCKS_WARN_ON(l->owner != current);
@@ -45,6 +65,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
 #else /* CONFIG_DEBUG_LOCK_ALLOC */
 # define LOCAL_LOCK_DEBUG_INIT(lockname)
 static inline void local_lock_acquire(local_lock_t *l) { }
+static inline void local_trylock_acquire(local_lock_t *l) { }
 static inline void local_lock_release(local_lock_t *l) { }
 static inline void local_lock_debug_init(local_lock_t *l) { }
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
@@ -87,10 +108,37 @@ do {								\
 
 #define __local_lock_irqsave(lock, flags)			\
 	do {							\
+		local_trylock_t *tl;				\
+		local_lock_t *l;				\
 		local_irq_save(flags);				\
-		local_lock_acquire(this_cpu_ptr(lock));		\
+		l = (local_lock_t *)this_cpu_ptr(lock);		\
+		tl = (local_trylock_t *)l;			\
+		_Generic((lock),				\
+			local_trylock_t *: ({			\
+				lockdep_assert(tl->active == 0);\
+				WRITE_ONCE(tl->active, 1);	\
+			}),					\
+			default:(void)0);			\
+		local_lock_acquire(l);				\
 	} while (0)
 
+
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		local_trylock_t *tl;				\
+		local_irq_save(flags);				\
+		tl = this_cpu_ptr(lock);			\
+		if (READ_ONCE(tl->active) == 1) {		\
+			local_irq_restore(flags);		\
+			tl = NULL;				\
+		} else {					\
+			WRITE_ONCE(tl->active, 1);		\
+			local_trylock_acquire(			\
+				(local_lock_t *)tl);		\
+		}						\
+		!!tl;						\
+	})
+
 #define __local_unlock(lock)					\
 	do {							\
 		local_lock_release(this_cpu_ptr(lock));		\
@@ -105,7 +153,17 @@ do {								\
 
 #define __local_unlock_irqrestore(lock, flags)			\
 	do {							\
-		local_lock_release(this_cpu_ptr(lock));		\
+		local_trylock_t *tl;				\
+		local_lock_t *l;				\
+		l = (local_lock_t *)this_cpu_ptr(lock);		\
+		tl = (local_trylock_t *)l;			\
+		_Generic((lock),				\
+			local_trylock_t *: ({			\
+				lockdep_assert(tl->active == 1);\
+				WRITE_ONCE(tl->active, 0);	\
+			}),					\
+			default:(void)0);			\
+		local_lock_release(l);				\
 		local_irq_restore(flags);			\
 	} while (0)
 
@@ -125,6 +183,7 @@ do {								\
  * critical section while staying preemptible.
  */
 typedef spinlock_t local_lock_t;
+typedef spinlock_t local_trylock_t;
 
 #define INIT_LOCAL_LOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname))
 
@@ -148,6 +207,22 @@ typedef spinlock_t local_lock_t;
 		__local_lock(lock);				\
 	} while (0)
 
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		__label__ out;					\
+		int ret = 0;					\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		if (in_nmi() || in_hardirq())			\
+			goto out;				\
+		migrate_disable();				\
+		ret = spin_trylock(this_cpu_ptr((lock)));	\
+		if (!ret)					\
+			migrate_enable();			\
+	out:							\
+		ret;						\
+	})
+
 #define __local_unlock(__lock)					\
 	do {							\
 		spin_unlock(this_cpu_ptr((__lock)));		\
-- 
2.43.5



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next v6 4/6] memcg: Use trylock to access memcg stock_lock.
  2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2025-01-24  3:56 ` [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave() Alexei Starovoitov
@ 2025-01-24  3:56 ` Alexei Starovoitov
  2025-01-24  3:56 ` [PATCH bpf-next v6 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  3:56 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Teach memcg to operate under trylock conditions when spinning locks
cannot be used.

local_trylock might fail and this would lead to charge cache bypass if
the calling context doesn't allow spinning (gfpflags_allow_spinning).
In those cases charge the memcg counter directly and fail early if
that is not possible. This might cause a pre-mature charge failing
but it will allow an opportunistic charging that is safe from
try_alloc_pages path.

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/memcontrol.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..9caca00cb7de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1722,7 +1722,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 }
 
 struct memcg_stock_pcp {
-	local_lock_t stock_lock;
+	local_trylock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 
@@ -1756,7 +1756,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
  *
  * returns true if successful, false otherwise.
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			  gfp_t gfp_mask)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned int stock_pages;
@@ -1766,7 +1767,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		if (!gfpflags_allow_spinning(gfp_mask))
+			return ret;
+		local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	}
 
 	stock = this_cpu_ptr(&memcg_stock);
 	stock_pages = READ_ONCE(stock->nr_pages);
@@ -1851,7 +1856,18 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		/*
+		 * In case of unlikely failure to lock percpu stock_lock
+		 * uncharge memcg directly.
+		 */
+		if (mem_cgroup_is_root(memcg))
+			return;
+		page_counter_uncharge(&memcg->memory, nr_pages);
+		if (do_memsw_account())
+			page_counter_uncharge(&memcg->memsw, nr_pages);
+		return;
+	}
 	__refill_stock(memcg, nr_pages);
 	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
@@ -2196,9 +2212,13 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long pflags;
 
 retry:
-	if (consume_stock(memcg, nr_pages))
+	if (consume_stock(memcg, nr_pages, gfp_mask))
 		return 0;
 
+	if (!gfpflags_allow_spinning(gfp_mask))
+		/* Avoid the refill and flush of the older stock */
+		batch = nr_pages;
+
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 		if (page_counter_try_charge(&memcg->memory, batch, &counter))
-- 
2.43.5



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next v6 5/6] mm, bpf: Use memcg in try_alloc_pages().
  2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2025-01-24  3:56 ` [PATCH bpf-next v6 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
@ 2025-01-24  3:56 ` Alexei Starovoitov
  2025-01-24  3:56 ` [PATCH bpf-next v6 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
  2025-01-24 14:16 ` [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Matthew Wilcox
  6 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  3:56 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
The caller is responsible to setup memcg correctly.
All BPF memory accounting is memcg based.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/page_alloc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fa750c46e0fc..931cedcda788 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7146,7 +7146,8 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 	 * specify it here to highlight that try_alloc_pages()
 	 * doesn't want to deplete reserves.
 	 */
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
+			| __GFP_ACCOUNT;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
@@ -7190,6 +7191,11 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 
 	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
 
+	if (memcg_kmem_online() && page &&
+	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
+		free_pages_nolock(page, order);
+		page = NULL;
+	}
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
-- 
2.43.5



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next v6 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
  2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2025-01-24  3:56 ` [PATCH bpf-next v6 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
@ 2025-01-24  3:56 ` Alexei Starovoitov
  2025-01-24 14:16 ` [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Matthew Wilcox
  6 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  3:56 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use try_alloc_pages() and free_pages_nolock() for BPF needs
when context doesn't allow using normal alloc_pages.
This is a prerequisite for further work.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h  |  2 +-
 kernel/bpf/arena.c   |  5 ++---
 kernel/bpf/syscall.c | 23 ++++++++++++++++++++---
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f3f50e29d639..e1838a341817 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2348,7 +2348,7 @@ int  generic_map_delete_batch(struct bpf_map *map,
 struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
 struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
 
-int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
+int bpf_map_alloc_pages(const struct bpf_map *map, int nid,
 			unsigned long nr_pages, struct page **page_array);
 #ifdef CONFIG_MEMCG
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 4b22a651b5d5..642399a5fd9f 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -287,7 +287,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGSEGV;
 
 	/* Account into memcg of the process that created bpf_arena */
-	ret = bpf_map_alloc_pages(map, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, 1, &page);
+	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
 		return VM_FAULT_SIGSEGV;
@@ -465,8 +465,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (ret)
 		goto out_free_pages;
 
-	ret = bpf_map_alloc_pages(&arena->map, GFP_KERNEL | __GFP_ZERO,
-				  node_id, page_cnt, pages);
+	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
 	if (ret)
 		goto out;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0daf098e3207..55588dbd2fce 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -569,7 +569,24 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 #endif
 
-int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
+static bool can_alloc_pages(void)
+{
+	return preempt_count() == 0 && !irqs_disabled() &&
+		!IS_ENABLED(CONFIG_PREEMPT_RT);
+}
+
+static struct page *__bpf_alloc_page(int nid)
+{
+	if (!can_alloc_pages())
+		return try_alloc_pages(nid, 0);
+
+	return alloc_pages_node(nid,
+				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
+				| __GFP_NOWARN,
+				0);
+}
+
+int bpf_map_alloc_pages(const struct bpf_map *map, int nid,
 			unsigned long nr_pages, struct page **pages)
 {
 	unsigned long i, j;
@@ -582,14 +599,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
 	old_memcg = set_active_memcg(memcg);
 #endif
 	for (i = 0; i < nr_pages; i++) {
-		pg = alloc_pages_node(nid, gfp | __GFP_ACCOUNT, 0);
+		pg = __bpf_alloc_page(nid);
 
 		if (pg) {
 			pages[i] = pg;
 			continue;
 		}
 		for (j = 0; j < i; j++)
-			__free_page(pages[j]);
+			free_pages_nolock(pages[j], 0);
 		ret = -ENOMEM;
 		break;
 	}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2025-01-24  3:56 ` [PATCH bpf-next v6 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
@ 2025-01-24 14:16 ` Matthew Wilcox
  2025-01-24 14:19   ` Vlastimil Babka
  6 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2025-01-24 14:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, mhocko, tglx, jannh, tj, linux-mm,
	kernel-team

On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote:
> - Considered using __GFP_COMP in try_alloc_pages to simplify
>   free_pages_nolock a bit, but then decided to make it work
>   for all types of pages, since free_pages_nolock() is used by
>   stackdepot and currently it's using non-compound order 2.
>   I felt it's best to leave it as-is and make free_pages_nolock()
>   support all pages.

We're trying to eliminate non-use of __GFP_COMP.  Because people don't
use __GFP_COMP, there's a security check that we can't turn on.  Would
you reconsider this change you made?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-01-24 14:16 ` [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Matthew Wilcox
@ 2025-01-24 14:19   ` Vlastimil Babka
  2025-01-24 16:19     ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2025-01-24 14:19 UTC (permalink / raw)
  To: Matthew Wilcox, Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, tglx, jannh, tj, linux-mm,
	kernel-team, Marco Elver, Andrey Konovalov, Oscar Salvador

On 1/24/25 15:16, Matthew Wilcox wrote:
> On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote:
>> - Considered using __GFP_COMP in try_alloc_pages to simplify
>>   free_pages_nolock a bit, but then decided to make it work
>>   for all types of pages, since free_pages_nolock() is used by
>>   stackdepot and currently it's using non-compound order 2.
>>   I felt it's best to leave it as-is and make free_pages_nolock()
>>   support all pages.
> 
> We're trying to eliminate non-use of __GFP_COMP.  Because people don't
> use __GFP_COMP, there's a security check that we can't turn on.  Would
> you reconsider this change you made?

This means changing stackdepot to use __GFP_COMP. Which would be a good
thing on its own. But if you consider if off-topic to your series, I can
look at it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-01-24 14:19   ` Vlastimil Babka
@ 2025-01-24 16:19     ` Alexei Starovoitov
  2025-01-24 16:28       ` Vlastimil Babka
  2025-01-24 16:28       ` Matthew Wilcox
  0 siblings, 2 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24 16:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team,
	Marco Elver, Andrey Konovalov, Oscar Salvador

On Fri, Jan 24, 2025 at 6:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/24/25 15:16, Matthew Wilcox wrote:
> > On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote:
> >> - Considered using __GFP_COMP in try_alloc_pages to simplify
> >>   free_pages_nolock a bit, but then decided to make it work
> >>   for all types of pages, since free_pages_nolock() is used by
> >>   stackdepot and currently it's using non-compound order 2.
> >>   I felt it's best to leave it as-is and make free_pages_nolock()
> >>   support all pages.
> >
> > We're trying to eliminate non-use of __GFP_COMP.  Because people don't
> > use __GFP_COMP, there's a security check that we can't turn on.  Would
> > you reconsider this change you made?
>
> This means changing stackdepot to use __GFP_COMP. Which would be a good
> thing on its own. But if you consider if off-topic to your series, I can
> look at it.

Ohh. I wasn't aware of that.
I can certainly add __GFP_COMP to try_alloc_pages() and
will check stackdepot too.
I spotted this line:
VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
that line alone was a good enough reason to use __GFP_COMP,
but since it's debug only I could only guess where the future lies.

Should it be something like:

if (WARN_ON(compound && compound_order(page) != order))
 order = compound_order(page);

since proceeding with the wrong order is certain to crash.
?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-01-24 16:19     ` Alexei Starovoitov
@ 2025-01-24 16:28       ` Vlastimil Babka
  2025-01-24 16:46         ` Alexei Starovoitov
  2025-01-24 16:28       ` Matthew Wilcox
  1 sibling, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2025-01-24 16:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matthew Wilcox, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team,
	Marco Elver, Andrey Konovalov, Oscar Salvador

On 1/24/25 17:19, Alexei Starovoitov wrote:
> On Fri, Jan 24, 2025 at 6:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 1/24/25 15:16, Matthew Wilcox wrote:
>> > On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote:
>> >> - Considered using __GFP_COMP in try_alloc_pages to simplify
>> >>   free_pages_nolock a bit, but then decided to make it work
>> >>   for all types of pages, since free_pages_nolock() is used by
>> >>   stackdepot and currently it's using non-compound order 2.
>> >>   I felt it's best to leave it as-is and make free_pages_nolock()
>> >>   support all pages.
>> >
>> > We're trying to eliminate non-use of __GFP_COMP.  Because people don't
>> > use __GFP_COMP, there's a security check that we can't turn on.  Would
>> > you reconsider this change you made?
>>
>> This means changing stackdepot to use __GFP_COMP. Which would be a good
>> thing on its own. But if you consider if off-topic to your series, I can
>> look at it.
> 
> Ohh. I wasn't aware of that.
> I can certainly add __GFP_COMP to try_alloc_pages() and

Yeah IIRC I suggested that previously.

> will check stackdepot too.

Great, thanks.

> I spotted this line:
> VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
> that line alone was a good enough reason to use __GFP_COMP,
> but since it's debug only I could only guess where the future lies.
> 
> Should it be something like:
> 
> if (WARN_ON(compound && compound_order(page) != order))
>  order = compound_order(page);
> 
> since proceeding with the wrong order is certain to crash.
> ?

That's the common question, should we be paranoid and add overhead to fast
paths in production. Here we do only for DEBUG_VM, which is meant for easier
debugging during development of new code.

I think it's not worth adding this overhead in normal configs, as the
(increasing) majority of order > 0 parameters should come here from
compound_order() anyway (i.e. put_folio()) As said we're trying to eliminate
the other cases so we don't need to cater for them more.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-01-24 16:19     ` Alexei Starovoitov
  2025-01-24 16:28       ` Vlastimil Babka
@ 2025-01-24 16:28       ` Matthew Wilcox
  1 sibling, 0 replies; 20+ messages in thread
From: Matthew Wilcox @ 2025-01-24 16:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Vlastimil Babka, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team,
	Marco Elver, Andrey Konovalov, Oscar Salvador

On Fri, Jan 24, 2025 at 08:19:19AM -0800, Alexei Starovoitov wrote:
> I spotted this line:
> VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
> that line alone was a good enough reason to use __GFP_COMP,
> but since it's debug only I could only guess where the future lies.
> 
> Should it be something like:
> 
> if (WARN_ON(compound && compound_order(page) != order))
>  order = compound_order(page);
> 
> since proceeding with the wrong order is certain to crash.
> ?

It's hard to say.  We've got a discrepancy between "order at free call
site" and "order recorded in page".  We might, for example, have a
memory corruption which has overwritten the compound_order() stored in
the struct page, in which case the 'order' passed in is the correct one,
and "correcting" it to the corrupt one stored in struct page would be
the thing which caused the crash.

I'd leave it as VM_BUG_ON().


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-01-24 16:28       ` Vlastimil Babka
@ 2025-01-24 16:46         ` Alexei Starovoitov
  0 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-24 16:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team,
	Marco Elver, Andrey Konovalov, Oscar Salvador

On Fri, Jan 24, 2025 at 8:28 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/24/25 17:19, Alexei Starovoitov wrote:
> > On Fri, Jan 24, 2025 at 6:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 1/24/25 15:16, Matthew Wilcox wrote:
> >> > On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote:
> >> >> - Considered using __GFP_COMP in try_alloc_pages to simplify
> >> >>   free_pages_nolock a bit, but then decided to make it work
> >> >>   for all types of pages, since free_pages_nolock() is used by
> >> >>   stackdepot and currently it's using non-compound order 2.
> >> >>   I felt it's best to leave it as-is and make free_pages_nolock()
> >> >>   support all pages.
> >> >
> >> > We're trying to eliminate non-use of __GFP_COMP.  Because people don't
> >> > use __GFP_COMP, there's a security check that we can't turn on.  Would
> >> > you reconsider this change you made?
> >>
> >> This means changing stackdepot to use __GFP_COMP. Which would be a good
> >> thing on its own. But if you consider if off-topic to your series, I can
> >> look at it.
> >
> > Ohh. I wasn't aware of that.
> > I can certainly add __GFP_COMP to try_alloc_pages() and
>
> Yeah IIRC I suggested that previously.

Yes, as a way to simplify free_pages_nolock().
Hence I looked into it and added the above comment to the cover letter.
Now I see that there are more and stronger reasons to use it.

> > will check stackdepot too.
>
> Great, thanks.
>
> > I spotted this line:
> > VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
> > that line alone was a good enough reason to use __GFP_COMP,
> > but since it's debug only I could only guess where the future lies.
> >
> > Should it be something like:
> >
> > if (WARN_ON(compound && compound_order(page) != order))
> >  order = compound_order(page);
> >
> > since proceeding with the wrong order is certain to crash.
> > ?
>
> That's the common question, should we be paranoid and add overhead to fast
> paths in production. Here we do only for DEBUG_VM, which is meant for easier
> debugging during development of new code.
>
> I think it's not worth adding this overhead in normal configs, as the
> (increasing) majority of order > 0 parameters should come here from
> compound_order() anyway (i.e. put_folio()) As said we're trying to eliminate
> the other cases so we don't need to cater for them more.

Understood.
I also agree with Matthew comment about page corruption.
Whether compound_order(page) or order is correct is indeed a question.

I'll wait for review on patch 3 before resubmitting with GFP_COMP included.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave()
  2025-01-24  3:56 ` [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave() Alexei Starovoitov
@ 2025-01-28 17:21   ` Sebastian Andrzej Siewior
  2025-01-28 18:50     ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-01-28 17:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On 2025-01-23 19:56:52 [-0800], Alexei Starovoitov wrote:
> Usage:
> 
> local_lock_t lock;                     // sizeof(lock) == 0 in !RT
> local_lock_irqsave(&lock, ...);        // irqsave as before
> if (local_trylock_irqsave(&lock, ...)) // compilation error
> 
> local_trylock_t lock;                  // sizeof(lock) == 4 in !RT
> local_lock_irqsave(&lock, ...);        // irqsave and active = 1
> if (local_trylock_irqsave(&lock, ...)) // if (!active) irqsave

so I've been looking at this for a while and I don't like the part where
the type is hidden away. It is then casted back. So I tried something
with _Generics but then the existing guard implementation complained.
Then I asked myself why do we want to hide much of the implementation
and not make it obvious. So I made a few macros to hide most of the
implementation for !RT. Later I figured if the variable is saved locally
then I save one this_cpu_ptr invocation. So I wrote it out and the
snippet below is what I have.

is this anywhere near possible to accept?

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb9..05c254a5d7d3e 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,6 +51,65 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)
 
+/**
+ * localtry_lock_init - Runtime initialize a lock instance
+ */
+#define localtry_lock_init(lock)		__localtry_lock_init(lock)
+
+/**
+ * localtry_lock - Acquire a per CPU local lock
+ * @lock:	The lock variable
+ */
+#define localtry_lock(lock)		__localtry_lock(lock)
+
+/**
+ * localtry_lock_irq - Acquire a per CPU local lock and disable interrupts
+ * @lock:	The lock variable
+ */
+#define localtry_lock_irq(lock)		__localtry_lock_irq(lock)
+
+/**
+ * localtry_lock_irqsave - Acquire a per CPU local lock, save and disable
+ *			 interrupts
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ */
+#define localtry_lock_irqsave(lock, flags)				\
+	__localtry_lock_irqsave(lock, flags)
+
+/**
+ * localtry_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ *			      interrupts if acquired
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ *
+ * The function can be used in any context such as NMI or HARDIRQ. Due to
+ * locking constrains it will _always_ fail to acquire the lock on PREEMPT_RT.
+ */
+#define localtry_trylock_irqsave(lock, flags)				\
+	__localtry_trylock_irqsave(lock, flags)
+
+/**
+ * local_unlock - Release a per CPU local lock
+ * @lock:	The lock variable
+ */
+#define localtry_unlock(lock)		__localtry_unlock(lock)
+
+/**
+ * local_unlock_irq - Release a per CPU local lock and enable interrupts
+ * @lock:	The lock variable
+ */
+#define localtry_unlock_irq(lock)		__localtry_unlock_irq(lock)
+
+/**
+ * localtry_unlock_irqrestore - Release a per CPU local lock and restore
+ *			      interrupt flags
+ * @lock:	The lock variable
+ * @flags:      Interrupt flags to restore
+ */
+#define localtry_unlock_irqrestore(lock, flags)			\
+	__localtry_unlock_irqrestore(lock, flags)
+
 DEFINE_GUARD(local_lock, local_lock_t __percpu*,
 	     local_lock(_T),
 	     local_unlock(_T))
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2b..789b0d878e6c5 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -15,6 +15,11 @@ typedef struct {
 #endif
 } local_lock_t;
 
+typedef struct {
+	local_lock_t	llock;
+	unsigned int	acquired;
+} localtry_lock_t;
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define LOCAL_LOCK_DEBUG_INIT(lockname)		\
 	.dep_map = {					\
@@ -50,6 +55,7 @@ static inline void local_lock_debug_init(local_lock_t *l) { }
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
 
 #define INIT_LOCAL_LOCK(lockname)	{ LOCAL_LOCK_DEBUG_INIT(lockname) }
+#define INIT_LOCALTRY_LOCK(lockname)	{ .llock = { LOCAL_LOCK_DEBUG_INIT(lockname.llock) }}
 
 #define __local_lock_init(lock)					\
 do {								\
@@ -118,6 +124,86 @@ do {								\
 #define __local_unlock_nested_bh(lock)				\
 	local_lock_release(this_cpu_ptr(lock))
 
+/* localtry_lock_t variants */
+
+#define __localtry_lock_init(lock)				\
+do {								\
+	__local_lock_init(&(lock)->llock);			\
+	WRITE_ONCE(&(lock)->acquired, 0);			\
+} while (0)
+
+#define __localtry_lock(lock)					\
+	do {							\
+		localtry_lock_t *lt;				\
+		preempt_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_lock_irq(lock)				\
+	do {							\
+		localtry_lock_t *lt;				\
+		local_irq_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_lock_irqsave(lock, flags)			\
+	do {							\
+		localtry_lock_t *lt;				\
+		local_irq_save(flags);				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_trylock_irqsave(lock, flags)			\
+	({							\
+		localtry_lock_t *lt;				\
+		bool _ret;					\
+								\
+		local_irq_save(flags);				\
+		lt = this_cpu_ptr(lock);			\
+		if (!READ_ONCE(lt->acquired)) {			\
+			local_lock_acquire(&lt->llock);		\
+			WRITE_ONCE(lt->acquired, 1);		\
+			_ret = true;				\
+		} else {					\
+			_ret = false;				\
+			local_irq_restore(flags);		\
+		}						\
+		_ret;						\
+	})
+
+#define __localtry_unlock(lock)					\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		preempt_enable();				\
+	} while (0)
+
+#define __localtry_unlock_irq(lock)				\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		local_irq_enable();				\
+	} while (0)
+
+#define __localtry_unlock_irqrestore(lock, flags)		\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		local_irq_restore(flags);			\
+	} while (0)
+
 #else /* !CONFIG_PREEMPT_RT */
 
 /*
@@ -125,8 +211,10 @@ do {								\
  * critical section while staying preemptible.
  */
 typedef spinlock_t local_lock_t;
+typedef spinlock_t localtry_lock_t;
 
 #define INIT_LOCAL_LOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname))
+#define INIT_LOCALTRY_LOCK(lockname) INIT_LOCAL_LOCK(lockname)
 
 #define __local_lock_init(l)					\
 	do {							\
@@ -169,4 +257,31 @@ do {								\
 	spin_unlock(this_cpu_ptr((lock)));			\
 } while (0)
 
+/* localtry_lock_t variants */
+
+#define __localtry_lock_init(lock)			__local_lock_init(lock)
+#define __localtry_lock(lock)				__local_lock(lock)
+#define __localtry_lock_irq(lock)			__local_lock(lock)
+#define __localtry_lock_irqsave(lock, flags)		__local_lock_irqsave(lock, flags)
+#define __localtry_unlock(lock)				__local_unlock(lock)
+#define __localtry_unlock_irq(lock)			__local_unlock(lock)
+#define __localtry_unlock_irqrestore(lock, flags)	__local_unlock_irqrestore(lock, flags)
+
+#define __localtry_trylock_irqsave(lock, flags)			\
+	({							\
+		int __locked;					\
+								\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		if (in_nmi() | in_hardirq()) {			\
+			__locked = 0;				\
+		} else {					\
+			migrate_disable();			\
+			__locked = spin_trylock(this_cpu_ptr((lock)));	\
+			if (!__locked)				\
+				migrate_enable();		\
+		}						\
+		__locked;					\
+	})
+
 #endif /* CONFIG_PREEMPT_RT */


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-24  3:56 ` [PATCH bpf-next v6 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-01-28 18:09   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2025-01-28 18:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Thu, Jan 23, 2025 at 07:56:50PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> rmqueue_pcplist() will spin_trylock to grab the page from percpu
> free list. If it fails (due to re-entrancy or list being empty)
> then rmqueue_bulk()/rmqueue_buddy() will attempt to
> spin_trylock zone->lock and grab the page from there.
> spin_trylock() is not safe in PREEMPT_RT when in NMI or in hard IRQ.
> Bailout early in such case.
> 
> The support for gfpflags_allow_spinning() mode for free_page and memcg
> comes in the next patches.
> 
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 2/6] mm, bpf: Introduce free_pages_nolock()
  2025-01-24  3:56 ` [PATCH bpf-next v6 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2025-01-28 18:18   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2025-01-28 18:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Thu, Jan 23, 2025 at 07:56:51PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce free_pages_nolock() that can free pages without taking locks.
> It relies on trylock and can be called from any context.
> Since spin_trylock() cannot be used in PREEMPT_RT from hard IRQ or NMI
> it uses lockless link list to stash the pages which will be freed
> by subsequent free_pages() from good context.
> 
> Do not use llist unconditionally. BPF maps continuously
> allocate/free, so we cannot unconditionally delay the freeing to
> llist. When the memory becomes free make it available to the
> kernel and BPF users right away if possible, and fallback to
> llist as the last resort.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave()
  2025-01-28 17:21   ` Sebastian Andrzej Siewior
@ 2025-01-28 18:50     ` Alexei Starovoitov
  2025-01-29  8:17       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-28 18:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Steven Rostedt, Hou Tao,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team

On Tue, Jan 28, 2025 at 9:21 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2025-01-23 19:56:52 [-0800], Alexei Starovoitov wrote:
> > Usage:
> >
> > local_lock_t lock;                     // sizeof(lock) == 0 in !RT
> > local_lock_irqsave(&lock, ...);        // irqsave as before
> > if (local_trylock_irqsave(&lock, ...)) // compilation error
> >
> > local_trylock_t lock;                  // sizeof(lock) == 4 in !RT
> > local_lock_irqsave(&lock, ...);        // irqsave and active = 1
> > if (local_trylock_irqsave(&lock, ...)) // if (!active) irqsave
>
> so I've been looking at this for a while and I don't like the part where
> the type is hidden away. It is then casted back. So I tried something
> with _Generics but then the existing guard implementation complained.
> Then I asked myself why do we want to hide much of the implementation
> and not make it obvious.

Well, the idea of hiding extra field with _Generic is to avoid
the churn:

git grep -E 'local_.*lock_irq'|wc -l
42

I think the api is clean enough and _Generic part is not exposed
to users.
Misuse or accidental usage is not possible either.
See the point:
if (local_trylock_irqsave(&lock, ...)) // compilation error

So imo it's a better tradeoff.

> is this anywhere near possible to accept?

Other than churn it's fine.
I can go with it if you insist,
but casting and _Generic() I think is cleaner.
Certainly a bit unusual pattern.
Could you sleep on it?

I can do s/local_trylock_t/localtry_lock_t/.
That part is trivial.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave()
  2025-01-28 18:50     ` Alexei Starovoitov
@ 2025-01-29  8:17       ` Sebastian Andrzej Siewior
  2025-01-30 20:51         ` Alexei Starovoitov
  2025-02-06 11:13         ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-01-29  8:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Peter Zijlstra
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Vlastimil Babka, Steven Rostedt, Hou Tao, Johannes Weiner,
	Shakeel Butt, Michal Hocko, Matthew Wilcox, Thomas Gleixner,
	Jann Horn, Tejun Heo, linux-mm, Kernel Team

PeterZ, may I summon you.

On 2025-01-28 10:50:33 [-0800], Alexei Starovoitov wrote:
> On Tue, Jan 28, 2025 at 9:21 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > On 2025-01-23 19:56:52 [-0800], Alexei Starovoitov wrote:
> > > Usage:
> > >
> > > local_lock_t lock;                     // sizeof(lock) == 0 in !RT
> > > local_lock_irqsave(&lock, ...);        // irqsave as before
> > > if (local_trylock_irqsave(&lock, ...)) // compilation error
> > >
> > > local_trylock_t lock;                  // sizeof(lock) == 4 in !RT
> > > local_lock_irqsave(&lock, ...);        // irqsave and active = 1
> > > if (local_trylock_irqsave(&lock, ...)) // if (!active) irqsave
> >
> > so I've been looking at this for a while and I don't like the part where
> > the type is hidden away. It is then casted back. So I tried something
> > with _Generics but then the existing guard implementation complained.
> > Then I asked myself why do we want to hide much of the implementation
> > and not make it obvious.
> 
> Well, the idea of hiding extra field with _Generic is to avoid
> the churn:
> 
> git grep -E 'local_.*lock_irq'|wc -l
> 42

This could be also hidden with a macro defining the general body and
having a place holder for "lock primitive".

> I think the api is clean enough and _Generic part is not exposed
> to users.
> Misuse or accidental usage is not possible either.
> See the point:
> if (local_trylock_irqsave(&lock, ...)) // compilation error
> 
> So imo it's a better tradeoff.
> 
> > is this anywhere near possible to accept?
> 
> Other than churn it's fine.
> I can go with it if you insist,
> but casting and _Generic() I think is cleaner.
> Certainly a bit unusual pattern.
> Could you sleep on it?

The cast there is somehow… We could have BUILD_BUG_ON() to ensure a
stable the layout of the structs… However all this is not my call.

PeterZ, do you have any preferences or an outline what you would like to
see here?

> I can do s/local_trylock_t/localtry_lock_t/.
> That part is trivial.

Sebastian


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave()
  2025-01-29  8:17       ` Sebastian Andrzej Siewior
@ 2025-01-30 20:51         ` Alexei Starovoitov
  2025-02-06 11:13         ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2025-01-30 20:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Vlastimil Babka, Steven Rostedt, Hou Tao,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team

On Wed, Jan 29, 2025 at 12:17 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> PeterZ, may I summon you.
>
> On 2025-01-28 10:50:33 [-0800], Alexei Starovoitov wrote:
> > On Tue, Jan 28, 2025 at 9:21 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2025-01-23 19:56:52 [-0800], Alexei Starovoitov wrote:
> > > > Usage:
> > > >
> > > > local_lock_t lock;                     // sizeof(lock) == 0 in !RT
> > > > local_lock_irqsave(&lock, ...);        // irqsave as before
> > > > if (local_trylock_irqsave(&lock, ...)) // compilation error
> > > >
> > > > local_trylock_t lock;                  // sizeof(lock) == 4 in !RT
> > > > local_lock_irqsave(&lock, ...);        // irqsave and active = 1
> > > > if (local_trylock_irqsave(&lock, ...)) // if (!active) irqsave
> > >
> > > so I've been looking at this for a while and I don't like the part where
> > > the type is hidden away. It is then casted back. So I tried something
> > > with _Generics but then the existing guard implementation complained.
> > > Then I asked myself why do we want to hide much of the implementation
> > > and not make it obvious.
> >
> > Well, the idea of hiding extra field with _Generic is to avoid
> > the churn:
> >
> > git grep -E 'local_.*lock_irq'|wc -l
> > 42
>
> This could be also hidden with a macro defining the general body and
> having a place holder for "lock primitive".

How would that look like?

> > I think the api is clean enough and _Generic part is not exposed
> > to users.
> > Misuse or accidental usage is not possible either.
> > See the point:
> > if (local_trylock_irqsave(&lock, ...)) // compilation error
> >
> > So imo it's a better tradeoff.
> >
> > > is this anywhere near possible to accept?
> >
> > Other than churn it's fine.
> > I can go with it if you insist,
> > but casting and _Generic() I think is cleaner.
> > Certainly a bit unusual pattern.
> > Could you sleep on it?
>
> The cast there is somehow… We could have BUILD_BUG_ON() to ensure a
> stable the layout of the structs… However all this is not my call.
>
> PeterZ, do you have any preferences or an outline what you would like to
> see here?

I still don't get the objection.
This is a normal function polymorphism that is present
in many languages.
Consider spin_lock().
It's already vastly different in PREEMPT_RT vs not.
This is polymorphism. The same function has different
implementations depending on config and argument type.
This patch makes local_lock_irqsave() polymorphic
not only in PREEMPT_RT vs not,
but also depending on local_lock_t vs localtry_lock_t
argument type in !PREEMPT_RT.

Anyway, if I don't hear back from Peter or you soon
I'll just take your localtry_lock_t version of the patch
with you being an author and your SOB (ok ?) and
will make the next patch 4 to suffer all the code churn.
We cannot afford to get stuck on something as trivial
as this for days.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave()
  2025-01-29  8:17       ` Sebastian Andrzej Siewior
  2025-01-30 20:51         ` Alexei Starovoitov
@ 2025-02-06 11:13         ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-06 11:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Peter Zijlstra
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Vlastimil Babka, Steven Rostedt, Hou Tao, Johannes Weiner,
	Shakeel Butt, Michal Hocko, Matthew Wilcox, Thomas Gleixner,
	Jann Horn, Tejun Heo, linux-mm, Kernel Team

On 2025-01-29 09:17:27 [+0100], To Alexei Starovoitov wrote:
> PeterZ, may I summon you.
PeterZ, may I summon you.

Sebastian


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-02-06 11:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-24  3:56 [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-01-24  3:56 ` [PATCH bpf-next v6 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-01-28 18:09   ` Shakeel Butt
2025-01-24  3:56 ` [PATCH bpf-next v6 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2025-01-28 18:18   ` Shakeel Butt
2025-01-24  3:56 ` [PATCH bpf-next v6 3/6] locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave() Alexei Starovoitov
2025-01-28 17:21   ` Sebastian Andrzej Siewior
2025-01-28 18:50     ` Alexei Starovoitov
2025-01-29  8:17       ` Sebastian Andrzej Siewior
2025-01-30 20:51         ` Alexei Starovoitov
2025-02-06 11:13         ` Sebastian Andrzej Siewior
2025-01-24  3:56 ` [PATCH bpf-next v6 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
2025-01-24  3:56 ` [PATCH bpf-next v6 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
2025-01-24  3:56 ` [PATCH bpf-next v6 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
2025-01-24 14:16 ` [PATCH bpf-next v6 0/6] bpf, mm: Introduce try_alloc_pages() Matthew Wilcox
2025-01-24 14:19   ` Vlastimil Babka
2025-01-24 16:19     ` Alexei Starovoitov
2025-01-24 16:28       ` Vlastimil Babka
2025-01-24 16:46         ` Alexei Starovoitov
2025-01-24 16:28       ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox