[PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages()
@ 2025-01-14  2:19 Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14  2:19 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Hi All,

The main motivation is to make alloc page and slab reentrant and
remove bpf_mem_alloc.

v3->v4:
Addressed feedback from Michal and Shakeel:
- GFP_TRYLOCK flag is gone. gfpflags_allow_spinning() is used instead.
- Improved comments and commit logs.

v3:
https://lore.kernel.org/bpf/20241218030720.1602449-1-alexei.starovoitov@gmail.com/

v2->v3:
To address the issues spotted by Sebastian, Vlastimil, Steven:
- Made GFP_TRYLOCK internal to mm/internal.h
  try_alloc_pages() and free_pages_nolock() are the only interfaces.
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
  disable such usage in lock_trylock and in try_alloc_pages().
  In such case free_pages_nolock() falls back to llist right away.
- Process trylock_free_pages llist when preemptible.
- Check for things like unaccepted memory and order <= 3 early.
- Don't call into __alloc_pages_slowpath() at all.
- Inspired by Vlastimil's struct local_tryirq_lock adopted it in
  local_lock_t. Extra 4 bytes in !RT in local_lock_t shouldn't
  affect any of the current local_lock_t users. This is patch 3.
- Tested with bpf selftests in RT and !RT and realized how much
  more work is necessary on bpf side to play nice with RT.
  The urgency of this work got higher. The alternative is to
  convert bpf bits left and right to bpf_mem_alloc.

v2:
https://lore.kernel.org/bpf/20241210023936.46871-1-alexei.starovoitov@gmail.com/

v1->v2:
- fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter.
- optimize all paths by doing spin_trylock_irqsave() first
  and only then check for gfp_flags & __GFP_TRYLOCK.
  Then spin_lock_irqsave() if it's a regular mode.
  So new gfp flag will not add performance overhead.
- patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock()
  and memcg support. So it's in usable shape for bpf in patch 6.

v1:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (6):
  mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  mm, bpf: Introduce free_pages_nolock()
  locking/local_lock: Introduce local_trylock_irqsave()
  memcg: Use trylock to access memcg stock_lock.
  mm, bpf: Use memcg in try_alloc_pages().
  bpf: Use try_alloc_pages() to allocate pages for bpf needs.

 include/linux/gfp.h                 |  23 ++++
 include/linux/local_lock.h          |   9 ++
 include/linux/local_lock_internal.h |  76 +++++++++++--
 include/linux/mm_types.h            |   4 +
 include/linux/mmzone.h              |   3 +
 kernel/bpf/syscall.c                |   4 +-
 mm/internal.h                       |   1 +
 mm/memcontrol.c                     |  24 +++-
 mm/page_alloc.c                     | 170 ++++++++++++++++++++++++++--
 9 files changed, 290 insertions(+), 24 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14  2:19 [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
@ 2025-01-14  2:19 ` Alexei Starovoitov
  2025-01-14  9:53   ` Peter Zijlstra
  2025-01-14 10:31   ` Michal Hocko
  2025-01-14  2:19 ` [PATCH bpf-next v4 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14  2:19 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce gfpflags_allow_spinning() condition that signals
to the allocator that running context is unknown.
Then rely on percpu free list of pages to allocate a page.
The rmqueue_pcplist() should be able to pop the page from.
If it fails (due to IRQ re-entrancy or list being empty) then
try_alloc_pages() attempts to spin_trylock zone->lock
and refill percpu freelist as normal.
BPF program may execute with IRQs disabled and zone->lock is
sleeping in RT, so trylock is the only option. In theory we can
introduce percpu reentrance counter and increment it every time
spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
on it. Even if this cpu is not in page_alloc path the
spin_lock_irqsave() is not safe, since BPF prog might be called
from tracepoint where preemption is disabled. So trylock only.

Note, free_page and memcg are not taught about gfpflags_allow_spinning()
condition. The support comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h | 22 ++++++++++++
 mm/internal.h       |  1 +
 mm/page_alloc.c     | 85 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 105 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b0fe9f62d15b..b41bb6e01781 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }
 
+static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
+{
+	/*
+	 * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
+	 * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
+	 * All GFP_* flags including GFP_NOWAIT use one or both flags.
+	 * try_alloc_pages() is the only API that doesn't specify either flag.
+	 *
+	 * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
+	 * those are guaranteed to never block on a sleeping lock.
+	 * Here we are enforcing that the allaaction doesn't ever spin
+	 * on any locks (i.e. only trylocks). There is no highlevel
+	 * GFP_$FOO flag for this use in try_alloc_pages() as the
+	 * regular page allocator doesn't fully support this
+	 * allocation mode.
+	 */
+	return !(gfp_flags & __GFP_RECLAIM);
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
@@ -347,6 +366,9 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
+struct page *try_alloc_pages_noprof(int nid, unsigned int order);
+#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
 
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..5454fa610aac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1174,6 +1174,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1cb4b8c8886d..0f4be88ff131 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2304,7 +2304,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+			return 0;
+		spin_lock_irqsave(&zone->lock, flags);
+	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -2904,7 +2908,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
+		if (!spin_trylock_irqsave(&zone->lock, flags)) {
+			if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+				return NULL;
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
@@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 
 	might_alloc(gfp_mask);
 
-	if (should_fail_alloc_page(gfp_mask, order))
+	if (!(*alloc_flags & ALLOC_TRYLOCK) &&
+	    should_fail_alloc_page(gfp_mask, order))
 		return false;
 
 	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
@@ -7023,3 +7032,73 @@ static bool __free_unaccepted(struct page *page)
 }
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+
+struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+	/*
+	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
+	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
+	 * is not safe in arbitrary context.
+	 *
+	 * These two are the conditions for gfpflags_allow_spinning() being true.
+	 *
+	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+	 * to warn. Also warn would trigger printk() which is unsafe from
+	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
+	 * since the running context is unknown.
+	 *
+	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
+	 * is safe in any context. Also zeroing the page is mandatory for
+	 * BPF use cases.
+	 *
+	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
+	 * specify it here to highlight that try_alloc_pages()
+	 * doesn't want to deplete reserves.
+	 */
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	unsigned int alloc_flags = ALLOC_TRYLOCK;
+	struct alloc_context ac = { };
+	struct page *page;
+
+	/*
+	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
+	 * If spin_trylock() is called from hard IRQ the current task may be
+	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
+	 * task as the owner of another rt_spin_lock which will confuse PI
+	 * logic, so return immediately if called form hard IRQ or NMI.
+	 *
+	 * Note, irqs_disabled() case is ok. This function can be called
+	 * from raw_spin_lock_irqsave region.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;
+	if (!pcp_allowed_order(order))
+		return NULL;
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
+	if (has_unaccepted_memory())
+		return NULL;
+#endif
+	/* Bailout, since _deferred_grow_zone() needs to take a lock */
+	if (deferred_pages_enabled())
+		return NULL;
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+
+	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
+			    &alloc_gfp, &alloc_flags);
+
+	/*
+	 * Best effort allocation from percpu free list.
+	 * If it's empty attempt to spin_trylock zone->lock.
+	 */
+	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
+
+	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+
+	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+	kmsan_alloc_page(page, order, alloc_gfp);
+	return page;
+}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next v4 2/6] mm, bpf: Introduce free_pages_nolock()
  2025-01-14  2:19 [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-01-14  2:19 ` Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14  2:19 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.

Do not use llist unconditionally. BPF maps continuously
allocate/free, so we cannot unconditionally delay the freeing to
llist. When the memory becomes free make it available to the
kernel and BPF users right away if possible, and fallback to
llist as the last resort.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h      |  1 +
 include/linux/mm_types.h |  4 ++
 include/linux/mmzone.h   |  3 ++
 mm/page_alloc.c          | 79 ++++++++++++++++++++++++++++++++++++----
 4 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b41bb6e01781..6eba2d80feb8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -391,6 +391,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
 	__get_free_pages((gfp_mask) | GFP_DMA, (order))
 
 extern void __free_pages(struct page *page, unsigned int order);
+extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68..52547b3e5fd8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,10 @@ struct page {
 				/* Or, free page */
 				struct list_head buddy_list;
 				struct list_head pcp_list;
+				struct {
+					struct llist_node pcp_llist;
+					unsigned int order;
+				};
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16..1a854e0a9e3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -953,6 +953,9 @@ struct zone {
 	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Pages to be freed when next trylock succeeds */
+	struct llist_head	trylock_free_pages;
+
 	/* Write-intensive fields used by compaction and vmstats. */
 	CACHELINE_PADDING(_pad2_);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f4be88ff131..f967725898be 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Free the page without taking locks. Rely on trylock only. */
+#define FPI_TRYLOCK		((__force fpi_t)BIT(2))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	}
 }
 
+static void add_page_to_zone_llist(struct zone *zone, struct page *page,
+				   unsigned int order)
+{
+	/* Remember the order */
+	page->order = order;
+	/* Add the page to the free list */
+	llist_add(&page->pcp_llist, &zone->trylock_free_pages);
+}
+
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
 			  fpi_t fpi_flags)
 {
+	struct llist_head *llhead;
 	unsigned long flags;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+			add_page_to_zone_llist(zone, page, order);
+			return;
+		}
+		spin_lock_irqsave(&zone->lock, flags);
+	}
+
+	/* The lock succeeded. Process deferred pages. */
+	llhead = &zone->trylock_free_pages;
+	if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
+		struct llist_node *llnode;
+		struct page *p, *tmp;
+
+		llnode = llist_del_all(llhead);
+		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+			unsigned int p_order = p->order;
+
+			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+			__count_vm_events(PGFREE, 1 << p_order);
+		}
+	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2596,7 +2630,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
-				   unsigned int order)
+				   unsigned int order, fpi_t fpi_flags)
 {
 	int high, batch;
 	int pindex;
@@ -2631,6 +2665,14 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
 		pcp->free_count += (1 << order);
+
+	if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+		/*
+		 * Do not attempt to take a zone lock. Let pcp->count get
+		 * over high mark temporarily.
+		 */
+		return;
+	}
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2645,7 +2687,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 /*
  * Free a pcp page
  */
-void free_unref_page(struct page *page, unsigned int order)
+static void __free_unref_page(struct page *page, unsigned int order,
+			      fpi_t fpi_flags)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2654,7 +2697,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, fpi_flags);
 		return;
 	}
 
@@ -2671,24 +2714,33 @@ void free_unref_page(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, fpi_flags);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
 	zone = page_zone(page);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) {
+		add_page_to_zone_llist(zone, page, order);
+		return;
+	}
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, migratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, fpi_flags);
 	}
 	pcp_trylock_finish(UP_flags);
 }
 
+void free_unref_page(struct page *page, unsigned int order)
+{
+	__free_unref_page(page, order, FPI_NONE);
+}
+
 /*
  * Free a batch of folios
  */
@@ -2777,7 +2829,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				       order, FPI_NONE);
 	}
 
 	if (pcp) {
@@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL(__free_pages);
 
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI,
+ * but only for pages that came from try_alloc_pages():
+ * order <= 3, !folio, etc
+ */
+void free_pages_nolock(struct page *page, unsigned int order)
+{
+	if (put_page_testzero(page))
+		__free_unref_page(page, order, FPI_TRYLOCK);
+}
+
 void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0) {
-- 
2.43.5



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next v4 3/6] locking/local_lock: Introduce local_trylock_irqsave()
  2025-01-14  2:19 [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2025-01-14  2:19 ` Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14  2:19 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Similar to local_lock_irqsave() introduce local_trylock_irqsave().
This is inspired by 'struct local_tryirq_lock' in:
https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/

Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
and fail instantly otherwise, since spin_trylock is not safe from IRQ
due to PI issues.

In !PREEMPT_RT use simple active flag to prevent IRQs or NMIs
reentering locked region.

Note there is no need to use local_inc for active flag.
If IRQ handler grabs the same local_lock after READ_ONCE(lock->active)
already completed it has to unlock it before returning.
Similar with NMI handler. So there is a strict nesting of scopes.
It's a per cpu lock. Multiple cpus do not access it in parallel.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          |  9 ++++
 include/linux/local_lock_internal.h | 76 ++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb..84ee560c4f51 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -30,6 +30,15 @@
 #define local_lock_irqsave(lock, flags)				\
 	__local_lock_irqsave(lock, flags)
 
+/**
+ * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ *			   interrupts. Always fails in RT when in_hardirq or NMI.
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ */
+#define local_trylock_irqsave(lock, flags)			\
+	__local_trylock_irqsave(lock, flags)
+
 /**
  * local_unlock - Release a per CPU local lock
  * @lock:	The lock variable
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..93672127c73d 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -9,6 +9,7 @@
 #ifndef CONFIG_PREEMPT_RT
 
 typedef struct {
+	int active;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 	struct task_struct	*owner;
@@ -22,7 +23,7 @@ typedef struct {
 		.wait_type_inner = LD_WAIT_CONFIG,	\
 		.lock_type = LD_LOCK_PERCPU,		\
 	},						\
-	.owner = NULL,
+	.owner = NULL, .active = 0
 
 static inline void local_lock_acquire(local_lock_t *l)
 {
@@ -31,6 +32,13 @@ static inline void local_lock_acquire(local_lock_t *l)
 	l->owner = current;
 }
 
+static inline void local_trylock_acquire(local_lock_t *l)
+{
+	lock_map_acquire_try(&l->dep_map);
+	DEBUG_LOCKS_WARN_ON(l->owner);
+	l->owner = current;
+}
+
 static inline void local_lock_release(local_lock_t *l)
 {
 	DEBUG_LOCKS_WARN_ON(l->owner != current);
@@ -45,6 +53,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
 #else /* CONFIG_DEBUG_LOCK_ALLOC */
 # define LOCAL_LOCK_DEBUG_INIT(lockname)
 static inline void local_lock_acquire(local_lock_t *l) { }
+static inline void local_trylock_acquire(local_lock_t *l) { }
 static inline void local_lock_release(local_lock_t *l) { }
 static inline void local_lock_debug_init(local_lock_t *l) { }
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
@@ -60,6 +69,7 @@ do {								\
 			      0, LD_WAIT_CONFIG, LD_WAIT_INV,	\
 			      LD_LOCK_PERCPU);			\
 	local_lock_debug_init(lock);				\
+	(lock)->active = 0;					\
 } while (0)
 
 #define __spinlock_nested_bh_init(lock)				\
@@ -75,37 +85,73 @@ do {								\
 
 #define __local_lock(lock)					\
 	do {							\
+		local_lock_t *l;				\
 		preempt_disable();				\
-		local_lock_acquire(this_cpu_ptr(lock));		\
+		l = this_cpu_ptr(lock);				\
+		lockdep_assert(l->active == 0);			\
+		WRITE_ONCE(l->active, 1);			\
+		local_lock_acquire(l);				\
 	} while (0)
 
 #define __local_lock_irq(lock)					\
 	do {							\
+		local_lock_t *l;				\
 		local_irq_disable();				\
-		local_lock_acquire(this_cpu_ptr(lock));		\
+		l = this_cpu_ptr(lock);				\
+		lockdep_assert(l->active == 0);			\
+		WRITE_ONCE(l->active, 1);			\
+		local_lock_acquire(l);				\
 	} while (0)
 
 #define __local_lock_irqsave(lock, flags)			\
 	do {							\
+		local_lock_t *l;				\
 		local_irq_save(flags);				\
-		local_lock_acquire(this_cpu_ptr(lock));		\
+		l = this_cpu_ptr(lock);				\
+		lockdep_assert(l->active == 0);			\
+		WRITE_ONCE(l->active, 1);			\
+		local_lock_acquire(l);				\
 	} while (0)
 
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		local_lock_t *l;				\
+		local_irq_save(flags);				\
+		l = this_cpu_ptr(lock);				\
+		if (READ_ONCE(l->active) == 1) {		\
+			local_irq_restore(flags);		\
+			l = NULL;				\
+		} else {					\
+			WRITE_ONCE(l->active, 1);		\
+			local_trylock_acquire(l);		\
+		}						\
+		!!l;						\
+	})
+
 #define __local_unlock(lock)					\
 	do {							\
-		local_lock_release(this_cpu_ptr(lock));		\
+		local_lock_t *l = this_cpu_ptr(lock);		\
+		lockdep_assert(l->active == 1);			\
+		WRITE_ONCE(l->active, 0);			\
+		local_lock_release(l);				\
 		preempt_enable();				\
 	} while (0)
 
 #define __local_unlock_irq(lock)				\
 	do {							\
-		local_lock_release(this_cpu_ptr(lock));		\
+		local_lock_t *l = this_cpu_ptr(lock);		\
+		lockdep_assert(l->active == 1);			\
+		WRITE_ONCE(l->active, 0);			\
+		local_lock_release(l);				\
 		local_irq_enable();				\
 	} while (0)
 
 #define __local_unlock_irqrestore(lock, flags)			\
 	do {							\
-		local_lock_release(this_cpu_ptr(lock));		\
+		local_lock_t *l = this_cpu_ptr(lock);		\
+		lockdep_assert(l->active == 1);			\
+		WRITE_ONCE(l->active, 0);			\
+		local_lock_release(l);				\
 		local_irq_restore(flags);			\
 	} while (0)
 
@@ -148,6 +194,22 @@ typedef spinlock_t local_lock_t;
 		__local_lock(lock);				\
 	} while (0)
 
+#define __local_trylock_irqsave(lock, flags)			\
+	({							\
+		__label__ out;					\
+		int ret = 0;					\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		if (in_nmi() || in_hardirq())			\
+			goto out;				\
+		migrate_disable();				\
+		ret = spin_trylock(this_cpu_ptr((lock)));	\
+		if (!ret)					\
+			migrate_enable();			\
+	out:							\
+		ret;						\
+	})
+
 #define __local_unlock(__lock)					\
 	do {							\
 		spin_unlock(this_cpu_ptr((__lock)));		\
-- 
2.43.5



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next v4 4/6] memcg: Use trylock to access memcg stock_lock.
  2025-01-14  2:19 [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2025-01-14  2:19 ` [PATCH bpf-next v4 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
@ 2025-01-14  2:19 ` Alexei Starovoitov
  2025-01-14 10:39   ` Michal Hocko
  2025-01-14  2:19 ` [PATCH bpf-next v4 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
  5 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14  2:19 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Teach memcg to operate under trylock conditions when
spinning locks cannot be used.
The end result is __memcg_kmem_charge_page() and
__memcg_kmem_uncharge_page() are safe to use from
any context in RT and !RT.
In !RT the NMI handler may fail to trylock stock_lock.
In RT hard IRQ and NMI handlers will not attempt to trylock.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/memcontrol.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..e4c7049465e0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1756,7 +1756,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
  *
  * returns true if successful, false otherwise.
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			  gfp_t gfp_mask)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned int stock_pages;
@@ -1766,7 +1767,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		if (!gfpflags_allow_spinning(gfp_mask))
+			return ret;
+		local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	}
 
 	stock = this_cpu_ptr(&memcg_stock);
 	stock_pages = READ_ONCE(stock->nr_pages);
@@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		/*
+		 * In case of unlikely failure to lock percpu stock_lock
+		 * uncharge memcg directly.
+		 */
+		mem_cgroup_cancel_charge(memcg, nr_pages);
+		return;
+	}
 	__refill_stock(memcg, nr_pages);
 	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
@@ -2196,9 +2208,13 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long pflags;
 
 retry:
-	if (consume_stock(memcg, nr_pages))
+	if (consume_stock(memcg, nr_pages, gfp_mask))
 		return 0;
 
+	if (!gfpflags_allow_spinning(gfp_mask))
+		/* Avoid the refill and flush of the older stock */
+		batch = nr_pages;
+
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 		if (page_counter_try_charge(&memcg->memory, batch, &counter))
-- 
2.43.5



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next v4 5/6] mm, bpf: Use memcg in try_alloc_pages().
  2025-01-14  2:19 [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2025-01-14  2:19 ` [PATCH bpf-next v4 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
@ 2025-01-14  2:19 ` Alexei Starovoitov
  2025-01-14  2:19 ` [PATCH bpf-next v4 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14  2:19 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
The caller is responsible to setup memcg correctly.
All BPF memory accounting is memcg based.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/page_alloc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f967725898be..d80d4212c7c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7118,7 +7118,8 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 	 * specify it here to highlight that try_alloc_pages()
 	 * doesn't want to deplete reserves.
 	 */
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
+			| __GFP_ACCOUNT;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
@@ -7161,6 +7162,11 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 
 	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
 
+	if (memcg_kmem_online() && page &&
+	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
+		free_pages_nolock(page, order);
+		page = NULL;
+	}
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
-- 
2.43.5



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next v4 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
  2025-01-14  2:19 [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2025-01-14  2:19 ` [PATCH bpf-next v4 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
@ 2025-01-14  2:19 ` Alexei Starovoitov
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14  2:19 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use try_alloc_pages() and free_pages_nolock()

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/syscall.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0daf098e3207..8bcf48e31a5a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -582,14 +582,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
 	old_memcg = set_active_memcg(memcg);
 #endif
 	for (i = 0; i < nr_pages; i++) {
-		pg = alloc_pages_node(nid, gfp | __GFP_ACCOUNT, 0);
+		pg = try_alloc_pages(nid, 0);
 
 		if (pg) {
 			pages[i] = pg;
 			continue;
 		}
 		for (j = 0; j < i; j++)
-			__free_page(pages[j]);
+			free_pages_nolock(pages[j], 0);
 		ret = -ENOMEM;
 		break;
 	}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14  2:19 ` [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-01-14  9:53   ` Peter Zijlstra
  2025-01-14 10:19     ` Michal Hocko
  2025-01-14 10:31   ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2025-01-14  9:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory.

> The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> The rmqueue_pcplist() should be able to pop the page from.
> If it fails (due to IRQ re-entrancy or list being empty) then
> try_alloc_pages() attempts to spin_trylock zone->lock
> and refill percpu freelist as normal.

> BPF program may execute with IRQs disabled and zone->lock is
> sleeping in RT, so trylock is the only option. 

how is spin_trylock() from IRQ context not utterly broken in RT?

It can lead to try to priority boost the idle thread, among other crazy
things.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14  9:53   ` Peter Zijlstra
@ 2025-01-14 10:19     ` Michal Hocko
  2025-01-14 10:39       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-01-14 10:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, vbabka, bigeasy,
	rostedt, houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory.
> 
> > The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > The rmqueue_pcplist() should be able to pop the page from.
> > If it fails (due to IRQ re-entrancy or list being empty) then
> > try_alloc_pages() attempts to spin_trylock zone->lock
> > and refill percpu freelist as normal.
> 
> > BPF program may execute with IRQs disabled and zone->lock is
> > sleeping in RT, so trylock is the only option. 
> 
> how is spin_trylock() from IRQ context not utterly broken in RT?

+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;

Deals with that, right?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14  2:19 ` [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
  2025-01-14  9:53   ` Peter Zijlstra
@ 2025-01-14 10:31   ` Michal Hocko
  2025-01-15  1:23     ` Alexei Starovoitov
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-01-14 10:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Mon 13-01-25 18:19:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> The rmqueue_pcplist() should be able to pop the page from.
> If it fails (due to IRQ re-entrancy or list being empty) then
> try_alloc_pages() attempts to spin_trylock zone->lock
> and refill percpu freelist as normal.
> BPF program may execute with IRQs disabled and zone->lock is
> sleeping in RT, so trylock is the only option. In theory we can
> introduce percpu reentrance counter and increment it every time
> spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
> on it. Even if this cpu is not in page_alloc path the
> spin_lock_irqsave() is not safe, since BPF prog might be called
> from tracepoint where preemption is disabled. So trylock only.
> 
> Note, free_page and memcg are not taught about gfpflags_allow_spinning()
> condition. The support comes in the next patches.
> 
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

LGTM, I am not entirely clear on kmsan_alloc_page part though.
As long as that part is correct you can add
Acked-by: Michal Hocko <mhocko@suse.com>

Other than that try_alloc_pages_noprof begs some user documentation.

/**
 * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
 * @nid - node to allocate from
 * @order - allocation order size
 *
 * Allocates pages of a given order from the given node. This is safe to
 * call from any context (from atomic, NMI but also reentrant 
 * allocator -> tracepoint -> try_alloc_pages_noprof).
 * Allocation is best effort and to be expected to fail easily so nobody should
 * rely on the succeess. Failures are not reported via warn_alloc().
 *
 * Return: allocated page or NULL on failure.
 */
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> +	/*
> +	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> +	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> +	 * is not safe in arbitrary context.
> +	 *
> +	 * These two are the conditions for gfpflags_allow_spinning() being true.
> +	 *
> +	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> +	 * to warn. Also warn would trigger printk() which is unsafe from
> +	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
> +	 * since the running context is unknown.
> +	 *
> +	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> +	 * is safe in any context. Also zeroing the page is mandatory for
> +	 * BPF use cases.
> +	 *
> +	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
> +	 * specify it here to highlight that try_alloc_pages()
> +	 * doesn't want to deplete reserves.
> +	 */
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> +	unsigned int alloc_flags = ALLOC_TRYLOCK;
> +	struct alloc_context ac = { };
> +	struct page *page;
> +
> +	/*
> +	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
> +	 * If spin_trylock() is called from hard IRQ the current task may be
> +	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> +	 * task as the owner of another rt_spin_lock which will confuse PI
> +	 * logic, so return immediately if called form hard IRQ or NMI.
> +	 *
> +	 * Note, irqs_disabled() case is ok. This function can be called
> +	 * from raw_spin_lock_irqsave region.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +		return NULL;
> +	if (!pcp_allowed_order(order))
> +		return NULL;
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
> +	if (has_unaccepted_memory())
> +		return NULL;
> +#endif
> +	/* Bailout, since _deferred_grow_zone() needs to take a lock */
> +	if (deferred_pages_enabled())
> +		return NULL;
> +
> +	if (nid == NUMA_NO_NODE)
> +		nid = numa_node_id();
> +
> +	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> +			    &alloc_gfp, &alloc_flags);
> +
> +	/*
> +	 * Best effort allocation from percpu free list.
> +	 * If it's empty attempt to spin_trylock zone->lock.
> +	 */
> +	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> +
> +	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> +
> +	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> +	kmsan_alloc_page(page, order, alloc_gfp);
> +	return page;
> +}
> -- 
> 2.43.5
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 4/6] memcg: Use trylock to access memcg stock_lock.
  2025-01-14  2:19 ` [PATCH bpf-next v4 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
@ 2025-01-14 10:39   ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2025-01-14 10:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Mon 13-01-25 18:19:20, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Teach memcg to operate under trylock conditions when
> spinning locks cannot be used.
> The end result is __memcg_kmem_charge_page() and
> __memcg_kmem_uncharge_page() are safe to use from
> any context in RT and !RT.

> In !RT the NMI handler may fail to trylock stock_lock.
> In RT hard IRQ and NMI handlers will not attempt to trylock.

I believe this is local_trylock_irqsave specific thing that is not that
interesting for the particular code path. It is more useful to mention
consequences. I would phrase it this way.

local_trylock might fail and this would lead to charge cache bypass if
the calling context doesn't allow spinning (gfpflags_allow_spinning).
In those cases we try to charge the memcg counter directly and fail
early if that is not possible. This might cause a pre-mature charge
failing but it will allow an opportunistic charging that is safe from
try_alloc_pages path.
 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..e4c7049465e0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1756,7 +1756,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   *
>   * returns true if successful, false otherwise.
>   */
> -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +			  gfp_t gfp_mask)
>  {
>  	struct memcg_stock_pcp *stock;
>  	unsigned int stock_pages;
> @@ -1766,7 +1767,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  	if (nr_pages > MEMCG_CHARGE_BATCH)
>  		return ret;
>  
> -	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> +		if (!gfpflags_allow_spinning(gfp_mask))
> +			return ret;
> +		local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	}
>  
>  	stock = this_cpu_ptr(&memcg_stock);
>  	stock_pages = READ_ONCE(stock->nr_pages);
> @@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
>  	unsigned long flags;
>  
> -	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> +		/*
> +		 * In case of unlikely failure to lock percpu stock_lock
> +		 * uncharge memcg directly.
> +		 */
> +		mem_cgroup_cancel_charge(memcg, nr_pages);
> +		return;
> +	}
>  	__refill_stock(memcg, nr_pages);
>  	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>  }
> @@ -2196,9 +2208,13 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned long pflags;
>  
>  retry:
> -	if (consume_stock(memcg, nr_pages))
> +	if (consume_stock(memcg, nr_pages, gfp_mask))
>  		return 0;
>  
> +	if (!gfpflags_allow_spinning(gfp_mask))
> +		/* Avoid the refill and flush of the older stock */
> +		batch = nr_pages;
> +
>  	if (!do_memsw_account() ||
>  	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>  		if (page_counter_try_charge(&memcg->memory, batch, &counter))
> -- 
> 2.43.5

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14 10:19     ` Michal Hocko
@ 2025-01-14 10:39       ` Peter Zijlstra
  2025-01-14 10:43         ` Michal Hocko
  2025-01-14 18:29         ` Alexei Starovoitov
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2025-01-14 10:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, vbabka, bigeasy,
	rostedt, houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote:
> On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > > 
> > > Tracing BPF programs execute from tracepoints and kprobes where
> > > running context is unknown, but they need to request additional
> > > memory.
> > 
> > > The prior workarounds were using pre-allocated memory and
> > > BPF specific freelists to satisfy such allocation requests.
> > > Instead, introduce gfpflags_allow_spinning() condition that signals
> > > to the allocator that running context is unknown.
> > > Then rely on percpu free list of pages to allocate a page.
> > > The rmqueue_pcplist() should be able to pop the page from.
> > > If it fails (due to IRQ re-entrancy or list being empty) then
> > > try_alloc_pages() attempts to spin_trylock zone->lock
> > > and refill percpu freelist as normal.
> > 
> > > BPF program may execute with IRQs disabled and zone->lock is
> > > sleeping in RT, so trylock is the only option. 
> > 
> > how is spin_trylock() from IRQ context not utterly broken in RT?
> 
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +		return NULL;
> 
> Deals with that, right?

Changelog didn't really mention that, did it? -- it seems to imply quite
the opposite :/

But maybe, I suppose any BPF program needs to expect failure due to this
being trylock. I just worry some programs will malfunction due to never
succeeding -- and RT getting blamed for this.

Maybe I worry too much.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14 10:39       ` Peter Zijlstra
@ 2025-01-14 10:43         ` Michal Hocko
  2025-01-14 18:29         ` Alexei Starovoitov
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2025-01-14 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, vbabka, bigeasy,
	rostedt, houtao1, hannes, shakeel.butt, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On Tue 14-01-25 11:39:46, Peter Zijlstra wrote:
> On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote:
> > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > 
> > > > Tracing BPF programs execute from tracepoints and kprobes where
> > > > running context is unknown, but they need to request additional
> > > > memory.
> > > 
> > > > The prior workarounds were using pre-allocated memory and
> > > > BPF specific freelists to satisfy such allocation requests.
> > > > Instead, introduce gfpflags_allow_spinning() condition that signals
> > > > to the allocator that running context is unknown.
> > > > Then rely on percpu free list of pages to allocate a page.
> > > > The rmqueue_pcplist() should be able to pop the page from.
> > > > If it fails (due to IRQ re-entrancy or list being empty) then
> > > > try_alloc_pages() attempts to spin_trylock zone->lock
> > > > and refill percpu freelist as normal.
> > > 
> > > > BPF program may execute with IRQs disabled and zone->lock is
> > > > sleeping in RT, so trylock is the only option. 
> > > 
> > > how is spin_trylock() from IRQ context not utterly broken in RT?
> > 
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> > +		return NULL;
> > 
> > Deals with that, right?
> 
> Changelog didn't really mention that, did it? -- it seems to imply quite
> the opposite :/

yes, one has to look into the implementation see all the constrains and
the changelog could/should be improved in that regards.

> But maybe, I suppose any BPF program needs to expect failure due to this
> being trylock. I just worry some programs will malfunction due to never
> succeeding -- and RT getting blamed for this.

This is a good question. The current implementation has fewer constrains
and there are no data points about the success rate with the new
implementation. But to be completely honest I am not really sure how
much BPF is used on PREEMPT_RT systems and with RT workloads so I am not
sure how much of a practical problem that is.

> 
> Maybe I worry too much.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14 10:39       ` Peter Zijlstra
  2025-01-14 10:43         ` Michal Hocko
@ 2025-01-14 18:29         ` Alexei Starovoitov
  2025-01-14 18:34           ` Steven Rostedt
  1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-14 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Vlastimil Babka, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Matthew Wilcox,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team

On Tue, Jan 14, 2025 at 2:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote:
> > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Tracing BPF programs execute from tracepoints and kprobes where
> > > > running context is unknown, but they need to request additional
> > > > memory.
> > >
> > > > The prior workarounds were using pre-allocated memory and
> > > > BPF specific freelists to satisfy such allocation requests.
> > > > Instead, introduce gfpflags_allow_spinning() condition that signals
> > > > to the allocator that running context is unknown.
> > > > Then rely on percpu free list of pages to allocate a page.
> > > > The rmqueue_pcplist() should be able to pop the page from.
> > > > If it fails (due to IRQ re-entrancy or list being empty) then
> > > > try_alloc_pages() attempts to spin_trylock zone->lock
> > > > and refill percpu freelist as normal.
> > >
> > > > BPF program may execute with IRQs disabled and zone->lock is
> > > > sleeping in RT, so trylock is the only option.
> > >
> > > how is spin_trylock() from IRQ context not utterly broken in RT?
> >
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> > +             return NULL;
> >
> > Deals with that, right?
>
> Changelog didn't really mention that, did it? -- it seems to imply quite
> the opposite :/

Hmm. Until you said it I didn't read it as "imply the opposite" :(

The cover letter is pretty clear...
"
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
  disable such usage in lock_trylock and in try_alloc_pages().
"

and the patch 2 commit log is clear too...

"
Since spin_trylock() cannot be used in RT from hard IRQ or NMI
it uses lockless link list...
"

and further in patch 3 commit log...

"
Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
and fail instantly otherwise, since spin_trylock is not safe from IRQ
due to PI issues.
"

I guess I can reword this particular sentence in patch 1 commit log,
but before jumping to an incorrect conclusion please read the
whole set.

> But maybe, I suppose any BPF program needs to expect failure due to this
> being trylock. I just worry some programs will malfunction due to never
> succeeding -- and RT getting blamed for this.
>
> Maybe I worry too much.

Humans will find a way to blame BPF and/or RT for all of their problems
anyway. Just days ago BPF was blamed in RT for causing IPIs during JIT.
Valentin's patches are going to address that but ain't noone has time
to explain that continuously.

Seriously, though, the number of things that still run in hard irq context
in RT is so small that if some tracing BPF prog is attached there
it should be using prealloc mode. Full prealloc is still
the default for bpf hash map.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14 18:29         ` Alexei Starovoitov
@ 2025-01-14 18:34           ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2025-01-14 18:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Michal Hocko, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Vlastimil Babka,
	Sebastian Sewior, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Tue, 14 Jan 2025 10:29:04 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Seriously, though, the number of things that still run in hard irq context
> in RT is so small that if some tracing BPF prog is attached there
> it should be using prealloc mode. Full prealloc is still
> the default for bpf hash map.

The one thing to watch out for is hrtimer trace events. They will be called
in hard irq context even in RT. If a BPF program is attached to one of
them, then that could be an issue.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-14 10:31   ` Michal Hocko
@ 2025-01-15  1:23     ` Alexei Starovoitov
  2025-01-15  8:35       ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-15  1:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Tue, Jan 14, 2025 at 2:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 13-01-25 18:19:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory. The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > The rmqueue_pcplist() should be able to pop the page from.
> > If it fails (due to IRQ re-entrancy or list being empty) then
> > try_alloc_pages() attempts to spin_trylock zone->lock
> > and refill percpu freelist as normal.
> > BPF program may execute with IRQs disabled and zone->lock is
> > sleeping in RT, so trylock is the only option. In theory we can
> > introduce percpu reentrance counter and increment it every time
> > spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
> > on it. Even if this cpu is not in page_alloc path the
> > spin_lock_irqsave() is not safe, since BPF prog might be called
> > from tracepoint where preemption is disabled. So trylock only.
> >
> > Note, free_page and memcg are not taught about gfpflags_allow_spinning()
> > condition. The support comes in the next patches.
> >
> > This is a first step towards supporting BPF requirements in SLUB
> > and getting rid of bpf_mem_alloc.
> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> LGTM, I am not entirely clear on kmsan_alloc_page part though.

Which part is still confusing?
I hoped the comment below is enough:
   * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
   * is safe in any context. Also zeroing the page is mandatory for
   * BPF use cases.

and once you zoom into:
void kmsan_alloc_page(struct page *page, unsigned int order, gfp_t flags)
{
        bool initialized = (flags & __GFP_ZERO) || !kmsan_enabled;
...
        if (initialized) {
                __memset(page_address(shadow), 0, PAGE_SIZE * pages);
                __memset(page_address(origin), 0, PAGE_SIZE * pages);
                return;
        }

So it's safe to call it and it's necessary to call it when KMSAN is on.

This was the easiest code path to analyze from doesnt-take-spinlocks pov.
I feel the comment is enough.

If/when people want to support !__GFP_ZERO case with KMSAN they would
need to make stack_depot_save() behave in
!gfpflags_allow_spinning() condition.

Since __GFP_ZERO is necessary for the BPF use case I left all
the extra work for the future follow ups.

> As long as that part is correct you can add
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Other than that try_alloc_pages_noprof begs some user documentation.
>
> /**
>  * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
>  * @nid - node to allocate from
>  * @order - allocation order size
>  *
>  * Allocates pages of a given order from the given node. This is safe to
>  * call from any context (from atomic, NMI but also reentrant
>  * allocator -> tracepoint -> try_alloc_pages_noprof).
>  * Allocation is best effort and to be expected to fail easily so nobody should
>  * rely on the succeess. Failures are not reported via warn_alloc().
>  *
>  * Return: allocated page or NULL on failure.
>  */

Nicely worded. Will add.

Thanks for all the reviews. Appreciate it!


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-15  1:23     ` Alexei Starovoitov
@ 2025-01-15  8:35       ` Michal Hocko
  2025-01-15 22:33         ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-01-15  8:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Tue 14-01-25 17:23:20, Alexei Starovoitov wrote:
> On Tue, Jan 14, 2025 at 2:31 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > LGTM, I am not entirely clear on kmsan_alloc_page part though.
> 
> Which part is still confusing?

It is not confusing as much as I have no idea how the kmsan code is
supposed to work. Why do we even need to memset if __GFP_ZERO is used.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-01-15  8:35       ` Michal Hocko
@ 2025-01-15 22:33         ` Alexei Starovoitov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 22:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Jan 15, 2025 at 12:35 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 14-01-25 17:23:20, Alexei Starovoitov wrote:
> > On Tue, Jan 14, 2025 at 2:31 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > LGTM, I am not entirely clear on kmsan_alloc_page part though.
> >
> > Which part is still confusing?
>
> It is not confusing as much as I have no idea how the kmsan code is
> supposed to work. Why do we even need to memset if __GFP_ZERO is used.

kmsan's main objective is to find uninitialized memory access.
So when the page is initialized kmsan needs to notice.
It's similar to kasan_unpoison logic.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-15 22:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-14  2:19 [PATCH bpf-next v4 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-01-14  2:19 ` [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-01-14  9:53   ` Peter Zijlstra
2025-01-14 10:19     ` Michal Hocko
2025-01-14 10:39       ` Peter Zijlstra
2025-01-14 10:43         ` Michal Hocko
2025-01-14 18:29         ` Alexei Starovoitov
2025-01-14 18:34           ` Steven Rostedt
2025-01-14 10:31   ` Michal Hocko
2025-01-15  1:23     ` Alexei Starovoitov
2025-01-15  8:35       ` Michal Hocko
2025-01-15 22:33         ` Alexei Starovoitov
2025-01-14  2:19 ` [PATCH bpf-next v4 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2025-01-14  2:19 ` [PATCH bpf-next v4 3/6] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
2025-01-14  2:19 ` [PATCH bpf-next v4 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
2025-01-14 10:39   ` Michal Hocko
2025-01-14  2:19 ` [PATCH bpf-next v4 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
2025-01-14  2:19 ` [PATCH bpf-next v4 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox