* [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages()
@ 2025-01-15 2:17 Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
` (6 more replies)
0 siblings, 7 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Hi All,
The main motivation is to make alloc page and slab reentrant and
remove bpf_mem_alloc.
v4->v5:
- Fixed patch 1 and 4 commit logs and comments per Michal suggestions.
Added Acks.
- Added patch 6 to make failslab, kfence, kmemleak complaint
with trylock mode. It's a prerequisite for reentrant slab patches.
v4:
https://lore.kernel.org/bpf/20250114021922.92609-1-alexei.starovoitov@gmail.com/
v3->v4:
Addressed feedback from Michal and Shakeel:
- GFP_TRYLOCK flag is gone. gfpflags_allow_spinning() is used instead.
- Improved comments and commit logs.
v3:
https://lore.kernel.org/bpf/20241218030720.1602449-1-alexei.starovoitov@gmail.com/
v2->v3:
To address the issues spotted by Sebastian, Vlastimil, Steven:
- Made GFP_TRYLOCK internal to mm/internal.h
try_alloc_pages() and free_pages_nolock() are the only interfaces.
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
disable such usage in lock_trylock and in try_alloc_pages().
In such case free_pages_nolock() falls back to llist right away.
- Process trylock_free_pages llist when preemptible.
- Check for things like unaccepted memory and order <= 3 early.
- Don't call into __alloc_pages_slowpath() at all.
- Inspired by Vlastimil's struct local_tryirq_lock adopted it in
local_lock_t. Extra 4 bytes in !RT in local_lock_t shouldn't
affect any of the current local_lock_t users. This is patch 3.
- Tested with bpf selftests in RT and !RT and realized how much
more work is necessary on bpf side to play nice with RT.
The urgency of this work got higher. The alternative is to
convert bpf bits left and right to bpf_mem_alloc.
v2:
https://lore.kernel.org/bpf/20241210023936.46871-1-alexei.starovoitov@gmail.com/
v1->v2:
- fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter.
- optimize all paths by doing spin_trylock_irqsave() first
and only then check for gfp_flags & __GFP_TRYLOCK.
Then spin_lock_irqsave() if it's a regular mode.
So new gfp flag will not add performance overhead.
- patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock()
and memcg support. So it's in usable shape for bpf in patch 6.
v1:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/
Alexei Starovoitov (7):
mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
mm, bpf: Introduce free_pages_nolock()
locking/local_lock: Introduce local_trylock_irqsave()
memcg: Use trylock to access memcg stock_lock.
mm, bpf: Use memcg in try_alloc_pages().
mm: Make failslab, kfence, kmemleak aware of trylock mode
bpf: Use try_alloc_pages() to allocate pages for bpf needs.
include/linux/gfp.h | 23 ++++
include/linux/local_lock.h | 9 ++
include/linux/local_lock_internal.h | 76 ++++++++++--
include/linux/mm_types.h | 4 +
include/linux/mmzone.h | 3 +
kernel/bpf/syscall.c | 4 +-
mm/failslab.c | 3 +
mm/internal.h | 1 +
mm/kfence/core.c | 4 +
mm/kmemleak.c | 3 +
mm/memcontrol.c | 24 +++-
mm/page_alloc.c | 183 ++++++++++++++++++++++++++--
12 files changed, 313 insertions(+), 24 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
@ 2025-01-15 2:17 ` Alexei Starovoitov
2025-01-15 11:19 ` Vlastimil Babka
2025-01-17 18:19 ` Sebastian Andrzej Siewior
2025-01-15 2:17 ` [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
` (5 subsequent siblings)
6 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce gfpflags_allow_spinning() condition that signals
to the allocator that running context is unknown.
Then rely on percpu free list of pages to allocate a page.
try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
rmqueue_pcplist() will spin_trylock to grab the page from percpu
free list. If it fails (due to re-entrancy or list being empty)
then rmqueue_bulk()/rmqueue_buddy() will attempt to
spin_trylock zone->lock and grab the page from there.
spin_trylock() is not safe in RT when in NMI or in hard IRQ.
Bailout early in such case.
The support for gfpflags_allow_spinning() mode for free_page and memcg
comes in the next patches.
This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/gfp.h | 22 ++++++++++
mm/internal.h | 1 +
mm/page_alloc.c | 98 +++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 118 insertions(+), 3 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b0fe9f62d15b..b41bb6e01781 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
}
+static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
+{
+ /*
+ * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
+ * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
+ * All GFP_* flags including GFP_NOWAIT use one or both flags.
+ * try_alloc_pages() is the only API that doesn't specify either flag.
+ *
+ * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
+ * those are guaranteed to never block on a sleeping lock.
+ * Here we are enforcing that the allaaction doesn't ever spin
+ * on any locks (i.e. only trylocks). There is no highlevel
+ * GFP_$FOO flag for this use in try_alloc_pages() as the
+ * regular page allocator doesn't fully support this
+ * allocation mode.
+ */
+ return !(gfp_flags & __GFP_RECLAIM);
+}
+
#ifdef CONFIG_HIGHMEM
#define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
#else
@@ -347,6 +366,9 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
}
#define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
+struct page *try_alloc_pages_noprof(int nid, unsigned int order);
+#define try_alloc_pages(...) alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
#define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..5454fa610aac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1174,6 +1174,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_NOFRAGMENT 0x0
#endif
#define ALLOC_HIGHATOMIC 0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#define ALLOC_TRYLOCK 0x400 /* Only use spin_trylock in allocation path */
#define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
/* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1cb4b8c8886d..74c2a7af1a77 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2304,7 +2304,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long flags;
int i;
- spin_lock_irqsave(&zone->lock, flags);
+ if (!spin_trylock_irqsave(&zone->lock, flags)) {
+ if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+ return 0;
+ spin_lock_irqsave(&zone->lock, flags);
+ }
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
alloc_flags);
@@ -2904,7 +2908,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
do {
page = NULL;
- spin_lock_irqsave(&zone->lock, flags);
+ if (!spin_trylock_irqsave(&zone->lock, flags)) {
+ if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+ return NULL;
+ spin_lock_irqsave(&zone->lock, flags);
+ }
if (alloc_flags & ALLOC_HIGHATOMIC)
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page) {
@@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
might_alloc(gfp_mask);
- if (should_fail_alloc_page(gfp_mask, order))
+ if (!(*alloc_flags & ALLOC_TRYLOCK) &&
+ should_fail_alloc_page(gfp_mask, order))
return false;
*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
@@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
}
#endif /* CONFIG_UNACCEPTED_MEMORY */
+
+/**
+ * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
+ * @nid - node to allocate from
+ * @order - allocation order size
+ *
+ * Allocates pages of a given order from the given node. This is safe to
+ * call from any context (from atomic, NMI, and also reentrant
+ * allocator -> tracepoint -> try_alloc_pages_noprof).
+ * Allocation is best effort and to be expected to fail easily so nobody should
+ * rely on the success. Failures are not reported via warn_alloc().
+ *
+ * Return: allocated page or NULL on failure.
+ */
+struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+ /*
+ * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
+ * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
+ * is not safe in arbitrary context.
+ *
+ * These two are the conditions for gfpflags_allow_spinning() being true.
+ *
+ * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+ * to warn. Also warn would trigger printk() which is unsafe from
+ * various contexts. We cannot use printk_deferred_enter() to mitigate,
+ * since the running context is unknown.
+ *
+ * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
+ * is safe in any context. Also zeroing the page is mandatory for
+ * BPF use cases.
+ *
+ * Though __GFP_NOMEMALLOC is not checked in the code path below,
+ * specify it here to highlight that try_alloc_pages()
+ * doesn't want to deplete reserves.
+ */
+ gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+ unsigned int alloc_flags = ALLOC_TRYLOCK;
+ struct alloc_context ac = { };
+ struct page *page;
+
+ /*
+ * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
+ * If spin_trylock() is called from hard IRQ the current task may be
+ * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
+ * task as the owner of another rt_spin_lock which will confuse PI
+ * logic, so return immediately if called form hard IRQ or NMI.
+ *
+ * Note, irqs_disabled() case is ok. This function can be called
+ * from raw_spin_lock_irqsave region.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+ return NULL;
+ if (!pcp_allowed_order(order))
+ return NULL;
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ /* Bailout, since try_to_accept_memory_one() needs to take a lock */
+ if (has_unaccepted_memory())
+ return NULL;
+#endif
+ /* Bailout, since _deferred_grow_zone() needs to take a lock */
+ if (deferred_pages_enabled())
+ return NULL;
+
+ if (nid == NUMA_NO_NODE)
+ nid = numa_node_id();
+
+ prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
+ &alloc_gfp, &alloc_flags);
+
+ /*
+ * Best effort allocation from percpu free list.
+ * If it's empty attempt to spin_trylock zone->lock.
+ */
+ page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
+
+ /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+
+ trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+ kmsan_alloc_page(page, order, alloc_gfp);
+ return page;
+}
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock()
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-01-15 2:17 ` Alexei Starovoitov
2025-01-15 11:47 ` Vlastimil Babka
2025-01-17 18:20 ` Sebastian Andrzej Siewior
2025-01-15 2:17 ` [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
` (4 subsequent siblings)
6 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.
Do not use llist unconditionally. BPF maps continuously
allocate/free, so we cannot unconditionally delay the freeing to
llist. When the memory becomes free make it available to the
kernel and BPF users right away if possible, and fallback to
llist as the last resort.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/gfp.h | 1 +
include/linux/mm_types.h | 4 ++
include/linux/mmzone.h | 3 ++
mm/page_alloc.c | 79 ++++++++++++++++++++++++++++++++++++----
4 files changed, 79 insertions(+), 8 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b41bb6e01781..6eba2d80feb8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -391,6 +391,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
__get_free_pages((gfp_mask) | GFP_DMA, (order))
extern void __free_pages(struct page *page, unsigned int order);
+extern void free_pages_nolock(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
#define __free_page(page) __free_pages((page), 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68..52547b3e5fd8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,10 @@ struct page {
/* Or, free page */
struct list_head buddy_list;
struct list_head pcp_list;
+ struct {
+ struct llist_node pcp_llist;
+ unsigned int order;
+ };
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16..1a854e0a9e3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -953,6 +953,9 @@ struct zone {
/* Primarily protects free_area */
spinlock_t lock;
+ /* Pages to be freed when next trylock succeeds */
+ struct llist_head trylock_free_pages;
+
/* Write-intensive fields used by compaction and vmstats. */
CACHELINE_PADDING(_pad2_);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 74c2a7af1a77..a9c639e3db91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
*/
#define FPI_TO_TAIL ((__force fpi_t)BIT(1))
+/* Free the page without taking locks. Rely on trylock only. */
+#define FPI_TRYLOCK ((__force fpi_t)BIT(2))
+
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,
}
}
+static void add_page_to_zone_llist(struct zone *zone, struct page *page,
+ unsigned int order)
+{
+ /* Remember the order */
+ page->order = order;
+ /* Add the page to the free list */
+ llist_add(&page->pcp_llist, &zone->trylock_free_pages);
+}
+
static void free_one_page(struct zone *zone, struct page *page,
unsigned long pfn, unsigned int order,
fpi_t fpi_flags)
{
+ struct llist_head *llhead;
unsigned long flags;
- spin_lock_irqsave(&zone->lock, flags);
+ if (!spin_trylock_irqsave(&zone->lock, flags)) {
+ if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+ add_page_to_zone_llist(zone, page, order);
+ return;
+ }
+ spin_lock_irqsave(&zone->lock, flags);
+ }
+
+ /* The lock succeeded. Process deferred pages. */
+ llhead = &zone->trylock_free_pages;
+ if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
+ struct llist_node *llnode;
+ struct page *p, *tmp;
+
+ llnode = llist_del_all(llhead);
+ llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+ unsigned int p_order = p->order;
+
+ split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+ __count_vm_events(PGFREE, 1 << p_order);
+ }
+ }
split_large_buddy(zone, page, pfn, order, fpi_flags);
spin_unlock_irqrestore(&zone->lock, flags);
@@ -2596,7 +2630,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
struct page *page, int migratetype,
- unsigned int order)
+ unsigned int order, fpi_t fpi_flags)
{
int high, batch;
int pindex;
@@ -2631,6 +2665,14 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
}
if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
pcp->free_count += (1 << order);
+
+ if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+ /*
+ * Do not attempt to take a zone lock. Let pcp->count get
+ * over high mark temporarily.
+ */
+ return;
+ }
high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2645,7 +2687,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
/*
* Free a pcp page
*/
-void free_unref_page(struct page *page, unsigned int order)
+static void __free_unref_page(struct page *page, unsigned int order,
+ fpi_t fpi_flags)
{
unsigned long __maybe_unused UP_flags;
struct per_cpu_pages *pcp;
@@ -2654,7 +2697,7 @@ void free_unref_page(struct page *page, unsigned int order)
int migratetype;
if (!pcp_allowed_order(order)) {
- __free_pages_ok(page, order, FPI_NONE);
+ __free_pages_ok(page, order, fpi_flags);
return;
}
@@ -2671,24 +2714,33 @@ void free_unref_page(struct page *page, unsigned int order)
migratetype = get_pfnblock_migratetype(page, pfn);
if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
+ free_one_page(page_zone(page), page, pfn, order, fpi_flags);
return;
}
migratetype = MIGRATE_MOVABLE;
}
zone = page_zone(page);
+ if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) {
+ add_page_to_zone_llist(zone, page, order);
+ return;
+ }
pcp_trylock_prepare(UP_flags);
pcp = pcp_spin_trylock(zone->per_cpu_pageset);
if (pcp) {
- free_unref_page_commit(zone, pcp, page, migratetype, order);
+ free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
pcp_spin_unlock(pcp);
} else {
- free_one_page(zone, page, pfn, order, FPI_NONE);
+ free_one_page(zone, page, pfn, order, fpi_flags);
}
pcp_trylock_finish(UP_flags);
}
+void free_unref_page(struct page *page, unsigned int order)
+{
+ __free_unref_page(page, order, FPI_NONE);
+}
+
/*
* Free a batch of folios
*/
@@ -2777,7 +2829,7 @@ void free_unref_folios(struct folio_batch *folios)
trace_mm_page_free_batched(&folio->page);
free_unref_page_commit(zone, pcp, &folio->page, migratetype,
- order);
+ order, FPI_NONE);
}
if (pcp) {
@@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
}
EXPORT_SYMBOL(__free_pages);
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI,
+ * but only for pages that came from try_alloc_pages():
+ * order <= 3, !folio, etc
+ */
+void free_pages_nolock(struct page *page, unsigned int order)
+{
+ if (put_page_testzero(page))
+ __free_unref_page(page, order, FPI_TRYLOCK);
+}
+
void free_pages(unsigned long addr, unsigned int order)
{
if (addr != 0) {
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2025-01-15 2:17 ` Alexei Starovoitov
2025-01-15 2:23 ` Alexei Starovoitov
` (2 more replies)
2025-01-15 2:17 ` [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
` (3 subsequent siblings)
6 siblings, 3 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Similar to local_lock_irqsave() introduce local_trylock_irqsave().
This is inspired by 'struct local_tryirq_lock' in:
https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/
Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
and fail instantly otherwise, since spin_trylock is not safe from IRQ
due to PI issues.
In !PREEMPT_RT use simple active flag to prevent IRQs or NMIs
reentering locked region.
Note there is no need to use local_inc for active flag.
If IRQ handler grabs the same local_lock after READ_ONCE(lock->active)
already completed it has to unlock it before returning.
Similar with NMI handler. So there is a strict nesting of scopes.
It's a per cpu lock. Multiple cpus do not access it in parallel.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/local_lock.h | 9 ++++
include/linux/local_lock_internal.h | 76 ++++++++++++++++++++++++++---
2 files changed, 78 insertions(+), 7 deletions(-)
diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb..84ee560c4f51 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -30,6 +30,15 @@
#define local_lock_irqsave(lock, flags) \
__local_lock_irqsave(lock, flags)
+/**
+ * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ * interrupts. Always fails in RT when in_hardirq or NMI.
+ * @lock: The lock variable
+ * @flags: Storage for interrupt flags
+ */
+#define local_trylock_irqsave(lock, flags) \
+ __local_trylock_irqsave(lock, flags)
+
/**
* local_unlock - Release a per CPU local lock
* @lock: The lock variable
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..93672127c73d 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -9,6 +9,7 @@
#ifndef CONFIG_PREEMPT_RT
typedef struct {
+ int active;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
struct task_struct *owner;
@@ -22,7 +23,7 @@ typedef struct {
.wait_type_inner = LD_WAIT_CONFIG, \
.lock_type = LD_LOCK_PERCPU, \
}, \
- .owner = NULL,
+ .owner = NULL, .active = 0
static inline void local_lock_acquire(local_lock_t *l)
{
@@ -31,6 +32,13 @@ static inline void local_lock_acquire(local_lock_t *l)
l->owner = current;
}
+static inline void local_trylock_acquire(local_lock_t *l)
+{
+ lock_map_acquire_try(&l->dep_map);
+ DEBUG_LOCKS_WARN_ON(l->owner);
+ l->owner = current;
+}
+
static inline void local_lock_release(local_lock_t *l)
{
DEBUG_LOCKS_WARN_ON(l->owner != current);
@@ -45,6 +53,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
#else /* CONFIG_DEBUG_LOCK_ALLOC */
# define LOCAL_LOCK_DEBUG_INIT(lockname)
static inline void local_lock_acquire(local_lock_t *l) { }
+static inline void local_trylock_acquire(local_lock_t *l) { }
static inline void local_lock_release(local_lock_t *l) { }
static inline void local_lock_debug_init(local_lock_t *l) { }
#endif /* !CONFIG_DEBUG_LOCK_ALLOC */
@@ -60,6 +69,7 @@ do { \
0, LD_WAIT_CONFIG, LD_WAIT_INV, \
LD_LOCK_PERCPU); \
local_lock_debug_init(lock); \
+ (lock)->active = 0; \
} while (0)
#define __spinlock_nested_bh_init(lock) \
@@ -75,37 +85,73 @@ do { \
#define __local_lock(lock) \
do { \
+ local_lock_t *l; \
preempt_disable(); \
- local_lock_acquire(this_cpu_ptr(lock)); \
+ l = this_cpu_ptr(lock); \
+ lockdep_assert(l->active == 0); \
+ WRITE_ONCE(l->active, 1); \
+ local_lock_acquire(l); \
} while (0)
#define __local_lock_irq(lock) \
do { \
+ local_lock_t *l; \
local_irq_disable(); \
- local_lock_acquire(this_cpu_ptr(lock)); \
+ l = this_cpu_ptr(lock); \
+ lockdep_assert(l->active == 0); \
+ WRITE_ONCE(l->active, 1); \
+ local_lock_acquire(l); \
} while (0)
#define __local_lock_irqsave(lock, flags) \
do { \
+ local_lock_t *l; \
local_irq_save(flags); \
- local_lock_acquire(this_cpu_ptr(lock)); \
+ l = this_cpu_ptr(lock); \
+ lockdep_assert(l->active == 0); \
+ WRITE_ONCE(l->active, 1); \
+ local_lock_acquire(l); \
} while (0)
+#define __local_trylock_irqsave(lock, flags) \
+ ({ \
+ local_lock_t *l; \
+ local_irq_save(flags); \
+ l = this_cpu_ptr(lock); \
+ if (READ_ONCE(l->active) == 1) { \
+ local_irq_restore(flags); \
+ l = NULL; \
+ } else { \
+ WRITE_ONCE(l->active, 1); \
+ local_trylock_acquire(l); \
+ } \
+ !!l; \
+ })
+
#define __local_unlock(lock) \
do { \
- local_lock_release(this_cpu_ptr(lock)); \
+ local_lock_t *l = this_cpu_ptr(lock); \
+ lockdep_assert(l->active == 1); \
+ WRITE_ONCE(l->active, 0); \
+ local_lock_release(l); \
preempt_enable(); \
} while (0)
#define __local_unlock_irq(lock) \
do { \
- local_lock_release(this_cpu_ptr(lock)); \
+ local_lock_t *l = this_cpu_ptr(lock); \
+ lockdep_assert(l->active == 1); \
+ WRITE_ONCE(l->active, 0); \
+ local_lock_release(l); \
local_irq_enable(); \
} while (0)
#define __local_unlock_irqrestore(lock, flags) \
do { \
- local_lock_release(this_cpu_ptr(lock)); \
+ local_lock_t *l = this_cpu_ptr(lock); \
+ lockdep_assert(l->active == 1); \
+ WRITE_ONCE(l->active, 0); \
+ local_lock_release(l); \
local_irq_restore(flags); \
} while (0)
@@ -148,6 +194,22 @@ typedef spinlock_t local_lock_t;
__local_lock(lock); \
} while (0)
+#define __local_trylock_irqsave(lock, flags) \
+ ({ \
+ __label__ out; \
+ int ret = 0; \
+ typecheck(unsigned long, flags); \
+ flags = 0; \
+ if (in_nmi() || in_hardirq()) \
+ goto out; \
+ migrate_disable(); \
+ ret = spin_trylock(this_cpu_ptr((lock))); \
+ if (!ret) \
+ migrate_enable(); \
+ out: \
+ ret; \
+ })
+
#define __local_unlock(__lock) \
do { \
spin_unlock(this_cpu_ptr((__lock))); \
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock.
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
` (2 preceding siblings ...)
2025-01-15 2:17 ` [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
@ 2025-01-15 2:17 ` Alexei Starovoitov
2025-01-15 16:07 ` Vlastimil Babka
2025-01-16 0:12 ` Shakeel Butt
2025-01-15 2:17 ` [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
` (2 subsequent siblings)
6 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Teach memcg to operate under trylock conditions when spinning locks
cannot be used.
local_trylock might fail and this would lead to charge cache bypass if
the calling context doesn't allow spinning (gfpflags_allow_spinning).
In those cases charge the memcg counter directly and fail early if
that is not possible. This might cause a pre-mature charge failing
but it will allow an opportunistic charging that is safe from
try_alloc_pages path.
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
mm/memcontrol.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..e4c7049465e0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1756,7 +1756,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
*
* returns true if successful, false otherwise.
*/
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+ gfp_t gfp_mask)
{
struct memcg_stock_pcp *stock;
unsigned int stock_pages;
@@ -1766,7 +1767,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
if (nr_pages > MEMCG_CHARGE_BATCH)
return ret;
- local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+ if (!gfpflags_allow_spinning(gfp_mask))
+ return ret;
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ }
stock = this_cpu_ptr(&memcg_stock);
stock_pages = READ_ONCE(stock->nr_pages);
@@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
{
unsigned long flags;
- local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+ /*
+ * In case of unlikely failure to lock percpu stock_lock
+ * uncharge memcg directly.
+ */
+ mem_cgroup_cancel_charge(memcg, nr_pages);
+ return;
+ }
__refill_stock(memcg, nr_pages);
local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
}
@@ -2196,9 +2208,13 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned long pflags;
retry:
- if (consume_stock(memcg, nr_pages))
+ if (consume_stock(memcg, nr_pages, gfp_mask))
return 0;
+ if (!gfpflags_allow_spinning(gfp_mask))
+ /* Avoid the refill and flush of the older stock */
+ batch = nr_pages;
+
if (!do_memsw_account() ||
page_counter_try_charge(&memcg->memsw, batch, &counter)) {
if (page_counter_try_charge(&memcg->memory, batch, &counter))
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages().
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
` (3 preceding siblings ...)
2025-01-15 2:17 ` [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
@ 2025-01-15 2:17 ` Alexei Starovoitov
2025-01-15 17:51 ` Vlastimil Babka
2025-01-16 0:24 ` Shakeel Butt
2025-01-15 2:17 ` [PATCH bpf-next v5 6/7] mm: Make failslab, kfence, kmemleak aware of trylock mode Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 7/7] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
6 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
The caller is responsible to setup memcg correctly.
All BPF memory accounting is memcg based.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
mm/page_alloc.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9c639e3db91..c87fd6cc3909 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7131,7 +7131,8 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
* specify it here to highlight that try_alloc_pages()
* doesn't want to deplete reserves.
*/
- gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+ gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
+ | __GFP_ACCOUNT;
unsigned int alloc_flags = ALLOC_TRYLOCK;
struct alloc_context ac = { };
struct page *page;
@@ -7174,6 +7175,11 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+ if (memcg_kmem_online() && page &&
+ unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
+ free_pages_nolock(page, order);
+ page = NULL;
+ }
trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
kmsan_alloc_page(page, order, alloc_gfp);
return page;
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v5 6/7] mm: Make failslab, kfence, kmemleak aware of trylock mode
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
` (4 preceding siblings ...)
2025-01-15 2:17 ` [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
@ 2025-01-15 2:17 ` Alexei Starovoitov
2025-01-15 17:57 ` Vlastimil Babka
2025-01-15 2:17 ` [PATCH bpf-next v5 7/7] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
6 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
When gfpflags_allow_spinning() == false spin_locks cannot be taken.
Make failslab, kfence, kmemleak compliant.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
mm/failslab.c | 3 +++
mm/kfence/core.c | 4 ++++
mm/kmemleak.c | 3 +++
3 files changed, 10 insertions(+)
diff --git a/mm/failslab.c b/mm/failslab.c
index c3901b136498..86c7304ef25a 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -27,6 +27,9 @@ int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
if (gfpflags & __GFP_NOFAIL)
return 0;
+ if (!gfpflags_allow_spinning(gfpflags))
+ return 0;
+
if (failslab.ignore_gfp_reclaim &&
(gfpflags & __GFP_DIRECT_RECLAIM))
return 0;
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 67fc321db79b..e5f2d63f3220 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -1096,6 +1096,10 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
if (s->flags & SLAB_SKIP_KFENCE)
return NULL;
+ /* Bailout, since kfence_guarded_alloc() needs to take a lock */
+ if (!gfpflags_allow_spinning(flags))
+ return NULL;
+
allocation_gate = atomic_inc_return(&kfence_allocation_gate);
if (allocation_gate > 1)
return NULL;
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 2a945c07ae99..64cb44948e9e 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -648,6 +648,9 @@ static struct kmemleak_object *__alloc_object(gfp_t gfp)
{
struct kmemleak_object *object;
+ if (!gfpflags_allow_spinning(gfp))
+ return NULL;
+
object = mem_pool_alloc(gfp);
if (!object) {
pr_warn("Cannot allocate a kmemleak_object structure\n");
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v5 7/7] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
` (5 preceding siblings ...)
2025-01-15 2:17 ` [PATCH bpf-next v5 6/7] mm: Make failslab, kfence, kmemleak aware of trylock mode Alexei Starovoitov
@ 2025-01-15 2:17 ` Alexei Starovoitov
2025-01-15 18:02 ` Vlastimil Babka
6 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:17 UTC (permalink / raw)
To: bpf
Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Use try_alloc_pages() and free_pages_nolock()
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
kernel/bpf/syscall.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0daf098e3207..8bcf48e31a5a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -582,14 +582,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
old_memcg = set_active_memcg(memcg);
#endif
for (i = 0; i < nr_pages; i++) {
- pg = alloc_pages_node(nid, gfp | __GFP_ACCOUNT, 0);
+ pg = try_alloc_pages(nid, 0);
if (pg) {
pages[i] = pg;
continue;
}
for (j = 0; j < i; j++)
- __free_page(pages[j]);
+ free_pages_nolock(pages[j], 0);
ret = -ENOMEM;
break;
}
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-15 2:17 ` [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
@ 2025-01-15 2:23 ` Alexei Starovoitov
2025-01-15 7:22 ` Sebastian Sewior
2025-01-15 14:22 ` Vlastimil Babka
2025-01-17 20:33 ` Sebastian Andrzej Siewior
2 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 2:23 UTC (permalink / raw)
To: bpf
Cc: Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
Tejun Heo, linux-mm, Kernel Team
On Tue, Jan 14, 2025 at 6:18 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Similar to local_lock_irqsave() introduce local_trylock_irqsave().
> This is inspired by 'struct local_tryirq_lock' in:
> https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/
>
> Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
> and fail instantly otherwise, since spin_trylock is not safe from IRQ
> due to PI issues.
>
> In !PREEMPT_RT use simple active flag to prevent IRQs or NMIs
> reentering locked region.
>
> Note there is no need to use local_inc for active flag.
> If IRQ handler grabs the same local_lock after READ_ONCE(lock->active)
> already completed it has to unlock it before returning.
> Similar with NMI handler. So there is a strict nesting of scopes.
> It's a per cpu lock. Multiple cpus do not access it in parallel.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> include/linux/local_lock.h | 9 ++++
> include/linux/local_lock_internal.h | 76 ++++++++++++++++++++++++++---
> 2 files changed, 78 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index 091dc0b6bdfb..84ee560c4f51 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -30,6 +30,15 @@
> #define local_lock_irqsave(lock, flags) \
> __local_lock_irqsave(lock, flags)
>
> +/**
> + * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
> + * interrupts. Always fails in RT when in_hardirq or NMI.
> + * @lock: The lock variable
> + * @flags: Storage for interrupt flags
> + */
> +#define local_trylock_irqsave(lock, flags) \
> + __local_trylock_irqsave(lock, flags)
> +
> /**
> * local_unlock - Release a per CPU local lock
> * @lock: The lock variable
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index 8dd71fbbb6d2..93672127c73d 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -9,6 +9,7 @@
> #ifndef CONFIG_PREEMPT_RT
>
> typedef struct {
> + int active;
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> struct lockdep_map dep_map;
> struct task_struct *owner;
> @@ -22,7 +23,7 @@ typedef struct {
> .wait_type_inner = LD_WAIT_CONFIG, \
> .lock_type = LD_LOCK_PERCPU, \
> }, \
> - .owner = NULL,
> + .owner = NULL, .active = 0
Sebastian,
could you please review/ack this patch ?
I looked through all current users of local_lock and the extra active
flag will be in the noise in all cases.
So I don't see any runtime/memory concerns
while extra lockdep_assert to catch reentrance issues
in RT and !RT will certainly help long term.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-15 2:23 ` Alexei Starovoitov
@ 2025-01-15 7:22 ` Sebastian Sewior
0 siblings, 0 replies; 37+ messages in thread
From: Sebastian Sewior @ 2025-01-15 7:22 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Vlastimil Babka, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On 2025-01-14 18:23:21 [-0800], Alexei Starovoitov wrote:
>
> Sebastian,
>
> could you please review/ack this patch ?
I will look at this (and the other things sent) tomorrow, or FRI at the
latest.
Sebastian
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
2025-01-15 2:17 ` [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-01-15 11:19 ` Vlastimil Babka
2025-01-15 23:00 ` Alexei Starovoitov
2025-01-15 23:16 ` Shakeel Butt
2025-01-17 18:19 ` Sebastian Andrzej Siewior
1 sibling, 2 replies; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-15 11:19 UTC (permalink / raw)
To: Alexei Starovoitov, bpf
Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> rmqueue_pcplist() will spin_trylock to grab the page from percpu
> free list. If it fails (due to re-entrancy or list being empty)
> then rmqueue_bulk()/rmqueue_buddy() will attempt to
> spin_trylock zone->lock and grab the page from there.
> spin_trylock() is not safe in RT when in NMI or in hard IRQ.
> Bailout early in such case.
>
> The support for gfpflags_allow_spinning() mode for free_page and memcg
> comes in the next patches.
>
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Some nits below:
> ---
> include/linux/gfp.h | 22 ++++++++++
> mm/internal.h | 1 +
> mm/page_alloc.c | 98 +++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 118 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b0fe9f62d15b..b41bb6e01781 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
> }
>
> +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> +{
> + /*
> + * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> + * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> + * All GFP_* flags including GFP_NOWAIT use one or both flags.
> + * try_alloc_pages() is the only API that doesn't specify either flag.
> + *
> + * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
> + * those are guaranteed to never block on a sleeping lock.
> + * Here we are enforcing that the allaaction doesn't ever spin
allocation
> + * on any locks (i.e. only trylocks). There is no highlevel
> + * GFP_$FOO flag for this use in try_alloc_pages() as the
> + * regular page allocator doesn't fully support this
> + * allocation mode.
> + */
> + return !(gfp_flags & __GFP_RECLAIM);
> +}
This function seems unused, guess the following patches will use.
> @@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>
> might_alloc(gfp_mask);
>
> - if (should_fail_alloc_page(gfp_mask, order))
> + if (!(*alloc_flags & ALLOC_TRYLOCK) &&
> + should_fail_alloc_page(gfp_mask, order))
Is it because should_fail_alloc_page() might take some lock or whatnot?
Maybe comment?
> return false;
>
> *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
> @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
> }
>
> #endif /* CONFIG_UNACCEPTED_MEMORY */
> +
> +/**
> + * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
> + * @nid - node to allocate from
> + * @order - allocation order size
> + *
> + * Allocates pages of a given order from the given node. This is safe to
> + * call from any context (from atomic, NMI, and also reentrant
> + * allocator -> tracepoint -> try_alloc_pages_noprof).
> + * Allocation is best effort and to be expected to fail easily so nobody should
> + * rely on the success. Failures are not reported via warn_alloc().
> + *
> + * Return: allocated page or NULL on failure.
> + */
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> + /*
> + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> + * is not safe in arbitrary context.
> + *
> + * These two are the conditions for gfpflags_allow_spinning() being true.
> + *
> + * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> + * to warn. Also warn would trigger printk() which is unsafe from
> + * various contexts. We cannot use printk_deferred_enter() to mitigate,
> + * since the running context is unknown.
> + *
> + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> + * is safe in any context. Also zeroing the page is mandatory for
> + * BPF use cases.
> + *
> + * Though __GFP_NOMEMALLOC is not checked in the code path below,
> + * specify it here to highlight that try_alloc_pages()
> + * doesn't want to deplete reserves.
> + */
> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> + unsigned int alloc_flags = ALLOC_TRYLOCK;
> + struct alloc_context ac = { };
> + struct page *page;
> +
> + /*
> + * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
> + * If spin_trylock() is called from hard IRQ the current task may be
> + * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> + * task as the owner of another rt_spin_lock which will confuse PI
> + * logic, so return immediately if called form hard IRQ or NMI.
> + *
> + * Note, irqs_disabled() case is ok. This function can be called
> + * from raw_spin_lock_irqsave region.
> + */
> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> + return NULL;
> + if (!pcp_allowed_order(order))
> + return NULL;
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> + /* Bailout, since try_to_accept_memory_one() needs to take a lock */
> + if (has_unaccepted_memory())
> + return NULL;
> +#endif
> + /* Bailout, since _deferred_grow_zone() needs to take a lock */
> + if (deferred_pages_enabled())
> + return NULL;
Is it fine for BPF that things will fail to allocate until all memory is
deferred-initialized and accepted? I guess it's easy to teach those places
later to evaluate if they can take the lock.
> +
> + if (nid == NUMA_NO_NODE)
> + nid = numa_node_id();
> +
> + prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> + &alloc_gfp, &alloc_flags);
> +
> + /*
> + * Best effort allocation from percpu free list.
> + * If it's empty attempt to spin_trylock zone->lock.
> + */
> + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
What about set_page_owner() from post_alloc_hook() and it's stackdepot
saving. I guess not an issue until try_alloc_pages() gets used later, so
just a mental note that it has to be resolved before. Or is it actually safe?
> +
> + /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> +
> + trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> + kmsan_alloc_page(page, order, alloc_gfp);
> + return page;
> +}
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock()
2025-01-15 2:17 ` [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2025-01-15 11:47 ` Vlastimil Babka
2025-01-15 23:15 ` Alexei Starovoitov
2025-01-17 18:20 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-15 11:47 UTC (permalink / raw)
To: Alexei Starovoitov, bpf
Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce free_pages_nolock() that can free pages without taking locks.
> It relies on trylock and can be called from any context.
> Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> it uses lockless link list to stash the pages which will be freed
> by subsequent free_pages() from good context.
>
> Do not use llist unconditionally. BPF maps continuously
> allocate/free, so we cannot unconditionally delay the freeing to
> llist. When the memory becomes free make it available to the
> kernel and BPF users right away if possible, and fallback to
> llist as the last resort.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
With:
> @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
> }
> EXPORT_SYMBOL(__free_pages);
>
> +/*
> + * Can be called while holding raw_spin_lock or from IRQ and NMI,
> + * but only for pages that came from try_alloc_pages():
> + * order <= 3, !folio, etc
I think order > 3 is fine, as !pcp_allowed_order() case is handled too? And
what does "!folio" mean?
> + */
> +void free_pages_nolock(struct page *page, unsigned int order)
> +{
> + if (put_page_testzero(page))
> + __free_unref_page(page, order, FPI_TRYLOCK);
Hmm this will reach reset_page_owner() and thus stackdepot so same mental
note as for patch 1.
> +}
> +
> void free_pages(unsigned long addr, unsigned int order)
> {
> if (addr != 0) {
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-15 2:17 ` [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
2025-01-15 2:23 ` Alexei Starovoitov
@ 2025-01-15 14:22 ` Vlastimil Babka
2025-01-16 2:20 ` Alexei Starovoitov
2025-01-17 20:33 ` Sebastian Andrzej Siewior
2 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-15 14:22 UTC (permalink / raw)
To: Alexei Starovoitov, bpf
Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Similar to local_lock_irqsave() introduce local_trylock_irqsave().
> This is inspired by 'struct local_tryirq_lock' in:
> https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/
Let's see what locking maintainers say about adding the flag to every
local_lock even if it doesn't use the trylock operation.
> Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
> and fail instantly otherwise, since spin_trylock is not safe from IRQ
> due to PI issues.
>
> In !PREEMPT_RT use simple active flag to prevent IRQs or NMIs
> reentering locked region.
>
> Note there is no need to use local_inc for active flag.
> If IRQ handler grabs the same local_lock after READ_ONCE(lock->active)
IRQ handler AFAICS can't do that since __local_trylock_irqsave() is the only
trylock operation and it still does local_irq_save(). Could you have added a
__local_trylock() operation instead? Guess not for your use case because I
see in Patch 4 you want to use the local_unlock_irqrestore() universally for
sections that are earlier locked either by local_trylock_irqsave() or
local_lock_irqsave(). But I wonder if those can be changed (will reply on
that patch).
The motivation in my case was to avoid the overhead of irqsave/restore on
!PREEMPT_RT. If there was a separate "flavor" of local_lock that would
support the trylock operation, I think it would not need the _irq and
_irqsave variants at all, and it would also avoid adding the "active" flag
on !PREEMPT_RT. Meanwhile on PREEMPT_RT, a single implementation could
likely handle both flavors with no downsides?
> already completed it has to unlock it before returning.
> Similar with NMI handler. So there is a strict nesting of scopes.
> It's a per cpu lock. Multiple cpus do not access it in parallel.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> include/linux/local_lock.h | 9 ++++
> include/linux/local_lock_internal.h | 76 ++++++++++++++++++++++++++---
> 2 files changed, 78 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index 091dc0b6bdfb..84ee560c4f51 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -30,6 +30,15 @@
> #define local_lock_irqsave(lock, flags) \
> __local_lock_irqsave(lock, flags)
>
> +/**
> + * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
> + * interrupts. Always fails in RT when in_hardirq or NMI.
> + * @lock: The lock variable
> + * @flags: Storage for interrupt flags
> + */
> +#define local_trylock_irqsave(lock, flags) \
> + __local_trylock_irqsave(lock, flags)
> +
> /**
> * local_unlock - Release a per CPU local lock
> * @lock: The lock variable
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index 8dd71fbbb6d2..93672127c73d 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -9,6 +9,7 @@
> #ifndef CONFIG_PREEMPT_RT
>
> typedef struct {
> + int active;
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> struct lockdep_map dep_map;
> struct task_struct *owner;
> @@ -22,7 +23,7 @@ typedef struct {
> .wait_type_inner = LD_WAIT_CONFIG, \
> .lock_type = LD_LOCK_PERCPU, \
> }, \
> - .owner = NULL,
> + .owner = NULL, .active = 0
>
> static inline void local_lock_acquire(local_lock_t *l)
> {
> @@ -31,6 +32,13 @@ static inline void local_lock_acquire(local_lock_t *l)
> l->owner = current;
> }
>
> +static inline void local_trylock_acquire(local_lock_t *l)
> +{
> + lock_map_acquire_try(&l->dep_map);
> + DEBUG_LOCKS_WARN_ON(l->owner);
> + l->owner = current;
> +}
> +
> static inline void local_lock_release(local_lock_t *l)
> {
> DEBUG_LOCKS_WARN_ON(l->owner != current);
> @@ -45,6 +53,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
> #else /* CONFIG_DEBUG_LOCK_ALLOC */
> # define LOCAL_LOCK_DEBUG_INIT(lockname)
> static inline void local_lock_acquire(local_lock_t *l) { }
> +static inline void local_trylock_acquire(local_lock_t *l) { }
> static inline void local_lock_release(local_lock_t *l) { }
> static inline void local_lock_debug_init(local_lock_t *l) { }
> #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
> @@ -60,6 +69,7 @@ do { \
> 0, LD_WAIT_CONFIG, LD_WAIT_INV, \
> LD_LOCK_PERCPU); \
> local_lock_debug_init(lock); \
> + (lock)->active = 0; \
> } while (0)
>
> #define __spinlock_nested_bh_init(lock) \
> @@ -75,37 +85,73 @@ do { \
>
> #define __local_lock(lock) \
> do { \
> + local_lock_t *l; \
> preempt_disable(); \
> - local_lock_acquire(this_cpu_ptr(lock)); \
> + l = this_cpu_ptr(lock); \
> + lockdep_assert(l->active == 0); \
> + WRITE_ONCE(l->active, 1); \
> + local_lock_acquire(l); \
> } while (0)
>
> #define __local_lock_irq(lock) \
> do { \
> + local_lock_t *l; \
> local_irq_disable(); \
> - local_lock_acquire(this_cpu_ptr(lock)); \
> + l = this_cpu_ptr(lock); \
> + lockdep_assert(l->active == 0); \
> + WRITE_ONCE(l->active, 1); \
> + local_lock_acquire(l); \
> } while (0)
>
> #define __local_lock_irqsave(lock, flags) \
> do { \
> + local_lock_t *l; \
> local_irq_save(flags); \
> - local_lock_acquire(this_cpu_ptr(lock)); \
> + l = this_cpu_ptr(lock); \
> + lockdep_assert(l->active == 0); \
> + WRITE_ONCE(l->active, 1); \
> + local_lock_acquire(l); \
> } while (0)
>
> +#define __local_trylock_irqsave(lock, flags) \
> + ({ \
> + local_lock_t *l; \
> + local_irq_save(flags); \
> + l = this_cpu_ptr(lock); \
> + if (READ_ONCE(l->active) == 1) { \
> + local_irq_restore(flags); \
> + l = NULL; \
> + } else { \
> + WRITE_ONCE(l->active, 1); \
> + local_trylock_acquire(l); \
> + } \
> + !!l; \
> + })
> +
> #define __local_unlock(lock) \
> do { \
> - local_lock_release(this_cpu_ptr(lock)); \
> + local_lock_t *l = this_cpu_ptr(lock); \
> + lockdep_assert(l->active == 1); \
> + WRITE_ONCE(l->active, 0); \
> + local_lock_release(l); \
> preempt_enable(); \
> } while (0)
>
> #define __local_unlock_irq(lock) \
> do { \
> - local_lock_release(this_cpu_ptr(lock)); \
> + local_lock_t *l = this_cpu_ptr(lock); \
> + lockdep_assert(l->active == 1); \
> + WRITE_ONCE(l->active, 0); \
> + local_lock_release(l); \
> local_irq_enable(); \
> } while (0)
>
> #define __local_unlock_irqrestore(lock, flags) \
> do { \
> - local_lock_release(this_cpu_ptr(lock)); \
> + local_lock_t *l = this_cpu_ptr(lock); \
> + lockdep_assert(l->active == 1); \
> + WRITE_ONCE(l->active, 0); \
> + local_lock_release(l); \
> local_irq_restore(flags); \
> } while (0)
>
> @@ -148,6 +194,22 @@ typedef spinlock_t local_lock_t;
> __local_lock(lock); \
> } while (0)
>
> +#define __local_trylock_irqsave(lock, flags) \
> + ({ \
> + __label__ out; \
> + int ret = 0; \
> + typecheck(unsigned long, flags); \
> + flags = 0; \
> + if (in_nmi() || in_hardirq()) \
> + goto out; \
> + migrate_disable(); \
> + ret = spin_trylock(this_cpu_ptr((lock))); \
> + if (!ret) \
> + migrate_enable(); \
> + out: \
> + ret; \
> + })
> +
> #define __local_unlock(__lock) \
> do { \
> spin_unlock(this_cpu_ptr((__lock))); \
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock.
2025-01-15 2:17 ` [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
@ 2025-01-15 16:07 ` Vlastimil Babka
2025-01-16 0:12 ` Shakeel Butt
1 sibling, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-15 16:07 UTC (permalink / raw)
To: Alexei Starovoitov, bpf
Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Teach memcg to operate under trylock conditions when spinning locks
> cannot be used.
>
> local_trylock might fail and this would lead to charge cache bypass if
> the calling context doesn't allow spinning (gfpflags_allow_spinning).
> In those cases charge the memcg counter directly and fail early if
> that is not possible. This might cause a pre-mature charge failing
> but it will allow an opportunistic charging that is safe from
> try_alloc_pages path.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> mm/memcontrol.c | 24 ++++++++++++++++++++----
> 1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..e4c7049465e0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1756,7 +1756,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> *
> * returns true if successful, false otherwise.
> */
> -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> + gfp_t gfp_mask)
> {
> struct memcg_stock_pcp *stock;
> unsigned int stock_pages;
> @@ -1766,7 +1767,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> if (nr_pages > MEMCG_CHARGE_BATCH)
> return ret;
>
> - local_lock_irqsave(&memcg_stock.stock_lock, flags);
> + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> + if (!gfpflags_allow_spinning(gfp_mask))
> + return ret;
> + local_lock_irqsave(&memcg_stock.stock_lock, flags);
The last line can practially only happen on RT, right? On non-RT irqsave
means we could only fail the trylock from a nmi and then we should have
gfp_flags that don't allow spinning.
So suppose we used local_trylock(), local_lock() and local_unlock() (no
_irqsave) instead, as I mentioned in reply to 3/7. The RT implementation
would be AFAICS the same. On !RT the trylock could now fail from a IRQ
context in addition to NMI context, but that should also have a gfp_mask
that does not allow spinning, so it should work fine.
It would however mean converting all users of the lock, i.e. also
consume_obj_stock() etc., but AFAIU that will be necessary anyway to have
opportunistic slab allocations?
> + }
>
> stock = this_cpu_ptr(&memcg_stock);
> stock_pages = READ_ONCE(stock->nr_pages);
> @@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> unsigned long flags;
>
> - local_lock_irqsave(&memcg_stock.stock_lock, flags);
> + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> + /*
> + * In case of unlikely failure to lock percpu stock_lock
> + * uncharge memcg directly.
> + */
> + mem_cgroup_cancel_charge(memcg, nr_pages);
> + return;
> + }
> __refill_stock(memcg, nr_pages);
> local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
> @@ -2196,9 +2208,13 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned long pflags;
>
> retry:
> - if (consume_stock(memcg, nr_pages))
> + if (consume_stock(memcg, nr_pages, gfp_mask))
> return 0;
>
> + if (!gfpflags_allow_spinning(gfp_mask))
> + /* Avoid the refill and flush of the older stock */
> + batch = nr_pages;
> +
> if (!do_memsw_account() ||
> page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> if (page_counter_try_charge(&memcg->memory, batch, &counter))
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages().
2025-01-15 2:17 ` [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
@ 2025-01-15 17:51 ` Vlastimil Babka
2025-01-16 0:24 ` Shakeel Butt
1 sibling, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-15 17:51 UTC (permalink / raw)
To: Alexei Starovoitov, bpf
Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
> The caller is responsible to setup memcg correctly.
> All BPF memory accounting is memcg based.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> mm/page_alloc.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a9c639e3db91..c87fd6cc3909 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7131,7 +7131,8 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> * specify it here to highlight that try_alloc_pages()
> * doesn't want to deplete reserves.
> */
> - gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
> + | __GFP_ACCOUNT;
> unsigned int alloc_flags = ALLOC_TRYLOCK;
> struct alloc_context ac = { };
> struct page *page;
> @@ -7174,6 +7175,11 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
>
> /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
>
> + if (memcg_kmem_online() && page &&
> + unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
> + free_pages_nolock(page, order);
> + page = NULL;
> + }
> trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> kmsan_alloc_page(page, order, alloc_gfp);
> return page;
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 6/7] mm: Make failslab, kfence, kmemleak aware of trylock mode
2025-01-15 2:17 ` [PATCH bpf-next v5 6/7] mm: Make failslab, kfence, kmemleak aware of trylock mode Alexei Starovoitov
@ 2025-01-15 17:57 ` Vlastimil Babka
2025-01-16 2:23 ` Alexei Starovoitov
0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-15 17:57 UTC (permalink / raw)
To: Alexei Starovoitov, bpf
Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> When gfpflags_allow_spinning() == false spin_locks cannot be taken.
> Make failslab, kfence, kmemleak compliant.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
All these are related to slab, so this would rather belong to a followup
series that expands the support from page allocator to slab, no?
> ---
> mm/failslab.c | 3 +++
> mm/kfence/core.c | 4 ++++
> mm/kmemleak.c | 3 +++
> 3 files changed, 10 insertions(+)
>
> diff --git a/mm/failslab.c b/mm/failslab.c
> index c3901b136498..86c7304ef25a 100644
> --- a/mm/failslab.c
> +++ b/mm/failslab.c
> @@ -27,6 +27,9 @@ int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
> if (gfpflags & __GFP_NOFAIL)
> return 0;
>
> + if (!gfpflags_allow_spinning(gfpflags))
> + return 0;
> +
> if (failslab.ignore_gfp_reclaim &&
> (gfpflags & __GFP_DIRECT_RECLAIM))
> return 0;
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index 67fc321db79b..e5f2d63f3220 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -1096,6 +1096,10 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
> if (s->flags & SLAB_SKIP_KFENCE)
> return NULL;
>
> + /* Bailout, since kfence_guarded_alloc() needs to take a lock */
> + if (!gfpflags_allow_spinning(flags))
> + return NULL;
> +
> allocation_gate = atomic_inc_return(&kfence_allocation_gate);
> if (allocation_gate > 1)
> return NULL;
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 2a945c07ae99..64cb44948e9e 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -648,6 +648,9 @@ static struct kmemleak_object *__alloc_object(gfp_t gfp)
> {
> struct kmemleak_object *object;
>
> + if (!gfpflags_allow_spinning(gfp))
> + return NULL;
> +
> object = mem_pool_alloc(gfp);
> if (!object) {
> pr_warn("Cannot allocate a kmemleak_object structure\n");
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 7/7] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
2025-01-15 2:17 ` [PATCH bpf-next v5 7/7] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
@ 2025-01-15 18:02 ` Vlastimil Babka
2025-01-16 2:25 ` Alexei Starovoitov
0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-15 18:02 UTC (permalink / raw)
To: Alexei Starovoitov, bpf
Cc: andrii, memxor, akpm, peterz, bigeasy, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Use try_alloc_pages() and free_pages_nolock()
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> kernel/bpf/syscall.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0daf098e3207..8bcf48e31a5a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -582,14 +582,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
This makes the gfp parameter unused? And the callers are passing GFP_KERNEL
anyway? Isn't try_alloc_pages() rather useful for some context that did not
even try to allocate until now, but now it could?
Also unless my concerns about page_owner were wrong, this is where they
could manifest.
> old_memcg = set_active_memcg(memcg);
> #endif
> for (i = 0; i < nr_pages; i++) {
> - pg = alloc_pages_node(nid, gfp | __GFP_ACCOUNT, 0);
> + pg = try_alloc_pages(nid, 0);
>
> if (pg) {
> pages[i] = pg;
> continue;
> }
> for (j = 0; j < i; j++)
> - __free_page(pages[j]);
> + free_pages_nolock(pages[j], 0);
> ret = -ENOMEM;
> break;
> }
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
2025-01-15 11:19 ` Vlastimil Babka
@ 2025-01-15 23:00 ` Alexei Starovoitov
2025-01-15 23:47 ` Shakeel Butt
2025-01-15 23:16 ` Shakeel Butt
1 sibling, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 23:00 UTC (permalink / raw)
To: Vlastimil Babka
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Wed, Jan 15, 2025 at 3:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory. The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> > rmqueue_pcplist() will spin_trylock to grab the page from percpu
> > free list. If it fails (due to re-entrancy or list being empty)
> > then rmqueue_bulk()/rmqueue_buddy() will attempt to
> > spin_trylock zone->lock and grab the page from there.
> > spin_trylock() is not safe in RT when in NMI or in hard IRQ.
> > Bailout early in such case.
> >
> > The support for gfpflags_allow_spinning() mode for free_page and memcg
> > comes in the next patches.
> >
> > This is a first step towards supporting BPF requirements in SLUB
> > and getting rid of bpf_mem_alloc.
> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> Some nits below:
>
> > ---
> > include/linux/gfp.h | 22 ++++++++++
> > mm/internal.h | 1 +
> > mm/page_alloc.c | 98 +++++++++++++++++++++++++++++++++++++++++++--
> > 3 files changed, 118 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index b0fe9f62d15b..b41bb6e01781 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> > return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
> > }
> >
> > +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> > +{
> > + /*
> > + * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> > + * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> > + * All GFP_* flags including GFP_NOWAIT use one or both flags.
> > + * try_alloc_pages() is the only API that doesn't specify either flag.
> > + *
> > + * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
> > + * those are guaranteed to never block on a sleeping lock.
> > + * Here we are enforcing that the allaaction doesn't ever spin
>
> allocation
oops.
> > + * on any locks (i.e. only trylocks). There is no highlevel
> > + * GFP_$FOO flag for this use in try_alloc_pages() as the
> > + * regular page allocator doesn't fully support this
> > + * allocation mode.
> > + */
> > + return !(gfp_flags & __GFP_RECLAIM);
> > +}
>
> This function seems unused, guess the following patches will use.
>
> > @@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> >
> > might_alloc(gfp_mask);
> >
> > - if (should_fail_alloc_page(gfp_mask, order))
> > + if (!(*alloc_flags & ALLOC_TRYLOCK) &&
> > + should_fail_alloc_page(gfp_mask, order))
>
> Is it because should_fail_alloc_page() might take some lock or whatnot?
> Maybe comment?
This is mainly because all kinds of printk-s that fail* logic does.
We've seen way too many syzbot reports because of printk from
the unsupported context.
This part of the comment:
+ * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+ * to warn. Also warn would trigger printk() which is unsafe from
+ * various contexts. We cannot use printk_deferred_enter() to mitigate,
+ * since the running context is unknown.
and also because of get_random_u32() inside fail* logic
that grabs spin_lock.
We've seen syzbot reports about get_random() deadlocking.
Fixing printk and fail*/get_random is necessary too,
but orthogonal work for these patches.
Here I'm preventing this path from prepare_alloc_pages() to avoid
dealing with more syzbot reports than already there.
With try_alloc_pages() out of bpf attached to kprobe inside
printk core logic it would be too easy for syzbot.
And then people will yell at bpf causing problems
whereas the root cause is printk being broken.
We see this finger pointing all too often.
>
> > return false;
> >
> > *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
> > @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
> > }
> >
> > #endif /* CONFIG_UNACCEPTED_MEMORY */
> > +
> > +/**
> > + * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
> > + * @nid - node to allocate from
> > + * @order - allocation order size
> > + *
> > + * Allocates pages of a given order from the given node. This is safe to
> > + * call from any context (from atomic, NMI, and also reentrant
> > + * allocator -> tracepoint -> try_alloc_pages_noprof).
> > + * Allocation is best effort and to be expected to fail easily so nobody should
> > + * rely on the success. Failures are not reported via warn_alloc().
> > + *
> > + * Return: allocated page or NULL on failure.
> > + */
> > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > +{
> > + /*
> > + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> > + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> > + * is not safe in arbitrary context.
> > + *
> > + * These two are the conditions for gfpflags_allow_spinning() being true.
> > + *
> > + * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> > + * to warn. Also warn would trigger printk() which is unsafe from
> > + * various contexts. We cannot use printk_deferred_enter() to mitigate,
> > + * since the running context is unknown.
> > + *
> > + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> > + * is safe in any context. Also zeroing the page is mandatory for
> > + * BPF use cases.
> > + *
> > + * Though __GFP_NOMEMALLOC is not checked in the code path below,
> > + * specify it here to highlight that try_alloc_pages()
> > + * doesn't want to deplete reserves.
> > + */
> > + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> > + unsigned int alloc_flags = ALLOC_TRYLOCK;
> > + struct alloc_context ac = { };
> > + struct page *page;
> > +
> > + /*
> > + * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
> > + * If spin_trylock() is called from hard IRQ the current task may be
> > + * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> > + * task as the owner of another rt_spin_lock which will confuse PI
> > + * logic, so return immediately if called form hard IRQ or NMI.
> > + *
> > + * Note, irqs_disabled() case is ok. This function can be called
> > + * from raw_spin_lock_irqsave region.
> > + */
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> > + return NULL;
> > + if (!pcp_allowed_order(order))
> > + return NULL;
> > +
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > + /* Bailout, since try_to_accept_memory_one() needs to take a lock */
> > + if (has_unaccepted_memory())
> > + return NULL;
> > +#endif
> > + /* Bailout, since _deferred_grow_zone() needs to take a lock */
> > + if (deferred_pages_enabled())
> > + return NULL;
>
> Is it fine for BPF that things will fail to allocate until all memory is
> deferred-initialized and accepted? I guess it's easy to teach those places
> later to evaluate if they can take the lock.
Exactly. If it becomes an issue it can be addressed.
I didn't want to complicate the code just-because.
> > +
> > + if (nid == NUMA_NO_NODE)
> > + nid = numa_node_id();
> > +
> > + prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > + &alloc_gfp, &alloc_flags);
> > +
> > + /*
> > + * Best effort allocation from percpu free list.
> > + * If it's empty attempt to spin_trylock zone->lock.
> > + */
> > + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
>
> What about set_page_owner() from post_alloc_hook() and it's stackdepot
> saving. I guess not an issue until try_alloc_pages() gets used later, so
> just a mental note that it has to be resolved before. Or is it actually safe?
set_page_owner() should be fine.
save_stack() has in_page_owner recursion protection mechanism.
stack_depot_save_flags() may be problematic if there is another
path to it.
I guess I can do:
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 245d5b416699..61772bc4b811 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -630,7 +630,7 @@ depot_stack_handle_t
stack_depot_save_flags(unsigned long *entries,
prealloc = page_address(page);
}
- if (in_nmi()) {
+ if (in_nmi() || !gfpflags_allow_spinning(alloc_flags)) {
/* We can never allocate in NMI context. */
WARN_ON_ONCE(can_alloc);
/* Best effort; bail if we fail to take the lock. */
if (!raw_spin_trylock_irqsave(&pool_lock, flags))
goto exit;
as part of this patch,
but not convinced whether it's necessary.
stack_depot* is effectively noinstr.
kprobe-bpf cannot be placed in there and afaict
it doesn't call any tracepoints.
So in_nmi() is the only way to reenter and that's already covered.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock()
2025-01-15 11:47 ` Vlastimil Babka
@ 2025-01-15 23:15 ` Alexei Starovoitov
2025-01-16 8:31 ` Vlastimil Babka
0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 23:15 UTC (permalink / raw)
To: Vlastimil Babka
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Wed, Jan 15, 2025 at 3:47 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce free_pages_nolock() that can free pages without taking locks.
> > It relies on trylock and can be called from any context.
> > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> > it uses lockless link list to stash the pages which will be freed
> > by subsequent free_pages() from good context.
> >
> > Do not use llist unconditionally. BPF maps continuously
> > allocate/free, so we cannot unconditionally delay the freeing to
> > llist. When the memory becomes free make it available to the
> > kernel and BPF users right away if possible, and fallback to
> > llist as the last resort.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> With:
>
> > @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
> > }
> > EXPORT_SYMBOL(__free_pages);
> >
> > +/*
> > + * Can be called while holding raw_spin_lock or from IRQ and NMI,
> > + * but only for pages that came from try_alloc_pages():
> > + * order <= 3, !folio, etc
>
> I think order > 3 is fine, as !pcp_allowed_order() case is handled too?
try_alloc_page() has:
if (!pcp_allowed_order(order))
return NULL;
to make sure it tries pcp first.
bpf has no use for order > 1. Even 3 is overkill,
but it's kinda free to support order <= 3, so why not.
> And
> what does "!folio" mean?
That's what we discussed last year.
__free_pages() has all the extra stuff if (!head) and
support for dropping ref on the middle page.
!folio captures this more broadly.
> > + */
> > +void free_pages_nolock(struct page *page, unsigned int order)
> > +{
> > + if (put_page_testzero(page))
> > + __free_unref_page(page, order, FPI_TRYLOCK);
>
> Hmm this will reach reset_page_owner() and thus stackdepot so same mental
> note as for patch 1.
save_stack() has recursions protection already. So should be fine.
> > +}
> > +
> > void free_pages(unsigned long addr, unsigned int order)
> > {
> > if (addr != 0) {
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
2025-01-15 11:19 ` Vlastimil Babka
2025-01-15 23:00 ` Alexei Starovoitov
@ 2025-01-15 23:16 ` Shakeel Butt
1 sibling, 0 replies; 37+ messages in thread
From: Shakeel Butt @ 2025-01-15 23:16 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, peterz, bigeasy,
rostedt, houtao1, hannes, mhocko, willy, tglx, jannh, tj,
linux-mm, kernel-team
On Wed, Jan 15, 2025 at 12:19:26PM +0100, Vlastimil Babka wrote:
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
[...]
> > +
> > + if (nid == NUMA_NO_NODE)
> > + nid = numa_node_id();
> > +
> > + prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > + &alloc_gfp, &alloc_flags);
> > +
> > + /*
> > + * Best effort allocation from percpu free list.
> > + * If it's empty attempt to spin_trylock zone->lock.
> > + */
> > + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
>
> What about set_page_owner() from post_alloc_hook() and it's stackdepot
> saving. I guess not an issue until try_alloc_pages() gets used later, so
> just a mental note that it has to be resolved before. Or is it actually safe?
>
stack_depot_save() does not seem safe here.
> > +
> > + /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> > +
> > + trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> > + kmsan_alloc_page(page, order, alloc_gfp);
> > + return page;
> > +}
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
2025-01-15 23:00 ` Alexei Starovoitov
@ 2025-01-15 23:47 ` Shakeel Butt
2025-01-16 2:44 ` Alexei Starovoitov
0 siblings, 1 reply; 37+ messages in thread
From: Shakeel Butt @ 2025-01-15 23:47 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Vlastimil Babka, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
Andrew Morton, Peter Zijlstra, Sebastian Sewior, Steven Rostedt,
Hou Tao, Johannes Weiner, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Wed, Jan 15, 2025 at 03:00:08PM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 15, 2025 at 3:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> >
Sorry missed your response here.
> > What about set_page_owner() from post_alloc_hook() and it's stackdepot
> > saving. I guess not an issue until try_alloc_pages() gets used later, so
> > just a mental note that it has to be resolved before. Or is it actually safe?
>
> set_page_owner() should be fine.
> save_stack() has in_page_owner recursion protection mechanism.
>
> stack_depot_save_flags() may be problematic if there is another
> path to it.
> I guess I can do:
>
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 245d5b416699..61772bc4b811 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -630,7 +630,7 @@ depot_stack_handle_t
> stack_depot_save_flags(unsigned long *entries,
> prealloc = page_address(page);
> }
There is alloc_pages(gfp_nested_mask(alloc_flags)...) just couple of
lines above. How about setting can_alloc false along with the below
change for this case? Or we can set ALLOC_TRYLOCK in core alloc_pages()
for !gfpflags_allow_spinning().
>
> - if (in_nmi()) {
> + if (in_nmi() || !gfpflags_allow_spinning(alloc_flags)) {
> /* We can never allocate in NMI context. */
> WARN_ON_ONCE(can_alloc);
> /* Best effort; bail if we fail to take the lock. */
> if (!raw_spin_trylock_irqsave(&pool_lock, flags))
> goto exit;
>
> as part of this patch,
> but not convinced whether it's necessary.
> stack_depot* is effectively noinstr.
> kprobe-bpf cannot be placed in there and afaict
> it doesn't call any tracepoints.
> So in_nmi() is the only way to reenter and that's already covered.
Are the locks in stack_depot* only the issue for bpf programs triggered
inside stack_depot*?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock.
2025-01-15 2:17 ` [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
2025-01-15 16:07 ` Vlastimil Babka
@ 2025-01-16 0:12 ` Shakeel Butt
2025-01-16 2:22 ` Alexei Starovoitov
1 sibling, 1 reply; 37+ messages in thread
From: Shakeel Butt @ 2025-01-16 0:12 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
houtao1, hannes, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On Tue, Jan 14, 2025 at 06:17:43PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Teach memcg to operate under trylock conditions when spinning locks
> cannot be used.
>
> local_trylock might fail and this would lead to charge cache bypass if
> the calling context doesn't allow spinning (gfpflags_allow_spinning).
> In those cases charge the memcg counter directly and fail early if
> that is not possible. This might cause a pre-mature charge failing
> but it will allow an opportunistic charging that is safe from
> try_alloc_pages path.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> @@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> unsigned long flags;
>
> - local_lock_irqsave(&memcg_stock.stock_lock, flags);
> + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> + /*
> + * In case of unlikely failure to lock percpu stock_lock
> + * uncharge memcg directly.
> + */
> + mem_cgroup_cancel_charge(memcg, nr_pages);
mem_cgroup_cancel_charge() has been removed by a patch in mm-tree. Maybe
we can either revive mem_cgroup_cancel_charge() or simply inline it
here.
> + return;
> + }
> __refill_stock(memcg, nr_pages);
> local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages().
2025-01-15 2:17 ` [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
2025-01-15 17:51 ` Vlastimil Babka
@ 2025-01-16 0:24 ` Shakeel Butt
1 sibling, 0 replies; 37+ messages in thread
From: Shakeel Butt @ 2025-01-16 0:24 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
houtao1, hannes, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On Tue, Jan 14, 2025 at 06:17:44PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
> The caller is responsible to setup memcg correctly.
> All BPF memory accounting is memcg based.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-15 14:22 ` Vlastimil Babka
@ 2025-01-16 2:20 ` Alexei Starovoitov
0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-16 2:20 UTC (permalink / raw)
To: Vlastimil Babka
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Wed, Jan 15, 2025 at 6:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Similar to local_lock_irqsave() introduce local_trylock_irqsave().
> > This is inspired by 'struct local_tryirq_lock' in:
> > https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@suse.cz/
>
> Let's see what locking maintainers say about adding the flag to every
> local_lock even if it doesn't use the trylock operation.
As I replied to Sebastian there are very few users and
hot users of local_lock like networking use it in RT only.
local_lock_nested_bh() stays true nop in !RT.
This patch doesn't change it.
So active flag on !RT is not in critical path
(at least as much I could study the code)
> > Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
> > and fail instantly otherwise, since spin_trylock is not safe from IRQ
> > due to PI issues.
> >
> > In !PREEMPT_RT use simple active flag to prevent IRQs or NMIs
> > reentering locked region.
> >
> > Note there is no need to use local_inc for active flag.
> > If IRQ handler grabs the same local_lock after READ_ONCE(lock->active)
>
> IRQ handler AFAICS can't do that since __local_trylock_irqsave() is the only
> trylock operation and it still does local_irq_save(). Could you have added a
> __local_trylock() operation instead? Guess not for your use case because I
> see in Patch 4 you want to use the local_unlock_irqrestore() universally for
> sections that are earlier locked either by local_trylock_irqsave() or
> local_lock_irqsave(). But I wonder if those can be changed (will reply on
> that patch).
Pasting your reply from patch 4 here to reply to both...
Yes, I'm only adding local_trylock_irqsave() and not local_trylock(),
since memcg and slab are using local_lock_irqsave() in numerous
places, and adding local_trylock() here would be just dead code.
> The motivation in my case was to avoid the overhead of irqsave/restore on
> !PREEMPT_RT. If there was a separate "flavor" of local_lock that would
> support the trylock operation, I think it would not need the _irq and
> _irqsave variants at all, and it would also avoid adding the "active" flag
> on !PREEMPT_RT. Meanwhile on PREEMPT_RT, a single implementation could
> likely handle both flavors with no downsides?
I agree with the desire to use local_lock() in slab and memcg long term,
but this is something that definitely should _not_ be done in this patch set.
try_alloc_page() needs to learn to walk before we teach it to run.
> The last line can practially only happen on RT, right? On non-RT irqsave
> means we could only fail the trylock from a nmi and then we should have
> gfp_flags that don't allow spinning.
Correct.
> So suppose we used local_trylock(), local_lock() and local_unlock() (no
> _irqsave) instead, as I mentioned in reply to 3/7. The RT implementation
> would be AFAICS the same. On !RT the trylock could now fail from a IRQ
> context in addition to NMI context, but that should also have a gfp_mask
> that does not allow spinning, so it should work fine.
Also correct.
> It would however mean converting all users of the lock, i.e. also
> consume_obj_stock() etc., but AFAIU that will be necessary anyway to have
> opportunistic slab allocations?
Exactly. And as soon as we do that we start to conflict between trees.
But the main concern is that change like that needs to be
thoroughly analyzed.
I'm not convinced that stock_lock as preempt_disable() will work for memcg.
People do GFP_NOWAIT allocations from IRQ and assume it works.
If memcg local_irqsave (aka local_lock_irqsave) is replaced
with preempt_disable the IRQ can happen in the middle of memcg
update of the counters,
so ALL of stock_lock operations would have to local_TRYlock()
with fallback in case IRQ kmalloc(GFP_NOWAIT) happens to reenter.
Same issue with slub.
local_lock_irqsave(&s->cpu_slab->lock)
as irq disabled region works for kmalloc(GPF_NOWAIT) users.
If it becomes preempt_disable I suspect it will break things.
Like perf and bpf use irq_work do wakeups and allocations.
slub's s->cpu_slab protected by preempt_disable
would mean that 'perf record -a' will be triggering in the
middle of slab partial, deactivate slab logic and
perf will be doing wakups right there. I suspect it will be sad.
While right now irq work handler will be called only
after the last local_unlock_irqrestore enables irqs.
So replacing local_lock_irqsave in slab and memcg with
local_lock is not something to take lightly.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock.
2025-01-16 0:12 ` Shakeel Butt
@ 2025-01-16 2:22 ` Alexei Starovoitov
2025-01-16 20:07 ` Joshua Hahn
0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-16 2:22 UTC (permalink / raw)
To: Shakeel Butt, joshua.hahnjy, Muchun Song, Andrew Morton,
SeongJae Park, open list:CONTROL GROUP (CGROUP),
linux-mm
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Peter Zijlstra,
Vlastimil Babka, Sebastian Sewior, Steven Rostedt, Hou Tao,
Johannes Weiner, Michal Hocko, Matthew Wilcox, Thomas Gleixner,
Jann Horn, Tejun Heo, Kernel Team
On Wed, Jan 15, 2025 at 4:12 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jan 14, 2025 at 06:17:43PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Teach memcg to operate under trylock conditions when spinning locks
> > cannot be used.
> >
> > local_trylock might fail and this would lead to charge cache bypass if
> > the calling context doesn't allow spinning (gfpflags_allow_spinning).
> > In those cases charge the memcg counter directly and fail early if
> > that is not possible. This might cause a pre-mature charge failing
> > but it will allow an opportunistic charging that is safe from
> > try_alloc_pages path.
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>
> > @@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> > {
> > unsigned long flags;
> >
> > - local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> > + /*
> > + * In case of unlikely failure to lock percpu stock_lock
> > + * uncharge memcg directly.
> > + */
> > + mem_cgroup_cancel_charge(memcg, nr_pages);
>
> mem_cgroup_cancel_charge() has been removed by a patch in mm-tree. Maybe
> we can either revive mem_cgroup_cancel_charge() or simply inline it
> here.
Ouch.
this one?
https://lore.kernel.org/all/20241211203951.764733-4-joshua.hahnjy@gmail.com/
Joshua,
could you hold on to that clean up?
Or leave mem_cgroup_cancel_charge() in place ?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 6/7] mm: Make failslab, kfence, kmemleak aware of trylock mode
2025-01-15 17:57 ` Vlastimil Babka
@ 2025-01-16 2:23 ` Alexei Starovoitov
0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-16 2:23 UTC (permalink / raw)
To: Vlastimil Babka
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Wed, Jan 15, 2025 at 9:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > When gfpflags_allow_spinning() == false spin_locks cannot be taken.
> > Make failslab, kfence, kmemleak compliant.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> All these are related to slab, so this would rather belong to a followup
> series that expands the support from page allocator to slab, no?
Sure. I can drop it for now.
It was more of the preview of things to come.
And how gfpflags_allow_spinning() fits in other places.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 7/7] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
2025-01-15 18:02 ` Vlastimil Babka
@ 2025-01-16 2:25 ` Alexei Starovoitov
0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-16 2:25 UTC (permalink / raw)
To: Vlastimil Babka
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Wed, Jan 15, 2025 at 10:03 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Use try_alloc_pages() and free_pages_nolock()
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> > kernel/bpf/syscall.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 0daf098e3207..8bcf48e31a5a 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -582,14 +582,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
>
> This makes the gfp parameter unused? And the callers are passing GFP_KERNEL
> anyway? Isn't try_alloc_pages() rather useful for some context that did not
> even try to allocate until now, but now it could?
Correct. gfp arg is unused and currently it's only called
from sleepable bpf kfunc with GFP_KERNEL.
I have more patches on top that change all that:
remove gfp argument, etc.
Just didn't send them as part of this set, since it's all bpf internal
stuff.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
2025-01-15 23:47 ` Shakeel Butt
@ 2025-01-16 2:44 ` Alexei Starovoitov
0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-16 2:44 UTC (permalink / raw)
To: Shakeel Butt
Cc: Vlastimil Babka, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
Andrew Morton, Peter Zijlstra, Sebastian Sewior, Steven Rostedt,
Hou Tao, Johannes Weiner, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Wed, Jan 15, 2025 at 3:47 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Jan 15, 2025 at 03:00:08PM -0800, Alexei Starovoitov wrote:
> > On Wed, Jan 15, 2025 at 3:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > >
>
> Sorry missed your response here.
>
> > > What about set_page_owner() from post_alloc_hook() and it's stackdepot
> > > saving. I guess not an issue until try_alloc_pages() gets used later, so
> > > just a mental note that it has to be resolved before. Or is it actually safe?
> >
> > set_page_owner() should be fine.
> > save_stack() has in_page_owner recursion protection mechanism.
> >
> > stack_depot_save_flags() may be problematic if there is another
> > path to it.
> > I guess I can do:
> >
> > diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> > index 245d5b416699..61772bc4b811 100644
> > --- a/lib/stackdepot.c
> > +++ b/lib/stackdepot.c
> > @@ -630,7 +630,7 @@ depot_stack_handle_t
> > stack_depot_save_flags(unsigned long *entries,
> > prealloc = page_address(page);
> > }
>
> There is alloc_pages(gfp_nested_mask(alloc_flags)...) just couple of
> lines above. How about setting can_alloc false along with the below
> change for this case?
argh. Just noticed this code path:
free_pages_prepare
__reset_page_owner
save_stack(GFP_NOWAIT | __GFP_NOWARN);
stack_depot_save(entries, nr_entries, flags);
stack_depot_save_flags(entries, nr_entries, alloc_flags,
STACK_DEPOT_FLAG_CAN_ALLOC);
bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
page = alloc_pages(gfp_nested_mask(alloc_flags),
DEPOT_POOL_ORDER);
> Or we can set ALLOC_TRYLOCK in core alloc_pages()
> for !gfpflags_allow_spinning().
so gfpflags_allow_spinning() approach doesn't work out of the door,
since free_pages don't have gfp flags.
Passing FPI flags everywhere is too much churn.
I guess using the same gfp flags as try_alloc_pages()
in __reset_page_owner() should work.
That will make gfpflags_allow_spinning()==true in stack_depot.
In an earlier email I convinced myself that
current->in_page_owner recursion protection in save_stack()
is enough, but looks like it's not.
We could be hitting tracepoint somewhere inside alloc_pages() then
alloc_pages -> tracepoint -> bpf
-> free_pages_nolock -> __free_unref_page -> free_pages_prepare
-> save_stack(GFP_NOWAIT | __GFP_NOWARN);
and this save_stack will be called for the first time.
It's not recursing into itself. Then:
-> stack_depot_save_flags -> alloc_pages(GFP_NOWAIT) -> boom.
So looks like __reset_page_owner() has to use
!gfpflags_allow_spinning() flags and
stack_depot_save_flags() has
to do:
if (!gfpflags_allow_spinning(alloc_flags))
can_alloc = false;
raw_spin_trylock_irqsave(&pool_lock, ..)
Will code this up. Unless there are better ideas.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock()
2025-01-15 23:15 ` Alexei Starovoitov
@ 2025-01-16 8:31 ` Vlastimil Babka
0 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-16 8:31 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On 1/16/25 00:15, Alexei Starovoitov wrote:
> On Wed, Jan 15, 2025 at 3:47 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 1/15/25 03:17, Alexei Starovoitov wrote:
>> > From: Alexei Starovoitov <ast@kernel.org>
>> >
>> > Introduce free_pages_nolock() that can free pages without taking locks.
>> > It relies on trylock and can be called from any context.
>> > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
>> > it uses lockless link list to stash the pages which will be freed
>> > by subsequent free_pages() from good context.
>> >
>> > Do not use llist unconditionally. BPF maps continuously
>> > allocate/free, so we cannot unconditionally delay the freeing to
>> > llist. When the memory becomes free make it available to the
>> > kernel and BPF users right away if possible, and fallback to
>> > llist as the last resort.
>> >
>> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> With:
>>
>> > @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
>> > }
>> > EXPORT_SYMBOL(__free_pages);
>> >
>> > +/*
>> > + * Can be called while holding raw_spin_lock or from IRQ and NMI,
>> > + * but only for pages that came from try_alloc_pages():
>> > + * order <= 3, !folio, etc
>>
>> I think order > 3 is fine, as !pcp_allowed_order() case is handled too?
>
> try_alloc_page() has:
> if (!pcp_allowed_order(order))
> return NULL;
Ah ok I missed that the comment describes what pages try_alloc_pages()
produces, not what's accepted here.
> to make sure it tries pcp first.
> bpf has no use for order > 1. Even 3 is overkill,
> but it's kinda free to support order <= 3, so why not.
>
>> And
>> what does "!folio" mean?
>
> That's what we discussed last year.
> __free_pages() has all the extra stuff if (!head) and
> support for dropping ref on the middle page.
> !folio captures this more broadly.
Aha! But in that case I realize we're actually wrong. It needs to be a folio
(compound page) if it's order > 0 in order to drop that tricky
!head freeing code. It's order > 0 pages that are not compound, that are
problematic and should be eventually removed from the kernel.
The solution is to add __GFP_COMP in try_alloc_pages_noprof(). This will
have no effect on order-0 pages that BPF uses.
Instead of "!folio" the comment could then say "compound".
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock.
2025-01-16 2:22 ` Alexei Starovoitov
@ 2025-01-16 20:07 ` Joshua Hahn
2025-01-17 17:36 ` Johannes Weiner
0 siblings, 1 reply; 37+ messages in thread
From: Joshua Hahn @ 2025-01-16 20:07 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Shakeel Butt, joshua.hahnjy, Muchun Song, Andrew Morton,
SeongJae Park, open list:CONTROL GROUP (CGROUP),
linux-mm, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
Steven Rostedt, Hou Tao, Johannes Weiner, Michal Hocko,
Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo,
Kernel Team
On Wed, 15 Jan 2025 18:22:28 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Wed, Jan 15, 2025 at 4:12 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Jan 14, 2025 at 06:17:43PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Teach memcg to operate under trylock conditions when spinning locks
> > > cannot be used.
> > >
> > > local_trylock might fail and this would lead to charge cache bypass if
> > > the calling context doesn't allow spinning (gfpflags_allow_spinning).
> > > In those cases charge the memcg counter directly and fail early if
> > > that is not possible. This might cause a pre-mature charge failing
> > > but it will allow an opportunistic charging that is safe from
> > > try_alloc_pages path.
> > >
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> >
> > > @@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> > > {
> > > unsigned long flags;
> > >
> > > - local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > > + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> > > + /*
> > > + * In case of unlikely failure to lock percpu stock_lock
> > > + * uncharge memcg directly.
> > > + */
> > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> >
> > mem_cgroup_cancel_charge() has been removed by a patch in mm-tree. Maybe
> > we can either revive mem_cgroup_cancel_charge() or simply inline it
> > here.
>
> Ouch.
>
> this one?
> https://lore.kernel.org/all/20241211203951.764733-4-joshua.hahnjy@gmail.com/
>
> Joshua,
>
> could you hold on to that clean up?
> Or leave mem_cgroup_cancel_charge() in place ?
Hi Alexei,
Yes, that makes sense to me. The goal of the patch was to remove the
last users and remove it, but if there are users of the function, I
don't think the patch makes any sense : -)
Have a great day!
Joshua
Hi Andrew,
I think that the patch was moved into mm-stable earlier this week.
I was wondering if it would be possible to revert the patch and
replace it with this one below. The only difference is that I leave
mem_cgroup_cancel_charge untouched in this version.
I'm also not sure if this is the best way to send the revised patch.
Please let me know if there is another way I should do this to make
it easiest for you!
Thank you for your time!
Joshua
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d1ee98dc3a38..c8d0554e5490 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -620,8 +620,6 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
page_counter_read(&memcg->memory);
}
-void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg);
-
int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);
/**
@@ -646,9 +644,6 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
return __mem_cgroup_charge(folio, mm, gfp);
}
-int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
- long nr_pages);
-
int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp);
int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
@@ -1137,23 +1132,12 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
return false;
}
-static inline void mem_cgroup_commit_charge(struct folio *folio,
- struct mem_cgroup *memcg)
-{
-}
-
static inline int mem_cgroup_charge(struct folio *folio,
struct mm_struct *mm, gfp_t gfp)
{
return 0;
}
-static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
- gfp_t gfp, long nr_pages)
-{
- return 0;
-}
-
static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd171bdf1bcc..aeff2af8d722 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2402,18 +2402,6 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
folio->memcg_data = (unsigned long)memcg;
}
-/**
- * mem_cgroup_commit_charge - commit a previously successful try_charge().
- * @folio: folio to commit the charge to.
- * @memcg: memcg previously charged.
- */
-void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
-{
- css_get(&memcg->css);
- commit_charge(folio, memcg);
- memcg1_commit_charge(folio, memcg);
-}
-
static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg,
struct pglist_data *pgdat,
enum node_stat_item idx, int nr)
@@ -4501,7 +4489,9 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
if (ret)
goto out;
- mem_cgroup_commit_charge(folio, memcg);
+ css_get(&memcg->css);
+ commit_charge(folio, memcg);
+ memcg1_commit_charge(folio, memcg);
out:
return ret;
}
@@ -4527,40 +4517,6 @@ bool memcg_accounts_hugetlb(void)
#endif
}
-/**
- * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
- * @memcg: memcg to charge.
- * @gfp: reclaim mode.
- * @nr_pages: number of pages to charge.
- *
- * This function is called when allocating a huge page folio to determine if
- * the memcg has the capacity for it. It does not commit the charge yet,
- * as the hugetlb folio itself has not been obtained from the hugetlb pool.
- *
- * Once we have obtained the hugetlb folio, we can call
- * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
- * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
- * of try_charge().
- *
- * Returns 0 on success. Otherwise, an error code is returned.
- */
-int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
- long nr_pages)
-{
- /*
- * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
- * but do not attempt to commit charge later (or cancel on error) either.
- */
- if (mem_cgroup_disabled() || !memcg ||
- !cgroup_subsys_on_dfl(memory_cgrp_subsys) || !memcg_accounts_hugetlb())
- return -EOPNOTSUPP;
-
- if (try_charge(memcg, gfp, nr_pages))
- return -ENOMEM;
-
- return 0;
-}
-
int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
{
struct mem_cgroup *memcg = get_mem_cgroup_from_current();
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock.
2025-01-16 20:07 ` Joshua Hahn
@ 2025-01-17 17:36 ` Johannes Weiner
0 siblings, 0 replies; 37+ messages in thread
From: Johannes Weiner @ 2025-01-17 17:36 UTC (permalink / raw)
To: Joshua Hahn
Cc: Alexei Starovoitov, Shakeel Butt, Muchun Song, Andrew Morton,
SeongJae Park, open list:CONTROL GROUP (CGROUP),
linux-mm, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
Peter Zijlstra, Vlastimil Babka, Sebastian Sewior,
Steven Rostedt, Hou Tao, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, Kernel Team
On Thu, Jan 16, 2025 at 12:07:28PM -0800, Joshua Hahn wrote:
> On Wed, 15 Jan 2025 18:22:28 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > On Wed, Jan 15, 2025 at 4:12 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > On Tue, Jan 14, 2025 at 06:17:43PM -0800, Alexei Starovoitov wrote:
> > > > @@ -1851,7 +1856,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> > > > {
> > > > unsigned long flags;
> > > >
> > > > - local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > > > + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
> > > > + /*
> > > > + * In case of unlikely failure to lock percpu stock_lock
> > > > + * uncharge memcg directly.
> > > > + */
> > > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > >
> > > mem_cgroup_cancel_charge() has been removed by a patch in mm-tree. Maybe
> > > we can either revive mem_cgroup_cancel_charge() or simply inline it
> > > here.
> >
> > Ouch.
> >
> > this one?
> > https://lore.kernel.org/all/20241211203951.764733-4-joshua.hahnjy@gmail.com/
> >
> > Joshua,
> >
> > could you hold on to that clean up?
> > Or leave mem_cgroup_cancel_charge() in place ?
>
> Hi Andrew,
>
> I think that the patch was moved into mm-stable earlier this week.
> I was wondering if it would be possible to revert the patch and
> replace it with this one below. The only difference is that I leave
> mem_cgroup_cancel_charge untouched in this version.
Let's not revert.
This is a bit of a weird function to keep around without the rest of
the transaction API. It doesn't need to be external linkage, either.
Alexei, can you please just open-code the two page_counter calls?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
2025-01-15 2:17 ` [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-01-15 11:19 ` Vlastimil Babka
@ 2025-01-17 18:19 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-01-17 18:19 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, andrii, memxor, akpm, peterz, vbabka, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 2025-01-14 18:17:40 [-0800], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> rmqueue_pcplist() will spin_trylock to grab the page from percpu
> free list. If it fails (due to re-entrancy or list being empty)
> then rmqueue_bulk()/rmqueue_buddy() will attempt to
> spin_trylock zone->lock and grab the page from there.
> spin_trylock() is not safe in RT when in NMI or in hard IRQ.
> Bailout early in such case.
>
> The support for gfpflags_allow_spinning() mode for free_page and memcg
> comes in the next patches.
>
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
>
> Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
could you…
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
…
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1cb4b8c8886d..74c2a7af1a77 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
> }
>
> #endif /* CONFIG_UNACCEPTED_MEMORY */
> +
> +/**
> + * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
> + * @nid - node to allocate from
> + * @order - allocation order size
> + *
> + * Allocates pages of a given order from the given node. This is safe to
> + * call from any context (from atomic, NMI, and also reentrant
> + * allocator -> tracepoint -> try_alloc_pages_noprof).
> + * Allocation is best effort and to be expected to fail easily so nobody should
> + * rely on the success. Failures are not reported via warn_alloc().
Could you maybe add a pointer like "See AlwaysFailRestrictions below."
or something similar to make the user aware of the comment below where
certain always-fail restrictions are mentioned. Such as PREEMPT_RT + NMI
or deferred_pages_enabled().
It might not be easy to be aware of this.
I'm curious how this turns out in the long run :)
> + *
> + * Return: allocated page or NULL on failure.
> + */
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> + /*
> + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> + * is not safe in arbitrary context.
> + *
> + * These two are the conditions for gfpflags_allow_spinning() being true.
> + *
> + * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> + * to warn. Also warn would trigger printk() which is unsafe from
> + * various contexts. We cannot use printk_deferred_enter() to mitigate,
> + * since the running context is unknown.
> + *
> + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> + * is safe in any context. Also zeroing the page is mandatory for
> + * BPF use cases.
> + *
> + * Though __GFP_NOMEMALLOC is not checked in the code path below,
> + * specify it here to highlight that try_alloc_pages()
> + * doesn't want to deplete reserves.
> + */
> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> + unsigned int alloc_flags = ALLOC_TRYLOCK;
> + struct alloc_context ac = { };
> + struct page *page;
> +
> + /*
> + * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
PREEMPT_RT please. s/may/will
> + * If spin_trylock() is called from hard IRQ the current task may be
> + * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> + * task as the owner of another rt_spin_lock which will confuse PI
> + * logic, so return immediately if called form hard IRQ or NMI.
> + *
> + * Note, irqs_disabled() case is ok. This function can be called
> + * from raw_spin_lock_irqsave region.
> + */
> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> + return NULL;
> + if (!pcp_allowed_order(order))
> + return NULL;
…
Sebastian
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock()
2025-01-15 2:17 ` [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2025-01-15 11:47 ` Vlastimil Babka
@ 2025-01-17 18:20 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-01-17 18:20 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, andrii, memxor, akpm, peterz, vbabka, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 2025-01-14 18:17:41 [-0800], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce free_pages_nolock() that can free pages without taking locks.
> It relies on trylock and can be called from any context.
> Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> it uses lockless link list to stash the pages which will be freed
> by subsequent free_pages() from good context.
>
> Do not use llist unconditionally. BPF maps continuously
> allocate/free, so we cannot unconditionally delay the freeing to
> llist. When the memory becomes free make it available to the
> kernel and BPF users right away if possible, and fallback to
> llist as the last resort.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
…
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 74c2a7af1a77..a9c639e3db91 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,
…
> static void free_one_page(struct zone *zone, struct page *page,
> unsigned long pfn, unsigned int order,
> fpi_t fpi_flags)
> {
> + struct llist_head *llhead;
> unsigned long flags;
>
> - spin_lock_irqsave(&zone->lock, flags);
> + if (!spin_trylock_irqsave(&zone->lock, flags)) {
> + if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> + add_page_to_zone_llist(zone, page, order);
> + return;
> + }
> + spin_lock_irqsave(&zone->lock, flags);
> + }
> +
> + /* The lock succeeded. Process deferred pages. */
> + llhead = &zone->trylock_free_pages;
> + if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
Thank you.
> + struct llist_node *llnode;
> + struct page *p, *tmp;
> +
> + llnode = llist_del_all(llhead);
> + llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
> + unsigned int p_order = p->order;
> +
> + split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> + __count_vm_events(PGFREE, 1 << p_order);
> + }
> + }
Sebastian
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-15 2:17 ` [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
2025-01-15 2:23 ` Alexei Starovoitov
2025-01-15 14:22 ` Vlastimil Babka
@ 2025-01-17 20:33 ` Sebastian Andrzej Siewior
2025-01-21 15:59 ` Vlastimil Babka
2 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-01-17 20:33 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, andrii, memxor, akpm, peterz, vbabka, rostedt, houtao1,
hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 2025-01-14 18:17:42 [-0800], Alexei Starovoitov wrote:
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index 8dd71fbbb6d2..93672127c73d 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -75,37 +85,73 @@ do { \
>
> #define __local_lock(lock) \
> do { \
> + local_lock_t *l; \
> preempt_disable(); \
> - local_lock_acquire(this_cpu_ptr(lock)); \
> + l = this_cpu_ptr(lock); \
> + lockdep_assert(l->active == 0); \
> + WRITE_ONCE(l->active, 1); \
> + local_lock_acquire(l); \
> } while (0)
…
> +#define __local_trylock_irqsave(lock, flags) \
> + ({ \
> + local_lock_t *l; \
> + local_irq_save(flags); \
> + l = this_cpu_ptr(lock); \
> + if (READ_ONCE(l->active) == 1) { \
> + local_irq_restore(flags); \
> + l = NULL; \
> + } else { \
> + WRITE_ONCE(l->active, 1); \
> + local_trylock_acquire(l); \
> + } \
> + !!l; \
> + })
> +
Part of the selling for local_lock_t was that it does not affect
!PREEMPT_RT builds. By adding `active' you extend every data structure
and you have an extra write on every local_lock(). It was meant to
replace preempt_disable()/ local_irq_save() based locking with something
that actually does locking on PREEMPT_RT without risking my life once
people with pitchforks come talk about the new locking :)
I admire that you try to make PREEMPT_RT and !PREEMPT_RT similar in a
way that both detect recursive locking which not meant to be supported.
Realistically speaking as of today we don't have any recursive lock
detection other than lockdep. So it should be enough given that the bots
use it often and hopefully local testing.
Your assert in local_lock() does not work without lockdep. It will only
make local_trylock_irqsave() detect recursion and lead to two splats
with lockdep enabled in local_lock() (one from the assert and the second
from lockdep).
I would say you could get rid of the `active' field and solely rely on
lockdep and the owner field. So __local_trylock_irqsave() could maybe
use local_trylock_acquire() to conditionally acquire the lock if `owner'
is NULL.
This makes it possible to have recursive code without lockdep, but with
lockdep enabled the testcase should fail if it relies on recursion.
Other than that, I don't see any advantage. Would that work?
Sebastian
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-17 20:33 ` Sebastian Andrzej Siewior
@ 2025-01-21 15:59 ` Vlastimil Babka
2025-01-21 16:43 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2025-01-21 15:59 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Alexei Starovoitov
Cc: bpf, andrii, memxor, akpm, peterz, rostedt, houtao1, hannes,
shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
kernel-team
On 1/17/25 9:33 PM, Sebastian Andrzej Siewior wrote:
> On 2025-01-14 18:17:42 [-0800], Alexei Starovoitov wrote:
>> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
>> index 8dd71fbbb6d2..93672127c73d 100644
>> --- a/include/linux/local_lock_internal.h
>> +++ b/include/linux/local_lock_internal.h
>> @@ -75,37 +85,73 @@ do { \
>>
>> #define __local_lock(lock) \
>> do { \
>> + local_lock_t *l; \
>> preempt_disable(); \
>> - local_lock_acquire(this_cpu_ptr(lock)); \
>> + l = this_cpu_ptr(lock); \
>> + lockdep_assert(l->active == 0); \
>> + WRITE_ONCE(l->active, 1); \
>> + local_lock_acquire(l); \
>> } while (0)
>
> …
>
>> +#define __local_trylock_irqsave(lock, flags) \
>> + ({ \
>> + local_lock_t *l; \
>> + local_irq_save(flags); \
>> + l = this_cpu_ptr(lock); \
>> + if (READ_ONCE(l->active) == 1) { \
>> + local_irq_restore(flags); \
>> + l = NULL; \
>> + } else { \
>> + WRITE_ONCE(l->active, 1); \
>> + local_trylock_acquire(l); \
>> + } \
>> + !!l; \
>> + })
>> +
>
> Part of the selling for local_lock_t was that it does not affect
> !PREEMPT_RT builds. By adding `active' you extend every data structure
> and you have an extra write on every local_lock(). It was meant to
> replace preempt_disable()/ local_irq_save() based locking with something
> that actually does locking on PREEMPT_RT without risking my life once
> people with pitchforks come talk about the new locking :)
>
> I admire that you try to make PREEMPT_RT and !PREEMPT_RT similar in a
> way that both detect recursive locking which not meant to be supported.
>
> Realistically speaking as of today we don't have any recursive lock
> detection other than lockdep. So it should be enough given that the bots
> use it often and hopefully local testing.
> Your assert in local_lock() does not work without lockdep. It will only
> make local_trylock_irqsave() detect recursion and lead to two splats
> with lockdep enabled in local_lock() (one from the assert and the second
> from lockdep).
>
> I would say you could get rid of the `active' field and solely rely on
> lockdep and the owner field. So __local_trylock_irqsave() could maybe
> use local_trylock_acquire() to conditionally acquire the lock if `owner'
> is NULL.
>
> This makes it possible to have recursive code without lockdep, but with
> lockdep enabled the testcase should fail if it relies on recursion.
> Other than that, I don't see any advantage. Would that work?
I don't think it would work, or am I missing something? The goal is to
allow the operation (alloc, free) to opportunistically succeed in e.g.
nmi context, but only if we didn't interrupt anything that holds the
lock. Otherwise we must allow for failure - hence trylock.
(a possible extension that I mentioned is to also stop doing irqsave to
avoid its overhead and thus also operations from an irq context would be
oportunistic)
But if we detect the "trylock must fail" cases only using lockdep, we'll
deadlock without lockdep. So e.g. the "active" flag has to be there?
So yes this goes beyond the original purpose of local_lock. Do you think
it should be a different lock type then, which would mean the other
users of current local_lock that don't want the opportunistic nesting
via trylock, would not inflict the "active" flag overhead?
AFAICS the RT implementation of local_lock could then be shared for both
of these types, but I might be missing some nuance there.
> Sebastian
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-21 15:59 ` Vlastimil Babka
@ 2025-01-21 16:43 ` Sebastian Andrzej Siewior
2025-01-22 1:35 ` Alexei Starovoitov
0 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-01-21 16:43 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, peterz, rostedt,
houtao1, hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj,
linux-mm, kernel-team
On 2025-01-21 16:59:40 [+0100], Vlastimil Babka wrote:
> I don't think it would work, or am I missing something? The goal is to
> allow the operation (alloc, free) to opportunistically succeed in e.g.
> nmi context, but only if we didn't interrupt anything that holds the
> lock. Otherwise we must allow for failure - hence trylock.
> (a possible extension that I mentioned is to also stop doing irqsave to
> avoid its overhead and thus also operations from an irq context would be
> oportunistic)
> But if we detect the "trylock must fail" cases only using lockdep, we'll
> deadlock without lockdep. So e.g. the "active" flag has to be there?
You are right. I noticed that myself but didn't get to reply…
> So yes this goes beyond the original purpose of local_lock. Do you think
> it should be a different lock type then, which would mean the other
> users of current local_lock that don't want the opportunistic nesting
> via trylock, would not inflict the "active" flag overhead?
>
> AFAICS the RT implementation of local_lock could then be shared for both
> of these types, but I might be missing some nuance there.
I was thinking about this over the weekend and this implementation
extends the data structure by 4 bytes and has this mandatory read/ write
on every lock/ unlock operation. This is what makes it a bit different
than the original.
If the local_lock_t is replaced with spinlock_t then the data structure
is still extended by four bytes (assuming no lockdep) and we have a
mandatory read/ write operation. The whole thing still does not work on
PREEMPT_RT but it isn't much different from what we have now. This is
kind of my favorite. This could be further optimized to avoid the atomic
operation given it is always local per-CPU memory. Maybe a
local_spinlock_t :)
Sebastian
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()
2025-01-21 16:43 ` Sebastian Andrzej Siewior
@ 2025-01-22 1:35 ` Alexei Starovoitov
0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-01-22 1:35 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Vlastimil Babka, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
Andrew Morton, Peter Zijlstra, Steven Rostedt, Hou Tao,
Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team
On Tue, Jan 21, 2025 at 8:43 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2025-01-21 16:59:40 [+0100], Vlastimil Babka wrote:
> > I don't think it would work, or am I missing something? The goal is to
> > allow the operation (alloc, free) to opportunistically succeed in e.g.
> > nmi context, but only if we didn't interrupt anything that holds the
> > lock. Otherwise we must allow for failure - hence trylock.
> > (a possible extension that I mentioned is to also stop doing irqsave to
> > avoid its overhead and thus also operations from an irq context would be
> > oportunistic)
> > But if we detect the "trylock must fail" cases only using lockdep, we'll
> > deadlock without lockdep. So e.g. the "active" flag has to be there?
>
> You are right. I noticed that myself but didn't get to reply…
>
> > So yes this goes beyond the original purpose of local_lock. Do you think
> > it should be a different lock type then, which would mean the other
> > users of current local_lock that don't want the opportunistic nesting
> > via trylock, would not inflict the "active" flag overhead?
> >
> > AFAICS the RT implementation of local_lock could then be shared for both
> > of these types, but I might be missing some nuance there.
>
> I was thinking about this over the weekend and this implementation
> extends the data structure by 4 bytes and has this mandatory read/ write
> on every lock/ unlock operation. This is what makes it a bit different
> than the original.
>
> If the local_lock_t is replaced with spinlock_t then the data structure
> is still extended by four bytes (assuming no lockdep) and we have a
> mandatory read/ write operation. The whole thing still does not work on
> PREEMPT_RT but it isn't much different from what we have now. This is
> kind of my favorite. This could be further optimized to avoid the atomic
> operation given it is always local per-CPU memory. Maybe a
> local_spinlock_t :)
I think the concern of people with pitchforks is overblown.
These are the only users of local_lock_t:
drivers/block/zram/zcomp.h: local_lock_t lock;
drivers/char/random.c: local_lock_t lock;
drivers/connector/cn_proc.c: local_lock_t lock;
drivers/md/raid5.h: local_lock_t lock;
kernel/softirq.c: local_lock_t lock;
mm/memcontrol.c: local_lock_t stock_lock;
mm/mlock.c: local_lock_t lock;
mm/slub.c: local_lock_t lock;
mm/swap.c: local_lock_t lock;
mm/swap.c: local_lock_t lock_irq;
mm/zsmalloc.c: local_lock_t lock;
net/bridge/br_netfilter_hooks.c: local_lock_t bh_lock;
net/core/skbuff.c: local_lock_t bh_lock;
net/ipv4/tcp_sigpool.c: local_lock_t bh_lock;
Networking is the one that really cares about performance
and there 'int active' adds 4 bytes, but no run-time overhead,
since it's using local_lock_nested_bh() that doesn't touch 'active'.
memcontrol.c and slub.c are those that need these new trylock
logic with 'active' field to protect from NMI on !RT.
They will change to new local_trylock_t anyway.
What's left is not perf critical. Single WRITE_ONCE in lock
and another WRITE_ONCE in unlock is really in the noise.
Hence I went with a simple approach you see in this patch.
I can go with new local_trylock_t and _Generic() trick.
The patch will look like this:
diff --git a/include/linux/local_lock_internal.h
b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..ed4623e0c71a 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -15,6 +15,14 @@ typedef struct {
#endif
} local_lock_t;
+typedef struct {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+ struct task_struct *owner;
+#endif
+ int active;
+} local_trylock_t;
#define __local_lock_irqsave(lock, flags) \
do { \
+ local_lock_t *l; \
local_irq_save(flags); \
- local_lock_acquire(this_cpu_ptr(lock)); \
+ l = (local_lock_t *)this_cpu_ptr(lock); \
+ _Generic((lock), \
+ local_trylock_t *: \
+ ({ \
+ lockdep_assert(((local_trylock_t *)l)->active == 0); \
+ WRITE_ONCE(((local_trylock_t *)l)->active, 1); \
+ }), \
+ default:(void)0); \
+ local_lock_acquire(l); \
} while (0)
+
+#define __local_trylock_irqsave(lock, flags) \
+ ({ \
+ local_trylock_t *l; \
+ local_irq_save(flags); \
+ l = this_cpu_ptr(lock); \
+ if (READ_ONCE(l->active) == 1) { \
+ local_irq_restore(flags); \
+ l = NULL; \
+ } else { \
+ WRITE_ONCE(l->active, 1); \
+ local_trylock_acquire((local_lock_t *)l);
\
+ } \
+ !!l; \
+ })
and use new local_trylock_t where necessary.
Like:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..8fe141e93a0b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1722,7 +1722,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
}
struct memcg_stock_pcp {
- local_lock_t stock_lock;
+ local_trylock_t stock_lock;
All lines where stock_lock is used will stay as-is.
So no code churn.
Above _Generic() isn't pretty, but not horrible either.
I guess that's a decent trade off.
Better ideas?
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2025-01-22 1:36 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-15 2:17 [PATCH bpf-next v5 0/7] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-01-15 11:19 ` Vlastimil Babka
2025-01-15 23:00 ` Alexei Starovoitov
2025-01-15 23:47 ` Shakeel Butt
2025-01-16 2:44 ` Alexei Starovoitov
2025-01-15 23:16 ` Shakeel Butt
2025-01-17 18:19 ` Sebastian Andrzej Siewior
2025-01-15 2:17 ` [PATCH bpf-next v5 2/7] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2025-01-15 11:47 ` Vlastimil Babka
2025-01-15 23:15 ` Alexei Starovoitov
2025-01-16 8:31 ` Vlastimil Babka
2025-01-17 18:20 ` Sebastian Andrzej Siewior
2025-01-15 2:17 ` [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave() Alexei Starovoitov
2025-01-15 2:23 ` Alexei Starovoitov
2025-01-15 7:22 ` Sebastian Sewior
2025-01-15 14:22 ` Vlastimil Babka
2025-01-16 2:20 ` Alexei Starovoitov
2025-01-17 20:33 ` Sebastian Andrzej Siewior
2025-01-21 15:59 ` Vlastimil Babka
2025-01-21 16:43 ` Sebastian Andrzej Siewior
2025-01-22 1:35 ` Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 4/7] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
2025-01-15 16:07 ` Vlastimil Babka
2025-01-16 0:12 ` Shakeel Butt
2025-01-16 2:22 ` Alexei Starovoitov
2025-01-16 20:07 ` Joshua Hahn
2025-01-17 17:36 ` Johannes Weiner
2025-01-15 2:17 ` [PATCH bpf-next v5 5/7] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
2025-01-15 17:51 ` Vlastimil Babka
2025-01-16 0:24 ` Shakeel Butt
2025-01-15 2:17 ` [PATCH bpf-next v5 6/7] mm: Make failslab, kfence, kmemleak aware of trylock mode Alexei Starovoitov
2025-01-15 17:57 ` Vlastimil Babka
2025-01-16 2:23 ` Alexei Starovoitov
2025-01-15 2:17 ` [PATCH bpf-next v5 7/7] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
2025-01-15 18:02 ` Vlastimil Babka
2025-01-16 2:25 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox