* [PATCH] vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY
2026-03-02 11:47 [PATCH] mm/vmalloc: Fix incorrect size reporting on allocation failure Uladzislau Rezki (Sony)
@ 2026-03-02 11:47 ` Uladzislau Rezki (Sony)
2026-03-02 17:38 ` Mikulas Patocka
2026-03-02 14:52 ` [PATCH] mm/vmalloc: Fix incorrect size reporting on allocation failure Dev Jain
2026-03-02 17:41 ` Mikulas Patocka
2 siblings, 1 reply; 6+ messages in thread
From: Uladzislau Rezki (Sony) @ 2026-03-02 11:47 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: Michal Hocko, Mikulas Patocka, Vishal Moola, Baoquan He, LKML,
Uladzislau Rezki
From: Michal Hocko <mhocko@suse.com>
__GFP_RETRY_MAYFAIL and __GFP_NORETRY haven't been supported so far
because their semantic (i.e. to not trigger OOM killer) is not possible
with the existing vmalloc page table allocation which is allowing for
the OOM killer.
Example: __vmalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
<snip>
vmalloc_test/55 invoked oom-killer:
gfp_mask=0x40dc0(
GFP_KERNEL|__GFP_ZERO|__GFP_COMP), order=0, oom_score_adj=0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0
slab_reclaimable:700 slab_unreclaimable:33708
mapped:0 shmem:0 pagetables:5174
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:850 free_pcp:319 free_cma:0
CPU: 4 UID: 0 PID: 639 Comm: vmalloc_test/55 ...
Hardware name: QEMU Standard PC (i440FX + PIIX, ...
Call Trace:
<TASK>
dump_stack_lvl+0x5d/0x80
dump_header+0x43/0x1b3
out_of_memory.cold+0x8/0x78
__alloc_pages_slowpath.constprop.0+0xef5/0x1130
__alloc_frozen_pages_noprof+0x312/0x330
alloc_pages_mpol+0x7d/0x160
alloc_pages_noprof+0x50/0xa0
__pte_alloc_kernel+0x1e/0x1f0
...
<snip>
There are usecases for these modifiers when a large allocation request
should rather fail than trigger OOM killer which wouldn't be able to
handle the situation anyway [1].
While we cannot change existing page table allocation code easily we can
piggy back on scoped NOWAIT allocation for them that we already have in
place. The rationale is that the bulk of the consumed memory is sitting
in pages backing the vmalloc allocation. Page tables are only
participating a tiny fraction. Moreover page tables for virtually allocated
areas are never reclaimed so the longer the system runs to less likely
they are. It makes sense to allow an approximation of __GFP_RETRY_MAYFAIL
and __GFP_NORETRY even if the page table allocation part is much weaker.
This doesn't break the failure mode while it allows for the no OOM
semantic.
[1] https://lore.kernel.org/all/32bd9bed-a939-69c4-696d-f7f9a5fe31d8@redhat.com/T/#u
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
mm/vmalloc.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a06f4b3ea367..975592b0ec89 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3798,6 +3798,8 @@ static void defer_vm_area_cleanup(struct vm_struct *area)
* non-blocking (no __GFP_DIRECT_RECLAIM) - memalloc_noreclaim_save()
* GFP_NOFS - memalloc_nofs_save()
* GFP_NOIO - memalloc_noio_save()
+ * __GFP_RETRY_MAYFAIL, __GFP_NORETRY - memalloc_noreclaim_save()
+ * to prevent OOMs
*
* Returns a flag cookie to pair with restore.
*/
@@ -3806,7 +3808,8 @@ memalloc_apply_gfp_scope(gfp_t gfp_mask)
{
unsigned int flags = 0;
- if (!gfpflags_allow_blocking(gfp_mask))
+ if (!gfpflags_allow_blocking(gfp_mask) ||
+ (gfp_mask & (__GFP_RETRY_MAYFAIL | __GFP_NORETRY)))
flags = memalloc_noreclaim_save();
else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
flags = memalloc_nofs_save();
@@ -3940,7 +3943,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
* GFP_KERNEL_ACCOUNT. Xfs uses __GFP_NOLOCKDEP.
*/
#define GFP_VMALLOC_SUPPORTED (GFP_KERNEL | GFP_ATOMIC | GFP_NOWAIT |\
- __GFP_NOFAIL | __GFP_ZERO | __GFP_NORETRY |\
+ __GFP_NOFAIL | __GFP_ZERO |\
+ __GFP_NORETRY | __GFP_RETRY_MAYFAIL |\
GFP_NOFS | GFP_NOIO | GFP_KERNEL_ACCOUNT |\
GFP_USER | __GFP_NOLOCKDEP)
@@ -3971,12 +3975,15 @@ static gfp_t vmalloc_fix_flags(gfp_t flags)
* virtual range with protection @prot.
*
* Supported GFP classes: %GFP_KERNEL, %GFP_ATOMIC, %GFP_NOWAIT,
- * %GFP_NOFS and %GFP_NOIO. Zone modifiers are not supported.
+ * %__GFP_RETRY_MAYFAIL, %__GFP_NORETRY, %GFP_NOFS and %GFP_NOIO.
+ * Zone modifiers are not supported.
* Please note %GFP_ATOMIC and %GFP_NOWAIT are supported only
* by __vmalloc().
*
- * Retry modifiers: only %__GFP_NOFAIL is supported; %__GFP_NORETRY
- * and %__GFP_RETRY_MAYFAIL are not supported.
+ * Retry modifiers: only %__GFP_NOFAIL is fully supported;
+ * %__GFP_NORETRY and %__GFP_RETRY_MAYFAIL are supported with limitation,
+ * i.e. page tables are allocated with NOWAIT semantic so they might fail
+ * under moderate memory pressure.
*
* %__GFP_NOWARN can be used to suppress failure messages.
*
--
2.47.3
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY
2026-03-02 11:47 ` [PATCH] vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY Uladzislau Rezki (Sony)
@ 2026-03-02 17:38 ` Mikulas Patocka
2026-03-02 18:51 ` Michal Hocko
0 siblings, 1 reply; 6+ messages in thread
From: Mikulas Patocka @ 2026-03-02 17:38 UTC (permalink / raw)
To: Uladzislau Rezki (Sony)
Cc: linux-mm, Andrew Morton, Michal Hocko, Vishal Moola, Baoquan He, LKML
On Mon, 2 Mar 2026, Uladzislau Rezki (Sony) wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __GFP_RETRY_MAYFAIL and __GFP_NORETRY haven't been supported so far
> because their semantic (i.e. to not trigger OOM killer) is not possible
> with the existing vmalloc page table allocation which is allowing for
> the OOM killer.
>
> Example: __vmalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>
> <snip>
> vmalloc_test/55 invoked oom-killer:
> gfp_mask=0x40dc0(
> GFP_KERNEL|__GFP_ZERO|__GFP_COMP), order=0, oom_score_adj=0
> active_anon:0 inactive_anon:0 isolated_anon:0
> active_file:0 inactive_file:0 isolated_file:0
> unevictable:0 dirty:0 writeback:0
> slab_reclaimable:700 slab_unreclaimable:33708
> mapped:0 shmem:0 pagetables:5174
> sec_pagetables:0 bounce:0
> kernel_misc_reclaimable:0
> free:850 free_pcp:319 free_cma:0
> CPU: 4 UID: 0 PID: 639 Comm: vmalloc_test/55 ...
> Hardware name: QEMU Standard PC (i440FX + PIIX, ...
> Call Trace:
> <TASK>
> dump_stack_lvl+0x5d/0x80
> dump_header+0x43/0x1b3
> out_of_memory.cold+0x8/0x78
> __alloc_pages_slowpath.constprop.0+0xef5/0x1130
> __alloc_frozen_pages_noprof+0x312/0x330
> alloc_pages_mpol+0x7d/0x160
> alloc_pages_noprof+0x50/0xa0
> __pte_alloc_kernel+0x1e/0x1f0
> ...
> <snip>
>
> There are usecases for these modifiers when a large allocation request
> should rather fail than trigger OOM killer which wouldn't be able to
> handle the situation anyway [1].
>
> While we cannot change existing page table allocation code easily we can
> piggy back on scoped NOWAIT allocation for them that we already have in
> place. The rationale is that the bulk of the consumed memory is sitting
> in pages backing the vmalloc allocation. Page tables are only
> participating a tiny fraction. Moreover page tables for virtually allocated
> areas are never reclaimed so the longer the system runs to less likely
> they are. It makes sense to allow an approximation of __GFP_RETRY_MAYFAIL
> and __GFP_NORETRY even if the page table allocation part is much weaker.
> This doesn't break the failure mode while it allows for the no OOM
> semantic.
>
> [1] https://lore.kernel.org/all/32bd9bed-a939-69c4-696d-f7f9a5fe31d8@redhat.com/T/#u
>
> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
> mm/vmalloc.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a06f4b3ea367..975592b0ec89 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3798,6 +3798,8 @@ static void defer_vm_area_cleanup(struct vm_struct *area)
> * non-blocking (no __GFP_DIRECT_RECLAIM) - memalloc_noreclaim_save()
> * GFP_NOFS - memalloc_nofs_save()
> * GFP_NOIO - memalloc_noio_save()
> + * __GFP_RETRY_MAYFAIL, __GFP_NORETRY - memalloc_noreclaim_save()
> + * to prevent OOMs
> *
> * Returns a flag cookie to pair with restore.
> */
> @@ -3806,7 +3808,8 @@ memalloc_apply_gfp_scope(gfp_t gfp_mask)
> {
> unsigned int flags = 0;
>
> - if (!gfpflags_allow_blocking(gfp_mask))
> + if (!gfpflags_allow_blocking(gfp_mask) ||
> + (gfp_mask & (__GFP_RETRY_MAYFAIL | __GFP_NORETRY)))
> flags = memalloc_noreclaim_save();
I wouldn't do this because:
1. it makes the __GFP_RETRY_MAYFAIL allocations unreliable.
2. The comment at memalloc_noreclaim_save says that it may deplete memory
reserves: "This should only be used when the caller guarantees the
allocation will allow more memory to be freed very shortly, i.e. it needs
to allocate some memory in the process of freeing memory, and cannot
reclaim due to potential recursion."
I think that the cleanest solution to this problem would be to get rid of
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO and instead introduce two per-thread
variables "gfp_t set_flags" and "gfp_t clear_flags" and set and clear gfp
flags according to them in the allocator: "gfp = (gfp |
current->set_flags) & ~current->clear_flags";
Mikulas
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY
2026-03-02 17:38 ` Mikulas Patocka
@ 2026-03-02 18:51 ` Michal Hocko
0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2026-03-02 18:51 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Uladzislau Rezki (Sony),
linux-mm, Andrew Morton, Vishal Moola, Baoquan He, LKML
On Mon 02-03-26 18:38:53, Mikulas Patocka wrote:
>
>
> On Mon, 2 Mar 2026, Uladzislau Rezki (Sony) wrote:
>
> > From: Michal Hocko <mhocko@suse.com>
> >
> > __GFP_RETRY_MAYFAIL and __GFP_NORETRY haven't been supported so far
> > because their semantic (i.e. to not trigger OOM killer) is not possible
> > with the existing vmalloc page table allocation which is allowing for
> > the OOM killer.
> >
> > Example: __vmalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >
> > <snip>
> > vmalloc_test/55 invoked oom-killer:
> > gfp_mask=0x40dc0(
> > GFP_KERNEL|__GFP_ZERO|__GFP_COMP), order=0, oom_score_adj=0
> > active_anon:0 inactive_anon:0 isolated_anon:0
> > active_file:0 inactive_file:0 isolated_file:0
> > unevictable:0 dirty:0 writeback:0
> > slab_reclaimable:700 slab_unreclaimable:33708
> > mapped:0 shmem:0 pagetables:5174
> > sec_pagetables:0 bounce:0
> > kernel_misc_reclaimable:0
> > free:850 free_pcp:319 free_cma:0
> > CPU: 4 UID: 0 PID: 639 Comm: vmalloc_test/55 ...
> > Hardware name: QEMU Standard PC (i440FX + PIIX, ...
> > Call Trace:
> > <TASK>
> > dump_stack_lvl+0x5d/0x80
> > dump_header+0x43/0x1b3
> > out_of_memory.cold+0x8/0x78
> > __alloc_pages_slowpath.constprop.0+0xef5/0x1130
> > __alloc_frozen_pages_noprof+0x312/0x330
> > alloc_pages_mpol+0x7d/0x160
> > alloc_pages_noprof+0x50/0xa0
> > __pte_alloc_kernel+0x1e/0x1f0
> > ...
> > <snip>
> >
> > There are usecases for these modifiers when a large allocation request
> > should rather fail than trigger OOM killer which wouldn't be able to
> > handle the situation anyway [1].
> >
> > While we cannot change existing page table allocation code easily we can
> > piggy back on scoped NOWAIT allocation for them that we already have in
> > place. The rationale is that the bulk of the consumed memory is sitting
> > in pages backing the vmalloc allocation. Page tables are only
> > participating a tiny fraction. Moreover page tables for virtually allocated
> > areas are never reclaimed so the longer the system runs to less likely
> > they are. It makes sense to allow an approximation of __GFP_RETRY_MAYFAIL
> > and __GFP_NORETRY even if the page table allocation part is much weaker.
> > This doesn't break the failure mode while it allows for the no OOM
> > semantic.
> >
> > [1] https://lore.kernel.org/all/32bd9bed-a939-69c4-696d-f7f9a5fe31d8@redhat.com/T/#u
> >
> > Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> > mm/vmalloc.c | 17 ++++++++++++-----
> > 1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index a06f4b3ea367..975592b0ec89 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3798,6 +3798,8 @@ static void defer_vm_area_cleanup(struct vm_struct *area)
> > * non-blocking (no __GFP_DIRECT_RECLAIM) - memalloc_noreclaim_save()
> > * GFP_NOFS - memalloc_nofs_save()
> > * GFP_NOIO - memalloc_noio_save()
> > + * __GFP_RETRY_MAYFAIL, __GFP_NORETRY - memalloc_noreclaim_save()
> > + * to prevent OOMs
> > *
> > * Returns a flag cookie to pair with restore.
> > */
> > @@ -3806,7 +3808,8 @@ memalloc_apply_gfp_scope(gfp_t gfp_mask)
> > {
> > unsigned int flags = 0;
> >
> > - if (!gfpflags_allow_blocking(gfp_mask))
> > + if (!gfpflags_allow_blocking(gfp_mask) ||
> > + (gfp_mask & (__GFP_RETRY_MAYFAIL | __GFP_NORETRY)))
> > flags = memalloc_noreclaim_save();
>
> I wouldn't do this because:
>
> 1. it makes the __GFP_RETRY_MAYFAIL allocations unreliable.
__GFP_RETRY_MAYFAIL doesn't provide any reliability. It just promisses
to not OOM while trying hard. I believe this implementation doesn't
break that promise.
> 2. The comment at memalloc_noreclaim_save says that it may deplete memory
> reserves: "This should only be used when the caller guarantees the
> allocation will allow more memory to be freed very shortly, i.e. it needs
> to allocate some memory in the process of freeing memory, and cannot
> reclaim due to potential recursion."
yes, this allocation clearly doesn't guaratee to free more memory. That
comment is rather dated. Anyway, the crux is to make sure that the
allocation is not unbound. The idea behind this decision is that the
page tables are only a tiny fraction of the resulting memory allocated.
Moreover this virtually allocated space is recycled so over time there
should be less and less of page tables allocated as well.
> I think that the cleanest solution to this problem would be to get rid of
> PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO and instead introduce two per-thread
> variables "gfp_t set_flags" and "gfp_t clear_flags" and set and clear gfp
> flags according to them in the allocator: "gfp = (gfp |
> current->set_flags) & ~current->clear_flags";
We've been through discussions like this one way too many times and the
conclusion is that, no this will not work. The gfp space we have and
need to support without rewriting a large part of the kernel is simply
incompatible with a more sane interface. Yeah, I hate that as well but
here we are. We need to be creative to keep sensible and not introduce
even more weirdness to the interface.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/vmalloc: Fix incorrect size reporting on allocation failure
2026-03-02 11:47 [PATCH] mm/vmalloc: Fix incorrect size reporting on allocation failure Uladzislau Rezki (Sony)
2026-03-02 11:47 ` [PATCH] vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY Uladzislau Rezki (Sony)
@ 2026-03-02 14:52 ` Dev Jain
2026-03-02 17:41 ` Mikulas Patocka
2 siblings, 0 replies; 6+ messages in thread
From: Dev Jain @ 2026-03-02 14:52 UTC (permalink / raw)
To: Uladzislau Rezki (Sony), linux-mm, Andrew Morton
Cc: Michal Hocko, Mikulas Patocka, Vishal Moola, Baoquan He, LKML
On 02/03/26 5:17 pm, Uladzislau Rezki (Sony) wrote:
> When __vmalloc_area_node() fails to allocate pages, the failure
> message may report an incorrect allocation size, for example:
>
> vmalloc error: size 0, failed to allocate pages, ...
>
> This happens because the warning prints area->nr_pages * PAGE_SIZE.
> At this point, area->nr_pages may be zero or partly populated thus
> it is not valid.
>
> Report the originally requested allocation size instead by using
> nr_small_pages * PAGE_SIZE, which reflects the actual number of
> pages being requested by user.
>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
LGTM
Reviewed-by: Dev Jain <dev.jain@arm.com>
> ---
> mm/vmalloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 61caa55a4402..a06f4b3ea367 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3901,7 +3901,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> if (!fatal_signal_pending(current) && page_order == 0)
> warn_alloc(gfp_mask, NULL,
> "vmalloc error: size %lu, failed to allocate pages",
> - area->nr_pages * PAGE_SIZE);
> + nr_small_pages * PAGE_SIZE);
> goto fail;
> }
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/vmalloc: Fix incorrect size reporting on allocation failure
2026-03-02 11:47 [PATCH] mm/vmalloc: Fix incorrect size reporting on allocation failure Uladzislau Rezki (Sony)
2026-03-02 11:47 ` [PATCH] vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY Uladzislau Rezki (Sony)
2026-03-02 14:52 ` [PATCH] mm/vmalloc: Fix incorrect size reporting on allocation failure Dev Jain
@ 2026-03-02 17:41 ` Mikulas Patocka
2 siblings, 0 replies; 6+ messages in thread
From: Mikulas Patocka @ 2026-03-02 17:41 UTC (permalink / raw)
To: Uladzislau Rezki (Sony)
Cc: linux-mm, Andrew Morton, Michal Hocko, Vishal Moola, Baoquan He, LKML
On Mon, 2 Mar 2026, Uladzislau Rezki (Sony) wrote:
> When __vmalloc_area_node() fails to allocate pages, the failure
> message may report an incorrect allocation size, for example:
>
> vmalloc error: size 0, failed to allocate pages, ...
>
> This happens because the warning prints area->nr_pages * PAGE_SIZE.
> At this point, area->nr_pages may be zero or partly populated thus
> it is not valid.
>
> Report the originally requested allocation size instead by using
> nr_small_pages * PAGE_SIZE, which reflects the actual number of
> pages being requested by user.
>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
> mm/vmalloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 61caa55a4402..a06f4b3ea367 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3901,7 +3901,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> if (!fatal_signal_pending(current) && page_order == 0)
> warn_alloc(gfp_mask, NULL,
> "vmalloc error: size %lu, failed to allocate pages",
> - area->nr_pages * PAGE_SIZE);
> + nr_small_pages * PAGE_SIZE);
> goto fail;
> }
>
> --
> 2.47.3
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Mikulas
^ permalink raw reply [flat|nested] 6+ messages in thread