linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: hailong.liu@oppo.com
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	 Christoph Hellwig <hch@infradead.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>,  Michal Hocko <mhocko@suse.com>,
	Baoquan He <bhe@redhat.com>, Matthew Wilcox <willy@infradead.org>,
	 "Tangquan . Zheng" <zhengtangquan@oppo.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0
Date: Thu, 25 Jul 2024 18:21:33 +1200	[thread overview]
Message-ID: <CAGsJ_4zWsh50Pzp0ntskT=eyfStALxD5BMNdWFrYJewrrx7V5Q@mail.gmail.com> (raw)
In-Reply-To: <20240725035318.471-1-hailong.liu@oppo.com>

On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@oppo.com> wrote:
>
> From: "Hailong.Liu" <hailong.liu@oppo.com>
>
> The scenario where the issue occurs is as follows:
> CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
>     __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
>         vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
>                                         and phys_addr is aligned with PMD_SIZE
>             vmap_pages_range
>                 vmap_pages_range_noflush
>                     __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
>
> In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> might be issues with the __vmap_pages_range_noflush().
>
> The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> are several reasons for this:
> - This increases memory footprint because ALIGNMENT.
> - This increases the likelihood of kvmalloc allocation failures.
> - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> Besides if drivers want to vmap huge, user vmalloc_huge instead.
>
> Fix it by disabling fallback and remove VM_ALLOW_HUGE_VMAP in
> kvmalloc_node().
> Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
>
> CC: Barry Song <21cnbao@gmail.com>
> CC: Baoquan He <bhe@redhat.com>
> CC: Matthew Wilcox <willy@infradead.org>
> Reported-by: Tangquan.Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>

The implementation of HUGE_VMAP appears to be quite disorganized.
A major change is needed.

1. when allocating 2.1MB kvmalloc, we shouldn't allocate 4MB memory, which
is now done by HUGE_VMAP. This is even worse than PMD-mapped THP for
userspace. We don't even do this for THP. vmap could be done by 1PMD map
+ 0.1MB PTE mapping instead.

2. We need to allow fallback to order-0 pages if we're unable to allocate 2MB.
In this case, we should perform PMD/PTE mapping based on how the pages
are acquired, rather than assuming they always form contiguous 2MB blocks.

3. Memory is entirely corrupted after Michael's "mm, vmalloc: fix high order
__GFP_NOFAIL allocations". but without it, forcing 2MB  allocation was
making OOM.

> ---
>  mm/util.c    | 2 +-
>  mm/vmalloc.c | 9 ---------
>  2 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 669397235787..b23133b738cf 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -657,7 +657,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>          * protection games.
>          */
>         return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -                       flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> +                       flags, PAGE_KERNEL, 0,
>                         node, __builtin_return_address(0));

I'd vote +1 for this. we don't want to waste memory, for example, wasting
1.9MB memory while allocating 2.1MB kvmalloc. but this should be a separate
patch.

>  }
>  EXPORT_SYMBOL(kvmalloc_node);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 03c78fae06f3..1914768f473e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3577,15 +3577,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                         page = alloc_pages(alloc_gfp, order);
>                 else
>                         page = alloc_pages_node(nid, alloc_gfp, order);
> -               if (unlikely(!page)) {
> -                       if (!nofail)
> -                               break;
> -
> -                       /* fall back to the zero order allocations */
> -                       alloc_gfp |= __GFP_NOFAIL;
> -                       order = 0;
> -                       continue;
> -               }
>
>                 /*
>                  * Higher order allocations must be able to be treated as
> --
> After 1) I check the code and I can't find a resonable band-aid to fix
> this. so the v2 patch works but ugly. Glad to hear a better solution :)

This is still incorrect because it undoes Michal's work. We also need to break
the loop if (!nofail), which you're currently omitting.

To avoid reverting Michal's work, the simplest "fix" would be,

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index caf032f0bd69..0011ca30df1c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
size, unsigned long align,
                return NULL;
        }

-       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
+       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
!(gfp_mask & __GFP_NOFAIL)) {
                unsigned long size_per_node;

                /*
>
> [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@oppo.com/
> 2.34.1

Thanks
Barry


  reply	other threads:[~2024-07-25  6:21 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25  3:53 hailong.liu
2024-07-25  6:21 ` Barry Song [this message]
2024-07-25  9:17   ` Hailong Liu
2024-07-25  9:34     ` Barry Song
2024-07-25  9:58       ` Hailong Liu
2024-07-25 10:22         ` Barry Song
2024-07-25 11:39 ` Baoquan He
2024-07-25 16:40   ` Hailong Liu
2024-07-26  1:31     ` Barry Song
2024-07-26  2:31     ` Baoquan He
2024-07-26  3:53       ` Barry Song
2024-07-29  1:48         ` Baoquan He
2024-07-30  3:24           ` Hailong Liu
2024-07-30  4:29             ` Baoquan He
2024-07-26  4:00       ` Hailong Liu
2024-07-26  5:03         ` Hailong Liu
2024-07-26  5:29           ` Barry Song
2024-07-26  8:37             ` Baoquan He
2024-07-26  8:48               ` Hailong Liu
2024-07-26  9:00                 ` Hailong Liu
2024-07-26  9:29                 ` Baoquan He
2024-07-26  9:15               ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4zWsh50Pzp0ntskT=eyfStALxD5BMNdWFrYJewrrx7V5Q@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=hailong.liu@oppo.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=mhocko@suse.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zhengtangquan@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox