Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Baoquan He <bhe@redhat.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Hailong Liu <hailong.liu@oppo.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Christoph Hellwig <hch@infradead.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	Tangquan Zheng <zhengtangquan@oppo.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0
Date: Fri, 26 Jul 2024 16:37:55 +0800	[thread overview]
Message-ID: <ZqNgY1T/VuWQvtfL@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CAGsJ_4xOauOwkHO5MTKHBP=fpeoNiP_9VJ31G4gBECFvLG4y0g@mail.gmail.com>

On 07/26/24 at 05:29pm, Barry Song wrote:
> On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> >
> > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > [...]
> > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > allocation fails, it will go normal mapping.
> > > > >
> > > > > However I also save the fallback position. The ones before this position are
> > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > "support the combination of PMD and PTE mapping". this  will take some
> > > > > times as it needs to address the corner cases and do some tests.
> > > >
> > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > and vmalloc_huge().
> > > >
> > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > it's a best effort behaviour.
> > > >
> > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > because the align requirement of huge. the real size is 4M.
> > > if allocation first order-9 successfully and the next failed. becuase the
> > > fallback, the layout out pages would be like order9 - 512 * order0
> > > order9 support huge mapping, but order0 not.
> > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > mapping, the huge mapping would not exist.
> > >
> > > >  mm/mm_init.c <<alloc_large_system_hash>>
> > > >  table = vmalloc_huge(size, gfp_flags);
> > > >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > >
> > > > Maybe we should add code comment or document to notice people that the
> > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > use it after boot.
> > > >
> > > > >
> > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > other to address the current mapping issue.
> > > > >
> > > > > --
> > > > > help you, help me,
> > > > > Hailong.
> > > > >
> > > >
> > > >
> > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > fallback to order 0, actuaally without this commit
> > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> >
> > fail:
> >         if (shift > PAGE_SHIFT) {
> >                 shift = PAGE_SHIFT;
> >                 align = real_align;
> >                 size = real_size;
> >                 goto again;
> >         }
> >
> > So do we really need fallback to order-0 if nofail?
> 
> Good catch, this is what I missed. I feel we can revert Michal's fix.
> And just remove __GFP_NOFAIL bit when we are still allocating
> by high-order. When "goto again" happens, we will allocate by
> order-0, in this case, we keep the __GFP_NOFAIL.

With Michal's patch, the fallback will be able to satisfy the allocation
for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The 
'if (shift > PAGE_SHIFT)' conditional checking and handling may be
problemtic since it could jump to fail becuase vmap_pages_range()
invocation failed, or partially allocate huge parges and break down, 
then it will ignore the already allocated pages, and do all the thing again.

The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
sense is it fallback to the real_size and real_align. BUT we need handle
the fail separately, e.g 
1)__get_vm_area_node() failed;
2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
3)vmap_pages_range() failed;

Honestly, I didn't see where the nofail is mishandled, could you point
it out specifically? I could miss it.

Thanks
Baoquan

next prev parent reply	other threads:[~2024-07-26  8:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25  3:53 hailong.liu
2024-07-25  6:21 ` Barry Song
2024-07-25  9:17   ` Hailong Liu
2024-07-25  9:34     ` Barry Song
2024-07-25  9:58       ` Hailong Liu
2024-07-25 10:22         ` Barry Song
2024-07-25 11:39 ` Baoquan He
2024-07-25 16:40   ` Hailong Liu
2024-07-26  1:31     ` Barry Song
2024-07-26  2:31     ` Baoquan He
2024-07-26  3:53       ` Barry Song
2024-07-29  1:48         ` Baoquan He
2024-07-30  3:24           ` Hailong Liu
2024-07-30  4:29             ` Baoquan He
2024-07-26  4:00       ` Hailong Liu
2024-07-26  5:03         ` Hailong Liu
2024-07-26  5:29           ` Barry Song
2024-07-26  8:37             ` Baoquan He [this message]
2024-07-26  8:48               ` Hailong Liu
2024-07-26  9:00                 ` Hailong Liu
2024-07-26  9:29                 ` Baoquan He
2024-07-26  9:15               ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZqNgY1T/VuWQvtfL@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hailong.liu@oppo.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=mhocko@suse.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zhengtangquan@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox