Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, muchun.song@linux.dev,
	osalvador@suse.de, david@redhat.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb
Date: Fri, 2 Feb 2024 10:55:05 +0100	[thread overview]
Message-ID: <Zby7-dTtPIy2k5pj@tiehlicka> (raw)
In-Reply-To: <f1606912-5bcc-46be-b4f4-666149eab7bd@linux.alibaba.com>

On Fri 02-02-24 17:29:02, Baolin Wang wrote:
> On 2/2/2024 4:17 PM, Michal Hocko wrote:
[...]
> > > Agree. So how about below changing?
> > > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which
> > > can ensure consistent behavior in handling hugetlb.
> > 
> > I can see two cases here. alloc_contig_range which is an internal kernel
> > user and then we have memory offlining. The former shouldn't break the
> > per-node hugetlb pool reservations, the latter might not have any other
> > choice (the whole node could get offline and that resembles breaking cpu
> > affininty if the cpu is gone).
> 
> IMO, not always true for memory offlining, when handling a free hugetlb, it
> disallows fallbacking, which is inconsistent.

It's been some time I've looked into that code so I am not 100% sure how
the free pool is currently handled. The above is the way I _think_ it
should work from the usability POV.

> Not only memory offlining, but also the longterm pinning (in
> migrate_longterm_unpinnable_pages()) and memory failure (in
> soft_offline_in_use_page()) can also break the per-node hugetlb pool
> reservations.

Bad

> > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA
> > users expectations but hugetlb migration already tries hard to allocate
> > a replacement hugetlb so the system must be under a heavy memory
> > pressure if that fails, right? Is it possible that the hugetlb
> > reservation is just overshooted here? Maybe the memory is just terribly
> > fragmented though?
> > 
> > Could you be more specific about numbers in your failure case?
> 
> Sure. Our customer's machine contains serveral numa nodes, and the system
> reserves a large number of CMA memory occupied 50% of the total memory which
> is used for the virtual machine, meanwhile it also reserves lots of hugetlb
> which can occupy 50% of the CMA. So before starting the virtual machine, the
> hugetlb can use 50% of the CMA, but when starting the virtual machine, the
> CMA will be used by the virtual machine and the hugetlb should be migrated
> from CMA.

Would it make more sense for hugetlb pages to _not_ use CMA in this
case? I mean would be better off overall if the hugetlb pool was
preallocated before the CMA is reserved? I do realize this is just
working around the current limitations but it could be better than
nothing.

> Due to several nodes in the system, one node's memory can be exhausted,
> which will fail the hugetlb migration with __GFP_THISNODE flag.

Is the workload NUMA aware? I.e. do you bind virtual machines to
specific nodes?

-- 
Michal Hocko
SUSE Labs

next prev parent reply	other threads:[~2024-02-02  9:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-01 13:31 Baolin Wang
2024-02-01 15:27 ` Michal Hocko
2024-02-02  1:35   ` Baolin Wang
2024-02-02  8:17     ` Michal Hocko
2024-02-02  9:29       ` Baolin Wang
2024-02-02  9:55         ` Michal Hocko [this message]
2024-02-05  2:50           ` Baolin Wang
2024-02-05  9:15             ` Michal Hocko
2024-02-05 13:06               ` Baolin Wang
2024-02-05 14:23                 ` Michal Hocko
2024-02-06  8:18                   ` Baolin Wang
2024-02-06 13:19                     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zby7-dTtPIy2k5pj@tiehlicka \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox