From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 60D64C4828D
	for <linux-mm@archiver.kernel.org>; Mon,  5 Feb 2024 13:06:14 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9F9D86B0095; Mon,  5 Feb 2024 08:06:13 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9A95A6B0098; Mon,  5 Feb 2024 08:06:13 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8706A6B0099; Mon,  5 Feb 2024 08:06:13 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 730056B0095
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 08:06:13 -0500 (EST)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 3D46880596
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 13:06:13 +0000 (UTC)
X-FDA: 81757773426.01.20BE1F2
Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111])
	by imf16.hostedemail.com (Postfix) with ESMTP id 3630918001E
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 13:06:07 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=kJQAAakD;
	spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1707138369;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HGoNLv2jRKFKzJDTTarPrBhake1p0+Zm7zJYtCUBCmY=;
	b=rj8lgQJYGSURjyaca2b6k6KJGJuNkT5NDfqbn7hZaEQvVKoxNVNtZ+8qqR2/RyPUcqxxhH
	FtIsp2qLy1o4H5wQJ3LrU9QQAkyKuVTl5RiurGVMM88pUssjiNyFGlrDbjC2t9xYfMJ7jv
	HtieUarr202eTJq4Sr/kVYDpRtdicLw=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707138369; a=rsa-sha256;
	cv=none;
	b=wEtSx7rtIoCOFYd+OZYMvvtgkYHtqN/1yd6awlNzyyotXx1E9ySgCEw9FwX21nrA/WygVM
	hNk8Ne1wNmNhQUK8iaz+SDXEh3GmnBm2za2zx8L2BGpmDR7DYRVWiGP3Q0N25UhgEpHIVR
	rMuDKL0UQt6Lpqs85Oi03gG5REYgvio=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=kJQAAakD;
	spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1707138364; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type;
	bh=HGoNLv2jRKFKzJDTTarPrBhake1p0+Zm7zJYtCUBCmY=;
	b=kJQAAakDRvJFUP/eunMdBqWVKgxAjsEp9FsJcilFBXXUO9ftzu+W0laMmOqPhNjYtLMrtE1RFtu6++1IvGv6x2GPlFPW8hN3ki99/2K6LRfpqExpy9B9G8kev54FyXgxDYT1DKE8WuDtgaG5NCto+QbgzkZ3oPqe6wMTQ+B58g0=
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0W0AXnX7_1707138362;
Received: from 192.168.0.105(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W0AXnX7_1707138362)
          by smtp.aliyun-inc.com;
          Mon, 05 Feb 2024 21:06:03 +0800
Message-ID: <2613b670-84f8-4f97-ab4e-0d480fc1a3f8@linux.alibaba.com>
Date: Mon, 5 Feb 2024 21:06:17 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when
 dissolving the old hugetlb
To: Michal Hocko <mhocko@suse.com>
Cc: akpm@linux-foundation.org, muchun.song@linux.dev, osalvador@suse.de,
 david@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org
References: <6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com>
 <Zbu4cD1XLFLfKan8@tiehlicka>
 <3f31cd89-f349-4f9e-bc29-35f29f489633@linux.alibaba.com>
 <ZbylJr_bbWCUMjMl@tiehlicka>
 <f1606912-5bcc-46be-b4f4-666149eab7bd@linux.alibaba.com>
 <Zby7-dTtPIy2k5pj@tiehlicka>
 <909cee7d-0201-4429-b85d-7d2662516e45@linux.alibaba.com>
 <ZcCnNPkNpE7KTHZu@tiehlicka>
From: Baolin Wang <baolin.wang@linux.alibaba.com>
In-Reply-To: <ZcCnNPkNpE7KTHZu@tiehlicka>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Stat-Signature: 3utmdmqwhm5ua5dbhn91ngjof6henq1s
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 3630918001E
X-Rspam-User: 
X-HE-Tag: 1707138367-865484
X-HE-Meta: U2FsdGVkX1/rd//mLbMIb8FPR+7o56UvQ/IBPNF4q2Wr2tJaa199tWqO6pcyW76/54j6uGXdyr0lbxJSV7LqFn29giQCj+XIfYNuQ1Ns5q6n1kZUXcNIoogIgYLy5mNB3dARDItdJXJ2JKB5WiRxW5+Q7Br6XJ3Ca9zaITYc7CMmkcwztT8ayyosVAU7Y+akaHRoncY/OfbC9r2hL8IUS/Rg9SbBQ89qMIOC6aHhLZpfHRDaR26uHS9tZ2u6iS3BFjTcgKn1KwhAJCyghG5kzwyHK73C1vQhIMG565Rtcb831aZEeO0GZklUVLYb5ZFd2PxaYedKxWqH+u5cHqotb/xIBwL5/+o0j4e/s1a59UCqijt9Tx5pkbHldtVmCQUyuy1qldOrpS0fiVt+kAYKul3bo4XYcTNCvLumR3HmLQ4pJ38anNfJiXDmikXHEjy2xXRKjai2ahurStxXbenizpjwTdiyW1PMBBQ6m8pT5l3kBtJMgE6EZL1LvrQUDGjvZMQRgTL2zig1hSyUrkXurKV2jjYc11k2uATPCi9CxRC6iLGxxaIOtmRaIV6NfmVMvcDhz2fWSqTzREfMQQlXoAhxTS2qdWurqvccswOmSVfCXlTou7VAleV7IliiZDo7JJQmuFDTMZ/louWqC+0AAlzlaRkW+v7wZUyOIuQb7G8gvXA0suWfHkwvwp168BJfwMEp6ERkaFghYAj5w6yWxfsMAcd8CGwqcoVAFKj/UWvkymQlG4hxpeCWiesxgl3e7io8XVmXPjYnAD1isDDARP+EfVXRdFYd/oLIEs3AR3+RmlTuttOfdqCFtxCvMb55CoJU/dpahLsy9LZ4AoBw7XbYEMns+72yB1bKvSNBzw7tmjz63eIgLFDIgEvqiyGr23CfWvE6gDLoORj7S4sICAep8sfHMZW6TZrLXj8o2d7TRQB8UGTIG67Aj2jHDhqrL6Ln4rXi2Lf9d6HTO9q
 5rYplP8S
 HF+t9zDVoIHdPZ2XMF6IS9Gbop+umDQUDfCdaKdywBJkkPYj4B88VUbL6p8h5SZ+38lhn2lDkTG1hRz5Sg2VN4cE4XUJm1+KAJtmXDuKg1PqOUR27MCa0UfQqAmYbySZnBDOM/YR4RWRTWCNIqg8EYKtPZmHyU8TS4F/GdqcI67YJsZHtu8oRyO1/wlvk2Et42O2hPkSL+PVKUBhACyYts45bPOoDUEUnb0sP3IoR//IcdDO7lJiZ6TZDP1XgtaEeJ2+UcvddnWgdkamtLMPoJeUBwXQjvWILkccF92aAQIs5ZA8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2/5/2024 5:15 PM, Michal Hocko wrote:
> On Mon 05-02-24 10:50:32, Baolin Wang wrote:
>>
>>
>> On 2/2/2024 5:55 PM, Michal Hocko wrote:
>>> On Fri 02-02-24 17:29:02, Baolin Wang wrote:
>>>> On 2/2/2024 4:17 PM, Michal Hocko wrote:
>>> [...]
>>>>>> Agree. So how about below changing?
>>>>>> (1) disallow fallbacking to other nodes when handing in-use hugetlb, which
>>>>>> can ensure consistent behavior in handling hugetlb.
>>>>>
>>>>> I can see two cases here. alloc_contig_range which is an internal kernel
>>>>> user and then we have memory offlining. The former shouldn't break the
>>>>> per-node hugetlb pool reservations, the latter might not have any other
>>>>> choice (the whole node could get offline and that resembles breaking cpu
>>>>> affininty if the cpu is gone).
>>>>
>>>> IMO, not always true for memory offlining, when handling a free hugetlb, it
>>>> disallows fallbacking, which is inconsistent.
>>>
>>> It's been some time I've looked into that code so I am not 100% sure how
>>> the free pool is currently handled. The above is the way I _think_ it
>>> should work from the usability POV.
>>
>> Please see alloc_and_dissolve_hugetlb_folio().
> 
> This is the alloc_contig_range rather than offlining path. Page
> offlining migrates in-use pages to a _different_ node (as long as there is one
> available) via do_migrate_range and it disolves free hugetlb pages via
> dissolve_free_huge_pages. So the node's pool is altered but as this is
> an explicit offling operation I think there is not choice to go
> differently.
>   
>>>> Not only memory offlining, but also the longterm pinning (in
>>>> migrate_longterm_unpinnable_pages()) and memory failure (in
>>>> soft_offline_in_use_page()) can also break the per-node hugetlb pool
>>>> reservations.
>>>
>>> Bad
>>>
>>>>> Now I can see how a hugetlb page sitting inside a CMA region breaks CMA
>>>>> users expectations but hugetlb migration already tries hard to allocate
>>>>> a replacement hugetlb so the system must be under a heavy memory
>>>>> pressure if that fails, right? Is it possible that the hugetlb
>>>>> reservation is just overshooted here? Maybe the memory is just terribly
>>>>> fragmented though?
>>>>>
>>>>> Could you be more specific about numbers in your failure case?
>>>>
>>>> Sure. Our customer's machine contains serveral numa nodes, and the system
>>>> reserves a large number of CMA memory occupied 50% of the total memory which
>>>> is used for the virtual machine, meanwhile it also reserves lots of hugetlb
>>>> which can occupy 50% of the CMA. So before starting the virtual machine, the
>>>> hugetlb can use 50% of the CMA, but when starting the virtual machine, the
>>>> CMA will be used by the virtual machine and the hugetlb should be migrated
>>>> from CMA.
>>>
>>> Would it make more sense for hugetlb pages to _not_ use CMA in this
>>> case? I mean would be better off overall if the hugetlb pool was
>>> preallocated before the CMA is reserved? I do realize this is just
>>> working around the current limitations but it could be better than
>>> nothing.
>>
>> In this case, the CMA area is large and occupies 50% of the total memory.
>> The purpose is that, if no virtual machines are launched, then CMA memory
>> can be used by hugetlb as much as possible. Once the virtual machines need
>> to be launched, it is necessary to allocate CMA memory as much as possible,
>> such as migrating hugetlb from CMA memory.
> 
> I am afraid that your assumption doesn't correspond to the existing
> implemntation. hugetlb allocations are movable but they are certainly
> not as movable as regular pages. So you have to consider a bigger
> margin and spare memory to achieve a more reliable movability.
> 
> Have you tried to handle this from the userspace. It seems that you know
> when there is the CMA demand to you could rebalance hugetlb pools at
> that moment, no?

Maybe this can help, but this just mitigates the issue ...

>> After more thinking, I think we should still drop the __GFP_THISNODE flag in
>> alloc_and_dissolve_hugetlb_folio(). Firstly, not only it potentially cause
>> CMA allocation to fail, but it might also cause memory offline to fail like
>> I said in the commit message. Secondly, there have been no user reports
>> complaining about breaking the per-node hugetlb pool, although longterm
>> pinning, memory failure, and memory offline can potentially break the
>> per-node hugetlb pool.
> 
> It is quite possible that traditional users (like large DBs) do not use
> CMA heavily so such a problem was not observed so far. That doesn't mean
> those problems do not really matter.

CMA is just one case, as I mentioned before, other situations can also 
break the per-node hugetlb pool now.

Let's focus on the main point, why we should still keep inconsistency 
behavior to handle free and in-use hugetlb for alloc_contig_range()? 
That's really confused.