From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60D64C4828D for ; Mon, 5 Feb 2024 13:06:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F9D86B0095; Mon, 5 Feb 2024 08:06:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A95A6B0098; Mon, 5 Feb 2024 08:06:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8706A6B0099; Mon, 5 Feb 2024 08:06:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 730056B0095 for ; Mon, 5 Feb 2024 08:06:13 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3D46880596 for ; Mon, 5 Feb 2024 13:06:13 +0000 (UTC) X-FDA: 81757773426.01.20BE1F2 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by imf16.hostedemail.com (Postfix) with ESMTP id 3630918001E for ; Mon, 5 Feb 2024 13:06:07 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=kJQAAakD; spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707138369; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HGoNLv2jRKFKzJDTTarPrBhake1p0+Zm7zJYtCUBCmY=; b=rj8lgQJYGSURjyaca2b6k6KJGJuNkT5NDfqbn7hZaEQvVKoxNVNtZ+8qqR2/RyPUcqxxhH FtIsp2qLy1o4H5wQJ3LrU9QQAkyKuVTl5RiurGVMM88pUssjiNyFGlrDbjC2t9xYfMJ7jv HtieUarr202eTJq4Sr/kVYDpRtdicLw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707138369; a=rsa-sha256; cv=none; b=wEtSx7rtIoCOFYd+OZYMvvtgkYHtqN/1yd6awlNzyyotXx1E9ySgCEw9FwX21nrA/WygVM hNk8Ne1wNmNhQUK8iaz+SDXEh3GmnBm2za2zx8L2BGpmDR7DYRVWiGP3Q0N25UhgEpHIVR rMuDKL0UQt6Lpqs85Oi03gG5REYgvio= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=kJQAAakD; spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1707138364; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=HGoNLv2jRKFKzJDTTarPrBhake1p0+Zm7zJYtCUBCmY=; b=kJQAAakDRvJFUP/eunMdBqWVKgxAjsEp9FsJcilFBXXUO9ftzu+W0laMmOqPhNjYtLMrtE1RFtu6++1IvGv6x2GPlFPW8hN3ki99/2K6LRfpqExpy9B9G8kev54FyXgxDYT1DKE8WuDtgaG5NCto+QbgzkZ3oPqe6wMTQ+B58g0= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0W0AXnX7_1707138362; Received: from 192.168.0.105(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W0AXnX7_1707138362) by smtp.aliyun-inc.com; Mon, 05 Feb 2024 21:06:03 +0800 Message-ID: <2613b670-84f8-4f97-ab4e-0d480fc1a3f8@linux.alibaba.com> Date: Mon, 5 Feb 2024 21:06:17 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb To: Michal Hocko Cc: akpm@linux-foundation.org, muchun.song@linux.dev, osalvador@suse.de, david@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com> <3f31cd89-f349-4f9e-bc29-35f29f489633@linux.alibaba.com> <909cee7d-0201-4429-b85d-7d2662516e45@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: 3utmdmqwhm5ua5dbhn91ngjof6henq1s X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 3630918001E X-Rspam-User: X-HE-Tag: 1707138367-865484 X-HE-Meta: U2FsdGVkX1/rd//mLbMIb8FPR+7o56UvQ/IBPNF4q2Wr2tJaa199tWqO6pcyW76/54j6uGXdyr0lbxJSV7LqFn29giQCj+XIfYNuQ1Ns5q6n1kZUXcNIoogIgYLy5mNB3dARDItdJXJ2JKB5WiRxW5+Q7Br6XJ3Ca9zaITYc7CMmkcwztT8ayyosVAU7Y+akaHRoncY/OfbC9r2hL8IUS/Rg9SbBQ89qMIOC6aHhLZpfHRDaR26uHS9tZ2u6iS3BFjTcgKn1KwhAJCyghG5kzwyHK73C1vQhIMG565Rtcb831aZEeO0GZklUVLYb5ZFd2PxaYedKxWqH+u5cHqotb/xIBwL5/+o0j4e/s1a59UCqijt9Tx5pkbHldtVmCQUyuy1qldOrpS0fiVt+kAYKul3bo4XYcTNCvLumR3HmLQ4pJ38anNfJiXDmikXHEjy2xXRKjai2ahurStxXbenizpjwTdiyW1PMBBQ6m8pT5l3kBtJMgE6EZL1LvrQUDGjvZMQRgTL2zig1hSyUrkXurKV2jjYc11k2uATPCi9CxRC6iLGxxaIOtmRaIV6NfmVMvcDhz2fWSqTzREfMQQlXoAhxTS2qdWurqvccswOmSVfCXlTou7VAleV7IliiZDo7JJQmuFDTMZ/louWqC+0AAlzlaRkW+v7wZUyOIuQb7G8gvXA0suWfHkwvwp168BJfwMEp6ERkaFghYAj5w6yWxfsMAcd8CGwqcoVAFKj/UWvkymQlG4hxpeCWiesxgl3e7io8XVmXPjYnAD1isDDARP+EfVXRdFYd/oLIEs3AR3+RmlTuttOfdqCFtxCvMb55CoJU/dpahLsy9LZ4AoBw7XbYEMns+72yB1bKvSNBzw7tmjz63eIgLFDIgEvqiyGr23CfWvE6gDLoORj7S4sICAep8sfHMZW6TZrLXj8o2d7TRQB8UGTIG67Aj2jHDhqrL6Ln4rXi2Lf9d6HTO9q 5rYplP8S HF+t9zDVoIHdPZ2XMF6IS9Gbop+umDQUDfCdaKdywBJkkPYj4B88VUbL6p8h5SZ+38lhn2lDkTG1hRz5Sg2VN4cE4XUJm1+KAJtmXDuKg1PqOUR27MCa0UfQqAmYbySZnBDOM/YR4RWRTWCNIqg8EYKtPZmHyU8TS4F/GdqcI67YJsZHtu8oRyO1/wlvk2Et42O2hPkSL+PVKUBhACyYts45bPOoDUEUnb0sP3IoR//IcdDO7lJiZ6TZDP1XgtaEeJ2+UcvddnWgdkamtLMPoJeUBwXQjvWILkccF92aAQIs5ZA8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2/5/2024 5:15 PM, Michal Hocko wrote: > On Mon 05-02-24 10:50:32, Baolin Wang wrote: >> >> >> On 2/2/2024 5:55 PM, Michal Hocko wrote: >>> On Fri 02-02-24 17:29:02, Baolin Wang wrote: >>>> On 2/2/2024 4:17 PM, Michal Hocko wrote: >>> [...] >>>>>> Agree. So how about below changing? >>>>>> (1) disallow fallbacking to other nodes when handing in-use hugetlb, which >>>>>> can ensure consistent behavior in handling hugetlb. >>>>> >>>>> I can see two cases here. alloc_contig_range which is an internal kernel >>>>> user and then we have memory offlining. The former shouldn't break the >>>>> per-node hugetlb pool reservations, the latter might not have any other >>>>> choice (the whole node could get offline and that resembles breaking cpu >>>>> affininty if the cpu is gone). >>>> >>>> IMO, not always true for memory offlining, when handling a free hugetlb, it >>>> disallows fallbacking, which is inconsistent. >>> >>> It's been some time I've looked into that code so I am not 100% sure how >>> the free pool is currently handled. The above is the way I _think_ it >>> should work from the usability POV. >> >> Please see alloc_and_dissolve_hugetlb_folio(). > > This is the alloc_contig_range rather than offlining path. Page > offlining migrates in-use pages to a _different_ node (as long as there is one > available) via do_migrate_range and it disolves free hugetlb pages via > dissolve_free_huge_pages. So the node's pool is altered but as this is > an explicit offling operation I think there is not choice to go > differently. > >>>> Not only memory offlining, but also the longterm pinning (in >>>> migrate_longterm_unpinnable_pages()) and memory failure (in >>>> soft_offline_in_use_page()) can also break the per-node hugetlb pool >>>> reservations. >>> >>> Bad >>> >>>>> Now I can see how a hugetlb page sitting inside a CMA region breaks CMA >>>>> users expectations but hugetlb migration already tries hard to allocate >>>>> a replacement hugetlb so the system must be under a heavy memory >>>>> pressure if that fails, right? Is it possible that the hugetlb >>>>> reservation is just overshooted here? Maybe the memory is just terribly >>>>> fragmented though? >>>>> >>>>> Could you be more specific about numbers in your failure case? >>>> >>>> Sure. Our customer's machine contains serveral numa nodes, and the system >>>> reserves a large number of CMA memory occupied 50% of the total memory which >>>> is used for the virtual machine, meanwhile it also reserves lots of hugetlb >>>> which can occupy 50% of the CMA. So before starting the virtual machine, the >>>> hugetlb can use 50% of the CMA, but when starting the virtual machine, the >>>> CMA will be used by the virtual machine and the hugetlb should be migrated >>>> from CMA. >>> >>> Would it make more sense for hugetlb pages to _not_ use CMA in this >>> case? I mean would be better off overall if the hugetlb pool was >>> preallocated before the CMA is reserved? I do realize this is just >>> working around the current limitations but it could be better than >>> nothing. >> >> In this case, the CMA area is large and occupies 50% of the total memory. >> The purpose is that, if no virtual machines are launched, then CMA memory >> can be used by hugetlb as much as possible. Once the virtual machines need >> to be launched, it is necessary to allocate CMA memory as much as possible, >> such as migrating hugetlb from CMA memory. > > I am afraid that your assumption doesn't correspond to the existing > implemntation. hugetlb allocations are movable but they are certainly > not as movable as regular pages. So you have to consider a bigger > margin and spare memory to achieve a more reliable movability. > > Have you tried to handle this from the userspace. It seems that you know > when there is the CMA demand to you could rebalance hugetlb pools at > that moment, no? Maybe this can help, but this just mitigates the issue ... >> After more thinking, I think we should still drop the __GFP_THISNODE flag in >> alloc_and_dissolve_hugetlb_folio(). Firstly, not only it potentially cause >> CMA allocation to fail, but it might also cause memory offline to fail like >> I said in the commit message. Secondly, there have been no user reports >> complaining about breaking the per-node hugetlb pool, although longterm >> pinning, memory failure, and memory offline can potentially break the >> per-node hugetlb pool. > > It is quite possible that traditional users (like large DBs) do not use > CMA heavily so such a problem was not observed so far. That doesn't mean > those problems do not really matter. CMA is just one case, as I mentioned before, other situations can also break the per-node hugetlb pool now. Let's focus on the main point, why we should still keep inconsistency behavior to handle free and in-use hugetlb for alloc_contig_range()? That's really confused.