From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81CB0C4828E for ; Fri, 2 Feb 2024 09:28:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 02CD26B0075; Fri, 2 Feb 2024 04:28:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F1EF26B0078; Fri, 2 Feb 2024 04:28:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E0DD76B007B; Fri, 2 Feb 2024 04:28:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CF2836B0075 for ; Fri, 2 Feb 2024 04:28:51 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 950FBA25E0 for ; Fri, 2 Feb 2024 09:28:51 +0000 (UTC) X-FDA: 81746339262.20.09E5BA6 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf28.hostedemail.com (Postfix) with ESMTP id A335DC0017 for ; Fri, 2 Feb 2024 09:28:47 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=igFZU13d; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf28.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706866130; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Mq/Heb4h+f5RoC1ScHkmVU31bkVi/CnbkssDq0b9Pc4=; b=falVEeJXUyGwwx2LZg0CAqFiBVHP/USWbYoiEyd1AWXW/wz4wpL0iXgH864btdvdSvqcoQ fB2ApkdM0q7L2v08F2VG6zzjbaR82FwrZea4KnaRKTiBbf6tGP3gVXx9rd0xOqxUMBwOkq fNz7/ZZj0MLn75w7CeGsSDEWONXaMmA= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=igFZU13d; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf28.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706866130; a=rsa-sha256; cv=none; b=rWKMdoBc/vaQsIm6lCdsJyMdDyIL0qer3pWJQk3rj9d4NZpjVC7GG4yzDmFA5YJsNM+a9H +Qlc5esM3YBJethNOTTo/ydkK6axKfJ7EawunHdx0T+ddwtTRpgtYakcGU6pBn9iJAEBkP UahBIXmiOhC+FkFRr3FSzQd5aJgYtaI= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1706866124; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=Mq/Heb4h+f5RoC1ScHkmVU31bkVi/CnbkssDq0b9Pc4=; b=igFZU13dpCj+8yvhf2Rvwb2jDyDcjJLCu13Z4l7DNRqZXDctNvZmMGVCAoUPy5ADIClF9ARzNLbAGyTk9k8Baslc96K/BG2b09fPyordWMkisLIkzTu52MWjJ72FRgEcWd3EAHxeN2h2pf+stCyKLg1DImW56FA5XAmJATdMuAM= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0W.wc2fj_1706866122; Received: from 30.97.56.44(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W.wc2fj_1706866122) by smtp.aliyun-inc.com; Fri, 02 Feb 2024 17:28:43 +0800 Message-ID: Date: Fri, 2 Feb 2024 17:29:02 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb To: Michal Hocko Cc: akpm@linux-foundation.org, muchun.song@linux.dev, osalvador@suse.de, david@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com> <3f31cd89-f349-4f9e-bc29-35f29f489633@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: A335DC0017 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 41qydxyqrmn35shf1e5kkk9143grexcq X-HE-Tag: 1706866127-201012 X-HE-Meta: U2FsdGVkX1+PhE5JL1zwtUYhyuaMhwqeKbwoVCgIu4gh42frA3x89avn64bcuv/3McAzctPY8W/vrhP0kE8Ex0bXhCce5EtW8X5D4Ch3r8edBD27AYVXJx5Rw/bWGq93aYTfmPb6OL3XJdyilwa3umvJPnIR2Mjv6obGX1wEoN5e2bbMlF4DxamLeiGOnnJLVy15IUHgpW3V0D17bYu6bILFTPNsufmu469s7psPQkzC1X684+6fb2F3cjStaRscvAjJMsm4ZkAv8uF3HInI0V+f0cP1aom2dZ44uuztjS3L7hTadXdUL7v4tl9t+mflQGwPQA3oEc95gdAyMiyzqvRflDSPlGVRTmB29vwiZvzjlElgLDNnWOKfH9Dn7yKZc5SMuySbEIjWQZMc1y5Y8mlbNQpO+k0NvEsgpRhVdizipXODmchOCao9mJOMXvVoO8FSGa/y2qgznMTYEC3Pt31TwqspfrSpfi0VydE/AB7I+EOR4V6VBrLxyh6Bl0VzoZjL1ldkwNQYghBCx4oFPFJfUpQoBZMpXRRQh23oDHsS7ENJFhp/zucojbKgTzlcErro7QZrB/3bEabL8DTd01X44MZHEFXIk3fgKDmn+570dUu1e1OIV6RKCt0QoFvYjk+IdQyDfq85j5EVnffo6M7pWTho/EzeJVlrhvvt09h6ZwbM55a07BopeBLu6W8l82MqefEGE5TVROT2z34lmpvq4+1w+JqE7YDcR9qHSRR7RsaJi37UuSkxd63By+r73KbrIolTr7CzT8lA84sdfGfLrjxllSlLyXOy7P/D3hzZcyYTOeAid/qYeNUvsjZYiuOSvOGVEJEoNZJ4p2086m87Fjfbt1soOTmIsXTjSC0cbfXHWPEvg+cQtRRl9ChLpeDWgeFel0Qzvzk24y/ycwpjbzCAoS48YIjHa3nETkEXKyyctj91RRYxSCujqEZyq+2woIatk0AYwEs6d3W DLVFO/0O Wqpttsr+QPMWX9HuLxKrRqSUx/P0WLcjG3JJvr4AQN0Sb/aAH97x4yHvA5B09xnEzILsCrBUH2RA+1lRglzEdJMycA07GDqptgvNvIPjVnnUmKe/EfV3tB/ROcsrRjviW2ZRF8iooraaY8B/cnMfpD+683Lni7XgD4VBepRJa8NAUn5cgwQtl0Yw46lyEKJxEx0vy3vzsc4Z+dssXy8u7FIUs73oCY1oiN2ADjIdAcH35A8JI9zjRB4MDrA9nfxpQSSiglTUERlpUlWJu+uqokNC0Xy2oAZDHel7fmpBlwRdg+5Ff1vHsM0srBPjGMHfggqoDpiRmdOvrhdA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2/2/2024 4:17 PM, Michal Hocko wrote: > On Fri 02-02-24 09:35:58, Baolin Wang wrote: >> >> >> On 2/1/2024 11:27 PM, Michal Hocko wrote: >>> On Thu 01-02-24 21:31:13, Baolin Wang wrote: >>>> Since commit 369fa227c219 ("mm: make alloc_contig_range handle free >>>> hugetlb pages"), the alloc_contig_range() can handle free hugetlb pages >>>> by allocating a new fresh hugepage, and replacing the old one in the >>>> free hugepage pool. >>>> >>>> However, our customers can still see the failure of alloc_contig_range() >>>> when seeing a free hugetlb page. The reason is that, there are few memory >>>> on the old hugetlb page's node, and it can not allocate a fresh hugetlb >>>> page on the old hugetlb page's node in isolate_or_dissolve_huge_page() with >>>> setting __GFP_THISNODE flag. This makes sense to some degree. >>>> >>>> Later, the commit ae37c7ff79f1 (" mm: make alloc_contig_range handle >>>> in-use hugetlb pages") handles the in-use hugetlb pages by isolating it >>>> and doing migration in __alloc_contig_migrate_range(), but it can allow >>>> fallbacking to other numa node when allocating a new hugetlb in >>>> alloc_migration_target(). >>>> >>>> This introduces inconsistency to handling free and in-use hugetlb. >>>> Considering the CMA allocation and memory hotplug relying on the >>>> alloc_contig_range() are important in some scenarios, as well as keeping >>>> the consistent hugetlb handling, we should remove the __GFP_THISNODE flag >>>> in isolate_or_dissolve_huge_page() to allow fallbacking to other numa node, >>>> which can solve the failure of alloc_contig_range() in our case. >>> >>> I do agree that the inconsistency is not really good but I am not sure >>> dropping __GFP_THISNODE is the right way forward. Breaking pre-allocated >>> per-node pools might result in unexpected failures when node bound >>> workloads doesn't get what is asssumed available. Keep in mind that our >>> user APIs allow to pre-allocate per-node pools separately. >> >> Yes, I agree, that is also what I concered. But sometimes users don't care >> about the distribution of per-node hugetlb, instead they are more concerned >> about the success of cma allocation or memory hotplug. > > Yes, sometimes the exact per-node distribution is not really important. > But the kernel has no way of knowing that right now. And we have to make > a conservative guess here. > >>> The in-use hugetlb is a very similar case. While having a temporarily >>> misplaced page doesn't really look terrible once that hugetlb page is >>> released back into the pool we are back to the case above. Either we >>> make sure that the node affinity is restored later on or it shouldn't be >>> migrated to a different node at all. >> >> Agree. So how about below changing? >> (1) disallow fallbacking to other nodes when handing in-use hugetlb, which >> can ensure consistent behavior in handling hugetlb. > > I can see two cases here. alloc_contig_range which is an internal kernel > user and then we have memory offlining. The former shouldn't break the > per-node hugetlb pool reservations, the latter might not have any other > choice (the whole node could get offline and that resembles breaking cpu > affininty if the cpu is gone). IMO, not always true for memory offlining, when handling a free hugetlb, it disallows fallbacking, which is inconsistent. Not only memory offlining, but also the longterm pinning (in migrate_longterm_unpinnable_pages()) and memory failure (in soft_offline_in_use_page()) can also break the per-node hugetlb pool reservations. > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA > users expectations but hugetlb migration already tries hard to allocate > a replacement hugetlb so the system must be under a heavy memory > pressure if that fails, right? Is it possible that the hugetlb > reservation is just overshooted here? Maybe the memory is just terribly > fragmented though? > > Could you be more specific about numbers in your failure case? Sure. Our customer's machine contains serveral numa nodes, and the system reserves a large number of CMA memory occupied 50% of the total memory which is used for the virtual machine, meanwhile it also reserves lots of hugetlb which can occupy 50% of the CMA. So before starting the virtual machine, the hugetlb can use 50% of the CMA, but when starting the virtual machine, the CMA will be used by the virtual machine and the hugetlb should be migrated from CMA. Due to several nodes in the system, one node's memory can be exhausted, which will fail the hugetlb migration with __GFP_THISNODE flag. >> (2) introduce a new sysctl (may be named as "hugetlb_allow_fallback_nodes") >> for users to control to allow fallbacking, that can solve the CMA or memory >> hotplug failures that users are more concerned about. > > I do not think this is a good idea. The policy might be different on > each node and this would get messy pretty quickly. If anything we could > try to detect a dedicated per node pool allocation instead. It is quite > likely that if admin preallocates pool without any memory policy then > the exact distribution of pages doesn't play a huge role. I also agree. Now I think the policy is already messy when handing hugetlb migration: 1. CMA allocation: can or can not break the per-node hugetlb pool reservations. 1.1 handling free hugetlb: can not break per-node hugetlb pool reservations. 1.2 handling in-use hugetlb: can break per-node hugetlb pool reservations. 2. memory offlining: can or can not break per-node hugetlb pool reservations. 2.1 handling free hugetlb: can not break 2.2 handling in-use hugetlb: can break 3. longterm pinning: can break per-node hugetlb pool reservations. 4. memory soft-offline: can break per-node hugetlb pool reservations. What a messy policy. And now we have no documentation to describe this messy policy. So we need to make things more clear when handling hugetlb migration with proper documantation.