From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 411CAC02180
	for <linux-mm@archiver.kernel.org>; Thu, 16 Jan 2025 01:35:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D0827280001; Wed, 15 Jan 2025 20:35:31 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CB8B06B0088; Wed, 15 Jan 2025 20:35:31 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BA752280001; Wed, 15 Jan 2025 20:35:31 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 9E35C6B0085
	for <linux-mm@kvack.org>; Wed, 15 Jan 2025 20:35:31 -0500 (EST)
Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 4A1B1AEE47
	for <linux-mm@kvack.org>; Thu, 16 Jan 2025 01:35:31 +0000 (UTC)
X-FDA: 83011597662.23.3506CBE
Received: from m16.mail.126.com (m16.mail.126.com [220.197.31.7])
	by imf01.hostedemail.com (Postfix) with ESMTP id 11DAF40002
	for <linux-mm@kvack.org>; Thu, 16 Jan 2025 01:35:24 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=126.com header.s=s110527 header.b=T1fgRBBj;
	dmarc=pass (policy=none) header.from=126.com;
	spf=pass (imf01.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.7 as permitted sender) smtp.mailfrom=yangge1116@126.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736991329; a=rsa-sha256;
	cv=none;
	b=yC1GERYeFLPELfSqDoUh3/81t/6xon7xUWN6cAWBkc9djL8+YQTqPwkgVe77CRwXydTund
	cZ7YI5p2Uk5BPDQzRwMhKHSwohtuKaCn51SSGn8ijTy2kr0fip0XFA+YwYqmAGqsuUTUdJ
	jyHeJ9eqhF/mhYuU2Yf8S0TLoxP0rPg=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=126.com header.s=s110527 header.b=T1fgRBBj;
	dmarc=pass (policy=none) header.from=126.com;
	spf=pass (imf01.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.7 as permitted sender) smtp.mailfrom=yangge1116@126.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736991329;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=2CT+N5TqMV2s0adNs4F4x92kdNZf+WH5odS/GR+hiU4=;
	b=8i0nR5NLpPLkomRpRN/VNbhw8A5sVLHN9FREohiztG0ZmE9Vz1ZEf87uG5ffUnP7geny9q
	3mVwWqcHQ1Msacx6NXJnQRIWpF8ZUX53D/YYfM/y7OVzUnUueg+63jxs+vt3enw+B6ND4K
	XfrYoo8dZRoRdejvixG6NjDUgskRHwc=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com;
	s=s110527; h=Message-ID:Date:MIME-Version:Subject:From:
	Content-Type; bh=2CT+N5TqMV2s0adNs4F4x92kdNZf+WH5odS/GR+hiU4=;
	b=T1fgRBBjJLJwUM8fGfA6H+vDqbkA0DqAyKb1gobWzFBmk7itL42/0l+8ogjJK+
	0Xf8KfsA6zAzANqyWtrmd4e1Io8yYUwZXrhLxHpXurvBOrYX0yWER2iNImRCjM1V
	OiFejz1yiSpYym54bn5ZcZKCcw3EI2Lo08kVcCgewty5U=
Received: from [172.19.20.199] (unknown [])
	by gzga-smtp-mtada-g0-0 (Coremail) with SMTP id _____wDnf7vhYYhnsnw2BA--.47661S2;
	Thu, 16 Jan 2025 09:33:22 +0800 (CST)
Message-ID: <4b3c4ecf-2a3f-4868-8b6a-5c1b1920ca79@126.com>
Date: Thu, 16 Jan 2025 09:33:21 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] mm: compaction: use the actual allocation context to
 determine the watermarks for costly order during async memory compaction
To: Vlastimil Babka <vbabka@suse.cz>, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com,
 david@redhat.com, baolin.wang@linux.alibaba.com, hannes@cmpxchg.org,
 liuzixing@hygon.cn
References: <1736929894-19228-1-git-send-email-yangge1116@126.com>
 <d09cbac5-d3fd-4212-a138-8ab6496c8f4f@suse.cz>
From: Ge Yang <yangge1116@126.com>
In-Reply-To: <d09cbac5-d3fd-4212-a138-8ab6496c8f4f@suse.cz>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-CM-TRANSID:_____wDnf7vhYYhnsnw2BA--.47661S2
X-Coremail-Antispam: 1Uf129KBjvJXoW3JF1kZryUKFy8CrWrtr1rCrg_yoWxZFy8pF
	W8WF1akaykXF17CF1xta1v9F1a9w48JF17JF12q34xA39IkFn2yr1Dta45AF4UZry3Ka1j
	vFWYgF9xCanxZa7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
	9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jbHUDUUUUU=
X-Originating-IP: [112.64.138.194]
X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbifh7WG2eIVpHBHQAAsZ
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 11DAF40002
X-Stat-Signature: 7exw6oh5k4kniobe6pta9d1dmmadgr9n
X-HE-Tag: 1736991324-211380
X-HE-Meta: U2FsdGVkX18PQ3u6Gi1EfALXev82qM1B4uPlTbRIJFBP/SqrQcInVUJ+b/dxp7Zo+ouT0MBND1SCMiy4NDSOfBCXwAAv6iuQjcPeujsObHPi7ruBRR0QWQAHLm4e02qCYcDmmU6vO7FyFSC/7tHSXFz3eWvKsCwUTzU77KEkD9YeTUUyp8K3jAIypDIi4cSBHgIsPZpuc06JojCYShfcSld3kod7nX2yh4WXAM4Ji/t2AmFhz4uialObc3QV++z6spwtOTzCVBTZrTQCkT+O4V+LuiDZNyygDvyDy3BK4TereRF4naIss0yYub3tJlevFLrcPdSzSb92MolOFQffZTAgcczwSSgafyYo/SRCWZI4rwFP/M70pxINLbivx8aM3M9vJXlGtVCpJ4WfgdugEX1xWnuAzkG4Ug0+Y9exnJifuRLr22UjEiWvI/YlQN5jIgyBj1jmt5OwiwXTiMwwPuCo/jz9x7hzGk0j8u+qxPd5/e638rmNhztkgCF1cQF7RNM4lq9FeRMwn0MGvDx997XVA8oWoDaXV4TPv/f1K8C/CSbQ7wPotycn1JBelvpGkWLlRRpTyfw08UzsT7sxgajOVlhLewuKcY/I5mCq/MBbm2wsRnKhpRQvFk5qyRyz3iEidVfEcbeToGID+/ab433xJ+Hrg1ObEV+oeVoNhVVkWmUeFiFt8kVWBb2qeX2wbyQlcTHIYFN0Is6FcrOSvF3Ks6Tup/jn9aR7bBRWuKTo750qh9D+mLK0RTIPvtTMKXrimFvN84e3JJyfmUWEGHxr7U09IRYVed21AZzleAGtU2QT9EuzI367RsOZZT91mH4t6yZMSEgKNQk6bjEbAdWaMVJ3hHvyzJofM2L8t5lT/4neyfwsCF63m5mg36JbJ5G38yD18LIoxPfGtYolyQaIXB5XHETOX+exfc9Gk3PihKk+rvqc0ndI+dqeLEqYpCAeKLotyZtCApwsfqt
 e5v3b2E2
 TES9+6rKMtkO7JnGrk5P0VPKQnWQBRwPPkWx/FgfUtVdXSX+q3At/2BRx+L1KYzSsaCJzO2a3sLBNSDZYWTlPOxtKi7hR/BDtn7DdCv5P6BrNyVwCBcrQKG7CuiYv8XhyQe+QbC77T5yCC11MdzscPYdlhWx4msBEjZ2YZUaCXAEFSozv1C0IeO1G0n8fFUzVnb+isYpfBPK7vl3gcTiG/Nvh1k2eDohQGfMg7RjgJvHEwpDQa5bETmGNRgsGB69+XtRsA4y5KA7A/bVOMBpcc60axNGcyenToleNuUGrwvjtNqIuncwWIyeMJYARTQdx2IS5ML5Snue3M6t2FmksnBFtM0xZQMVf/jADdSKF0bGvOI459stQigM7qEljcrHink3k7ZE74VUHcDX5exe8ldyKR1RhvWkMnOqOeEvUsiHc2yXOxXMsFMsjhccnzRey0mfM
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


在 2025/1/15 17:56, Vlastimil Babka 写道:
> On 1/15/25 09:31, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>>
>> For costly allocations, if the __compaction_suitable() function always
>> returns true, it causes the __alloc_pages_slowpath() function to fail
>> to exit at the appropriate point. This prevents timely fallback to
>> allocating memory on other nodes, ultimately resulting in excessively
>> long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> We could use the real unmovable allocation context to have
>> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
>> pass the order-0 check anymore once the non-CMA part is exhausted. There
>> is some risk that in some different scenario the compaction could in
>> fact migrate pages from the exhausted non-CMA part of the zone to the
>> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
>> allocations should be affected in the immediate "goto nopage" when
>> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
>> anyway and won't fail without trying to compact-migrate the non-CMA
>> pageblocks into CMA pageblocks first, so it should be fine.
>>
>> After this fix, it only takes a few tens of seconds to start a 32GB
>> virtual machine with device passthrough functionality.
> 
> So did you verify it works? 
After multiple tests, it has been confirmed to work properly. Thank you.
I just realized there might be still cases it
> won't help. There might be enough free order-0 pages in the non-CMA
> pageblocks (so the additional check will not stop us) but fragmented and
> impossible to compact due to unmovable pages. Then we won't avoid your
> issue, right?
> 
The pages that are pinned are mostly Transparent Huge Pages (THP). 
Therefore, it is not common to find free order-0 pages in non-CMA 
pageblocks that are fragmented and impossible to compact due to the 
presence of unmovable pages. This patch can resolve my issue.
>> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
>> Signed-off-by: yangge <yangge1116@126.com>
> 
> In case it really helps reliably:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Some nits below:
> 
>> ---
>>   mm/compaction.c | 31 +++++++++++++++++++++++++++----
>>   1 file changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..9032bb6 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>>    */
>>   static enum compact_result
>>   compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>> -				 int highest_zoneidx, unsigned int alloc_flags)
>> +				 int highest_zoneidx, unsigned int alloc_flags,
>> +				 bool async)
>>   {
>>   	unsigned long watermark;
>>   
>> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>>   			      alloc_flags))
>>   		return COMPACT_SUCCESS;
>>   
>> +	/*
>> +	 * For costly orders, during the async memory compaction process, use the
>> +	 * actual allocation context to determine the watermarks. There's some risk
>> +	 * that in some different scenario the compaction could in fact migrate
>> +	 * pages from the exhausted non-CMA part of the zone to the CMA part and
>> +	 * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
>> +	 * should be affected in the immediate "goto nopage" when compaction is
>> +	 * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
>> +	 * fail without trying to compact-migrate the non-CMA pageblocks into CMA
>> +	 * pageblocks first, so it should be fine.
> 
> I think it's explaining too much about why not do this than why do this. How
> about:
> 
> For unmovable allocations (without ALLOC_CMA), check if there is enough free
> memory in the non-CMA pageblocks. Otherwise compaction could form the
> high-order page in CMA pageblocks, which would not help the allocation to
> succeed. However, limit the check to costly order async compaction (such as
> opportunistic THP attempts) because there is the possibility that compaction
> would migrate pages from non-CMA to CMA pageblock.
> 
>> +	 */
>> +	if (order > PAGE_ALLOC_COSTLY_ORDER && async) {
> 
> We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the
> watermark check in the normal THP allocation case (not from pinned gup),
> because then it just repeats the watermark check that was done above.
> 
>> +		watermark = low_wmark_pages(zone) + compact_gap(order);
>> +		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
>> +					   alloc_flags & ALLOC_CMA,
> 
> And then here we can just pass 0.
> 
>> +					   zone_page_state(zone, NR_FREE_PAGES)))
>> +			return COMPACT_SKIPPED;
>> +	}
>> +
>>   	if (!compaction_suitable(zone, order, highest_zoneidx))
>>   		return COMPACT_SKIPPED;
>>   
>> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>   	if (!is_via_compact_memory(cc->order)) {
>>   		ret = compaction_suit_allocation_order(cc->zone, cc->order,
>>   						       cc->highest_zoneidx,
>> -						       cc->alloc_flags);
>> +						       cc->alloc_flags,
>> +						       cc->mode == MIGRATE_ASYNC);
>>   		if (ret != COMPACT_CONTINUE)
>>   			return ret;
>>   	}
>> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>>   				pgdat->kcompactd_max_order,
>> -				highest_zoneidx, ALLOC_WMARK_MIN);
>> +				highest_zoneidx, ALLOC_WMARK_MIN,
>> +				0);
> 
> It's bool, so false instead of 0.
> 
>>   		if (ret == COMPACT_CONTINUE)
>>   			return true;
>>   	}
>> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>>   			continue;
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>> -				cc.order, zoneid, ALLOC_WMARK_MIN);
>> +				cc.order, zoneid, ALLOC_WMARK_MIN,
>> +				cc.mode == MIGRATE_ASYNC);
> 
> We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and
> has no real alloc_context.
> 
>>   		if (ret != COMPACT_CONTINUE)
>>   			continue;
>>