linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ge Yang <yangge1116@126.com>
To: Vlastimil Babka <vbabka@suse.cz>, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	21cnbao@gmail.com, david@redhat.com,
	baolin.wang@linux.alibaba.com, hannes@cmpxchg.org,
	liuzixing@hygon.cn
Subject: Re: [PATCH] mm: compaction: use the actual allocation context to determine the watermarks for costly order during async memory compaction
Date: Thu, 16 Jan 2025 09:33:21 +0800	[thread overview]
Message-ID: <4b3c4ecf-2a3f-4868-8b6a-5c1b1920ca79@126.com> (raw)
In-Reply-To: <d09cbac5-d3fd-4212-a138-8ab6496c8f4f@suse.cz>



在 2025/1/15 17:56, Vlastimil Babka 写道:
> On 1/15/25 09:31, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>>
>> For costly allocations, if the __compaction_suitable() function always
>> returns true, it causes the __alloc_pages_slowpath() function to fail
>> to exit at the appropriate point. This prevents timely fallback to
>> allocating memory on other nodes, ultimately resulting in excessively
>> long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> We could use the real unmovable allocation context to have
>> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
>> pass the order-0 check anymore once the non-CMA part is exhausted. There
>> is some risk that in some different scenario the compaction could in
>> fact migrate pages from the exhausted non-CMA part of the zone to the
>> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
>> allocations should be affected in the immediate "goto nopage" when
>> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
>> anyway and won't fail without trying to compact-migrate the non-CMA
>> pageblocks into CMA pageblocks first, so it should be fine.
>>
>> After this fix, it only takes a few tens of seconds to start a 32GB
>> virtual machine with device passthrough functionality.
> 
> So did you verify it works? 
After multiple tests, it has been confirmed to work properly. Thank you.
I just realized there might be still cases it
> won't help. There might be enough free order-0 pages in the non-CMA
> pageblocks (so the additional check will not stop us) but fragmented and
> impossible to compact due to unmovable pages. Then we won't avoid your
> issue, right?
> 
The pages that are pinned are mostly Transparent Huge Pages (THP). 
Therefore, it is not common to find free order-0 pages in non-CMA 
pageblocks that are fragmented and impossible to compact due to the 
presence of unmovable pages. This patch can resolve my issue.
>> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
>> Signed-off-by: yangge <yangge1116@126.com>
> 
> In case it really helps reliably:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Some nits below:
> 
>> ---
>>   mm/compaction.c | 31 +++++++++++++++++++++++++++----
>>   1 file changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..9032bb6 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>>    */
>>   static enum compact_result
>>   compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>> -				 int highest_zoneidx, unsigned int alloc_flags)
>> +				 int highest_zoneidx, unsigned int alloc_flags,
>> +				 bool async)
>>   {
>>   	unsigned long watermark;
>>   
>> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>>   			      alloc_flags))
>>   		return COMPACT_SUCCESS;
>>   
>> +	/*
>> +	 * For costly orders, during the async memory compaction process, use the
>> +	 * actual allocation context to determine the watermarks. There's some risk
>> +	 * that in some different scenario the compaction could in fact migrate
>> +	 * pages from the exhausted non-CMA part of the zone to the CMA part and
>> +	 * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
>> +	 * should be affected in the immediate "goto nopage" when compaction is
>> +	 * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
>> +	 * fail without trying to compact-migrate the non-CMA pageblocks into CMA
>> +	 * pageblocks first, so it should be fine.
> 
> I think it's explaining too much about why not do this than why do this. How
> about:
> 
> For unmovable allocations (without ALLOC_CMA), check if there is enough free
> memory in the non-CMA pageblocks. Otherwise compaction could form the
> high-order page in CMA pageblocks, which would not help the allocation to
> succeed. However, limit the check to costly order async compaction (such as
> opportunistic THP attempts) because there is the possibility that compaction
> would migrate pages from non-CMA to CMA pageblock.
> 
>> +	 */
>> +	if (order > PAGE_ALLOC_COSTLY_ORDER && async) {
> 
> We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the
> watermark check in the normal THP allocation case (not from pinned gup),
> because then it just repeats the watermark check that was done above.
> 
>> +		watermark = low_wmark_pages(zone) + compact_gap(order);
>> +		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
>> +					   alloc_flags & ALLOC_CMA,
> 
> And then here we can just pass 0.
> 
>> +					   zone_page_state(zone, NR_FREE_PAGES)))
>> +			return COMPACT_SKIPPED;
>> +	}
>> +
>>   	if (!compaction_suitable(zone, order, highest_zoneidx))
>>   		return COMPACT_SKIPPED;
>>   
>> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>   	if (!is_via_compact_memory(cc->order)) {
>>   		ret = compaction_suit_allocation_order(cc->zone, cc->order,
>>   						       cc->highest_zoneidx,
>> -						       cc->alloc_flags);
>> +						       cc->alloc_flags,
>> +						       cc->mode == MIGRATE_ASYNC);
>>   		if (ret != COMPACT_CONTINUE)
>>   			return ret;
>>   	}
>> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>>   				pgdat->kcompactd_max_order,
>> -				highest_zoneidx, ALLOC_WMARK_MIN);
>> +				highest_zoneidx, ALLOC_WMARK_MIN,
>> +				0);
> 
> It's bool, so false instead of 0.
> 
>>   		if (ret == COMPACT_CONTINUE)
>>   			return true;
>>   	}
>> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>>   			continue;
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>> -				cc.order, zoneid, ALLOC_WMARK_MIN);
>> +				cc.order, zoneid, ALLOC_WMARK_MIN,
>> +				cc.mode == MIGRATE_ASYNC);
> 
> We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and
> has no real alloc_context.
> 
>>   		if (ret != COMPACT_CONTINUE)
>>   			continue;
>>   



      reply	other threads:[~2025-01-16  1:35 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-15  8:31 yangge1116
2025-01-15  9:56 ` Vlastimil Babka
2025-01-16  1:33   ` Ge Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4b3c4ecf-2a3f-4868-8b6a-5c1b1920ca79@126.com \
    --to=yangge1116@126.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuzixing@hygon.cn \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox