From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC1BFE77188 for ; Tue, 14 Jan 2025 11:21:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CF406B007B; Tue, 14 Jan 2025 06:21:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 258E56B0083; Tue, 14 Jan 2025 06:21:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D2B36B0085; Tue, 14 Jan 2025 06:21:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id DF3406B007B for ; Tue, 14 Jan 2025 06:21:28 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7A6E3440D6 for ; Tue, 14 Jan 2025 11:21:28 +0000 (UTC) X-FDA: 83005816656.04.AB5F364 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf21.hostedemail.com (Postfix) with ESMTP id 195331C001A for ; Tue, 14 Jan 2025 11:21:25 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=i2vPiRDt; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=5Z5h1hxO; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=i2vPiRDt; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=5Z5h1hxO; dmarc=none; spf=pass (imf21.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736853686; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oZIC/wYPNF6acmY3Rxav/IHTNtnaKZCu//Vx7D460ig=; b=c32ves8I0+d3bolZ5+Swr31Sg3RHDFo85zeommJZ/6tPBa5v1//bm6HZRrG7MUKzBZnjEK ItxdbwDSpNtqrZOdjfmACMguoCo/o2im442BbR/RA9FpR7qKLtbvY+mYdAVIKpy/JMi6g8 B0y6TqfqjOn6ZHaVwBb/SyCjH6ZfE3s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736853686; a=rsa-sha256; cv=none; b=QcDCD0124lDJ0wWOHZnWxII6WPYm4CDQx5TBxrN6mkGkWljuMZzxD3yacP4wn8dfkWiuJF DWXSKkTU21fzU2Ett9M/EaEw0xboQVWjUosxXa62XgVTaT7dG5pG9lLMTCCH4AueUf9MH6 J7lnUetuow8W1Q50gGg5UEX3a7gTpcI= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=i2vPiRDt; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=5Z5h1hxO; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=i2vPiRDt; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=5Z5h1hxO; dmarc=none; spf=pass (imf21.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 53A1C1F38F; Tue, 14 Jan 2025 11:21:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1736853684; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oZIC/wYPNF6acmY3Rxav/IHTNtnaKZCu//Vx7D460ig=; b=i2vPiRDt/1Tyefbma3Y3AzXZwmGBTeseBWtliu5MnFfBMLwG8kWeE1FMfqOo3ZGxqNmQoA t5+h60mRu+Ahhh15ZjD5ewtRUAVGxRhkupijcJtek5xIRcv1a1BBPjSkBlsFuxTc2mWqjA nDNm+OcCGbLPd8aP8EH0J8v/SyDHT/k= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1736853684; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oZIC/wYPNF6acmY3Rxav/IHTNtnaKZCu//Vx7D460ig=; b=5Z5h1hxOEOXtezzuENFS8VvaZOeumR1hWwn0FSe3pK2jZZDK/e/L0vVGWSnIj6iSfOrIjx Tw+EnJFrTTsT68CA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1736853684; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oZIC/wYPNF6acmY3Rxav/IHTNtnaKZCu//Vx7D460ig=; b=i2vPiRDt/1Tyefbma3Y3AzXZwmGBTeseBWtliu5MnFfBMLwG8kWeE1FMfqOo3ZGxqNmQoA t5+h60mRu+Ahhh15ZjD5ewtRUAVGxRhkupijcJtek5xIRcv1a1BBPjSkBlsFuxTc2mWqjA nDNm+OcCGbLPd8aP8EH0J8v/SyDHT/k= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1736853684; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oZIC/wYPNF6acmY3Rxav/IHTNtnaKZCu//Vx7D460ig=; b=5Z5h1hxOEOXtezzuENFS8VvaZOeumR1hWwn0FSe3pK2jZZDK/e/L0vVGWSnIj6iSfOrIjx Tw+EnJFrTTsT68CA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 05280139CB; Tue, 14 Jan 2025 11:21:24 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id NhxZALRIhmfrHQAAD6G6ig (envelope-from ); Tue, 14 Jan 2025 11:21:24 +0000 Message-ID: <39709cd7-0de6-4247-a70a-9f81fccbce6a@suse.cz> Date: Tue, 14 Jan 2025 12:21:23 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages Content-Language: en-US To: yangge1116@126.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com, david@redhat.com, baolin.wang@linux.alibaba.com, hannes@cmpxchg.org, liuzixing@hygon.cn References: <1736335854-548-1-git-send-email-yangge1116@126.com> From: Vlastimil Babka In-Reply-To: <1736335854-548-1-git-send-email-yangge1116@126.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: p5c7kf1gs78ry3thzycg9nyk4zcqs43q X-Rspamd-Queue-Id: 195331C001A X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1736853685-684512 X-HE-Meta: U2FsdGVkX1+HiyBGIxKcFy+VECoBDQF3NlfdxGxGAKtSeDV4V3fg4Lk+eliGsKKgm6i7Ml1YEGAHCK4fZQ5t2l1BsU7zMkF9nNe25NEUZGGkvynLQEz20LCrfInxmyR/2Nd9B6w9CnjutlZPxmE6ZzzTpNd2MXa/xE81pgHm94yPTrFQlwbUshymSIix4h1DZCvWHKtSrVvMzK73KyU7FedMBhoa0SwFly0qhcuuiXOB6qZRpeFnOrnhW+dH6LaOLwS/fZa5bUNhpahdNq4ckU/PxfaltIbDjBpO2bSdpSvogrxLnecTIiC9CXJcnRN4/yyeBUbBpyt9X2FL2QbWe5y4lLo+yE78iq3NuY6lWvBgwIjqZchi0z+YzjIwMKY4xJl6T1gUutJUpmXlLAmmis31b6uKJxnCImmk+xU/R3TaZQWbO79nDYWl86WIWI1AA8JiiddZ/BftpFXws3gI+vLcjL5duNRUM1anZObtSQkDJrNWRTBzagzriOizYP4BnTaq0Jq1dnehWdGuNZyqR+GeRI9WO27GD1f4HKXqGmAR6FlXeQfN6cBQhd55v1HEIZY7dB+qWUOMPJqTzIYXHlBUVlfBeHkp8DeL1j42KgzqapTh1YUIpHLOztxEISMyFI7UJnCCDsmVE5eSxv7sRanCiXeJ/63OB/l7XzKFb+FQW9+U9Ue74ULxOVzXsi0ZchD5fQCnWqz6jqfg4IWMzvGlTO32kRo2yrq8Z5I3JhSW3nNZSzRTOOnbubVnX57PSBHrddia4sCR7En/Cr6biFWqVXcYBXxgRVNnmIxIZnleHasBNwrQ2dag0Uz6E11RY5UDYnhetJCT8r+ZH6WpnYmyGVYt9Vv7YEbzjBnA2VjevdBbb7zpDLnY2Pubotzp1BRLWQa/KYE9qJ9vKFcB5QPlbYgZuhr65E3ErmAUKjKOlVUZnRKY5EdC+cC30A6d9cXmk11mu53wQQv1ofL 1F2szazY h3gYPuBsYB/7g1G3vC8uLonlsFPBbMXyh7ciyEUvghO/LL6Q1jSVHx+Cy2E0ABZJMgq7vlUvaD3YgYUbKg3y6ByyF3Xg4RhrPFhAD//NbTqtQ55vIDc7eqsfyfgotFaIpQ1pRH2AsibGFJVVSJnIrDAigqzHlTeznDcELxFEK67GKeZ9GkEI6It8oNdu7ICFOtlxmHPk8ST1UzAFYgBA9GLWKWcO3Ghr9EfQ0bVA8lMzuMDDS8SL10HuHd/5m85D6OGf6+m3R3a5esPcsraJTqKggcll6Jb1W/DbsU+xFRKlfCOWnvytiMYrNYPnYXqmmTpQwqZZKY5u5lCmeqJHohUSJ1p82zHxCPlH7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 1/8/25 12:30, yangge1116@126.com wrote: > From: yangge > > There are 4 NUMA nodes on my machine, and each NUMA node has 32GB > of memory. I have configured 16GB of CMA memory on each NUMA node, > and starting a 32GB virtual machine with device passthrough is > extremely slow, taking almost an hour. > > During the start-up of the virtual machine, it will call > pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. > Long term GUP cannot allocate memory from CMA area, so a maximum of > 16 GB of no-CMA memory on a NUMA node can be used as virtual machine > memory. There is 16GB of free CMA memory on a NUMA node, which is > sufficient to pass the order-0 watermark check, causing the > __compaction_suitable() function to consistently return true. > However, if there aren't enough migratable pages available, performing > memory compaction is also meaningless. Besides checking whether > the order-0 watermark is met, __compaction_suitable() also needs > to determine whether there are sufficient migratable pages available > for memory compaction. > > For costly allocations, because __compaction_suitable() always > returns true, __alloc_pages_slowpath() can't exit at the appropriate > place, resulting in excessively long virtual machine startup times. > Call trace: > __alloc_pages_slowpath > if (compact_result == COMPACT_SKIPPED || > compact_result == COMPACT_DEFERRED) > goto nopage; // should exit __alloc_pages_slowpath() from here > > When the 16G of non-CMA memory on a single node is exhausted, we will > fallback to allocating memory on other nodes. In order to quickly > fallback to remote nodes, we should skip memory compaction when > migratable pages are insufficient. After this fix, it only takes a > few tens of seconds to start a 32GB virtual machine with device > passthrough functionality. > > Signed-off-by: yangge > --- > > V3: > - fix build error > > V2: > - consider unevictable folios > > mm/compaction.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 07bd227..a9f1261 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order, > int highest_zoneidx, > unsigned long wmark_target) > { > + pg_data_t __maybe_unused *pgdat = zone->zone_pgdat; > + unsigned long sum, nr_pinned; > unsigned long watermark; > + > + sum = node_page_state(pgdat, NR_INACTIVE_FILE) + > + node_page_state(pgdat, NR_INACTIVE_ANON) + > + node_page_state(pgdat, NR_ACTIVE_FILE) + > + node_page_state(pgdat, NR_ACTIVE_ANON) + > + node_page_state(pgdat, NR_UNEVICTABLE); In addition to what Johannes pointed out, these are whole-node numbers and compaction works on a zone level. > + > + nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - > + node_page_state(pgdat, NR_FOLL_PIN_RELEASED); Statistics of *events* used to derive current *state*... I don't think we do that anywhere else? I'm not sure if we make sure vmstat events may never be missed, as they are only for statistics. IIUC we allow some rare races to have less expensive synchronization? But anyway let's try looking for a different solution. Assuming this is a THP allocation attempt (__GFP_THISNODE even?) and we are in the "For costly allocations, try direct compaction first" part of __alloc_pages_slowpath() right? Hopefully also when done from the pin_user_pages_remote(..., FOLL_LONGTERM, ...) context the allocation gfp_mask correctly lacks __GFP_MOVABLE? I guess it has to, otherwise it would allocate from the CMA pageblocks. Then I wonder if we could use the real allocation context to determine watermarks, as __compaction_suitable() is passing ALLOC_CMA instead because it's checking only for migration targets, which have to be CMA compatible by definition. But we could use the real unmovable allocation context to have __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass the order-0 check anymore once the non-CMA part is exhausted. There's some risk that in some different scenario the compaction could in fact migrate pages from the exhausted non-CMA part of the zone to the CMA part and succeed, and we'll skip it instead. But that should be rare? Anyway given that concern I'm not sure about changing __compaction_suitable() for every caller like this. We could (at least initially) target this heuristic only for COMPACT_PRIO_ASYNC which is being used for this THP opportunistic attempt. So for example: - add a new bool flag to compact_control that is true for COMPACT_PRIO_ASYNC - pass cc pointer to compaction_suit_allocation_order() - in that function, add another check if the the new cc flag is true, between the current zone_watermark_ok() and compaction_suitable() checks, which works like __compaction_suitable() but uses alloc_flags (which should not be ALLOC_CMA in our pinned allocation case) instead of ALLOC_CMA, return COMPACT_SKIPPED if it fails. > + /* > + * Gup-pinned pages are non-migratable. After subtracting these pages, > + * we need to check if the remaining pages are sufficient for memory > + * compaction. > + */ > + if ((sum - nr_pinned) < (1 << order)) > + return false; > + > /* > * Watermarks for order-0 must be met for compaction to be able to > * isolate free pages for migration targets. This means that the