From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 411CAC02180 for ; Thu, 16 Jan 2025 01:35:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D0827280001; Wed, 15 Jan 2025 20:35:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CB8B06B0088; Wed, 15 Jan 2025 20:35:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BA752280001; Wed, 15 Jan 2025 20:35:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9E35C6B0085 for ; Wed, 15 Jan 2025 20:35:31 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4A1B1AEE47 for ; Thu, 16 Jan 2025 01:35:31 +0000 (UTC) X-FDA: 83011597662.23.3506CBE Received: from m16.mail.126.com (m16.mail.126.com [220.197.31.7]) by imf01.hostedemail.com (Postfix) with ESMTP id 11DAF40002 for ; Thu, 16 Jan 2025 01:35:24 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=T1fgRBBj; dmarc=pass (policy=none) header.from=126.com; spf=pass (imf01.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.7 as permitted sender) smtp.mailfrom=yangge1116@126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736991329; a=rsa-sha256; cv=none; b=yC1GERYeFLPELfSqDoUh3/81t/6xon7xUWN6cAWBkc9djL8+YQTqPwkgVe77CRwXydTund cZ7YI5p2Uk5BPDQzRwMhKHSwohtuKaCn51SSGn8ijTy2kr0fip0XFA+YwYqmAGqsuUTUdJ jyHeJ9eqhF/mhYuU2Yf8S0TLoxP0rPg= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=T1fgRBBj; dmarc=pass (policy=none) header.from=126.com; spf=pass (imf01.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.7 as permitted sender) smtp.mailfrom=yangge1116@126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736991329; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2CT+N5TqMV2s0adNs4F4x92kdNZf+WH5odS/GR+hiU4=; b=8i0nR5NLpPLkomRpRN/VNbhw8A5sVLHN9FREohiztG0ZmE9Vz1ZEf87uG5ffUnP7geny9q 3mVwWqcHQ1Msacx6NXJnQRIWpF8ZUX53D/YYfM/y7OVzUnUueg+63jxs+vt3enw+B6ND4K XfrYoo8dZRoRdejvixG6NjDUgskRHwc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Message-ID:Date:MIME-Version:Subject:From: Content-Type; bh=2CT+N5TqMV2s0adNs4F4x92kdNZf+WH5odS/GR+hiU4=; b=T1fgRBBjJLJwUM8fGfA6H+vDqbkA0DqAyKb1gobWzFBmk7itL42/0l+8ogjJK+ 0Xf8KfsA6zAzANqyWtrmd4e1Io8yYUwZXrhLxHpXurvBOrYX0yWER2iNImRCjM1V OiFejz1yiSpYym54bn5ZcZKCcw3EI2Lo08kVcCgewty5U= Received: from [172.19.20.199] (unknown []) by gzga-smtp-mtada-g0-0 (Coremail) with SMTP id _____wDnf7vhYYhnsnw2BA--.47661S2; Thu, 16 Jan 2025 09:33:22 +0800 (CST) Message-ID: <4b3c4ecf-2a3f-4868-8b6a-5c1b1920ca79@126.com> Date: Thu, 16 Jan 2025 09:33:21 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: compaction: use the actual allocation context to determine the watermarks for costly order during async memory compaction To: Vlastimil Babka , akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com, david@redhat.com, baolin.wang@linux.alibaba.com, hannes@cmpxchg.org, liuzixing@hygon.cn References: <1736929894-19228-1-git-send-email-yangge1116@126.com> From: Ge Yang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wDnf7vhYYhnsnw2BA--.47661S2 X-Coremail-Antispam: 1Uf129KBjvJXoW3JF1kZryUKFy8CrWrtr1rCrg_yoWxZFy8pF W8WF1akaykXF17CF1xta1v9F1a9w48JF17JF12q34xA39IkFn2yr1Dta45AF4UZry3Ka1j vFWYgF9xCanxZa7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jbHUDUUUUU= X-Originating-IP: [112.64.138.194] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbifh7WG2eIVpHBHQAAsZ X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 11DAF40002 X-Stat-Signature: 7exw6oh5k4kniobe6pta9d1dmmadgr9n X-HE-Tag: 1736991324-211380 X-HE-Meta: U2FsdGVkX18PQ3u6Gi1EfALXev82qM1B4uPlTbRIJFBP/SqrQcInVUJ+b/dxp7Zo+ouT0MBND1SCMiy4NDSOfBCXwAAv6iuQjcPeujsObHPi7ruBRR0QWQAHLm4e02qCYcDmmU6vO7FyFSC/7tHSXFz3eWvKsCwUTzU77KEkD9YeTUUyp8K3jAIypDIi4cSBHgIsPZpuc06JojCYShfcSld3kod7nX2yh4WXAM4Ji/t2AmFhz4uialObc3QV++z6spwtOTzCVBTZrTQCkT+O4V+LuiDZNyygDvyDy3BK4TereRF4naIss0yYub3tJlevFLrcPdSzSb92MolOFQffZTAgcczwSSgafyYo/SRCWZI4rwFP/M70pxINLbivx8aM3M9vJXlGtVCpJ4WfgdugEX1xWnuAzkG4Ug0+Y9exnJifuRLr22UjEiWvI/YlQN5jIgyBj1jmt5OwiwXTiMwwPuCo/jz9x7hzGk0j8u+qxPd5/e638rmNhztkgCF1cQF7RNM4lq9FeRMwn0MGvDx997XVA8oWoDaXV4TPv/f1K8C/CSbQ7wPotycn1JBelvpGkWLlRRpTyfw08UzsT7sxgajOVlhLewuKcY/I5mCq/MBbm2wsRnKhpRQvFk5qyRyz3iEidVfEcbeToGID+/ab433xJ+Hrg1ObEV+oeVoNhVVkWmUeFiFt8kVWBb2qeX2wbyQlcTHIYFN0Is6FcrOSvF3Ks6Tup/jn9aR7bBRWuKTo750qh9D+mLK0RTIPvtTMKXrimFvN84e3JJyfmUWEGHxr7U09IRYVed21AZzleAGtU2QT9EuzI367RsOZZT91mH4t6yZMSEgKNQk6bjEbAdWaMVJ3hHvyzJofM2L8t5lT/4neyfwsCF63m5mg36JbJ5G38yD18LIoxPfGtYolyQaIXB5XHETOX+exfc9Gk3PihKk+rvqc0ndI+dqeLEqYpCAeKLotyZtCApwsfqt e5v3b2E2 TES9+6rKMtkO7JnGrk5P0VPKQnWQBRwPPkWx/FgfUtVdXSX+q3At/2BRx+L1KYzSsaCJzO2a3sLBNSDZYWTlPOxtKi7hR/BDtn7DdCv5P6BrNyVwCBcrQKG7CuiYv8XhyQe+QbC77T5yCC11MdzscPYdlhWx4msBEjZ2YZUaCXAEFSozv1C0IeO1G0n8fFUzVnb+isYpfBPK7vl3gcTiG/Nvh1k2eDohQGfMg7RjgJvHEwpDQa5bETmGNRgsGB69+XtRsA4y5KA7A/bVOMBpcc60axNGcyenToleNuUGrwvjtNqIuncwWIyeMJYARTQdx2IS5ML5Snue3M6t2FmksnBFtM0xZQMVf/jADdSKF0bGvOI459stQigM7qEljcrHink3k7ZE74VUHcDX5exe8ldyKR1RhvWkMnOqOeEvUsiHc2yXOxXMsFMsjhccnzRey0mfM X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/1/15 17:56, Vlastimil Babka 写道: > On 1/15/25 09:31, yangge1116@126.com wrote: >> From: yangge >> >> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB >> of memory. I have configured 16GB of CMA memory on each NUMA node, >> and starting a 32GB virtual machine with device passthrough is >> extremely slow, taking almost an hour. >> >> Long term GUP cannot allocate memory from CMA area, so a maximum of >> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine >> memory. There is 16GB of free CMA memory on a NUMA node, which is >> sufficient to pass the order-0 watermark check, causing the >> __compaction_suitable() function to consistently return true. >> >> For costly allocations, if the __compaction_suitable() function always >> returns true, it causes the __alloc_pages_slowpath() function to fail >> to exit at the appropriate point. This prevents timely fallback to >> allocating memory on other nodes, ultimately resulting in excessively >> long virtual machine startup times. >> Call trace: >> __alloc_pages_slowpath >> if (compact_result == COMPACT_SKIPPED || >> compact_result == COMPACT_DEFERRED) >> goto nopage; // should exit __alloc_pages_slowpath() from here >> >> We could use the real unmovable allocation context to have >> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't >> pass the order-0 check anymore once the non-CMA part is exhausted. There >> is some risk that in some different scenario the compaction could in >> fact migrate pages from the exhausted non-CMA part of the zone to the >> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY >> allocations should be affected in the immediate "goto nopage" when >> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY >> anyway and won't fail without trying to compact-migrate the non-CMA >> pageblocks into CMA pageblocks first, so it should be fine. >> >> After this fix, it only takes a few tens of seconds to start a 32GB >> virtual machine with device passthrough functionality. > > So did you verify it works? After multiple tests, it has been confirmed to work properly. Thank you. I just realized there might be still cases it > won't help. There might be enough free order-0 pages in the non-CMA > pageblocks (so the additional check will not stop us) but fragmented and > impossible to compact due to unmovable pages. Then we won't avoid your > issue, right? > The pages that are pinned are mostly Transparent Huge Pages (THP). Therefore, it is not common to find free order-0 pages in non-CMA pageblocks that are fragmented and impossible to compact due to the presence of unmovable pages. This patch can resolve my issue. >> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/ >> Signed-off-by: yangge > > In case it really helps reliably: > > Acked-by: Vlastimil Babka > > Some nits below: > >> --- >> mm/compaction.c | 31 +++++++++++++++++++++++++++---- >> 1 file changed, 27 insertions(+), 4 deletions(-) >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 07bd227..9032bb6 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, >> */ >> static enum compact_result >> compaction_suit_allocation_order(struct zone *zone, unsigned int order, >> - int highest_zoneidx, unsigned int alloc_flags) >> + int highest_zoneidx, unsigned int alloc_flags, >> + bool async) >> { >> unsigned long watermark; >> >> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order, >> alloc_flags)) >> return COMPACT_SUCCESS; >> >> + /* >> + * For costly orders, during the async memory compaction process, use the >> + * actual allocation context to determine the watermarks. There's some risk >> + * that in some different scenario the compaction could in fact migrate >> + * pages from the exhausted non-CMA part of the zone to the CMA part and >> + * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations >> + * should be affected in the immediate "goto nopage" when compaction is >> + * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't >> + * fail without trying to compact-migrate the non-CMA pageblocks into CMA >> + * pageblocks first, so it should be fine. > > I think it's explaining too much about why not do this than why do this. How > about: > > For unmovable allocations (without ALLOC_CMA), check if there is enough free > memory in the non-CMA pageblocks. Otherwise compaction could form the > high-order page in CMA pageblocks, which would not help the allocation to > succeed. However, limit the check to costly order async compaction (such as > opportunistic THP attempts) because there is the possibility that compaction > would migrate pages from non-CMA to CMA pageblock. > >> + */ >> + if (order > PAGE_ALLOC_COSTLY_ORDER && async) { > > We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the > watermark check in the normal THP allocation case (not from pinned gup), > because then it just repeats the watermark check that was done above. > >> + watermark = low_wmark_pages(zone) + compact_gap(order); >> + if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx, >> + alloc_flags & ALLOC_CMA, > > And then here we can just pass 0. > >> + zone_page_state(zone, NR_FREE_PAGES))) >> + return COMPACT_SKIPPED; >> + } >> + >> if (!compaction_suitable(zone, order, highest_zoneidx)) >> return COMPACT_SKIPPED; >> >> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) >> if (!is_via_compact_memory(cc->order)) { >> ret = compaction_suit_allocation_order(cc->zone, cc->order, >> cc->highest_zoneidx, >> - cc->alloc_flags); >> + cc->alloc_flags, >> + cc->mode == MIGRATE_ASYNC); >> if (ret != COMPACT_CONTINUE) >> return ret; >> } >> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) >> >> ret = compaction_suit_allocation_order(zone, >> pgdat->kcompactd_max_order, >> - highest_zoneidx, ALLOC_WMARK_MIN); >> + highest_zoneidx, ALLOC_WMARK_MIN, >> + 0); > > It's bool, so false instead of 0. > >> if (ret == COMPACT_CONTINUE) >> return true; >> } >> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat) >> continue; >> >> ret = compaction_suit_allocation_order(zone, >> - cc.order, zoneid, ALLOC_WMARK_MIN); >> + cc.order, zoneid, ALLOC_WMARK_MIN, >> + cc.mode == MIGRATE_ASYNC); > > We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and > has no real alloc_context. > >> if (ret != COMPACT_CONTINUE) >> continue; >>