From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0C09C02180 for ; Tue, 14 Jan 2025 02:52:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 28C65280002; Mon, 13 Jan 2025 21:52:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 23D01280001; Mon, 13 Jan 2025 21:52:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 105DC280002; Mon, 13 Jan 2025 21:52:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E6500280001 for ; Mon, 13 Jan 2025 21:52:03 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 9603616058E for ; Tue, 14 Jan 2025 02:52:03 +0000 (UTC) X-FDA: 83004532926.03.4069BD0 Received: from m16.mail.126.com (m16.mail.126.com [220.197.31.6]) by imf19.hostedemail.com (Postfix) with ESMTP id 14C991A0006 for ; Tue, 14 Jan 2025 02:52:00 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=Btco0AOp; spf=pass (imf19.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.6 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736823121; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C/cPjmlA+t6jlHSj6/8PTaHoULJ6MSmV+P7CNNzIowc=; b=8PU79RdHHrZovMoYGmtBwYhxvsbqomerw4vwBtUtuW/+M3o3eSbIcCHbwzw/iuk66k3IXw XWOD+c2kPVH9oY+4QJk6a0g0Sb2Cv7211ZenrIhse7Zv7SaLG5hjrm0MSKRublZ+4IuxPI i4rVQdlKNrl+94Rg1w38gn9AqBdfqUM= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=Btco0AOp; spf=pass (imf19.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.6 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736823121; a=rsa-sha256; cv=none; b=QFiDmTdLt435UOB9wRjYh0uWUOSSG6Eo+VW33S/prxQ0tvxDeTIF6ETaK5f8q5wr3doNd1 OTKTqGK+NIYAsBWLDASq70q+1IFDNqEBqv9iSzmaJBM+kcVaQSJKuGX238yzPLFtxukX0M 9tUuv6i0PY/lFHkh9uJzgy93XvtP744= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Message-ID:Date:MIME-Version:Subject:From: Content-Type; bh=C/cPjmlA+t6jlHSj6/8PTaHoULJ6MSmV+P7CNNzIowc=; b=Btco0AOpUSq6HvzSeoup2jqATNDsc8A3440ohN3BjTbro6FSaU8if3E6sdVaKR cD6qBc0TBdmin/ITDkuYhvi0N8c1wYqWQseyhuJGFvUqGFC7WT4vE6tnU5B7RbK9 aHcFS7rcokuDgEavx4jvFg/q0LulG6ulC4SVBYx8gbiKA= Received: from [172.19.20.199] (unknown []) by gzga-smtp-mtada-g0-3 (Coremail) with SMTP id _____wDnf3dI0YVnlvOLBA--.522S2; Tue, 14 Jan 2025 10:51:53 +0800 (CST) Message-ID: Date: Tue, 14 Jan 2025 10:51:52 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages To: Johannes Weiner Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com, david@redhat.com, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn, Vlastimil Babka References: <1736335854-548-1-git-send-email-yangge1116@126.com> <20250113154657.GA829144@cmpxchg.org> From: Ge Yang In-Reply-To: <20250113154657.GA829144@cmpxchg.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wDnf3dI0YVnlvOLBA--.522S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxKF1ktFWkGFy7GF1xtrWDJwb_yoW7Ar43pF yxGFn8Kr4DXFZFkr1Iq3Wv9F92y3yftFW8JFySyryfu3ZI9FySya1DtFyDCF1DZr1jqw4Y qFWq9wnruws8Za7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jbHUDUUUUU= X-Originating-IP: [112.64.138.194] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiOhLUG2eFxZQF1QABsF X-Rspamd-Server: rspam05 X-Stat-Signature: n3ezcz3aut64qi33inpzd9bjjgooee7x X-Rspamd-Queue-Id: 14C991A0006 X-Rspam-User: X-HE-Tag: 1736823120-505059 X-HE-Meta: U2FsdGVkX1+hmsFdkAp+eqBVed+4bPryoEoZpdzOgB5VL4D0YjiDxz2Xbet+UkJxOkDYRRd7BG2Bbeh0V0QvNUEKjmvPPf2xspyxMUN3gGp/4NbhhoP2anjabe4/j/IF/ozbp288rZJHsg4SOyHozdpy3J5IyelXIGiayjJdxVYsXORDt943LwZghOGt3OmurMthZ7RgMWd/tBgBPF74CQjxm/4XMHXN9pPTgtfz2uY3Cq5DI8mjAoIG9XWCpX/rWhJyr1gXPMU+A+sp3kkDWwuSqw5UvyNWXmSvFzUOLankiDFRX5QeaqGo5+LpMBpQ4FXx2g9qwIQUoX0DYPgwiyzUQ8I0WDtrKEtwjTKpQwqJ0o2SK80uy5ANfTa9+ZEfF1qNxZ0pDeYmN4eKnFLk/2E7Oe8TllZvmrisYFNh0mAgOcHZ529FMTpD/eYhFBXNuRHVtVc0lvri1zMQfaAKf0sUnd+6ZBOt93B3P59Kme6bEEMGdtTEOzoymP6l2JSVkAG2AW6gOjQXe5vz4idvM8ZmDNh+fpxUcBBBrp0LQ72t7zabbNySCONnC4E1hXC6+KxeotykGij+oOtHHo6nKaVFzdGmFLYZDFlEb96N/+qRI7pQgZ8FfIiygdBhZAfQ1BboCpLTgrngEse/rjT+eWzx+Mw3t1h6b7h5wb7GAFRmOrrQAe2g10cx0jxqsRhFLdDZDjfKMGwL535GU3w4NTnW8jUocV9/vwL/r3i/U4bVtCtjSfVYdJRy4EXyjvdoskfVSinhhpRAj3PQ8w73J0sjZq1IhOnHfTKbDiSGjqAlvGgHn2VR2Uw72LkxBwutoZ6Go8l1fRI0D9x3T+CP9ZDn8IvJoJ/h1JEfOxRkmAM/HWdu4wZYFOFLR+1zNRTgKPwUX+mTHtA/SrqKsLLe6HQra5EJsEP/rD61ZHOyVGCEmaL6qB7FgjX4aGeM9B4XfUail8VKmU4ZlFYZzlk 3kpVyCrb dQcwBpt7kADAT4Nx/9QJI0uQJG+RoKXe44Dabh7lXPd5FD7yZ5wYRNLQKD+vXpR+Gx1CG2kjAr6r9kDhl81ChWlTeifCkD8s1ZIoxyf9It+r0XSd1wiS8/wslfXURg6cKFrroeYfvkbRlmZ5WMNVy7eFn7nQ1Eoxoy+AyMVrgufmzUWzZCEjJmUiYBnGleqBgJYohVoW8AeSexjCkKqSpcxiegkCxbI1nM32g/5nHDmIh4rVjIYCqVS3Ifi4jUsuIzY/MPfsBpjDAqXcM09/DV+g9CFCBciqtgqa4gDJEc43cu7Tq/onUkTDfXEQyIyQdfasX1aibFO/qtiguH69xgdizf6itXtYk1ko0yHXWF1nV6apN5l4q0hpgV2NLW7vc+oUedp03bMofDyc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/1/13 23:46, Johannes Weiner 写道: > CC Vlastimil > > On Wed, Jan 08, 2025 at 07:30:54PM +0800, yangge1116@126.com wrote: >> From: yangge >> >> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB >> of memory. I have configured 16GB of CMA memory on each NUMA node, >> and starting a 32GB virtual machine with device passthrough is >> extremely slow, taking almost an hour. >> >> During the start-up of the virtual machine, it will call >> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. >> Long term GUP cannot allocate memory from CMA area, so a maximum of >> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine >> memory. There is 16GB of free CMA memory on a NUMA node, which is >> sufficient to pass the order-0 watermark check, causing the >> __compaction_suitable() function to consistently return true. >> However, if there aren't enough migratable pages available, performing >> memory compaction is also meaningless. Besides checking whether >> the order-0 watermark is met, __compaction_suitable() also needs >> to determine whether there are sufficient migratable pages available >> for memory compaction. >> >> For costly allocations, because __compaction_suitable() always >> returns true, __alloc_pages_slowpath() can't exit at the appropriate >> place, resulting in excessively long virtual machine startup times. >> Call trace: >> __alloc_pages_slowpath >> if (compact_result == COMPACT_SKIPPED || >> compact_result == COMPACT_DEFERRED) >> goto nopage; // should exit __alloc_pages_slowpath() from here >> >> When the 16G of non-CMA memory on a single node is exhausted, we will >> fallback to allocating memory on other nodes. In order to quickly >> fallback to remote nodes, we should skip memory compaction when >> migratable pages are insufficient. After this fix, it only takes a >> few tens of seconds to start a 32GB virtual machine with device >> passthrough functionality. >> >> Signed-off-by: yangge >> --- >> >> V3: >> - fix build error >> >> V2: >> - consider unevictable folios >> >> mm/compaction.c | 20 ++++++++++++++++++++ >> 1 file changed, 20 insertions(+) >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 07bd227..a9f1261 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order, >> int highest_zoneidx, >> unsigned long wmark_target) >> { >> + pg_data_t __maybe_unused *pgdat = zone->zone_pgdat; >> + unsigned long sum, nr_pinned; >> unsigned long watermark; >> + >> + sum = node_page_state(pgdat, NR_INACTIVE_FILE) + >> + node_page_state(pgdat, NR_INACTIVE_ANON) + >> + node_page_state(pgdat, NR_ACTIVE_FILE) + >> + node_page_state(pgdat, NR_ACTIVE_ANON) + >> + node_page_state(pgdat, NR_UNEVICTABLE); > > What about PAGE_MAPPING_MOVABLE pages that aren't on this list? For > example, zsmalloc backend pages can be a large share of allocated > memory, and they are compactable. You would give up on compaction > prematurely and cause unnecessary allocation failures. > Yes, indeed, there are pages that are not in the LRU list but support migration. Currently, technologies such as balloon, z3fold, and zsmalloc are utilizing such pages. I feel that we could add an item to node_stat_item to keep statistics on these pages. > That scenario is way more common than the one you're trying to fix. > > I think trying to make this list complete, and maintaining it, is > painstaking and error prone. And errors are hard to detect: they will > just manifest as spurious failures in higher order requests that you'd > need to catch with tracing enabled in the right moments. > > So I'm not a fan of this approach. > > Compaction is already skipped when previous runs were not successful. > See defer_compaction() and compaction_deferred(). Why is this not > helping here? if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE || status == COMPACT_PARTIAL_SKIPPED)) defer_compaction(zone, order); When prio != COMPACT_PRIO_ASYNC, defer_compaction(zone, order) will be executed. In the __alloc_page_slowpath() function, during the first execution of __alloc_pages_direct_compact(), prio is equal to COMPACT_PRIO_ASYNC, and therefore defer_compaction(zone, order) will not be executed. Instead, it will eventually proceed to the time-consuming __alloc_pages_direct_reclaim(). This can be avoided in scenarios where memory compaction is not suitable. > >> + nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - >> + node_page_state(pgdat, NR_FOLL_PIN_RELEASED); > > Likewise, as Barry notes, not all pinned pages are necessarily LRU > pages. remap_vmalloc_range() pages come to mind. You can't do subset > math on potentially disjunct sets. Indeed, some problem scenarios are unsolvable currently, but there are some scenarios that can be resolved through this approach. Currently, we haven't come up with a better solution yet. > >> + /* >> + * Gup-pinned pages are non-migratable. After subtracting these pages, >> + * we need to check if the remaining pages are sufficient for memory >> + * compaction. >> + */ >> + if ((sum - nr_pinned) < (1 << order)) >> + return false; >> +