From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7985AE7719E for ; Mon, 13 Jan 2025 11:24:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 09A4B6B0085; Mon, 13 Jan 2025 06:24:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 049516B0088; Mon, 13 Jan 2025 06:24:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E52666B0089; Mon, 13 Jan 2025 06:24:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C45C46B0085 for ; Mon, 13 Jan 2025 06:24:15 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C11D9121CB1 for ; Mon, 13 Jan 2025 11:23:56 +0000 (UTC) X-FDA: 83002194072.07.9CF8909 Received: from m16.mail.126.com (m16.mail.126.com [117.135.210.9]) by imf20.hostedemail.com (Postfix) with ESMTP id 8B4AE1C0009 for ; Mon, 13 Jan 2025 11:23:53 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b="C8m1vER/"; dmarc=pass (policy=none) header.from=126.com; spf=pass (imf20.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.9 as permitted sender) smtp.mailfrom=yangge1116@126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736767435; a=rsa-sha256; cv=none; b=HHHV+hpwtzlD3o314QgPMM9u9Phkkr/R+eKEc0JDakhFF8qlltMg6q0MK/kBBjuayDjidw Q/9WNMG9JTi9shj9fNCDKBIacbrtmojGcTadVJIHGNK0USfQsOeJOkPSovVDO4MyJBqOuV 9B3f/VJlrK6RggzVZRITmzf32VeaLDs= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b="C8m1vER/"; dmarc=pass (policy=none) header.from=126.com; spf=pass (imf20.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.9 as permitted sender) smtp.mailfrom=yangge1116@126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736767435; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Db0pLgV0yMFTyp7VrRJl9wkQ82DP+5RZQzKhO56UXEs=; b=FEGtLDfDFF77nFCFJfxQXBRFIjub/EU0VgFL91qZwL/lWlTSrXf+EsEtvCMdC3IPxbw/ZZ +1Ei0RZg221ZlHX2k5mlTUQOp4y5af+DJwarvh2cnAEFDW43szzQ4WW1zK6sQ0duZAlCrN 8xshnlVRnBfTXQDGZ97s8jYkiYhBkcQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Message-ID:Date:MIME-Version:Subject:From: Content-Type; bh=Db0pLgV0yMFTyp7VrRJl9wkQ82DP+5RZQzKhO56UXEs=; b=C8m1vER/ItONgmzlPXmx2zPOTxZjk+Mi1A6F9Vg0QMI8BE9IHEVtwNLiZrxfDL D6b0kl/Q6ICsigZ+2nqIkorWmHlMA++qEKZuLSY0VNMgLlHwjvO0bBSJT1GGLRMU yKZSfBYM+hOVaY8bSkAtVDW+uCRchNuP4K/Qs3/9xgtdo= Received: from [172.19.20.199] (unknown []) by gzga-smtp-mtada-g1-1 (Coremail) with SMTP id _____wD31wq894Rn6tyPBA--.11799S2; Mon, 13 Jan 2025 19:23:40 +0800 (CST) Message-ID: Date: Mon, 13 Jan 2025 19:23:40 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, david@redhat.com, baolin.wang@linux.alibaba.com, hannes@cmpxchg.org, liuzixing@hygon.cn References: <1736335854-548-1-git-send-email-yangge1116@126.com> From: Ge Yang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wD31wq894Rn6tyPBA--.11799S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxKF1xAw45CFWDuFW7Cr1fCrg_yoW7Gry3pr y8GFnxKF4DXFZIyr1Iq3ZYvFnxt3yfKF4UXr9rtF97u3sI9F1Iyr47tryUuF1kZr1Utr4Y vw4DX3sruan8Za7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jYpB-UUUUU= X-Originating-IP: [112.64.138.194] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbifhvTG2eE44fx0wAAsI X-Stat-Signature: 5rrf53gza76k97i6wqzgpo5irddu6gqr X-Rspam-User: X-Rspamd-Queue-Id: 8B4AE1C0009 X-Rspamd-Server: rspam08 X-HE-Tag: 1736767433-95353 X-HE-Meta: U2FsdGVkX1/WFdZ/h7UddsASIERfJw0fDJMTtazTe7MemH9B1bzZAK1aO4YrVMqDGb06xkH7Pgp+PEFhuzAlkG5EJ+DEeI0ybBroUvC6MOWhgiUA2xOLmOUDhzcuW1yqtHAue83vSoDSHqrC2YL2Qiiiuuk2MppRFedEeLsXfCb9O76jWLRvCrkXL3B8RpXb491PWhwqsJi+FV077QzL6yv2gjj+o8/ZeBbbbhMa1pgSEHkYXkZ46weoHt44L3S7bF+kXKEB2+CwWh5t91ZNdV2B85RxVE7IwwhC+yj9Sn4JUNIRHMkVglePvexo/fUVY5O929fEAAuv+0RArGQbUTvxLpptcq3DkzgDM5h8K7iv6OVUXw/XbU+7A/Svnrtm9+qnEXsakvHs+o86Sjzlu67lhX8ubYBEJWlbMwNdCfdQGpGkTP/toe5cMrM1fgd39fVBCtIvq69jOZzyEhVBAEpxvn4CnwZ/HpI9MRmkE3GXx8nVf1uZvf/Mv6F7UcBEChaO5XOqZdVerpYywvFNEVvBQK2136YNh8hicFDsNdCm4uSxGb651h6I8FkxDhGsHZxTfTqUuvFSOeOkn2AIn1+4UcnzC9P51vLjmjX32eUn6lwAiRcSGyPWMnqrQ3WvkjqOra7PzsaVcwvv5NjQr2HpJAa0guZn5AdZJXa9xiFOuiyBQ74w90nl/b8XtTTIdeWrdlsVP/3bgsul/A/FibiNScJ5SvZNbZC2jSE6SAJnTszg7bIH1HvyZIoYZishKCfQXZ0H4NJ8vqVVQz3iOd0ywX8EbOssSsLFuE4Kzg/k8vWAVzclFTl+Uyb+MaZvcM1uFVpdqU/mr45JP1SEujneTaxS2gVxaHKaFJqCPnsgfkyzMe4J/21MluzHFFnvkKZlkWRVOifF51XEwHxdH3OiMQartAbAd+K/q3C4BqmYY50E8rjgsIQBF5cgWrUEmtNDyRLlJ+YlxyVWpe0 WpQ1HnVP vvp4i0jymWcfRUW9ZQuzc+DDw+wmZAKSnr376hC14cOJ1Zf/Qkxfl80D+UwmziU55zQijevRmnq3BRtPxAnkGRMcl5td4Z4HCJJbSXji481sAB/AvPnIghoM1bJ0RtT4JqD+z6OwBNrp9PgQfO5g8VrEgpxsjs9hPJSW+0CoFj53HgnER/v/cG8Cm5nm0C8skOxLxYjiSPZYA/uyYC7w8V/jFLVFRv6RVducmmDhRQc8/oDzebd+Kuhyl18wcmsAYfmi5Q9sDbQhz/gmJN5FtGN/+LZnnAFNmfS2uOu1gkoxWPfTiJsDPv+BPJOiig0/bH5xmyQ+n+5hIhfvrSU/0T++x804Q79B2T/kMZMzCYOMyWlh5SM0JYQPZgZq5ihqIbIteXX1x+aFJEiA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/1/13 18:05, Barry Song 写道: > On Mon, Jan 13, 2025 at 10:04 PM Ge Yang wrote: >> >> >> >> 在 2025/1/13 16:47, Barry Song 写道: >>> On Thu, Jan 9, 2025 at 12:31 AM wrote: >>>> >>>> From: yangge >>>> >>>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB >>>> of memory. I have configured 16GB of CMA memory on each NUMA node, >>>> and starting a 32GB virtual machine with device passthrough is >>>> extremely slow, taking almost an hour. >>>> >>>> During the start-up of the virtual machine, it will call >>>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. >>>> Long term GUP cannot allocate memory from CMA area, so a maximum of >>>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine >>>> memory. There is 16GB of free CMA memory on a NUMA node, which is >>>> sufficient to pass the order-0 watermark check, causing the >>>> __compaction_suitable() function to consistently return true. >>>> However, if there aren't enough migratable pages available, performing >>>> memory compaction is also meaningless. Besides checking whether >>>> the order-0 watermark is met, __compaction_suitable() also needs >>>> to determine whether there are sufficient migratable pages available >>>> for memory compaction. >>>> >>>> For costly allocations, because __compaction_suitable() always >>>> returns true, __alloc_pages_slowpath() can't exit at the appropriate >>>> place, resulting in excessively long virtual machine startup times. >>>> Call trace: >>>> __alloc_pages_slowpath >>>> if (compact_result == COMPACT_SKIPPED || >>>> compact_result == COMPACT_DEFERRED) >>>> goto nopage; // should exit __alloc_pages_slowpath() from here >>>> >>>> When the 16G of non-CMA memory on a single node is exhausted, we will >>>> fallback to allocating memory on other nodes. In order to quickly >>>> fallback to remote nodes, we should skip memory compaction when >>>> migratable pages are insufficient. After this fix, it only takes a >>>> few tens of seconds to start a 32GB virtual machine with device >>>> passthrough functionality. >>>> >>>> Signed-off-by: yangge >>>> --- >>>> >>>> V3: >>>> - fix build error >>>> >>>> V2: >>>> - consider unevictable folios >>>> >>>> mm/compaction.c | 20 ++++++++++++++++++++ >>>> 1 file changed, 20 insertions(+) >>>> >>>> diff --git a/mm/compaction.c b/mm/compaction.c >>>> index 07bd227..a9f1261 100644 >>>> --- a/mm/compaction.c >>>> +++ b/mm/compaction.c >>>> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order, >>>> int highest_zoneidx, >>>> unsigned long wmark_target) >>>> { >>>> + pg_data_t __maybe_unused *pgdat = zone->zone_pgdat; >>>> + unsigned long sum, nr_pinned; >>>> unsigned long watermark; >>>> + >>>> + sum = node_page_state(pgdat, NR_INACTIVE_FILE) + >>>> + node_page_state(pgdat, NR_INACTIVE_ANON) + >>>> + node_page_state(pgdat, NR_ACTIVE_FILE) + >>>> + node_page_state(pgdat, NR_ACTIVE_ANON) + >>>> + node_page_state(pgdat, NR_UNEVICTABLE); >>>> + >>>> + nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - >>>> + node_page_state(pgdat, NR_FOLL_PIN_RELEASED); >>>> + >>> >>> Does the sum of all LRU pages equal non-CMA memory? >>> I'm quite confused for two reasons: >>> 1. CMA pages can be LRU pages. >>> 2. Free pages might not belong to any LRUs. >> NO. >> >> If all the pages in the LRU are pinned, it seems unnecessary to perform >> memory compaction, as the migration of pinned pages is unlikely to succeed. >> Besides checking whether the order-0 watermark is met, >> __compaction_suitable() also needs to determine whether there are >> sufficient migratable pages available for memory compaction. > > Ok, but I am not convinced that this is a correct patch. If all your > CMA pages are > used by userspace—in other words, they are in LRUs—the sum could become > quite large, and `nr_pinned` might include non-CMA pages. In that case, > `sum - nr_pinned` would also be quite large. The "return false" logic wouldn't > work as intended. > > I suspect the issue seems to have disappeared simply because your CMA is > not being used at all. > Part of the CMA has been used. Due to __compaction_suitable() always returning true, it triggers swapping, which evicts the already-used CMA pages to disk, ultimately resulting in only pinned pages remaining in the LRU (Least Recently Used) list. >>> >>> >>>> + /* >>>> + * Gup-pinned pages are non-migratable. After subtracting these pages, >>>> + * we need to check if the remaining pages are sufficient for memory >>>> + * compaction. >>>> + */ >>>> + if ((sum - nr_pinned) < (1 << order)) >>>> + return false; >>>> + >>>> /* >>>> * Watermarks for order-0 must be met for compaction to be able to >>>> * isolate free pages for migration targets. This means that the >>>> -- >>>> 2.7.4 >>>> >>>> >>> > > Thanks > Barry