From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2B77E77197 for ; Wed, 8 Jan 2025 02:50:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4DB096B0082; Tue, 7 Jan 2025 21:50:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 48A066B0083; Tue, 7 Jan 2025 21:50:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3784C6B0088; Tue, 7 Jan 2025 21:50:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 1B4FA6B0082 for ; Tue, 7 Jan 2025 21:50:51 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BAE7AB05FE for ; Wed, 8 Jan 2025 02:50:50 +0000 (UTC) X-FDA: 82982757060.02.C779BFC Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) by imf05.hostedemail.com (Postfix) with ESMTP id 26D0410000A for ; Wed, 8 Jan 2025 02:50:47 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="Q6VA2/L5"; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736304649; a=rsa-sha256; cv=none; b=Cv5LKDCMDKVkdxctfWscQE+Op649EzSKFMFr0+u1diSojACY+SwZyTiw25h3rOO5hJIxTl SJYLgfBIZlQFAEWKAK4fHsCRFz6BspNfgRpdYuQz3/lpyOuRT6N2aavhGRfNvkNg+4pebE Ehf1GicpoGJnun6rSzyEw1AM2YL3FCw= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="Q6VA2/L5"; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736304649; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zFfSp6cBrgkyr25i0Ve6IFs3/rxGOE6Qs6SHi3jjjCI=; b=WYmWe/Iz2U7tuh2Wonmf9U3JbCB9wZ+zcRc6dcpPPnL0JMduI04x25DrFjaiHFuql8oZMP Akh68rHbThoHcHB8LHuWAT3T4x5PBJpfyg7USbjmDFAzKe4vqCbI1rD87BSfrqTXIDgnSf DXOIVYZR8LWj5v1DpyWUSzcl7xNd2pA= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1736304645; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=zFfSp6cBrgkyr25i0Ve6IFs3/rxGOE6Qs6SHi3jjjCI=; b=Q6VA2/L5sLUXTJuGvXIQ7E9TROqWShN9uO7mATxS7vZW/anXTvmncL5lvX4VMz7bBM9LK+8fxoKvTRdvUNtekgGiPHLAuJE+yjYqLJwOTfkUddnrWXov7d/8WilwMk9gy/nwQhNUaPbM7lRSkf6PtmZGE2ow8vh/QTCXP4+Ciek= Received: from 30.74.144.127(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WNCOuUI_1736304642 cluster:ay36) by smtp.aliyun-inc.com; Wed, 08 Jan 2025 10:50:43 +0800 Message-ID: <180269be-f344-49e8-86da-23dda0bb31a0@linux.alibaba.com> Date: Wed, 8 Jan 2025 10:50:42 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: compaction: skip memory compaction when there are not enough migratable pages To: Ge Yang , akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com, david@redhat.com, hannes@cmpxchg.org, liuzixing@hygon.cn References: <1735981122-2085-1-git-send-email-yangge1116@126.com> <2889f0bf-b0ae-4f1a-b91c-fb4b59eb2d97@126.com> From: Baolin Wang In-Reply-To: <2889f0bf-b0ae-4f1a-b91c-fb4b59eb2d97@126.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 26D0410000A X-Stat-Signature: zp4w5djgap43co934b6a5e9ruewzxrrb X-HE-Tag: 1736304647-778215 X-HE-Meta: U2FsdGVkX186NWEdyNypI8NY6holWLx3iAsP11Th5ePcgf3Q+bDA7FB6SdGqw9QivGx0OsjUPVOjZQa5uChQMoO3c+WjuN/ttQzk1V0JzpCbHbFTEyPPUV1HY8UkwSVQgd2dc+rBKf2QKJeXjwHO4P0ZfnmOs4U/TlOtbaFIah0O6dUxjiw4aRs6YJYaKV9ZKfpPUmCImnv2osslm0L3stONkhjoV8VvFSsFmZCNeuUDwIMXrwpTvF4DaqInUxUCeqiTH100aFCa3C3wqqjM/JauFFe8af45hA2ql9jNXPWSqGltJCzxgHMC4xQiu3edaTWQf683pEwG2ipH8btllx2z1ReXJskNpyjD1arIdRGWhr1KcOBqEFQJd2fNUZdJLvUb8JUF5b0MexZx7sIOvmfU6c9AqTNPU+WgdAtMHHyBKPA4Ao0Fcn4L1sScsHiB4KOWXIHUh9Vqn0ViYtmewUZHVZ98cHUMT3gWt0tB4IEfXGuarHKwKCGJHPIrKvmLk9ligwKi3xXJfx4YSpgHVLxH6dMSpZWRnRSNfZGjTKZ4TxaKOcbV4ZHi0kiLYrRHUa/ESlYBt4HOYIGT80cBYQTDpabiA40F/Ssi2RY4o/gyHhnC9sr+XLAf8cAs2tto9NgWuPl5WI4G/gFxURaspCSFv/h2is/nZw/zCI0efGb+XGAKRuCNtWNZ/qPV+J0dObtora6kdQuRQnOkMA+RmgwqBLOEYkPQhA0TlYAcLlXaog/iZBfLn7iCqngKB5UBLZ1/CCZmCOPIgpAVeFkVsOaLntgcXw3ealpjv5JTTjGUrUcE8w3Rmie3SenybaI8eLUtmhZKahsNomSn3xp7dBZU2j2X3e/JW/cIY9Grgky2k9H/Xqwr2bbkv2oO6eagxRms4Ob7/kML0qSpCfB2P2UGhyDfYe3P1KuV2ICkQ+Yf29CtyVT/zwUwvx62AhKcEIzv/bvYEGyuaGBJ5Wz ktZ7nnUR VX7nYA0QjTLz1GJrL9rWbfmhozsnF1N/sX0o479cN7sJ4fHLnvYfouus0/V7DqAMhgKXzZmbIIjpIaJ7InnRpgseG2jfEqcvdiuZWY62i9pwbxNaAbcYPu4NkM43SuI2qyuPL/7T83BqFt24qn7zcbmg1lJdq+oNY77k1m+0qREHxp+dRyD/zIvdKZL3c1K6x0mClfMP4N582fZXEEj+hna8gGCbftA0CeeCAo9jx0KFK2Ff4FqgTpwITq40SZmv4RT5fT1OCKOsD0xgZj/gzliWc7ERSfPiq4zWrke3hrgS7PgoGpTEKXNKG4f1qJ+yjaDA0wnwacs9vGD8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/1/6 16:49, Ge Yang wrote: > > > 在 2025/1/6 16:12, Baolin Wang 写道: >> >> >> On 2025/1/4 16:58, yangge1116@126.com wrote: >>> From: yangge >>> >>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB >>> of memory. I have configured 16GB of CMA memory on each NUMA node, >>> and starting a 32GB virtual machine with device passthrough is >>> extremely slow, taking almost an hour. >>> >>> During the start-up of the virtual machine, it will call >>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. >>> Long term GUP cannot allocate memory from CMA area, so a maximum of >>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine >>> memory. There is 16GB of free CMA memory on a NUMA node, which is >>> sufficient to pass the order-0 watermark check, causing the >>> __compaction_suitable() function to  consistently return true. >>> However, if there aren't enough migratable pages available, performing >>> memory compaction is also meaningless. Besides checking whether >>> the order-0 watermark is met, __compaction_suitable() also needs >>> to determine whether there are sufficient migratable pages available >>> for memory compaction. >>> >>> For costly allocations, because __compaction_suitable() always >>> returns true, __alloc_pages_slowpath() can't exit at the appropriate >>> place, resulting in excessively long virtual machine startup times. >>> Call trace: >>> __alloc_pages_slowpath >>>      if (compact_result == COMPACT_SKIPPED || >>>          compact_result == COMPACT_DEFERRED) >>>          goto nopage; // should exit __alloc_pages_slowpath() from here >>> >>> When the 16G of non-CMA memory on a single node is exhausted, we will >>> fallback to allocating memory on other nodes. In order to quickly >>> fallback to remote nodes, we should skip memory compaction when >>> migratable pages are insufficient. After this fix, it only takes a >>> few tens of seconds to start a 32GB virtual machine with device >>> passthrough functionality. >>> >>> Signed-off-by: yangge >>> --- >>>   mm/compaction.c | 19 +++++++++++++++++++ >>>   1 file changed, 19 insertions(+) >>> >>> diff --git a/mm/compaction.c b/mm/compaction.c >>> index 07bd227..1c469b3 100644 >>> --- a/mm/compaction.c >>> +++ b/mm/compaction.c >>> @@ -2383,7 +2383,26 @@ static bool __compaction_suitable(struct zone >>> *zone, int order, >>>                     int highest_zoneidx, >>>                     unsigned long wmark_target) >>>   { >>> +    pg_data_t *pgdat = zone->zone_pgdat; >>> +    unsigned long sum, nr_pinned; >>>       unsigned long watermark; >>> + >>> +    sum = node_page_state(pgdat, NR_INACTIVE_FILE) + >>> +        node_page_state(pgdat, NR_INACTIVE_ANON) + >>> +        node_page_state(pgdat, NR_ACTIVE_FILE) + >>> +        node_page_state(pgdat, NR_ACTIVE_ANON); >>> + >>> +    nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - >>> +        node_page_state(pgdat, NR_FOLL_PIN_RELEASED); >>> + >>> +    /* >>> +     * Gup-pinned pages are non-migratable. After subtracting these >>> pages, >>> +     * we need to check if the remaining pages are sufficient for >>> memory >>> +     * compaction. >>> +     */ >>> +    if ((sum - nr_pinned) < (1 << order)) >>> +        return false; >>> + >> >> IMO, using the node's statistics to determine whether the zone is >> suitable for compaction doesn't make sense. It is possible that even >> though the normal zone has long-term pinned pages, the movable zone >> can still be suitable for compaction. > If all the memory used on a node is pinned, then this memory cannot be > migrated anymore, and memory compaction operations would not succeed. > I haven't used movable zone before, can you explain why memory > compaction is still necessary? Thank you. Please consider unevictable folios that are not in the active/inactive file/anon LRU lists, yet can still be migrated.