From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83CC1C02183 for ; Tue, 14 Jan 2025 12:26:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F0E776B0083; Tue, 14 Jan 2025 07:26:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E97F26B0089; Tue, 14 Jan 2025 07:26:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D13296B008A; Tue, 14 Jan 2025 07:26:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B279A6B0083 for ; Tue, 14 Jan 2025 07:26:33 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6EF6C1C73C2 for ; Tue, 14 Jan 2025 12:26:33 +0000 (UTC) X-FDA: 83005980666.22.7809C2F Received: from m16.mail.126.com (m16.mail.126.com [220.197.31.7]) by imf18.hostedemail.com (Postfix) with ESMTP id D45B61C0012 for ; Tue, 14 Jan 2025 12:26:30 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=OpwwwwPA; spf=pass (imf18.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.7 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736857591; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bZ3i6LCrmGXeajrYeYaoqY04tXa8H5WZDZ18/p5CAz0=; b=crgZKVPX1qzu4dxTNLslXq3hWtUWKHviNYvZFD7QE8W9BEu79bdKsfu04qx2YwC3Aof1Jx khD7ivmucw2cCU/udkOIkZJZuezVWtyzqXk+wIm61zgt4QFB4AWzImtK/sU2w14rIWYA9C QKcX/oxorGkrA6o7a6je1EffaKiN5Mo= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=OpwwwwPA; spf=pass (imf18.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.7 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736857591; a=rsa-sha256; cv=none; b=vtvhvrHtVabvMyYNUoiTV82hd/XBq5UZWy5CWhRCCg44ONN/YRcj726dXsB+a7JkdQFmJk XSRc4OSXSLjH1ZEU39pM+ycZuefqPcOoYHUB0yjWqIQ9VHn8wosimNoJ4nmzZ5fvsuB2WS qyb2U3OxH5BBe8537Z9xXrRdqHBykng= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Message-ID:Date:MIME-Version:Subject:From: Content-Type; bh=bZ3i6LCrmGXeajrYeYaoqY04tXa8H5WZDZ18/p5CAz0=; b=OpwwwwPAzZamwwbbMCXctz5D6njKmkUKpGXZzdU++Mglhclle6zWHNb/6v+oSl eKv84QLrdEGN0zvG/nNdAbS956Iq5Bb8VaBJgZ74qH/in96wRrfG2qAnYkbGfRsm mYyFbImIZbu3NOTK08CE1KCfYuUSaWWQD+aIThqbEcVK0= Received: from [172.19.20.199] (unknown []) by gzga-smtp-mtada-g1-3 (Coremail) with SMTP id _____wD3V0+CV4ZnZZ6sBA--.36419S2; Tue, 14 Jan 2025 20:24:34 +0800 (CST) Message-ID: Date: Tue, 14 Jan 2025 20:24:34 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages To: Vlastimil Babka , akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com, david@redhat.com, baolin.wang@linux.alibaba.com, hannes@cmpxchg.org, liuzixing@hygon.cn References: <1736335854-548-1-git-send-email-yangge1116@126.com> <39709cd7-0de6-4247-a70a-9f81fccbce6a@suse.cz> From: Ge Yang In-Reply-To: <39709cd7-0de6-4247-a70a-9f81fccbce6a@suse.cz> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wD3V0+CV4ZnZZ6sBA--.36419S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxKF1ktFWkGFy7GF1xtrWDJwb_yoWxXFyDpF 48GFnxWan8XFW7Cw1Iq3Wv93Zakw4fGFWUJF9aqryxuwsI9F9ayFnrK34Uua1UZry0qF4Y qF4j93srCan8Za7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jbPEfUUUUU= X-Originating-IP: [112.64.138.194] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbifhzUG2eGRkGsogAAsF X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D45B61C0012 X-Rspam-User: X-Stat-Signature: aqmkyjng4eh4gbmgrnz68if7pna4wpy1 X-HE-Tag: 1736857590-851005 X-HE-Meta: U2FsdGVkX1/h2w+/qYuHBsNZkCaU6GY2ZnxWuZNJZOVzXkd2hX+GhxgvGh8Fc9MQI/Bt/qjSK+UkMk8Rd2cgQd6Jj4oMRyJIOtE7N74z6rtLjkDfErnrwQ2LSb1MqPVFHt+kTuyJ2kWuib/s/P+BL4Ur4bQqofawMe8dLchhNXdkWbJimlrn593/ES6iqUxiDwyhKhX7Zcv4rdwaut/tMbF3+jHGzK2plZR8bK8vwwI/4CVgV9fOT59fXT++xZH5OGB4VnUGygqzBduaEaDWA/4aAlVWmfScVMUKNveD7rnJB5ZFYIZNnGX0pDlkmjItqGltctgrRz9bWfl9Jw0kdHNcbLDKe0jx+iP1972UD6pJDVrM9556sx5JQ7SbbC+BfVyLnD3wBpSvTlDxh2LBMHqoaJgO2SfHMwAWdiVTDh5f5k+BRYVaRJBo0Hd4M736WnNxYfCyBXeta17KW23D7px4hKA5OgcYEcV41BgY/qOvEgbo4Q8zEClz9/YRFU923nkayxBMaMreQRIiYkSRrRR9Yui0tX/AbN8wV+B/yQSugsMR+SJuJLu475N7M80ya6MgAhmrf5JbgTFynQVEAtuCfTX9Jjrr96fKpji/+qV2vzhY55r3Si0/WVWE19GRQxU1Uj96Nu3YHVR9FUwqz14RbMY4uBCojqGU9x2P98h/vSTBSPmQS+1bjShb31rAEK48TVtFoEuKwHMj7b0iqRR520PIMlunq/d74E5NZQ9VJoS7oRasL8HvFKG5/vjYweg9KLxccplE6zO5HFOE+sTbDSc9ZiCz1dgJeMasWRv5iYAqtjE4m9rLeHKzRufjiG9EHP71w1Uhs8eGfgpGKtB3QWU4f3ZyHNCcdfE96nVepnRYPCWOq1HIChmnd/gxuKh4dHW3xTOt81EvduuXW6/G/PPXPKnJoxGnBndLxXpz1EKtTOk8T2qfbojO14AXSlYSoMElGAwVgpMWYPI bfcnbELr qbJXp8IkBCU2gnoycTY+TSqb1QmnEVmOTvjmkFEk19Y4X8zbR1fr4F3KAdgP+MQzSWMk+QM827RjskX9PJellMQfM78Fy2Cfwh0hT73g4XqdPLOcsmonsG1kzwhFQ6GlVjJgsRgSI25KN1JtRwAbUvWGCYk6lR8BKZiLdozaz3eyFzw+Fn7cM2WsKkftYUxdwA8UuOdZIZQEZ5qLGubPDJXjZ7ISvSYMH04U46MO9GjJC1yr7RfVx7PBMrkl5s0ornkuS/+Qf8bEgtfjQsNX9IgN3OrHnpBGSTd/51+vJzcujKZb+ziHWFxUpr7aLY3gV3sTi765J/nZgAAcYKLeMDFcOz7BNz0U/qMlOu+F39oHDEUFDFz1MM4VP7C46cfZe+piQu6rmTTKKIM0lUm19+Ff4ZEPZRdbRSUYP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/1/14 19:21, Vlastimil Babka 写道: > On 1/8/25 12:30, yangge1116@126.com wrote: >> From: yangge >> >> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB >> of memory. I have configured 16GB of CMA memory on each NUMA node, >> and starting a 32GB virtual machine with device passthrough is >> extremely slow, taking almost an hour. >> >> During the start-up of the virtual machine, it will call >> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. >> Long term GUP cannot allocate memory from CMA area, so a maximum of >> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine >> memory. There is 16GB of free CMA memory on a NUMA node, which is >> sufficient to pass the order-0 watermark check, causing the >> __compaction_suitable() function to consistently return true. >> However, if there aren't enough migratable pages available, performing >> memory compaction is also meaningless. Besides checking whether >> the order-0 watermark is met, __compaction_suitable() also needs >> to determine whether there are sufficient migratable pages available >> for memory compaction. >> >> For costly allocations, because __compaction_suitable() always >> returns true, __alloc_pages_slowpath() can't exit at the appropriate >> place, resulting in excessively long virtual machine startup times. >> Call trace: >> __alloc_pages_slowpath >> if (compact_result == COMPACT_SKIPPED || >> compact_result == COMPACT_DEFERRED) >> goto nopage; // should exit __alloc_pages_slowpath() from here >> >> When the 16G of non-CMA memory on a single node is exhausted, we will >> fallback to allocating memory on other nodes. In order to quickly >> fallback to remote nodes, we should skip memory compaction when >> migratable pages are insufficient. After this fix, it only takes a >> few tens of seconds to start a 32GB virtual machine with device >> passthrough functionality. >> >> Signed-off-by: yangge >> --- >> >> V3: >> - fix build error >> >> V2: >> - consider unevictable folios >> >> mm/compaction.c | 20 ++++++++++++++++++++ >> 1 file changed, 20 insertions(+) >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 07bd227..a9f1261 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order, >> int highest_zoneidx, >> unsigned long wmark_target) >> { >> + pg_data_t __maybe_unused *pgdat = zone->zone_pgdat; >> + unsigned long sum, nr_pinned; >> unsigned long watermark; >> + >> + sum = node_page_state(pgdat, NR_INACTIVE_FILE) + >> + node_page_state(pgdat, NR_INACTIVE_ANON) + >> + node_page_state(pgdat, NR_ACTIVE_FILE) + >> + node_page_state(pgdat, NR_ACTIVE_ANON) + >> + node_page_state(pgdat, NR_UNEVICTABLE); > > In addition to what Johannes pointed out, these are whole-node numbers and > compaction works on a zone level. > >> + >> + nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - >> + node_page_state(pgdat, NR_FOLL_PIN_RELEASED); > > Statistics of *events* used to derive current *state*... I don't think we do > that anywhere else? I'm not sure if we make sure vmstat events may never be > missed, as they are only for statistics. IIUC we allow some rare races to > have less expensive synchronization? > > But anyway let's try looking for a different solution. > > Assuming this is a THP allocation attempt (__GFP_THISNODE even?) Yes, Transparent Huge Pages are allocated using the __GFP_THISNODE flag. and we are > in the "For costly allocations, try direct compaction first" part of > __alloc_pages_slowpath() right? Yes, memory is being allocated using the following memory allocation strategy: static struct page *alloc_pages_mpol() { page = __alloc_frozen_pages_noprof(__GFP_THISNODE,...); // 1, try to allocate THP only on local node if (page || !(gpf & __GFP_DIRECT_RECLAIM)) return page; page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);//2, fall back to remote NUMA nodes } > Hopefully also when done from the pin_user_pages_remote(..., FOLL_LONGTERM, > ...) context the allocation gfp_mask correctly lacks __GFP_MOVABLE? yes. I guess > it has to, otherwise it would allocate from the CMA pageblocks. > > Then I wonder if we could use the real allocation context to determine > watermarks, as __compaction_suitable() is passing ALLOC_CMA instead because > it's checking only for migration targets, which have to be CMA compatible by > definition. But we could use the real unmovable allocation context to have > __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass > the order-0 check anymore once the non-CMA part is exhausted. > > There's some risk that in some different scenario the compaction could in > fact migrate pages from the exhausted non-CMA part of the zone to the CMA > part and succeed, and we'll skip it instead. But that should be rare? > Below is the previous discussion: https://lore.kernel.org/lkml/1734436004-1212-1-git-send-email-yangge1116@126.com/ > Anyway given that concern I'm not sure about changing > __compaction_suitable() for every caller like this. We could (at least > initially) target this heuristic only for COMPACT_PRIO_ASYNC which is being > used for this THP opportunistic attempt. > > So for example: > - add a new bool flag to compact_control that is true for COMPACT_PRIO_ASYNC > - pass cc pointer to compaction_suit_allocation_order() > - in that function, add another check if the the new cc flag is true, > between the current zone_watermark_ok() and compaction_suitable() checks, > which works like __compaction_suitable() but uses alloc_flags (which should > not be ALLOC_CMA in our pinned allocation case) instead of ALLOC_CMA, return > COMPACT_SKIPPED if it fails. > I will send a new version of the patch based on the suggestions here. Thank you. >> + /* >> + * Gup-pinned pages are non-migratable. After subtracting these pages, >> + * we need to check if the remaining pages are sufficient for memory >> + * compaction. >> + */ >> + if ((sum - nr_pinned) < (1 << order)) >> + return false; >> + >> /* >> * Watermarks for order-0 must be met for compaction to be able to >> * isolate free pages for migration targets. This means that the