From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2DA48C02180 for ; Mon, 13 Jan 2025 15:47:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AA9606B0089; Mon, 13 Jan 2025 10:47:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A596E6B008A; Mon, 13 Jan 2025 10:47:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9211A6B008C; Mon, 13 Jan 2025 10:47:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 7864E6B0089 for ; Mon, 13 Jan 2025 10:47:07 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3330844D3F for ; Mon, 13 Jan 2025 15:47:07 +0000 (UTC) X-FDA: 83002857294.24.3663F19 Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf07.hostedemail.com (Postfix) with ESMTP id 1C25B40010 for ; Mon, 13 Jan 2025 15:47:04 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=tFOCjZxV; spf=pass (imf07.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736783225; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+vxq4fvztjJv2PvKAdD8GFxCQjIi6fGEOFnZELMhweE=; b=f656fj1VNhsqkTUusrdifTau7FWu+TLETcQfCKHyHZDnYvH8jAg4JRD4OvPZG7+qH6+Ae1 2x8C/0yKEsfUTDpz8L2IiHhJjr7Uewj6I2FRn+EsZRRjypqw0xYNCFSPN8f7zjrPQXZs7y InlNeTRkxclMC7I5Yl5oH60ChR5ToBo= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=tFOCjZxV; spf=pass (imf07.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736783225; a=rsa-sha256; cv=none; b=d+OyR/IXIbIi1l/lEBcFkkZ+76rcWI0r3Hx3AnnnHFfbDE+1gFBoPwxpuNynBXuyI3qCxi qYVyW3eIRK2GsGh/FtL811nPn39RJaWstJaBQ0yOaiX/GYRIY7hjAkZCbOulFRqNo4uEm7 B8krkztCJh2bzB+ODIO+6MWeKpGhjf0= Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6d8e773ad77so31878236d6.2 for ; Mon, 13 Jan 2025 07:47:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1736783224; x=1737388024; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=+vxq4fvztjJv2PvKAdD8GFxCQjIi6fGEOFnZELMhweE=; b=tFOCjZxVWf7rBSlN/jyD2yFmwVi2RC2i0nwUIh6MSpsn0P/xZes1TBUiAGG8ClWhEc 1f8a9jylfveVyLc4luVAF/ycJj/brWTwVThJc9mbOMCLkggoYV07p+U0jg2D4tz6B5fv FOBOfRNDJfzyk5yhtb8GgWLV9T754HYTQslO9BfPn8jbe6eymoly+YU8RBRxWThq0q00 Rk7hO06OeVUrdmtmuJN9SSOm5KVgnNzHvVyJI9Ap2hEWgEGdYSfLRN61FhsvnD31Ncp6 mrBsb82Rv+o/TgX2LJWEKXN5TBftrzgY6nDa303/N5wI72jsoqyhTGi6hV4KscRwA/BD xOFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736783224; x=1737388024; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=+vxq4fvztjJv2PvKAdD8GFxCQjIi6fGEOFnZELMhweE=; b=iRcq8s//nqyeanURtTW2hlfgvBlGZi6fNhWRbWsUMsDHxmUDMWih2mbmLZl2WSnvgA WKJVUMpw1Kvtb/myH2heLlwOjmzKrH7Turt2YdyTdxN1AjrloWq3ZFjPR0aEVaKjq/mh dx+YwkFPCSyIdSovFgC3XEut61i3eEl6EVJpFMxblKdAlPcvM0YBZka/Jn9MFQ+zka7e /lp9cbVyww07FnEIzrdXdVKjoH9XLZVYFWSwMMJISabqcSKRsM+sanoZpFuYkYPvBn3f 8v4YHtKdLlvZY4+Fb/fv7UPd+fHaL84jyInv4bBW1HbfByEdt8YwOtSch1xtSzZ8xuGs 2+TQ== X-Forwarded-Encrypted: i=1; AJvYcCWIUUmE/zYrrcLvEbFf6GA/dw/a312byiBkrDORZWfCs1Zo9hICjlzLb9CK0nOqn27Uu0H5/aQBDQ==@kvack.org X-Gm-Message-State: AOJu0Yxm0gUyUDCWbb3LXYl/Nhsk1p3ciBdNmwM5TdvT8jhCM1lmweQp 7SRQlZSq2RImlOloHYs25hCMLoIePl4ghYM5yo3GUy28qSuiL+OtLKLuG7CA8A4= X-Gm-Gg: ASbGncvpqsJBB5OsRczyWOEr8zlH5KuAxU10a3JuQyG23f0Bpq7cw+7IOq5q20OljYx Q+DAwcs2lFU+EPSOryg6ztEhn7MxyOBUjwqW+Tq0gv+zzjB/qBSPEs5Z/oeaa25hWyMcew6JthJ Lm8tKc4klal7DD9wyOAcqB2faLxgD8Zlg/91/rDwmqLstvIvbOKKTKBrhRESc4WL39TgMHwF0qL 8/RKhBYEaDQldKqA2b6muBCRM6vRKLlM9iLToWNjUQfpYa2BqfAeV8= X-Google-Smtp-Source: AGHT+IHN09Tq4LaBfZti3SZjd3z+/R6bRDB5R1rHpJw59ofYzPffSx6Kxdc3LgosysowmJoqBMpoRA== X-Received: by 2002:a05:6214:2686:b0:6d4:36ff:4356 with SMTP id 6a1803df08f44-6df9b220effmr387333376d6.19.1736783222565; Mon, 13 Jan 2025 07:47:02 -0800 (PST) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6dfad86056csm42585156d6.11.2025.01.13.07.47.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Jan 2025 07:47:01 -0800 (PST) Date: Mon, 13 Jan 2025 10:46:57 -0500 From: Johannes Weiner To: yangge1116@126.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com, david@redhat.com, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn, Vlastimil Babka Subject: Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages Message-ID: <20250113154657.GA829144@cmpxchg.org> References: <1736335854-548-1-git-send-email-yangge1116@126.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1736335854-548-1-git-send-email-yangge1116@126.com> X-Rspamd-Server: rspam05 X-Stat-Signature: xy8yb13p1u7nj6jcocozgguzhecbiswm X-Rspamd-Queue-Id: 1C25B40010 X-Rspam-User: X-HE-Tag: 1736783224-729600 X-HE-Meta: U2FsdGVkX1+dImZYR87Y4A6QBgpbkMh1xLKU5WVcdoVLM2+m5ZffJwHqcJ0fSdn1Wr7578+viplGa1Iydwurm52iq772FPtmBaXUcQWWeGmrxqZzIUUTkibpo02YqBdwfgOmlureBy+DKVdG3QrQACS7GtRCj2h0yvd/mq6gQBkyzKYMUjR4n2lksL8ujR0GCdE7xmtFvjvJnRY/LRWMSvAtS+M1aPP8NCSODb6LMZQUg1BTq6xd+xLM6H153GR+hydYgOyJ7c+TBMk6CsVmqAK4quXg+JJjT1E5L8QFGPSr1JTz3Y+QebAqwS7LLxjxJYTUwB7ZX0uWVhRhH0xjbMEqr7ySt8KL9zG3ekfUQxaqG2ZuBzmjZ0qkOdptawCIlD2p+I1sBUqeM0dUbHik8k5rGVjMSodzaYmD5Cqgn6DpYAQIhplhgal0llPuMOTkNeloe02ETgWOvWqFBtHIgl7U4JHQPWxCxR4pOs/QPkpa4j36WlqIvcUySYoeoXqkuo4ReFt6Mx3+/RxtvRBRy3JOqCG2abS/aG5vJDBB/ljueqSYr2vVhtuzjkoE1H1HuDUWfCV05PkkBanrmQ737EqEADcrXZcKpdHLGQVI6IdQBlpMATRQX9fs9uSt+xmmIF6L7APaF1rtVeC3rlJGIuIIvMPH93kxTewwh0sy4BO/4pULZtAg4lrFOwOY5BlRU0N4iY/QeTAG/BnurYONbawXvgS5etosR4WfoBR2II+PQh+HVwqG5BzELyHABHytzfwiDCtbQBx3ElpNsO4UvTupHcoLXsuQsEhqQWfp9GPLFLkXT0ZNOKm4d1yrs9Q96NEbVCWkLowFERSqdn2+64W+UErXGiCbUqSUP/ezA0CeeKmowPnaRVodK2xhU4oOVI/yleatyY94wwsuf6i6RTFnUmAZ4eZ32MSfM3qy9U/zyWQ8qQESKbqwpmq7YWt2VEMjLVBwYa07IerRlAG by17ICw2 B6l544lB1QvgFP3BzrlHu9UJnoYW6sIKRXOL/q+8VSz1/7IAUV2YLYPtarW95rDJqEDBfzqPzJe/DiHTOFxnlMXQnGF7XU4QkGPV6vQTc2IYh60o9nFG69LFvw4RK6UaIkNb3SD2duVBQx8NoIJSKV7GvSpFk9hb2X4PTB9MrbRVgcFgbVYKG+Rcg+qk5/qdVrWLLKaW2E8ud3kO72ViuMIywJSDN5rlPWUw/728ovc/pEQNGnn0CejlvcQOhuq9+SdfcAduAwIMHachqkwAv0amvVM00X+Hgpje3xU/w7cKGnIqi7jmQB57bO0yb2c2IuPT/RgH1ZJXLBME09W/Orl5LaPi8wAk6v/ChOeS0CoZUWFQ774H3zGuZHFp+4f5F8ty81MzQpKiQmN9WFXGRX1YKxL0J4SV1SYdS89US3EQnJkrerbnFXuY8dn75XUuwnuHpgAjukHYVhuNKjQEe2m8bK0jnU1AWFBHRuqIP69ccOZBr/pKMV8isHz2L6MPVD/bE7qsNg8k5NxI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: CC Vlastimil On Wed, Jan 08, 2025 at 07:30:54PM +0800, yangge1116@126.com wrote: > From: yangge > > There are 4 NUMA nodes on my machine, and each NUMA node has 32GB > of memory. I have configured 16GB of CMA memory on each NUMA node, > and starting a 32GB virtual machine with device passthrough is > extremely slow, taking almost an hour. > > During the start-up of the virtual machine, it will call > pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. > Long term GUP cannot allocate memory from CMA area, so a maximum of > 16 GB of no-CMA memory on a NUMA node can be used as virtual machine > memory. There is 16GB of free CMA memory on a NUMA node, which is > sufficient to pass the order-0 watermark check, causing the > __compaction_suitable() function to consistently return true. > However, if there aren't enough migratable pages available, performing > memory compaction is also meaningless. Besides checking whether > the order-0 watermark is met, __compaction_suitable() also needs > to determine whether there are sufficient migratable pages available > for memory compaction. > > For costly allocations, because __compaction_suitable() always > returns true, __alloc_pages_slowpath() can't exit at the appropriate > place, resulting in excessively long virtual machine startup times. > Call trace: > __alloc_pages_slowpath > if (compact_result == COMPACT_SKIPPED || > compact_result == COMPACT_DEFERRED) > goto nopage; // should exit __alloc_pages_slowpath() from here > > When the 16G of non-CMA memory on a single node is exhausted, we will > fallback to allocating memory on other nodes. In order to quickly > fallback to remote nodes, we should skip memory compaction when > migratable pages are insufficient. After this fix, it only takes a > few tens of seconds to start a 32GB virtual machine with device > passthrough functionality. > > Signed-off-by: yangge > --- > > V3: > - fix build error > > V2: > - consider unevictable folios > > mm/compaction.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 07bd227..a9f1261 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order, > int highest_zoneidx, > unsigned long wmark_target) > { > + pg_data_t __maybe_unused *pgdat = zone->zone_pgdat; > + unsigned long sum, nr_pinned; > unsigned long watermark; > + > + sum = node_page_state(pgdat, NR_INACTIVE_FILE) + > + node_page_state(pgdat, NR_INACTIVE_ANON) + > + node_page_state(pgdat, NR_ACTIVE_FILE) + > + node_page_state(pgdat, NR_ACTIVE_ANON) + > + node_page_state(pgdat, NR_UNEVICTABLE); What about PAGE_MAPPING_MOVABLE pages that aren't on this list? For example, zsmalloc backend pages can be a large share of allocated memory, and they are compactable. You would give up on compaction prematurely and cause unnecessary allocation failures. That scenario is way more common than the one you're trying to fix. I think trying to make this list complete, and maintaining it, is painstaking and error prone. And errors are hard to detect: they will just manifest as spurious failures in higher order requests that you'd need to catch with tracing enabled in the right moments. So I'm not a fan of this approach. Compaction is already skipped when previous runs were not successful. See defer_compaction() and compaction_deferred(). Why is this not helping here? > + nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - > + node_page_state(pgdat, NR_FOLL_PIN_RELEASED); Likewise, as Barry notes, not all pinned pages are necessarily LRU pages. remap_vmalloc_range() pages come to mind. You can't do subset math on potentially disjunct sets. > + /* > + * Gup-pinned pages are non-migratable. After subtracting these pages, > + * we need to check if the remaining pages are sufficient for memory > + * compaction. > + */ > + if ((sum - nr_pinned) < (1 << order)) > + return false; > +