From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C01D1C369B2
	for <linux-mm@archiver.kernel.org>; Tue, 15 Apr 2025 00:41:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 64BDC2800B7; Mon, 14 Apr 2025 20:41:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5FB8B2800B1; Mon, 14 Apr 2025 20:41:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4C33E2800B7; Mon, 14 Apr 2025 20:41:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 2C5742800B1
	for <linux-mm@kvack.org>; Mon, 14 Apr 2025 20:41:10 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 6AE131208B1
	for <linux-mm@kvack.org>; Tue, 15 Apr 2025 00:41:10 +0000 (UTC)
X-FDA: 83334423900.11.271835E
Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91])
	by imf14.hostedemail.com (Postfix) with ESMTP id C2C8D100007
	for <linux-mm@kvack.org>; Tue, 15 Apr 2025 00:41:08 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=isujnLKZ;
	spf=pass (imf14.hostedemail.com: domain of sj@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=sj@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1744677668;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7fJE7FNzFxg/4ijDWMDV2KwChjuvU9u3bo6C/3gtbTY=;
	b=qd8i4YJTXU8wtmycV0RyGJZUQ5dpYnGVolYATnQpCPX2p5l5sTIDAjJOwwKXQdLM1MiD3H
	h8aqATxWmBFJHO+84PdoEPEofVg6oq5XAruJ2dZRbfyw4tovg2pqx21w6JRiqjuXmeZly2
	SQk6TQ5PD3Gu94gLcDdKEy6ZyGQ8o/E=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=isujnLKZ;
	spf=pass (imf14.hostedemail.com: domain of sj@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=sj@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744677668; a=rsa-sha256;
	cv=none;
	b=n6q5MKC/TjSWfG7eCdmz+M6E/HVR/y8uFkZzVGgepVEt56ACrOEaEnr9ZfOe+xijyHLKtS
	jZYSUORZ3vAiCvrmt0QeRVz09lQ6MwTlrr/cMpI3Ad8Rq73+xn3iEINCZaH+/GpHEP6aAZ
	MGzan4FTQZ3kSY5l03a98GaweUG48jo=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by nyc.source.kernel.org (Postfix) with ESMTP id ADE86A40C0E;
	Tue, 15 Apr 2025 00:35:39 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8498CC4CEE2;
	Tue, 15 Apr 2025 00:41:07 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1744677667;
	bh=DjGBKzYbzZBbpkJ0MrKAyTgSBUDd+rZ8JP5bIZpHppI=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=isujnLKZ7TRlGddlYHi/xRM27uUB7XqT6uyBTl1lGgjEQqt1zYaOkM3a3E5rbgKOc
	 kk++eDipscxZR80n1XwmsP78nv0wYb1Yx/JVBbxKlTLJ0fuBUv89mdUxiu5RKAlqSo
	 9oFj1npri1vfeWUe0I31bKVmcIe8XdY24aIR/PhX6/m9U8E9YzhXeSD7/zZ7p7rPbE
	 34up/YHkE43kZzRHyLeFGB2a9aApwhAcUhfajwarCgwqEC/2zJuEsC4PTHRjaE4Ro7
	 fcndlGAlJE9vyjVJQiC2vFqIbImMACQ+WWvWR8qHiCxkQDF/uCqxy0NkoGxc0DFCM4
	 e0i2qKBs1Ajkg==
From: SeongJae Park <sj@kernel.org>
To: Gregory Price <gourry@gourry.net>
Cc: SeongJae Park <sj@kernel.org>,
	linux-mm@kvack.org,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	kernel-team@meta.com,
	akpm@linux-foundation.org,
	mingo@redhat.com,
	peterz@infradead.org,
	juri.lelli@redhat.com,
	vincent.guittot@linaro.org,
	hannes@cmpxchg.org,
	mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	donettom@linux.ibm.com
Subject: Re: [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion
Date: Mon, 14 Apr 2025 17:41:05 -0700
Message-Id: <20250415004105.121462-1-sj@kernel.org>
X-Mailer: git-send-email 2.39.5
In-Reply-To: <20250411221111.493193-6-gourry@gourry.net>
References: 
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: C2C8D100007
X-Rspamd-Server: rspam05
X-Rspam-User: 
X-Stat-Signature: 1mxrhs51k99c3kf8wzmtumaao5kr6j7r
X-HE-Tag: 1744677668-574215
X-HE-Meta: U2FsdGVkX193abhZHvVkYVFthFpkYvQ39XyTvciDax7jYgNfG4pHpE2AE20giR0Us9fzKLpwHl97REVelXWa7D20Cy5tvMLwGNfnIyzGNJVZnzzhNUEqcJ/6WJcTnSGroXk6ZSLtsc9QQXFiccrBFGjSvCnpmmYHO4cciLw5MBx7YKrLNGBLDgX8YdFQOvIIzypftjpmN63ivFO2OLwR3obN0Kc+vxnQL6ydwYZY9bMvpMu+999nNM2C0oZfFVabW+chtG3ww/TLwtZNuU7RuxtgvJyS8qiu+KztklMcZDlcfjx5q5XgJL1ONATrCnvPb6lw7diEWmq6bqPA0ckKaEQpw7Rgq+560AZ8z7dQXUTOsRmM737DaJwy+GTqfQu4rnEfGoDbs8wbDmyawhOaw0eQ51cgnixD5l8M+CVkhRXWKWqkdGlKSK5lzEr0MNXddyB7ZDDPFQHJK3m0veKTTBBnUNYDjQSRjWSmMiodusf5oCjp6dVhkD1kW8iZOB4eyVC21K4pKG2gQ66hdCOp5Hsersl+EagQwXbSNPhD6oZuJx++Hj3G6vtuK9y7x52XSUN+8NmjFn3JaQKx8lgpvRS4YBGpIC1B+KcZd06kgRWeqBl+iHg3lVVuxNztTvrdNFGGZ0hQaVzEEfDmm1HzNxU4W19SvxFciCVwLbcE7BmEzCWNVhnXQ9rfXzoV/lM2SdGdRHtK0mKre/UElenBq5dRU2Nih4XkOXARaID4XUrzwvj8lc4ZxUiCNiuDNbXSXQcL2vHBuEffqs56qS+t11NVtjh4AvUyUSDa/G+/7xmf41aVUcZcd9gm6k/KVxruxM0SOAobwZtCMuCgIKNV6be/T1Vcq81ksnvB2YIsPh5q4cJ/F7coQNP0iP44jcZBmGtP19z+gUEKhgVotK4kPGiod1Po14z+ffo9HTAwDAdjZlgLPLeQ0StftrbkYFBgNEPgdhGy91R4qqtPLoj
 QKhu+kSt
 61+TGycmY2WwavRvb+dyyaPU6AGKYcVSGvBwUNN7h7TaLEai7122hiMfw3bBnkhjOy+kiZn7S4MetgA8q8qTU7Y3VD075e6qPda8i7T4WlzyoDEoE6haulxdEiQpTPEcP6NZGH8HJ7AV8l5ceds2TZdFfTdVqUPY+oyb/2byY51KncXmVMVfiaLl2OABN7+/+qmgDNrR6oE9Oz2bRWswMX+P/R6LJoOmaRAYXKx9otnNGM8v5ngZ8g61uaTgMPp724rUOb1fLee8/a/M4udn68sz+awXU+yIlzhFDgY039CdZ+lQEPvGjSk0q7G1CoOhdGvG4Pp0HVO327csiwUdvZnFZPJ0LIS2ajKlJDOfI3LiaZvv3TAq/ADoAjtacrhFHUyee
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, 11 Apr 2025 18:11:10 -0400 Gregory Price <gourry@gourry.net> wrote:

> adds /sys/kernel/mm/numa/pagecache_promotion_enabled
> 
> When page cache lands on lower tiers, there is no way for promotion
> to occur unless it becomes memory-mapped and exposed to NUMA hint
> faults.  Just adding a mechanism to promote pages unconditionally,
> however, opens up significant possibility of performance regressions.
> 
> Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
> to enable and disable page cache promotion.  This option will enable
> opportunistic promotion of unmapped page cache during syscall access.
> 
> This option is intended for operational conditions where demoted page
> cache will eventually contain memory which becomes hot - and where
> said memory likely to cause performance issues due to being trapped on
> the lower tier of memory.
> 
> A Page Cache folio is considered a promotion candidates when:
>   0) tiering and pagecache-promotion are enabled

"Tiering" here means NUMA_BALANCING_MEMORY_TIERING, right?  Why do you make
this feature depend on it?

If there is a good reason for the dependency, what do you think about

1. making pagecache_promotion_enabled automatically eanbles
   NUMA_BALANCING_MEMORY_TIERING, or
2. adding another flag for NUMA balancing
   (e.g., echo 4 > /proc/sys/kernel/numa_balancing) that enables this feature
   and mapped pages promotion together?

>   1) the folio resides on a node not in the top tier
>   2) the folio is already marked referenced and active.
>   3) Multiple accesses in (referenced & active) state occur quickly.

I don't clearly understand what 3) means, particularly the criteria of "quick",
and how the speed is measured.  Could you please clarify?

> 
> Since promotion is not safe to execute unconditionally from within
> folio_mark_accessed, we defer promotion to a new task_work captured
> in the task_struct.  This ensures that the task doing the access has
> some hand in promoting pages - even among deduplicated read only files.
> 
> We limit the total number of folios on the promotion list to the
> promotion rate limit to limit the amount of inline work done during
> large reads - avoiding significant overhead.  We do not use the existing
> rate-limit check function this checked during the migration anyway.
> 
> The promotion node is always the local node of the promoting cpu.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  .../ABI/testing/sysfs-kernel-mm-numa          | 20 +++++++
>  include/linux/memory-tiers.h                  |  2 +
>  include/linux/migrate.h                       |  5 ++
>  include/linux/sched.h                         |  4 ++
>  include/linux/sched/sysctl.h                  |  1 +
>  init/init_task.c                              |  2 +
>  kernel/sched/fair.c                           | 24 +++++++-
>  mm/memory-tiers.c                             | 27 +++++++++
>  mm/migrate.c                                  | 55 +++++++++++++++++++
>  mm/swap.c                                     |  8 +++
>  10 files changed, 147 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> index 77e559d4ed80..ebb041891db2 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> @@ -22,3 +22,23 @@ Description:	Enable/disable demoting pages during reclaim
>  		the guarantees of cpusets.  This should not be enabled
>  		on systems which need strict cpuset location
>  		guarantees.
> +
> +What:		/sys/kernel/mm/numa/pagecache_promotion_enabled

This is not for any page cache page but unmapped page cache pages, right?
I think making the name be more explicit about it could avoid confuses?

> +Date:		January 2025

Captain, it's April ;)

> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Enable/disable promoting pages during file access
> +
> +		Page migration during file access is intended for systems
> +		with tiered memory configurations that have significant
> +		unmapped file cache usage. By default, file cache memory
> +		on slower tiers will not be opportunistically promoted by
> +		normal NUMA hint faults, because the system has no way to
> +		track them.  This option enables opportunistic promotion
> +		of pages that are accessed via syscall (e.g. read/write)
> +		if multiple accesses occur in quick succession.

I again think it would be nice to clarify how quick it should be.

> +
> +		It may move data to a NUMA node that does not fall into
> +		the cpuset of the allocating process which might be
> +		construed to violate the guarantees of cpusets.  This
> +		should not be enabled on systems which need strict cpuset
> +		location guarantees.
[...]
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
[...]
> @@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
>  	return count;
>  }
>  
> +static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
> +						struct kobj_attribute *attr,
> +						char *buf)
> +{
> +	return sysfs_emit(buf, "%s\n",
> +			  numa_pagecache_promotion_enabled ? "true" : "false");
> +}

How about using str_true_false(), like demotion_enabled_show() does?

[...]
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -44,6 +44,8 @@
>  #include <linux/sched/sysctl.h>
>  #include <linux/memory-tiers.h>
>  #include <linux/pagewalk.h>
> +#include <linux/sched/numa_balancing.h>
> +#include <linux/task_work.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -2762,5 +2764,58 @@ int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
>  	BUG_ON(!list_empty(folio_list));
>  	return nr_remaining ? -EAGAIN : 0;
>  }
> +
> +/**
> + * promotion_candidate: report a promotion candidate folio
> + *
> + * The folio will be isolated from LRU if selected, and task_work will
> + * putback the folio on promotion failure.
> + *
> + * Candidates may not be promoted and may be returned to the LRU.

Is this for situations that are different from the above sentence explaining
cases?  If so, could you clarify that?

> + *
> + * Takes a folio reference that will be released in task work.
> + */
> +void promotion_candidate(struct folio *folio)
> +{
> +	struct task_struct *task = current;
> +	struct list_head *promo_list = &task->promo_list;
> +	struct callback_head *work = &task->numa_promo_work;
> +	int nid = folio_nid(folio);
> +	int flags, last_cpupid;
> +
> +	/* do not migrate toptier folios or in kernel context */
> +	if (node_is_toptier(nid) || task->flags & PF_KTHREAD)
> +		return;
> +
> +	/*
> +	 * Limit per-syscall migration rate to balancing rate limit. This avoids

Isn't this per-task work rather than per-syscall?

> +	 * excessive work during large reads knowing that task work is likely to
> +	 * hit the rate limit and put excess folios back on the LRU anyway.
> +	 */
> +	if (task->promo_count >= sysctl_numa_balancing_promote_rate_limit)
> +		return;
> +
> +	/* Isolate the folio to prepare for migration */
> +	nid = numa_migrate_check(folio, NULL, 0, &flags, folio_test_dirty(folio),
> +				 &last_cpupid);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	if (migrate_misplaced_folio_prepare(folio, NULL, nid))
> +		return;
> +
> +	/*
> +	 * If work is pending, add this folio to the list. Otherwise, ensure
> +	 * the task will execute the work, otherwise we can leak folios.
> +	 */
> +	if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
> +		folio_putback_lru(folio);
> +		return;
> +	}
> +	list_add_tail(&folio->lru, promo_list);
> +	task->promo_count += folio_nr_pages(folio);
> +	return;
> +}
> +EXPORT_SYMBOL(promotion_candidate);

Why export this symbol?


Thanks,
SJ

[...]