From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C01D1C369B2 for ; Tue, 15 Apr 2025 00:41:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 64BDC2800B7; Mon, 14 Apr 2025 20:41:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5FB8B2800B1; Mon, 14 Apr 2025 20:41:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C33E2800B7; Mon, 14 Apr 2025 20:41:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2C5742800B1 for ; Mon, 14 Apr 2025 20:41:10 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6AE131208B1 for ; Tue, 15 Apr 2025 00:41:10 +0000 (UTC) X-FDA: 83334423900.11.271835E Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf14.hostedemail.com (Postfix) with ESMTP id C2C8D100007 for ; Tue, 15 Apr 2025 00:41:08 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=isujnLKZ; spf=pass (imf14.hostedemail.com: domain of sj@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744677668; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7fJE7FNzFxg/4ijDWMDV2KwChjuvU9u3bo6C/3gtbTY=; b=qd8i4YJTXU8wtmycV0RyGJZUQ5dpYnGVolYATnQpCPX2p5l5sTIDAjJOwwKXQdLM1MiD3H h8aqATxWmBFJHO+84PdoEPEofVg6oq5XAruJ2dZRbfyw4tovg2pqx21w6JRiqjuXmeZly2 SQk6TQ5PD3Gu94gLcDdKEy6ZyGQ8o/E= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=isujnLKZ; spf=pass (imf14.hostedemail.com: domain of sj@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744677668; a=rsa-sha256; cv=none; b=n6q5MKC/TjSWfG7eCdmz+M6E/HVR/y8uFkZzVGgepVEt56ACrOEaEnr9ZfOe+xijyHLKtS jZYSUORZ3vAiCvrmt0QeRVz09lQ6MwTlrr/cMpI3Ad8Rq73+xn3iEINCZaH+/GpHEP6aAZ MGzan4FTQZ3kSY5l03a98GaweUG48jo= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id ADE86A40C0E; Tue, 15 Apr 2025 00:35:39 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8498CC4CEE2; Tue, 15 Apr 2025 00:41:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1744677667; bh=DjGBKzYbzZBbpkJ0MrKAyTgSBUDd+rZ8JP5bIZpHppI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=isujnLKZ7TRlGddlYHi/xRM27uUB7XqT6uyBTl1lGgjEQqt1zYaOkM3a3E5rbgKOc kk++eDipscxZR80n1XwmsP78nv0wYb1Yx/JVBbxKlTLJ0fuBUv89mdUxiu5RKAlqSo 9oFj1npri1vfeWUe0I31bKVmcIe8XdY24aIR/PhX6/m9U8E9YzhXeSD7/zZ7p7rPbE 34up/YHkE43kZzRHyLeFGB2a9aApwhAcUhfajwarCgwqEC/2zJuEsC4PTHRjaE4Ro7 fcndlGAlJE9vyjVJQiC2vFqIbImMACQ+WWvWR8qHiCxkQDF/uCqxy0NkoGxc0DFCM4 e0i2qKBs1Ajkg== From: SeongJae Park To: Gregory Price Cc: SeongJae Park , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, akpm@linux-foundation.org, mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, donettom@linux.ibm.com Subject: Re: [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion Date: Mon, 14 Apr 2025 17:41:05 -0700 Message-Id: <20250415004105.121462-1-sj@kernel.org> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250411221111.493193-6-gourry@gourry.net> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: C2C8D100007 X-Rspamd-Server: rspam05 X-Rspam-User: X-Stat-Signature: 1mxrhs51k99c3kf8wzmtumaao5kr6j7r X-HE-Tag: 1744677668-574215 X-HE-Meta: U2FsdGVkX193abhZHvVkYVFthFpkYvQ39XyTvciDax7jYgNfG4pHpE2AE20giR0Us9fzKLpwHl97REVelXWa7D20Cy5tvMLwGNfnIyzGNJVZnzzhNUEqcJ/6WJcTnSGroXk6ZSLtsc9QQXFiccrBFGjSvCnpmmYHO4cciLw5MBx7YKrLNGBLDgX8YdFQOvIIzypftjpmN63ivFO2OLwR3obN0Kc+vxnQL6ydwYZY9bMvpMu+999nNM2C0oZfFVabW+chtG3ww/TLwtZNuU7RuxtgvJyS8qiu+KztklMcZDlcfjx5q5XgJL1ONATrCnvPb6lw7diEWmq6bqPA0ckKaEQpw7Rgq+560AZ8z7dQXUTOsRmM737DaJwy+GTqfQu4rnEfGoDbs8wbDmyawhOaw0eQ51cgnixD5l8M+CVkhRXWKWqkdGlKSK5lzEr0MNXddyB7ZDDPFQHJK3m0veKTTBBnUNYDjQSRjWSmMiodusf5oCjp6dVhkD1kW8iZOB4eyVC21K4pKG2gQ66hdCOp5Hsersl+EagQwXbSNPhD6oZuJx++Hj3G6vtuK9y7x52XSUN+8NmjFn3JaQKx8lgpvRS4YBGpIC1B+KcZd06kgRWeqBl+iHg3lVVuxNztTvrdNFGGZ0hQaVzEEfDmm1HzNxU4W19SvxFciCVwLbcE7BmEzCWNVhnXQ9rfXzoV/lM2SdGdRHtK0mKre/UElenBq5dRU2Nih4XkOXARaID4XUrzwvj8lc4ZxUiCNiuDNbXSXQcL2vHBuEffqs56qS+t11NVtjh4AvUyUSDa/G+/7xmf41aVUcZcd9gm6k/KVxruxM0SOAobwZtCMuCgIKNV6be/T1Vcq81ksnvB2YIsPh5q4cJ/F7coQNP0iP44jcZBmGtP19z+gUEKhgVotK4kPGiod1Po14z+ffo9HTAwDAdjZlgLPLeQ0StftrbkYFBgNEPgdhGy91R4qqtPLoj QKhu+kSt 61+TGycmY2WwavRvb+dyyaPU6AGKYcVSGvBwUNN7h7TaLEai7122hiMfw3bBnkhjOy+kiZn7S4MetgA8q8qTU7Y3VD075e6qPda8i7T4WlzyoDEoE6haulxdEiQpTPEcP6NZGH8HJ7AV8l5ceds2TZdFfTdVqUPY+oyb/2byY51KncXmVMVfiaLl2OABN7+/+qmgDNrR6oE9Oz2bRWswMX+P/R6LJoOmaRAYXKx9otnNGM8v5ngZ8g61uaTgMPp724rUOb1fLee8/a/M4udn68sz+awXU+yIlzhFDgY039CdZ+lQEPvGjSk0q7G1CoOhdGvG4Pp0HVO327csiwUdvZnFZPJ0LIS2ajKlJDOfI3LiaZvv3TAq/ADoAjtacrhFHUyee X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 11 Apr 2025 18:11:10 -0400 Gregory Price wrote: > adds /sys/kernel/mm/numa/pagecache_promotion_enabled > > When page cache lands on lower tiers, there is no way for promotion > to occur unless it becomes memory-mapped and exposed to NUMA hint > faults. Just adding a mechanism to promote pages unconditionally, > however, opens up significant possibility of performance regressions. > > Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle > to enable and disable page cache promotion. This option will enable > opportunistic promotion of unmapped page cache during syscall access. > > This option is intended for operational conditions where demoted page > cache will eventually contain memory which becomes hot - and where > said memory likely to cause performance issues due to being trapped on > the lower tier of memory. > > A Page Cache folio is considered a promotion candidates when: > 0) tiering and pagecache-promotion are enabled "Tiering" here means NUMA_BALANCING_MEMORY_TIERING, right? Why do you make this feature depend on it? If there is a good reason for the dependency, what do you think about 1. making pagecache_promotion_enabled automatically eanbles NUMA_BALANCING_MEMORY_TIERING, or 2. adding another flag for NUMA balancing (e.g., echo 4 > /proc/sys/kernel/numa_balancing) that enables this feature and mapped pages promotion together? > 1) the folio resides on a node not in the top tier > 2) the folio is already marked referenced and active. > 3) Multiple accesses in (referenced & active) state occur quickly. I don't clearly understand what 3) means, particularly the criteria of "quick", and how the speed is measured. Could you please clarify? > > Since promotion is not safe to execute unconditionally from within > folio_mark_accessed, we defer promotion to a new task_work captured > in the task_struct. This ensures that the task doing the access has > some hand in promoting pages - even among deduplicated read only files. > > We limit the total number of folios on the promotion list to the > promotion rate limit to limit the amount of inline work done during > large reads - avoiding significant overhead. We do not use the existing > rate-limit check function this checked during the migration anyway. > > The promotion node is always the local node of the promoting cpu. > > Suggested-by: Johannes Weiner > Signed-off-by: Gregory Price > --- > .../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++ > include/linux/memory-tiers.h | 2 + > include/linux/migrate.h | 5 ++ > include/linux/sched.h | 4 ++ > include/linux/sched/sysctl.h | 1 + > init/init_task.c | 2 + > kernel/sched/fair.c | 24 +++++++- > mm/memory-tiers.c | 27 +++++++++ > mm/migrate.c | 55 +++++++++++++++++++ > mm/swap.c | 8 +++ > 10 files changed, 147 insertions(+), 1 deletion(-) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa > index 77e559d4ed80..ebb041891db2 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa > @@ -22,3 +22,23 @@ Description: Enable/disable demoting pages during reclaim > the guarantees of cpusets. This should not be enabled > on systems which need strict cpuset location > guarantees. > + > +What: /sys/kernel/mm/numa/pagecache_promotion_enabled This is not for any page cache page but unmapped page cache pages, right? I think making the name be more explicit about it could avoid confuses? > +Date: January 2025 Captain, it's April ;) > +Contact: Linux memory management mailing list > +Description: Enable/disable promoting pages during file access > + > + Page migration during file access is intended for systems > + with tiered memory configurations that have significant > + unmapped file cache usage. By default, file cache memory > + on slower tiers will not be opportunistically promoted by > + normal NUMA hint faults, because the system has no way to > + track them. This option enables opportunistic promotion > + of pages that are accessed via syscall (e.g. read/write) > + if multiple accesses occur in quick succession. I again think it would be nice to clarify how quick it should be. > + > + It may move data to a NUMA node that does not fall into > + the cpuset of the allocating process which might be > + construed to violate the guarantees of cpusets. This > + should not be enabled on systems which need strict cpuset > + location guarantees. [...] > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c [...] > @@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj, > return count; > } > > +static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj, > + struct kobj_attribute *attr, > + char *buf) > +{ > + return sysfs_emit(buf, "%s\n", > + numa_pagecache_promotion_enabled ? "true" : "false"); > +} How about using str_true_false(), like demotion_enabled_show() does? [...] > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -44,6 +44,8 @@ > #include > #include > #include > +#include > +#include > > #include > > @@ -2762,5 +2764,58 @@ int migrate_misplaced_folio_batch(struct list_head *folio_list, int node) > BUG_ON(!list_empty(folio_list)); > return nr_remaining ? -EAGAIN : 0; > } > + > +/** > + * promotion_candidate: report a promotion candidate folio > + * > + * The folio will be isolated from LRU if selected, and task_work will > + * putback the folio on promotion failure. > + * > + * Candidates may not be promoted and may be returned to the LRU. Is this for situations that are different from the above sentence explaining cases? If so, could you clarify that? > + * > + * Takes a folio reference that will be released in task work. > + */ > +void promotion_candidate(struct folio *folio) > +{ > + struct task_struct *task = current; > + struct list_head *promo_list = &task->promo_list; > + struct callback_head *work = &task->numa_promo_work; > + int nid = folio_nid(folio); > + int flags, last_cpupid; > + > + /* do not migrate toptier folios or in kernel context */ > + if (node_is_toptier(nid) || task->flags & PF_KTHREAD) > + return; > + > + /* > + * Limit per-syscall migration rate to balancing rate limit. This avoids Isn't this per-task work rather than per-syscall? > + * excessive work during large reads knowing that task work is likely to > + * hit the rate limit and put excess folios back on the LRU anyway. > + */ > + if (task->promo_count >= sysctl_numa_balancing_promote_rate_limit) > + return; > + > + /* Isolate the folio to prepare for migration */ > + nid = numa_migrate_check(folio, NULL, 0, &flags, folio_test_dirty(folio), > + &last_cpupid); > + if (nid == NUMA_NO_NODE) > + return; > + > + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) > + return; > + > + /* > + * If work is pending, add this folio to the list. Otherwise, ensure > + * the task will execute the work, otherwise we can leak folios. > + */ > + if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) { > + folio_putback_lru(folio); > + return; > + } > + list_add_tail(&folio->lru, promo_list); > + task->promo_count += folio_nr_pages(folio); > + return; > +} > +EXPORT_SYMBOL(promotion_candidate); Why export this symbol? Thanks, SJ [...]