From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C5A2C282EC for ; Mon, 10 Mar 2025 15:35:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2DD0D280005; Mon, 10 Mar 2025 11:35:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 28B43280004; Mon, 10 Mar 2025 11:35:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F5EC280005; Mon, 10 Mar 2025 11:35:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E15CB280004 for ; Mon, 10 Mar 2025 11:35:06 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5F148C11BF for ; Mon, 10 Mar 2025 15:35:08 +0000 (UTC) X-FDA: 83206039896.14.E707C25 Received: from out203-205-221-231.mail.qq.com (out203-205-221-231.mail.qq.com [203.205.221.231]) by imf18.hostedemail.com (Postfix) with ESMTP id C0CBD1C0019 for ; Mon, 10 Mar 2025 15:35:05 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=foxmail.com header.s=s201512 header.b=x7eIRHPa; spf=pass (imf18.hostedemail.com: domain of yu.chen.surf@foxmail.com designates 203.205.221.231 as permitted sender) smtp.mailfrom=yu.chen.surf@foxmail.com; dmarc=pass (policy=none) header.from=foxmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741620906; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QD4BNC6L/gJ1g337KEtnyuP7OQgQm7c2Xf3mWHcS5dA=; b=kKs6c8qQkPfaAiVSHNVrpjENc/viR0LObO76ACCOoG2/iCsMiYW6ELvZ9lkK1lT1BiXWjJ o0ZMW7x5SZyV4+7vh2xgfudOcj6B+XRS8h4ZYM5+UsaKqM8RDFv8MT5ijF0EnCcfczvUM8 8dUdJ3DdEFCIbaCkBdg4gRcA7ZpMaVc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741620906; a=rsa-sha256; cv=none; b=s39R2eGWV7itiB6ZWsjLgD7/UALl43euRKrVURQ9K8doJhaGz4audKBmUGXEiGJnRZhfRx EqBubz8+Gvg0EJZxrrJATHlb7pX1vp/xjYV2Qf5ypulTil3t3R3k7WIaLfYhKlPBNxcvUh gRWZck7ODmXBm/V65gayEKak/x6FBqU= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=foxmail.com header.s=s201512 header.b=x7eIRHPa; spf=pass (imf18.hostedemail.com: domain of yu.chen.surf@foxmail.com designates 203.205.221.231 as permitted sender) smtp.mailfrom=yu.chen.surf@foxmail.com; dmarc=pass (policy=none) header.from=foxmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1741620886; bh=QD4BNC6L/gJ1g337KEtnyuP7OQgQm7c2Xf3mWHcS5dA=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=x7eIRHPa7tjrn76zu/SduIeF/2tjupIAvQj6DL1s3awlfZprptXX3231LOzEp/2X+ YofT77yuHUvAGepY4PizyBnUGqsHRb5m1IrhkXNoYqBXsYBqgGTw0QVgtCDeT2ASjV qnas55sr67opfLu3DlVdspzpxANXfqwDwM9/IYA0= Received: from chenyu5-mobl2 ([125.69.38.41]) by newxmesmtplogicsvrszc13-0.qq.com (NewEsmtp) with SMTP id 8A7A6EA5; Mon, 10 Mar 2025 23:34:39 +0800 X-QQ-mid: xmsmtpt1741620879tgob8bak6 Message-ID: X-QQ-XMAILINFO: McFs54YitxxBSSrbZshezlPop2X5lsIPeKke4k5HxK2vwsXIsgILU//lzZ+g9e wVKHYtthKTjn3eY7vovBwQMc7uApmxOe5vwhpLIIQ1j/ceEMh6/v5tu8WUlslfIf9u+BJwjSXhlV Z8Oyfvv5lJqCN5f3tIn0nodIN4rIE2HP/pvmEiK53GRNQUTF0/CRllDf+I7GiWhQQvrP17YPku+p 629q8onGAQCepi+B6ubNIPIQyOf0HOwIaqjLygJ1Vi5LbxTR+818+YnQrIVHiuYahSmkYyC5tY+3 UvRPOZLBtvU2ZQqhmoQ2pHXeRyXmAClQVfNGAQHcSXLl/pqNj/DI62IzqM7GzX0H0GRZaeTKUSOh qwEXsrVmemxy/0QtFZ8BN9Ll/wViFGxRNUW/VVBS2IjltPhq71zPiep18QoN7OuMMKLDY0baDTkz qc41HuZUTGrzLlfgMe14l+0XMwOOybXSl+/N+c8nQXPsZmZvlcMkMAtRH/bZU2yZo2vR8efCJZHZ VSgitjyDhHGJrG/QSEhRFo6OKKJ7uoJ9sYv7h33rncHqPtA9/Eu5gDWyYs+efO3MrNzeW1LSAAPo pxAUvSl34n+D5OhkOzSxcq9USf2b8JPK4ZcfAycpYJ6MWgwh/fj1ahLNkDpIxW9jPn9fAMr1jTMk eV7bHhCtbk9ezqvKzTShpkUyZJOilCWYVEuOOZBvGEeBUV1JTNZCbKtpVTlIQzdvB6vqoBrXUO/F R1ex0vLnnE4tDmzuz52UIn8Pd80TVd35NzEIRDy/1+wGsrYaaa2CaAE3P1ZExWIqKSzxq9IutaRV 75UBLPCMaIHF532CcOe9xVE8I3W/jqiSqUKUFefK2/fy/Y9LgfzyLN5BJQztgdvtHbk0HPEadtjX DEqB5yncAFF80IC+qWaFRNNizxGJljBGGnGruMiJ7RJMtdfKtQLRol06MbZe9aHW/CAhsOWI0xdX sSUa9YcC2zZ8vSCVKK9J7+Q05qPf4NzVwVw6/CLbRMB99OIXUIyIbZHTo/V7h7h5oSCIcI/m4= X-QQ-XMRINFO: OD9hHCdaPRBwq3WW+NvGbIU= Date: Mon, 10 Mar 2025 23:36:10 +0800 From: Chen Yu To: Tim Chen Cc: Chen Yu , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Andrew Morton , Rik van Riel , Mel Gorman , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Liam R. Howlett" , Lorenzo Stoakes , "Huang, Ying" , Tim Chen , Aubrey Li , Michael Wang , Kaiyang Zhao , David Rientjes , Raghavendra K T , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control X-OQ-MSGID: References: <0d1cc457c6a97178fc68880957757f3c27088f53.camel@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0d1cc457c6a97178fc68880957757f3c27088f53.camel@linux.intel.com> X-Rspamd-Server: rspam12 X-Rspam-User: X-Rspamd-Queue-Id: C0CBD1C0019 X-Stat-Signature: 54pu119ewnbo95i6fk77xe1dr8754rw8 X-HE-Tag: 1741620905-816975 X-HE-Meta: U2FsdGVkX1+mDeWzTfEKRuFW0oy1oJGgUXfSmRm7Z4Oim/3cWSsyHtJGdvZIOaQ27uStmvVytROdnZto3mPyFM5mMCfneeM8r4VsZOnS9RTKHKvyaxO9MGweaUEgwKvmoswkFn6r5xnIcBFyVibC/PlJY3Bcd81iAEAqf2cb5ZhPYpfR+i7BKDlGhiCI4lbnhIQcjmy+7pdKwVEw5JJYsgdPkc5nUgfoXBra4+3cPbE/ZFsXM30NoqdMhZaAOlUaKJfOVwrs73HDIcKfLQQI/Y824DVRvbhuc2nNJrY2f6WxmKFWsZE3jw4sVZl3EYbPqHF1cWZRoyf4DnNu6Zo9ZD52AjpbDtD8v494+CbgHpJ33QufoFzNKPrkfU8dIHaAZ0NBNmXFXz5LPjdvlw8uUpLFwiZtb4Fd10VU8iclR8ovSZc9roqyFRvJ7Wzanmw7SCSx+uLR+clZuFzqPQu0WVzT80y54dUDPdOjqQXJR3zjCdYxMx+VToAXe/kFtkm5zTt18Gu7eBRRW/4hDyWBvYzaWigrsL6ONo93EU8pVlublo7OUFquFBSB/js3/XJlnnVw2ANW1v2dNl1K4q5EhGSs+J7xWcALujjCsJyIFMmIcJ9rGw1xVfLLfCGgn9PFWpXse0Xz/OJRo+7/t2ucWbCsmuuiI0aXbqstbWa1kdciTRYw6y1g95lAIb+1G1dgjL5Tb1JHJlFMi+BdQV5GnriQG7aKtcG5L8xW55nXKHsnwNDtoYsKsrvwD01qoZcAawqnqkp4H2AOnyOyC5ZVpPggg9ao971E32Zxdl2fFgLPSXBmO+vtRQuP6DelNv+/tOCaQQn6yaIp6Wf6s+P/omSdBjMndKZgvhSHdjnhqBw4kAgGtmKECumGao1nNFDp3iwuNqc2j2dmIjo4dPp3Eh8kNoV4IVBPpZL5CZBL1U52pXWPuvpqwlC9Vl8C7cZS1Axkf6pmViaWyWLgwil Vfe3kL1Y mF6UAj+6AURciZ/8CkmLBbX7ruh+3HGOzsPB4AHWpSIZq9wYqKoF7F9jLmMbUHhxOkXnaGbXG/hGuEbNioFJmr34QOatzA8OgFez64IjUujg0LviTQx3VDa+NqmRRuNI5Vbg3qB49pFlzrWhtG6XvOeZ0XkN7nloWsEI/8TkXsDzcWa7gZoBKQGCRWm9Y0w77Na9IQ9guUDdtYttuxd8n6EvSe7J03U+tRfNSoW99KPijESyPsFQe+Q8+UQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025-03-07 at 14:54:10 -0800, Tim Chen wrote: > On Tue, 2025-02-25 at 22:00 +0800, Chen Yu wrote: > > [Problem Statement] > > Currently, NUMA balancing is configured system-wide. However, > > > > > > A simple example to show how to use per-cgroup Numa balancing: > > > > Step1 > > //switch to global per cgroup Numa balancing, > > //All cgroup's Numa balance is disabled by default. > > echo 4 > /proc/sys/kernel/numa_balancing > > > > Can you add documentation of this additional feature > for numa_balancing in > admin-guide/sysctl/kernel.rst > OK, will refine in next version. > Should you make NUMA_BALANCING_NORMAL and NUMA_BALANCING_CGROUP > mutually exclusive in? In other words > echo 5 > /proc/sys/kernel/numa_balancing should result in numa_balancing to be 1? > > Otherwise tg_numa_balance_enabled() can return 0 with NUMA_BALANCING_CGROUP > bit turned on even though you have NUMA_BALANCING_NORMAL bit on. > I see, will fix tg_numa_balance_enabled() in next version, thanks! Best, Chenyu > Tim > > > > Suggested-by: Tim Chen > > Signed-off-by: Chen Yu > > --- > > include/linux/sched/sysctl.h | 1 + > > kernel/sched/core.c | 32 ++++++++++++++++++++++++++++++++ > > kernel/sched/fair.c | 18 ++++++++++++++++++ > > kernel/sched/sched.h | 3 +++ > > mm/mprotect.c | 5 +++-- > > 5 files changed, 57 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h > > index 5a64582b086b..1e4d5a9ddb26 100644 > > --- a/include/linux/sched/sysctl.h > > +++ b/include/linux/sched/sysctl.h > > @@ -22,6 +22,7 @@ enum sched_tunable_scaling { > > #define NUMA_BALANCING_DISABLED 0x0 > > #define NUMA_BALANCING_NORMAL 0x1 > > #define NUMA_BALANCING_MEMORY_TIERING 0x2 > > +#define NUMA_BALANCING_CGROUP 0x4 > > > > #ifdef CONFIG_NUMA_BALANCING > > extern int sysctl_numa_balancing_mode; > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 44efc725054a..f4f048b3da68 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -10023,6 +10023,31 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, > > } > > #endif > > > > +#ifdef CONFIG_NUMA_BALANCING > > +static DEFINE_MUTEX(numa_balance_mutex); > > +static int numa_balance_write_u64(struct cgroup_subsys_state *css, > > + struct cftype *cftype, u64 enable) > > +{ > > + struct task_group *tg; > > + int ret; > > + > > + guard(mutex)(&numa_balance_mutex); > > + tg = css_tg(css); > > + if (tg->nlb_enabled == enable) > > + return 0; > > + > > + tg->nlb_enabled = enable; > > + > > + return ret; > > +} > > + > > +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css, > > + struct cftype *cft) > > +{ > > + return css_tg(css)->nlb_enabled; > > +} > > +#endif /* CONFIG_NUMA_BALANCING */ > > + > > static struct cftype cpu_files[] = { > > #ifdef CONFIG_GROUP_SCHED_WEIGHT > > { > > @@ -10071,6 +10096,13 @@ static struct cftype cpu_files[] = { > > .seq_show = cpu_uclamp_max_show, > > .write = cpu_uclamp_max_write, > > }, > > +#endif > > +#ifdef CONFIG_NUMA_BALANCING > > + { > > + .name = "numa_load_balance", > > + .read_u64 = numa_balance_read_u64, > > + .write_u64 = numa_balance_write_u64, > > + }, > > #endif > > { } /* terminate */ > > }; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 1c0ef435a7aa..526cb33b007c 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3146,6 +3146,18 @@ void task_numa_free(struct task_struct *p, bool final) > > } > > } > > > > +/* return true if the task group has enabled the numa balance */ > > +static bool tg_numa_balance_enabled(struct task_struct *p) > > +{ > > + struct task_group *tg = task_group(p); > > + > > + if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) && > > + !tg->nlb_enabled) > > + return false; > > + > > + return true; > > +} > > + > > /* > > * Got a PROT_NONE fault for a page on @node. > > */ > > @@ -3174,6 +3186,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) > > !cpupid_valid(last_cpupid))) > > return; > > > > + if (!tg_numa_balance_enabled(p)) > > + return; > > + > > /* Allocate buffer to track faults on a per-node basis */ > > if (unlikely(!p->numa_faults)) { > > int size = sizeof(*p->numa_faults) * > > @@ -3596,6 +3611,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr) > > if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work) > > return; > > > > + if (!tg_numa_balance_enabled(curr)) > > + return; > > + > > /* > > * Using runtime rather than walltime has the dual advantage that > > * we (mostly) drive the selection from busy threads and that the > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 38e0e323dda2..9f478fb2c03a 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -491,6 +491,9 @@ struct task_group { > > /* Effective clamp values used for a task group */ > > struct uclamp_se uclamp[UCLAMP_CNT]; > > #endif > > +#ifdef CONFIG_NUMA_BALANCING > > + u64 nlb_enabled; > > +#endif > > > > }; > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 516b1d847e2c..ddaaf20ef94c 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb, > > toptier = node_is_toptier(nid); > > > > /* > > - * Skip scanning top tier node if normal numa > > + * Skip scanning top tier node if normal/cgroup numa > > * balancing is disabled > > */ > > - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && > > + if (!(sysctl_numa_balancing_mode & > > + (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) && > > toptier) > > continue; > > if (folio_use_access_time(folio)) >