From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBA29C4332F for ; Fri, 3 Nov 2023 07:47:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B901280011; Fri, 3 Nov 2023 03:47:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3696028000F; Fri, 3 Nov 2023 03:47:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 25873280011; Fri, 3 Nov 2023 03:47:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 17EE628000F for ; Fri, 3 Nov 2023 03:47:24 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E4E4A1CB332 for ; Fri, 3 Nov 2023 07:47:23 +0000 (UTC) X-FDA: 81415862766.27.25CC2F2 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by imf29.hostedemail.com (Postfix) with ESMTP id 12254120009 for ; Fri, 3 Nov 2023 07:47:20 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Y5dKYmAi; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698997642; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rbTbNwUsGTp7JVPM2Ro11bI62qQkv5AbsgAPXyjbBmk=; b=cyRxn/RTLQCbQVMnCr40bg4wpA6Jyvk2s0665XAX4UycSeYPDp/oUfasi6PVwgpU/qqZm6 QA3vCK4jFMBRgvYuss+g/NRjjNTfP3kyeX9Xa+hHlwJq4fLE6tJw5+n48hDLEf5AY4b73L Lh6UDax5ZmSiTEY77hq5W6WGnHz5kms= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698997642; a=rsa-sha256; cv=none; b=rN3DH2crKpPdtZTpAX7mS8whFLLJQXKiSTW9AB22d6Zar4GV2FEsVTejoToIqSOVVE8a1Y 4TDvf+lGgcJJkfHtReTIUnluB0SSuqvK1ajxllU3MQ2IFon/u9X40ZlPRNdBd8ZrhDXauk dP8vfMr2rAHQyxOCA9xQr0VHZxj9DmE= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Y5dKYmAi; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698997641; x=1730533641; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=rezhK3cP/BaT5BMnIYABwg8Q2dxJJQOxJ5jTfjavXtc=; b=Y5dKYmAiF4vgfFpDHmZHQp4ik1mGCIcfHWRj+n2S+FHVQF8uM7mKNIJK BS6PUeDRmmuK/WI6DEL/dgbzW1pop42u6w/JgrvNXYRe7HHPQ21kwj8La 28tFHNK/FOBW3bQhGPDgCKYaU3EkzYkXfYD3lcEXgPLSqvunqJEYc1rfR t5d9E+C90loDS0DA22hdRMUcw/kTkxh3tMI1imnOrRPD06ofcgXnimvNi msS8Pc9ioKxJmdOCfBclRGuOllgZGMWMav44FTZ7jfJQhrxSiEvjwxAWY FSdSC4kjDilaYTFwZvAtZd5WyfnHiqP5Ljxl/MC+AA/3tmbM9v/kbjtmI Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10882"; a="10430964" X-IronPort-AV: E=Sophos;i="6.03,273,1694761200"; d="scan'208";a="10430964" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2023 00:47:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10882"; a="761535270" X-IronPort-AV: E=Sophos;i="6.03,273,1694761200"; d="scan'208";a="761535270" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2023 00:47:15 -0700 From: "Huang, Ying" To: Gregory Price Cc: Michal Hocko , Johannes Weiner , Gregory Price , , , , , , , , , , , , Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave In-Reply-To: (Gregory Price's message of "Wed, 1 Nov 2023 23:18:59 -0400") References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031152142.GA3029315@cmpxchg.org> Date: Fri, 03 Nov 2023 15:45:13 +0800 Message-ID: <87fs1nz3ee.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: g5jqksoobzimbk4wk9scora7h6qshm7b X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 12254120009 X-Rspam-User: X-HE-Tag: 1698997640-85342 X-HE-Meta: U2FsdGVkX1/0mSX+g5DyhaBVc4hnp78l7fPkAdUSK0a0J9jVMvRhyKHp8CRWUCM9dBw6QwR7gU8U717i+bjQ82+IegmQkMlqlT7+OMs5i3w/+QswgvrSqU+5ssn1YKEavDA9mS1AbGvpi5G7DjUFiF1S6vsKwjT8kxBsJn+z4Cb8fzF8tzsB+x2H459qsN0eF8oPEs/wweJTiZVEbxunLcD8UFtNIGd8C8fXDlQ5d1+8oLlrAIv6tspll/6BjZiXNAU3FpVnDGaQlkga3twoq5N/PeS/LnmoLpFXfP/3+78OmVDBYW/D9fHRc8JbVxJOvFGmUgeVcVlLvjoS8LQ41+LvTcffvRftqSWP5UKcitcGz6+FLk7Oc9He9Q14BtaN5EDcSV+A1JVtZH+n06F0OgEQLnfAD1rOa8oVn6dSYj/pd2aguD/WAufHZioqKmB1M8ao8N6IusEgzI4jcfoN0vBMs8aatX0MY0WskOmmr7+4lqaWhUhjZqnu9zakEdhmfuN8ta7TIbkRlM9YV1fzfRoyUJIvlDgOBPA/eq9l4uN9xnjH0r7TMcNg9kyEjBisjuIGiQ1ftelw18gAKa2M4Iz7qTFqncYf58zA+dfTK1bxQ7NOSdeK3P1d4GOAcMmdJ/CwgL6o47NdjTJcCLzUkszT+I+OZBbr7cx0y19ne0ClZW9Y7n9Oy1DrZIyBVnUrDOic98AZFbeVWIumkyPndGyZTocDoB/mi09J2yr62J/OcTVqyeNXOo7qx3e6+u95+6msQSaGzCe1q9e7KcB1/nAD/4ed0y5r8tnKsfvogaluWf2nseJTFqFRrk3a9Cf7hhrdoU2Ogr0QNhfRRIWh53V09Heo7TpmnMBH9yohpOLN9DvV0TwftwnU91qibzIeMLkgTRUfqkMzkllcYn2K+vaPomjyH2i7013yyXqQewg4BQIoO6Hk8Nhy5SUB9NaNDphAiQSEPqM1OpoChcA FKMwBFWp y+Jdt1nKXDLjl4M0heTh6lp+TQO4m1bPmhjcMwJzx8p1AHPMc74+kRfF5AzPkBaNq38xOjad5tc63tU8qDW1N/Jh1TNLDhw3sZxD1+Eqm+T9N4injNVWKG1dqBClqdLg2nm1no3DJIlLAvmgMTpqCOf3b91jcd2A9szds41dUoLq2vSToWGAJnf/pr7QMyRf5M9NhD6FoF2uSRWzcuhwWwlwTRaLb5bkvUuEYsTqAjFDE9ZBRTrMb6pinFI9k6zLLR9NH+cHUNyiOqN+iLuro7LkOVV1KV+5ifODzlmmyMJFm20l82Z0/hTGnyT0E8kjGbTb8xL2WYgr03sIPA5x5/a3+BQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: >> On Wed 01-11-23 12:58:55, Gregory Price wrote: >> > Basically consider: `numactl --interleave=all ...` >> > >> > If `--weights=...`: when a node hotplug event occurs, there is no >> > recourse for adding a weight for the new node (it will default to 1). >> >> Correct and this is what I was asking about in an earlier email. How >> much do we really need to consider this setup. Is this something nice to >> have or does the nature of the technology requires to be fully dynamic >> and expect new nodes coming up at any moment? >> > > Dynamic Capacity is expected to cause a numa node to change size (in > number of memory blocks) rather than cause numa nodes to come and go, so > maybe handling the full node hotplug is a bit of an overreach. Will node max bandwidth change with the number of memory blocks? > Good call, I'll stop considering this problem for now. > >> > If the node is removed from the system, I believe (need to validate >> > this, but IIRC) the node will be removed from any registered cpusets. >> > As a result, that falls down to mempolicy, and the node is removed. >> >> I do not think we do anything like that. Userspace might decide to >> change the numa mask when a node is offlined but I do not think we do >> anything like that automagically. >> > > mpol_rebind_policy called by update_tasks_nodemask > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 > > falls down from cpuset_hotplug_workfn: > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 > > /* > * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY]. > * Call this routine anytime after node_states[N_MEMORY] changes. > * See cpuset_update_active_cpus() for CPU hotplug handling. > */ > static int cpuset_track_online_nodes(struct notifier_block *self, > unsigned long action, void *arg) > { > schedule_work(&cpuset_hotplug_work); > return NOTIFY_OK; > } > > void __init cpuset_init_smp(void) > { > ... > hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); > } > > > Causes 1 of 3 situations: > MPOL_F_STATIC_NODES: overwrite with (old & new) > MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?) > Default: either does a remap or replaces old with new. > > My assumption based on this is that a hot-unplugged node would completely > be removed. Doesn't look like hot-add is handled at all, so I can just > drop that entirely for now (except add default weight of 1 incase it is > ever added in the future). > > I've been pushing agianst the weights being in memory-tiers.c for this > reason, as a weight set per-tier is meaningless if a node disappears. > > Example: Tier has 2 nodes with some weight N split between them, such > that interleave gives each node N/2 pages. If 1 node is removed, the > remaining node gets N pages, which is twice the allocation. Presumably > a node is an abstraction of 1 or more devices, therefore if the node is > removed, the weight should change. The per-tier weight can be defined as interleave weight of each node of the tier. Tier just groups NUMA nodes with similar performance. The performance (including bandwidth) is still per-node in the context of tier. If we have multiple nodes in one tier, this makes weight definition easier. > You could handle hotplug in tiers, but if a node being hotplugged forcibly > removes the node from cpusets and mempolicy nodemasks, then it's > irrelevant since the node can never get selected for allocation anyway. > > It's looking more like cgroups is the right place to put this. Have a cgroup/task level interface doesn't prevent us to have a system level interface to provide default for cgroups/tasks. Where performance information (e.g., from HMAT) can help define a reasonable default automatically. >> >> Moving the global policy to cgroups would make the main cocern of >> different workloads looking for different policy less problamatic. >> I didn't have much time to think that through but the main question is >> how to sanely define hierarchical properties of those weights? This is >> more of a resource distribution than enforcement so maybe a simple >> inherit or overwrite (if you have a more specific needs) semantic makes >> sense and it is sufficient. >> > > As a user I would assume it would operate much the same way as other > nested cgroups, which is inherit by default (with subsets) or an > explicit overwrite that can't exceed the higher level settings. > > Weights could arguably allow different settings than capacity controls, > but that could be an extension. > >> This is not as much about the code as it is about the proper interface >> because that will get cast in stone once introduced. It would be really >> bad to realize that we have a global policy that doesn't fit well and >> have hard time to work it around without breaking anybody. > > o7 I concur now. I'll take some time to rework this into a > cgroups+mempolicy proposal based on my earlier RFCs. -- Best Regards, Huang, Ying