From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DEF02C4332F for ; Thu, 2 Nov 2023 02:03:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1B2DD8D006D; Wed, 1 Nov 2023 22:03:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 163228D0026; Wed, 1 Nov 2023 22:03:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 02A8C8D006D; Wed, 1 Nov 2023 22:03:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E8EE58D0026 for ; Wed, 1 Nov 2023 22:03:30 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id C09651A0DCF for ; Thu, 2 Nov 2023 02:03:30 +0000 (UTC) X-FDA: 81411367380.15.6686618 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf01.hostedemail.com (Postfix) with ESMTP id 11A9840016 for ; Thu, 2 Nov 2023 02:03:27 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=eg+HrjKi; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698890609; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qQrJ3s1JpWNayjQ/o7b+eMvmff4fWVeoExmfyPgQYls=; b=qLZ9PH3JSBN8Ie9JK/qZgANumztbv6F3uyBOCuffz0xL6xcdUwOs4SvfxZhx0KNCWprU9I ICLf+RUy7hHOnsdglsTShcWLCqnzyiUbE29IzFsPN4ZXKIi/vreUAv3hUZUzW3JipdU5+Q LbwGCK5+4BYC+Be2VSUKP0hKVi34zoY= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=eg+HrjKi; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698890609; a=rsa-sha256; cv=none; b=huRc/ER2Kwioyyb/Dq0TxT32/rgebYBEO9l2Y5SHt+9uzWPt1s7MpoRgNjFIzhJ2PH2RIE IuLxum88zpR9umniwfvGmyebcz6ZLka1qpUCkQeyJgiTu92YxttJH9Tc4pPVdpS7M/pr8d a711Kfpc0usrKpBmgEmIVwVrRypUv4A= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698890608; x=1730426608; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=E2Mb/KvO9IGYgWWlhQaCJkJRm8aWbo+wV8m7MjhdsDI=; b=eg+HrjKiHi/x/R52y2TBM7nDXmgCUQ+tkeRkOiKCxNQRiXla0kuP+bor fTWAJemu4ZsFWnJOTpmhxHStre1R2E3DoEFEN7gkn/3xvMwILKuCgkpti hHWphXPLDOxuZuGtEEMTjGZh0DLOkDfHzA73Ht0oyNwJZqKiionpFJWri DQ4FPxmn2LSZRiOlec2hHWGKwAPRqNhGkLDc0gENgpGpksaHO4nW2FC0A T+bEhFCa9SYkPHRzLJd+MvrcMVVfrj4X87bzAEGjtyEx7W04EL1ma1aMr fPcBjUd75zgVmASuT+AvhWXmpraxyvlXgQv2iszmaLtWSwidsurRqhHfO A==; X-IronPort-AV: E=McAfee;i="6600,9927,10881"; a="392501358" X-IronPort-AV: E=Sophos;i="6.03,270,1694761200"; d="scan'208";a="392501358" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2023 19:03:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10881"; a="826959147" X-IronPort-AV: E=Sophos;i="6.03,270,1694761200"; d="scan'208";a="826959147" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2023 19:03:21 -0700 From: "Huang, Ying" To: Gregory Price Cc: Michal Hocko , Johannes Weiner , Gregory Price , , , , , , , , , , , , Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave In-Reply-To: (Gregory Price's message of "Tue, 31 Oct 2023 00:27:04 -0400") References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031152142.GA3029315@cmpxchg.org> Date: Thu, 02 Nov 2023 10:01:20 +0800 Message-ID: <87cyws3ocv.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 11A9840016 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: pk7usp9re7b4roj4aequ5qb8osnai9ua X-HE-Tag: 1698890607-326585 X-HE-Meta: U2FsdGVkX18bGqyVD4E2URO5sRggXp1vSkI8evXG8SE+5vOtfo8LRfIdwllWCBTq/dNjYdAZ2vYwcQ67L/YrjKlRxHoM8cB1zDI0f7GxZNIUr0y48YHylsFe/eZryRGlbP+ooeMbFiI9JrUU9YVUQrU1LhX7lkNn5nIbPmwK+/Aj6ru3DdcdTZYu+DrWquEMttoEtr4adE+DCuy47iht+xKJqaJPc3f3bn0fDcDNOClgy6laZ2XxLecguc/LoRaHAMgM+8PY7TGj6Ijgi/mhMwS7DZpL0dsTWPF9j+LIVZoFZjJngcJTcMYayLeChYM2lA1ibsK8qyo9xdeZL9mpfsE2XvmF2DfG+yitkN3ln7N/5Sn4fnKldHpWqMQcXH227ER4RDhobT82OfxDjw0qyT+A7gH/h36PxlwZoYrYv2VD2iBYFlt97U1qeHz810mmdeIcM9amAHmjenMgFWGhza6ma23jhx9Anh9H82RtJxCyCmE8o+gBnNoVDC9G55GOTRzNHJTb87lrfOffkqDy+8H2CSGVF3MSDHxwtNuHnw3E87sSZGmcjs4tNntWeMupLkGrL+a/UPX42r/0azWQxpsjRuNftBmq0NQ0hg/JTZsfpbSY5QgBsq7OQrwXYzcTCpc5lqUE3PtkUVJFIGp+ls0VB+V2oM1ekH2QV/uTub7qWdyZdZlKD/8hiXWoli54WE+WtKl9HRAtUI1nLxlbiZpn3JzPHhy29EkN3Y14DKvtLR3GHSsG5oMbe8xNUvbhOLjBUQEYd4tHuZaB5L5F6iKYH0PUajCCWPcGWF5g0gR1EZNEmu5jIOjRk4+bCIzZvmRCiXf3Cwy3LUb/vMpmGLtiaSRobofdOl23OxKrQRoJz1tgDP4Tp0MT7qCGUmS3OEQchDSoVWZ3qIWxa1zVOm0aqB+cs2pzND2bn9NVvD3TmiJ7qFJdW61mGwE9k1Hh1NafkHG23amSbVstGIZ 6qcRdtIs L6OrQBkQ5DF6Y79Iu54o2wdNI5rmn4UhGd214yT98OJEQT4TxJFfEcUQzHJxqD0idtoI1j+NZVMuT4YliT6zcstbutl2fxoCM8c0gqET4pcEl9QC/aY4J1NEjHv0PdR1tHEKimU3Sr366Q0cIpBkHsqHTrU/0F+XU+XU7GqIykI+GmmOoZxfbULDwQEgw+DcdIZnV9PgyUf2Ox4ZofD2bta6/QAe4yWoCnZ509dBEzgalCZW8To74XjHZYvVi1TMY1iCcmPuq6FDD6pt/y/6IgPMASjtivaUW7x7m4v4NjymaH5i6M8oa3xjWP1ZuOVSSh/Ogkkr2X7XHloNZFtJKRdZveA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > >> > This hopefully also explains why it's a global setting. The usecase is >> > different from conventional NUMA interleaving, which is used as a >> > locality measure: spread shared data evenly between compute >> > nodes. This one isn't about locality - the CXL tier doesn't have local >> > compute. Instead, the optimal spread is based on hardware parameters, >> > which is a global property rather than a per-workload one. >> >> Well, I am not convinced about that TBH. Sure it is probably a good fit >> for this specific CXL usecase but it just doesn't fit into many others I >> can think of - e.g. proportional use of those tiers based on the >> workload - you get what you pay for. >> >> Is there any specific reason for not having a new interleave interface >> which defines weights for the nodemask? Is this because the policy >> itself is very dynamic or is this more driven by simplicity of use? >> > > I had originally implemented it this way while experimenting with new > mempolicies. > > https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/ > > The downside of doing it in mempolicy is... > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a > non-trivial task. It is very "current-task" centric. > > 2) Barring a change to mempolicy to be sysfs friendly, the options for > implementing weights in the mempolicy are either a) new flag and > setting every weight individually in many syscalls, or b) a new > syscall (set_mempolicy2), which is what I demonstrated in the RFC. > > 3) mempolicy is also subject to cgroup nodemasks, and as a result you > end up with a rats nest of interactions between mempolicy nodemasks > changing as a result of cgroup migrations, nodes potentially coming > and going (hotplug under CXL), and others I'm probably forgetting. > > Basically: If a node leaves the nodemask, should you retain the > weight, or should you reset it? If a new node comes into the node > mask... what weight should you set? I did not have answers to these > questions. > > > It was recommended to explore placing it in tiers instead, so I took a > crack at it here: > > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/ > > This had similar issue with the idea of hotplug nodes: if you give a > tier a weight, and one or more of the nodes goes away/comes back... what > should you do with the weight? Split it up among the remaining nodes? > Rebalance? Etc. The weight of a tier can be defined as the weight of one node of the tier instead of the weight of all nodes of the tier. That is, for a system as follows, tier 0: node 0, node 1; weight=4 tier 1: node 2, node 3; weight=1 If you run workload with `numactl --weighted-interleave -n 0,2,3`, the proportion will be: "4:0:1:1" on each node. While for `numactl --weighted-interleave -n 0,2`, it will be: "4:0:1:0". -- Best Regards, Huang, Ying > The result of this discussion lead us to simply say "What if we place > the weights directly in the node". And that lead us to this RFC. > > > I am not against implementing it in mempolicy (as proof: my first RFC). > I am simply searching for the acceptable way to implement it. > > One of the benefits of having it set as a global setting is that weights > can be automatically generated from HMAT/HMEM information (ACPI tables) > and programs already using MPOL_INTERLEAVE will have a direct benefit. > > I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added > along side this patch so that MPOL_INTERLEAVE is left entirely alone. > > Happy to discuss more, > ~Gregory