From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2E14C4332F for ; Wed, 1 Nov 2023 13:45:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8623E8D0041; Wed, 1 Nov 2023 09:45:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 811E78D0001; Wed, 1 Nov 2023 09:45:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D9648D0041; Wed, 1 Nov 2023 09:45:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5E5168D0001 for ; Wed, 1 Nov 2023 09:45:55 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 34F0F1CB179 for ; Wed, 1 Nov 2023 13:45:55 +0000 (UTC) X-FDA: 81409508670.04.B8B2BEE Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf25.hostedemail.com (Postfix) with ESMTP id 0D706A0004 for ; Wed, 1 Nov 2023 13:45:52 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=cUZLQ1vk; spf=pass (imf25.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698846353; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LtTK0Ve9fvHJNOvzcaEqD249pBo2083iZ5qPSdDbfNg=; b=4JsCTl3fEzyquoEbkTwgunSSfT0cyEzVkgXaBbQK6qh8VpXBV4H1gwqtWjG9Kpz9ht+jEC ySK71ErMOqWcM+Wpsqjtss+MIYbBPb77jgXy+l0rso6JGxEoMH/FpWjn6oVucq+En71uvP Y16Erp6mhb/btBoMSv+662cz8G/fb2k= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=cUZLQ1vk; spf=pass (imf25.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698846353; a=rsa-sha256; cv=none; b=XBzNBvAd5/ZQZlys6Ko/z9kkclppwF83c1SnCM8Ko3+W8ZfpLELqLt3U3IjPZ9GpDWROoD ZbzUFy1Nm1LdFY0xruq0ZYDzI35D+SXq3C5/y0roIpvVEE1/9TTEPoiQPLWXIdW0yavUW0 Cm1ZRFOTl1zBKV8VST+au2kiXReNwPo= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 915CE1F750; Wed, 1 Nov 2023 13:45:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1698846351; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LtTK0Ve9fvHJNOvzcaEqD249pBo2083iZ5qPSdDbfNg=; b=cUZLQ1vk/TkzkU7BO5pKe0quxSNG7qmBe1s8JS2/aMYUlbVciyDV+aH9MBvOuPLYGg2mDS LWfrYX3AcMP8ctmtPeBhbRv3H/OdUj7gtHR3Gw9isyYROG/aWdhcIZHGXQnooTUqPl5Hzq KubDIt0MzJvAMAifOl3MRxJOpZ6iMho= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 6954713460; Wed, 1 Nov 2023 13:45:51 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id FdHJF49WQmWPdwAAMHmgww (envelope-from ); Wed, 01 Nov 2023 13:45:51 +0000 Date: Wed, 1 Nov 2023 14:45:50 +0100 From: Michal Hocko To: Gregory Price Cc: Johannes Weiner , Gregory Price , linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com, dave.hansen@intel.com, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Message-ID: References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031152142.GA3029315@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 0D706A0004 X-Rspam-User: X-Stat-Signature: os8otsjaep7f5aen7ni6n8bm9nwspneo X-Rspamd-Server: rspam01 X-HE-Tag: 1698846352-836255 X-HE-Meta: U2FsdGVkX1/UPhoGNeUzUh4K0GKUxOc0guzJUJgTvjJXeMNacAr1od+xMyoXwOmTJ2qceQqoMrJsEWbDQssb3mDOjJz5in0iE7G2th/8E4mEapjFIH2VKdi89jS4gGqvTF+sbgD5KA0A9lseQIvMPdummyqsWUJewznk6e2rFJ/iVXaQdWnLlRH8s7JdhH6FI6X4U14VLJGPDblwpjmVTG1gCqhta0d2mTB3SX0BNPtJ05AMIYAjmk9URy7plkgOa44tIQ5iAvQd8fPYJI4IJMPhn/pxzJYb4IsAfNHjAv8mke8PGEr5v2Z0HM4GBt4RVcZRYT/3ASJYtTKBmavZ+uHzbR7dqz73tPwELHBmyjvfwu77Euw1jGYncquy4+WxfOK6pRzcskbqi/OsqBc3c1qlUlQy7A1as/yYGLYcLwqaOCEWraYEMlHcshlZPbDL1p6D3l0hoUhYAfc/S0XwHbSOSYQc1hLmisBN0DRIgcdfIjeLAziZoLwETlFDTPomSnzos8GCPDbxMBWsxlORUm6ZC+CZb+MJGktluOLCgSWX+sajAZXfgO+mfYJy4lD9QcFgCVcxK3G3WFldFPA90q4lkGWNl4tWTZPDuLm/4V4TYq4WMSIp9Czfy57VQ+I/vf6NL2X6BjTSJAXBBnC/yjlo2JUaFoFo1Ly+vZLbPUwQlZ/dt48UlyQ3AgMJT499vUSg+ZVHEAItx7cgkIhcWmKX7dZ9oZBrdjF2N9feuAwBY3wSDo+XEj80000ZHKQq++MU8Pdrl1Mu9pfJfpPffFRdWVzlW6SfVtu6Z3CjpqmNtZeunlT/Zvd9kQFTGiO8xQDqDJhWCCraXwh9DJdFV0S/IxI5S2JyzF5ANTIEJB2ff5IXSjY8TsLefBdLtMVnpdkToQtBPQqAiC6fDQaS7nZFD5SxINy7nw7Z28qJ9jZ10esUUFyuuvZqEd8K0/B4IFYSKX1f8qti8uTEtoC RQKfIq6Y wAzwnUgpTRvEK55bC9YrkKwd7pD/3IF7bKKBZYbirE2dEP2fm6jClLKFOQGws5YNzHKBVIcGEmtm2SkF7dFdcHz8pqDm1UwrWnxQvCtAB1SrW3ZsFsP5V7q+jIJFCscCe/K5wXRO5aaJuB0Ffx43/Xtwawo+fULYwGi3nbJAPrds7A0ht486/hXLJrVchNDT06ex3wXhlnQiVVlwcdkJhZpCCx+aoB1TGHofgE7LpIJgttqQCvdGj97Ry2006Y6Dgw+Wv3haFl0XVNf8aaOza38efeDVSGr4dd2w49cFeNVGip6EjME6zkMMf0G4iyGGIS44s91HWwh5gHJj53A/JAN8NBrTZr2jnYO+B7v2zK+P5XfN6IOAJCVxhQxBpY3+ungPL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue 31-10-23 00:27:04, Gregory Price wrote: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > > > > This hopefully also explains why it's a global setting. The usecase is > > > different from conventional NUMA interleaving, which is used as a > > > locality measure: spread shared data evenly between compute > > > nodes. This one isn't about locality - the CXL tier doesn't have local > > > compute. Instead, the optimal spread is based on hardware parameters, > > > which is a global property rather than a per-workload one. > > > > Well, I am not convinced about that TBH. Sure it is probably a good fit > > for this specific CXL usecase but it just doesn't fit into many others I > > can think of - e.g. proportional use of those tiers based on the > > workload - you get what you pay for. > > > > Is there any specific reason for not having a new interleave interface > > which defines weights for the nodemask? Is this because the policy > > itself is very dynamic or is this more driven by simplicity of use? > > > > I had originally implemented it this way while experimenting with new > mempolicies. > > https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/ > > The downside of doing it in mempolicy is... > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a > non-trivial task. It is very "current-task" centric. True. Cpusets is the way to make it less process centric but that comes with its own constains (namely which NUMA policies are supported). > 2) Barring a change to mempolicy to be sysfs friendly, the options for > implementing weights in the mempolicy are either a) new flag and > setting every weight individually in many syscalls, or b) a new > syscall (set_mempolicy2), which is what I demonstrated in the RFC. Yes, that would likely require a new syscall. > 3) mempolicy is also subject to cgroup nodemasks, and as a result you > end up with a rats nest of interactions between mempolicy nodemasks > changing as a result of cgroup migrations, nodes potentially coming > and going (hotplug under CXL), and others I'm probably forgetting. Is this really any different from what you are proposing though? > Basically: If a node leaves the nodemask, should you retain the > weight, or should you reset it? If a new node comes into the node > mask... what weight should you set? I did not have answers to these > questions. I am not really sure I follow you. Are you talking about cpuset nodemask changes or memory hotplug here. > It was recommended to explore placing it in tiers instead, so I took a > crack at it here: > > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/ > > This had similar issue with the idea of hotplug nodes: if you give a > tier a weight, and one or more of the nodes goes away/comes back... what > should you do with the weight? Split it up among the remaining nodes? > Rebalance? Etc. How is this any different from node becoming depleted? You cannot really expect that you get memory you are asking for and you can easily end up getting memory from a different node instead. > The result of this discussion lead us to simply say "What if we place > the weights directly in the node". And that lead us to this RFC. Maybe I am missing something really crucial here but I do not see how this fundamentally changes anything. Memory hotremove (or mere node memory depletion) is not really a thing because interleaving is a best effort operation so you have to live with memory not being strictly distributed per your preferences. Memory hotadd will be easier to manage because you just update a single place after node is hotadded rather than gazillions partial policies. But, that requires that interleave policy nodemask is assuming future nodes going online and put them to the mask. > I am not against implementing it in mempolicy (as proof: my first RFC). > I am simply searching for the acceptable way to implement it. > > One of the benefits of having it set as a global setting is that weights > can be automatically generated from HMAT/HMEM information (ACPI tables) > and programs already using MPOL_INTERLEAVE will have a direct benefit. Right. This is understood. My main concern is whether this is outweights the limitations of having a _global_ policy _only_. Historically a single global policy usually led to finding ways how to make that more scoped (usually through cgroups). > I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added > along side this patch so that MPOL_INTERLEAVE is left entirely alone. > > Happy to discuss more, > ~Gregory -- Michal Hocko SUSE Labs