From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7AD1FC3ABCA for ; Fri, 9 May 2025 11:31:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 728326B00FF; Fri, 9 May 2025 07:31:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D7CF6B0100; Fri, 9 May 2025 07:31:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5A0526B0101; Fri, 9 May 2025 07:31:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 3ADDA6B00FF for ; Fri, 9 May 2025 07:31:39 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 229F4BE08A for ; Fri, 9 May 2025 11:31:39 +0000 (UTC) X-FDA: 83423154318.02.D96E7B3 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf29.hostedemail.com (Postfix) with ESMTP id 16E7B12000C for ; Fri, 9 May 2025 11:31:36 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf29.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746790297; a=rsa-sha256; cv=none; b=MLGJo3+niLMlFNkNPFJkEl8+3Z15xmqFZbX5Fsm02TtUh4pG80r9dUBnYPMfPJOWchYoNc EuLBrW+SG7WCiy2ESzHOenm5V5Adb3GNbmObjhyrxrKQPK1nkynGPuhvda6QOHBNnvMbjD dFDjkLJLcsk43WeZxEBbuSRc5gOpl9I= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf29.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746790297; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PrK1fWKI3o2lqNzK1q0+ujEU2S6ndKuu6ZDyr2PmYs4=; b=rVdUealhHWg3HYW5d6/0MmIAdDEGWwzJje3HO0b3rCt8sleimiFcCWeBw7E3igAPyrydsU kz82y1YrmBin3qmitAhimz6tXFx4hfoX4h+Jg10ZPCzF+tRPuXiU73ZcFiv/JJXOiufKly PHa1xFmoNTgv/OuzqgOPqKyqD7r49TI= Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4Zv6Jf47HPz6K9qv; Fri, 9 May 2025 19:31:14 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id A6087140519; Fri, 9 May 2025 19:31:33 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 9 May 2025 13:31:33 +0200 Date: Fri, 9 May 2025 12:31:31 +0100 From: Jonathan Cameron To: Gregory Price CC: Rakie Kim , , , , , , , , , , , "Keith Busch" , Jerome Glisse Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave Message-ID: <20250509123131.0000051b@huawei.com> In-Reply-To: References: <20250508063042.210-1-rakie.kim@sk.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml100002.china.huawei.com (7.191.160.241) To frapeml500008.china.huawei.com (7.182.85.71) X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 16E7B12000C X-Stat-Signature: yr9wc6gemfuqi8iykgo3jtami4yiqyr4 X-Rspam-User: X-HE-Tag: 1746790296-467314 X-HE-Meta: U2FsdGVkX1+mrWcdz+u2o/2E/XICeuxkOGfrFA7Qo9tjBAISREcCPcCvSyFE4Exjb6tZXwHCWrgBO5D2qkYKw5j0CIzWENUmSZ6F0m27O8W/DylQYqLrm2u9VE83qAevwtcx6bWOhFWj5Uu+TwBzMuTv3Qy8J4PreizIXSIQr+gI7/szimW13TJ8NNglcwJZCG0dLaCi8Nhv2ZCsqImUba5dHJQeYX1qcH0I4fWoMuYh8V0t2ffsWvw3Vsaoz449gmNU0I9NHihCdGuFiAo/KyX9eMitgUfGoxUKBzhfkSkS0peFqeXL9vBxOg56zfimvnGKaZly3S0akqihfqmqiwiM75iXmGkzI+xek/eGsaLQcTaS8+E1jKkcMj18+A9jShiQVh62pdQ2DJ8yjxCHJYS5ZuQNsOxpRg2Xaq/cO/P4fIvqrgvovmayuHLn2QEf9MQqFZAO4vJ5fs0Fhp5vzim8YRV7hPImjTbDswBN4vih4lGnspgj0LIrMwdCU8XJn5NB61RkjtBZtynB9JhUzMHbtvhU7S7+IA49Ow+8dYyIhPo9WKgLhbbvAi8/xwWshWcvcc94PAwCA4lsGaLty7K7gw225lUf/xrkhh21cdr2s9ViwQFRXM7tY00zO6eKdigJbKtutn7TxaWNwNY1ID4U3aMSDk9bKmGWMC1poqoERrOeaNwq+DQ7G5Q95Nb2X4ibwJqnZIacekJHNBzqzKWzph8iHVGGna8/hOV0CbPUBqPJ+9mzvEApuBZ9SCiUxebBvZhBUxAaVAazSzeEohTXjj/0T9qpOyhYB92hoFeKDw7Zo3wZT1Haz4wKSCNOahyMMMhKAoPV+NVNHJmHKRHqp1DyTPkDAVucizzkAU7lZOajAbUbQFDIsT4BNobmekVwpDbbcwPrZorcyl6teuuOqk2rk8mDhrnrqBRR9F/wCHgepIclcg+oK9LAerKoBFQ26oQOWwTmEwC1aQY f+vHhYk0 Yk9v4V/F30IVdiJIJm5Ax7m/KtyZJkUdoHlsD0qPiL17ZfYncGnKHngvyuGFIy3p/N4IFbFqZX4IMYb2pgs4ROjqVtkFSPZjzP6/Po6Io08uX4LtBYRxMr1C4ZILytbpOf/RBRTQbb4RqGlV6/eoNnQDpDOzwDdlW0UFbaUN1CxWWhLeA90qpVjvIvkTNEvQd/PTwfWk2fQHXKKtU5WBxdsiHao8ijJ/GR7PbSAZtzVeArwHMMNM6KaJi8z9MUmwp/Gpu9Pzz0r7TqZwPvkpYGUGpLm6oz3rnszTCI/fqLGHapj3uC3cjZz3/Vw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 8 May 2025 11:12:35 -0400 Gregory Price wrote: > On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote: > > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price wrote: > > > > The proposed design is completely optional and isolated: it retains the > > existing flat weight model as-is and activates the source-aware behavior only > > when 'multi' mode is enabled. The complexity is scoped entirely to users who > > opt into this mode. > > > > I get what you're going for, just expressing my experience around this > issue specifically. > > The lack of enthusiasm for solving the cross-socket case, and thus > reduction from a 2D array to a 1D array, was because reasoning about > interleave w/ cross-socket interconnects is not really feasible with > the NUMA abstraction. Cross-socket interconnects are "Invisible" but > have real performance implications. Unless we have a way to: Sort of invisible... What their topology is, but we have some info... > > 1) Represent the topology, AND > 2) A way to get performance about that topology There was some discussion on this at LSF-MM. +CC Keith and Jerome who were once interested in this topic It's not perfect but ACPI HMAT does have what is probably sufficient info for a simple case like this (2 socket server + Generic Ports and CXL description of the rest of the path), it's just that today we aren't exposing that to userspace (instead only the BW / Latency from a single selected nearest initiator /CPU node to any memory containing node). That decision was much discussed back when Keith was adding HMAT support. At that time the question was what workload needed the dense info (2D matrix) and we didn't have one. With weighted interleave I think we do. As to the problems... We come unstuck badly in much more complex situations as that information is load free so if we have heavy contention due to one shared link between islands of nodes it can give a very misleading idea. [CXL Node 0] [CXL Node 2] | | [NODE A]---\ /----[NODE C] \___Shared link____/ / \ [NODE B]---/ \----[NODE D] | | [CXL Node 1] [CXL Node 3] In this from ACPI this looks much like this (fully connected 4 socket system). [CXL Node 0] [CXL Node 2] | | [NODE A]-----------------------------[NODE C] | \___________________________ / | | ____________________________\/ | | / \ | [NODE B]-----------------------------[NODE D] | | [CXL Node 1] [CXL Node 3] In the first case we should probably halve the BW of shared link or something like that. In the second case use the full version. In general we have no way to know which one we have and it gets way more fun with 8 + sockets :) SLIT is indeed useless for anything other than what's nearest decisions Anyhow, short term I'd like us to revisit what info we present from HMAT (and what we get from CXL topology descriptions which have pretty much everything we might want). That should put the info in userspace to tune weighted interleave better anyway and perhaps provide the info you need here. So just all the other problems to solve ;) J > > It's not useful. So NUMA is an incomplete (if not wrong) tool for this. > > Additionally - reacting to task migration is not a real issue. If > you're deploying an allocation strategy, you probably don't want your > task migrating away from the place where you just spent a bunch of time > allocating based on some existing strategy. So the solution is: don't > migrate, and if you do - don't use cross-socket interleave. > > Maybe if we solve the first half of this we can take a look at the task > migration piece again, but I wouldn't try to solve for migration. > > At the same time we were discussing this, we were also discussing how to > do external task-mempolicy modifications - which seemed significantly > more useful, but ultimately more complex and without sufficient > interested parties / users. > > ~Gregory >