From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9C84C43334 for ; Fri, 10 Jun 2022 09:57:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B9C68D0088; Fri, 10 Jun 2022 05:57:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 342FE8D0087; Fri, 10 Jun 2022 05:57:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C7F78D0088; Fri, 10 Jun 2022 05:57:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 06DF88D0087 for ; Fri, 10 Jun 2022 05:57:18 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C8EE535956 for ; Fri, 10 Jun 2022 09:57:17 +0000 (UTC) X-FDA: 79561873314.09.CB47B1D Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf31.hostedemail.com (Postfix) with ESMTP id A115A20094 for ; Fri, 10 Jun 2022 09:57:15 +0000 (UTC) Received: from fraeml702-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4LKGTf4C8cz687qR; Fri, 10 Jun 2022 17:52:22 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml702-chm.china.huawei.com (10.206.15.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.24; Fri, 10 Jun 2022 11:57:13 +0200 Received: from localhost (10.81.209.23) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 10 Jun 2022 10:57:12 +0100 Date: Fri, 10 Jun 2022 10:57:08 +0100 From: Jonathan Cameron To: Johannes Weiner CC: Aneesh Kumar K V , , , Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes Subject: Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Message-ID: <20220610105708.0000679b@Huawei.com> In-Reply-To: References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> <20220603134237.131362-2-aneesh.kumar@linux.ibm.com> <02ee2c97-3bca-8eb6-97d8-1f8743619453@linux.ibm.com> <20220609152243.00000332@Huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.81.209.23] X-ClientProxiedBy: lhreml725-chm.china.huawei.com (10.201.108.76) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654855037; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jZkXlkBKggzuYBKZPre27io2eqtLXerlXt3LOi5XEaY=; b=KJctw5KfzSbG2s00j3igGFbAcbwnVHs3Bvn6gFfJtjupFppnw90Vr1JIoNwjcpSdiBLnm6 9a0GkaA7AoaYZRuuSS6f0+mprFNTZ6GXGNvyX2pUob3ve9LlK43/IFKOrMZmKkPUkm6/D+ xvNSkyyRHvB9YNtG80Zc81++c6bqyjY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654855037; a=rsa-sha256; cv=none; b=GuzWC6awnrQqpdu3lFyj5kEVW4/znCtrCp14fH5+F3ShR0BI9mtcXVbFiaPqNLn05lBvnX i2LbbmsiiW3iReJ1W+Vhv9pEVXCjGPQvqq56q59Mr2Dxq+0t9wdER5sIOcF6aP/OfeqXSi 1s5kQlJsJu8vf+B4cF0/sexkK53z3FE= ARC-Authentication-Results: i=1; imf31.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf31.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com X-Rspam-User: Authentication-Results: imf31.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf31.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com X-Rspamd-Server: rspam03 X-Stat-Signature: ecrehwrcyeh1yr6qd6efckjzocgogs4e X-Rspamd-Queue-Id: A115A20094 X-HE-Tag: 1654855035-161173 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 9 Jun 2022 16:41:04 -0400 Johannes Weiner wrote: > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote: > > I think discussion hinged on it making sense to be able to change > > rank of a tier rather than create a new tier and move things one by one. > > Example was wanting to change the rank of a tier that was created > > either by core code or a subsystem. > > > > E.g. If GPU driver creates a tier, assumption is all similar GPUs will > > default to the same tier (if hot plugged later for example) as the > > driver subsystem will keep a reference to the created tier. > > Hence if user wants to change the order of that relative to > > other tiers, the option of creating a new tier and moving the > > devices would then require us to have infrastructure to tell the GPU > > driver to now use the new tier for additional devices. > > That's an interesting point, thanks for explaining. > > But that could still happen when two drivers report the same tier and > one of them is wrong, right? You'd still need to separate out by hand > to adjust rank, as well as handle hotplug events. Driver colllisions > are probable with coarse categories like gpu, dram, pmem. There will always be cases that need hand tweaking. Also I'd envision some driver subsystems being clever enough to manage several tiers and use the information available to them to assign appropriately. This is definitely true for CXL 2.0+ devices where we can have radically different device types under the same driver (volatile, persistent, direct connect, behind switches etc). There will be some interesting choices to make on groupings in big systems as we don't want too many tiers unless we naturally demote multiple levels in one go.. > > Would it make more sense to have the platform/devicetree/driver > provide more fine-grained distance values similar to NUMA distances, > and have a driver-scope tunable to override/correct? And then have the > distance value function as the unique tier ID and rank in one. Absolutely a good thing to provide that information, but it's black magic. There are too many contradicting metrics (latency vs bandwidth etc) even not including a more complex system model like Jerome Glisse proposed a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/ CXL 2.0 got this more right than anything else I've seen as provides discoverable topology along with details like latency to cross between particular switch ports. Actually using that data (other than by throwing it to userspace controls for HPC apps etc) is going to take some figuring out. Even the question of what + how we expose this info to userspace is non obvious. The 'right' decision is also usecase specific, so what you'd do for particular memory characteristics for a GPU are not the same as what you'd do for the same characteristics on a memory only device. > > That would allow device class reassignments, too, and it would work > with driver collisions where simple "tier stickiness" would > not. (Although collisions would be less likely to begin with given a > broader range of possible distance values.) I think we definitely need the option to move individual nodes (in this case nodes map to individual devices if characteristics vary between them) around as well, but I think that's somewhat orthogonal to a good first guess. > > Going further, it could be useful to separate the business of hardware > properties (and configuring quirks) from the business of configuring > MM policies that should be applied to the resulting tier hierarchy. > They're somewhat orthogonal tuning tasks, and one of them might become > obsolete before the other (if the quality of distance values provided > by drivers improves before the quality of MM heuristics ;). Separating > them might help clarify the interface for both designers and users. > > E.g. a memdev class scope with a driver-wide distance value, and a > memdev scope for per-device values that default to "inherit driver > value". The memtier subtree would then have an r/o structure, but > allow tuning per-tier interleaving ratio[1], demotion rules etc. Ok that makes sense. I'm not sure if that ends up as an implementation detail, or effects the userspace interface of this particular element. I'm not sure completely read only is flexible enough (though mostly RO is fine) as we keep sketching out cases where any attempt to do things automatically does the wrong thing and where we need to add an extra tier to get everything to work. Short of having a lot of tiers I'm not sure how we could have the default work well. Maybe a lot of "tiers" is fine though perhaps we need to rename them if going this way and then they don't really work as current concept of tier. Imagine a system with subtle difference between different memories such as 10% latency increase for same bandwidth. To get an advantage from demoting to such a tier will require really stable usage and long run times. Whilst you could design a demotion scheme that takes that into account, I think we are a long way from that today. Jonathan > > [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t