From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4474CC43334 for ; Thu, 9 Jun 2022 14:22:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ADA098D0020; Thu, 9 Jun 2022 10:22:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A89F58D0006; Thu, 9 Jun 2022 10:22:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 92AC98D0020; Thu, 9 Jun 2022 10:22:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7E9428D0006 for ; Thu, 9 Jun 2022 10:22:52 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 483EF6139D for ; Thu, 9 Jun 2022 14:22:52 +0000 (UTC) X-FDA: 79558913784.15.EAB81D4 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf08.hostedemail.com (Postfix) with ESMTP id ED5C1160067 for ; Thu, 9 Jun 2022 14:22:48 +0000 (UTC) Received: from fraeml738-chm.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4LJmS03JqVz67yl7; Thu, 9 Jun 2022 22:19:12 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml738-chm.china.huawei.com (10.206.15.219) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Thu, 9 Jun 2022 16:22:46 +0200 Received: from localhost (10.81.202.195) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Thu, 9 Jun 2022 15:22:44 +0100 Date: Thu, 9 Jun 2022 15:22:43 +0100 From: Jonathan Cameron To: Johannes Weiner CC: Aneesh Kumar K V , , , Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes Subject: Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Message-ID: <20220609152243.00000332@Huawei.com> In-Reply-To: References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> <20220603134237.131362-2-aneesh.kumar@linux.ibm.com> <02ee2c97-3bca-8eb6-97d8-1f8743619453@linux.ibm.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.81.202.195] X-ClientProxiedBy: lhreml731-chm.china.huawei.com (10.201.108.82) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654784571; a=rsa-sha256; cv=none; b=DG4Mzo8OR04HPXyELNsfhqa3MKr5l7t8wXnUdcO4XIp2HGf5SHwBpRB78AESPSwfJiFc8W j1hiMGnpzpHBmWHxbSi6qvwIyXmJDmV256dX8Y5B6fX2BbEO7Jh6DD5QZ+IrTXFok9oaD7 OsKbcTmOOLv6nz77fLFA8HcXZakKEOQ= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf08.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654784571; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gJiI7BEbexXlq60Ew6bO6a4tDGnG5ju52RrU81rxjLU=; b=Y84z3EWucZFrRw/WRHrl6bPOXn1AhPWmUkPmlzIDt5BM59QiKAmFX7zl07eimGP2Vpf0T0 nLzyuRkzQYvR2cAafhC9qZVSrflJMJ2X2u2/SNU4scB4Mw3FLrjr6/4iHxYPe/DhwhGD0J iPnQpf2Sobv3LMUWHTPsTlGJgY3DG1Y= X-Stat-Signature: obuduy3pb34on4s5qh63pqo57o5xrte7 X-Rspamd-Queue-Id: ED5C1160067 X-Rspam-User: X-Rspamd-Server: rspam10 Authentication-Results: imf08.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf08.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com X-HE-Tag: 1654784568-211604 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 9 Jun 2022 09:55:45 -0400 Johannes Weiner wrote: > On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote: > > On 6/8/22 11:46 PM, Johannes Weiner wrote: > > > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote: > > > > On 6/8/22 9:25 PM, Johannes Weiner wrote: > > > > > Hello, > > > > > > > > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote: > > > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote: > > > > > > > @@ -0,0 +1,20 @@ > > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > > > > > +#ifndef _LINUX_MEMORY_TIERS_H > > > > > > > +#define _LINUX_MEMORY_TIERS_H > > > > > > > + > > > > > > > +#ifdef CONFIG_TIERED_MEMORY > > > > > > > + > > > > > > > +#define MEMORY_TIER_HBM_GPU 0 > > > > > > > +#define MEMORY_TIER_DRAM 1 > > > > > > > +#define MEMORY_TIER_PMEM 2 > > > > > > > + > > > > > > > +#define MEMORY_RANK_HBM_GPU 300 > > > > > > > +#define MEMORY_RANK_DRAM 200 > > > > > > > +#define MEMORY_RANK_PMEM 100 > > > > > > > + > > > > > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM > > > > > > > +#define MAX_MEMORY_TIERS 3 > > > > > > > > > > > > I understand the names are somewhat arbitrary, and the tier ID space > > > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS. > > > > > > > > > > > > But starting out with a packed ID space can get quite awkward for > > > > > > users when new tiers - especially intermediate tiers - show up in > > > > > > existing configurations. I mentioned in the other email that DRAM != > > > > > > DRAM, so new tiers seem inevitable already. > > > > > > > > > > > > It could make sense to start with a bigger address space and spread > > > > > > out the list of kernel default tiers a bit within it: > > > > > > > > > > > > MEMORY_TIER_GPU 0 > > > > > > MEMORY_TIER_DRAM 10 > > > > > > MEMORY_TIER_PMEM 20 > > > > > > > > > > Forgive me if I'm asking a question that has been answered. I went > > > > > back to earlier threads and couldn't work it out - maybe there were > > > > > some off-list discussions? Anyway... > > > > > > > > > > Why is there a distinction between tier ID and rank? I undestand that > > > > > rank was added because tier IDs were too few. But if rank determines > > > > > ordering, what is the use of a separate tier ID? IOW, why not make the > > > > > tier ID space wider and have the kernel pick a few spread out defaults > > > > > based on known hardware, with plenty of headroom to be future proof. > > > > > > > > > > $ ls tiers > > > > > 100 # DEFAULT_TIER > > > > > $ cat tiers/100/nodelist > > > > > 0-1 # conventional numa nodes > > > > > > > > > > > > > > > > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1 # conventional numa > > > > > tiers/200/nodelist:2 # pmem > > > > > > > > > > $ grep . nodes/*/tier > > > > > nodes/0/tier:100 > > > > > nodes/1/tier:100 > > > > > nodes/2/tier:200 > > > > > > > > > > > > > > > > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1,3 > > > > > tiers/200/nodelist:2 > > > > > > > > > > $ echo 300 >nodes/3/tier > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1 > > > > > tiers/200/nodelist:2 > > > > > tiers/300/nodelist:3 > > > > > > > > > > $ echo 200 >nodes/3/tier > > > > > $ grep . tiers/*/nodelist > > > > > tiers/100/nodelist:0-1 > > > > > tiers/200/nodelist:2-3 > > > > > > > > > > etc. > > > > > > > > tier ID is also used as device id memtier.dev.id. It was discussed that we > > > > would need the ability to change the rank value of a memory tier. If we make > > > > rank value same as tier ID or tier device id, we will not be able to support > > > > that. > > > > > > Is the idea that you could change the rank of a collection of nodes in > > > one go? Rather than moving the nodes one by one into a new tier? > > > > > > [ Sorry, I wasn't able to find this discussion. AFAICS the first > > > patches in RFC4 already had the struct device { .id = tier } > > > logic. Could you point me to it? In general it would be really > > > helpful to maintain summarized rationales for such decisions in the > > > coverletter to make sure things don't get lost over many, many > > > threads, conferences, and video calls. ] > > > > Most of the discussion happened not int he patch review email threads. > > > > RFC: Memory Tiering Kernel Interfaces (v2) > > https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com > > > > RFC: Memory Tiering Kernel Interfaces (v4) > > https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com > > I read the RFCs, the discussions and your code. It's still not clear > why the tier/device ID and the rank need to be two separate, > user-visible things. There is only one tier of a given rank, why can't > the rank be the unique device id? dev->id = 100. One number. Or use a > unique device id allocator if large numbers are causing problems > internally. But I don't see an explanation why they need to be two > different things, let alone two different things in the user ABI. I think discussion hinged on it making sense to be able to change rank of a tier rather than create a new tier and move things one by one. Example was wanting to change the rank of a tier that was created either by core code or a subsystem. E.g. If GPU driver creates a tier, assumption is all similar GPUs will default to the same tier (if hot plugged later for example) as the driver subsystem will keep a reference to the created tier. Hence if user wants to change the order of that relative to other tiers, the option of creating a new tier and moving the devices would then require us to have infrastructure to tell the GPU driver to now use the new tier for additional devices. Or we could go with new nodes are not assigned to a tier and userspace is always responsible for that assignment. That may be a problem for anything relying on existing behavior. Means that there must always be a sensible userspace script... Jonathan