From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F3F7CCA481 for ; Mon, 6 Jun 2022 14:59:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 518428D0001; Mon, 6 Jun 2022 10:59:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C65A6B0074; Mon, 6 Jun 2022 10:59:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 38AE98D0001; Mon, 6 Jun 2022 10:59:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 28B866B0073 for ; Mon, 6 Jun 2022 10:59:28 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 01A64331FF for ; Mon, 6 Jun 2022 14:59:27 +0000 (UTC) X-FDA: 79548119616.07.D5ED6F2 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf21.hostedemail.com (Postfix) with ESMTP id F2C451C006B for ; Mon, 6 Jun 2022 14:59:10 +0000 (UTC) Received: from fraeml737-chm.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4LGxSR3ZQmz687hk; Mon, 6 Jun 2022 22:58:15 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml737-chm.china.huawei.com (10.206.15.218) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 6 Jun 2022 16:59:24 +0200 Received: from localhost (10.202.226.42) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 6 Jun 2022 15:59:22 +0100 Date: Mon, 6 Jun 2022 15:59:20 +0100 From: Jonathan Cameron To: Aneesh Kumar K V CC: , , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Message-ID: <20220606155920.00004ce9@Huawei.com> In-Reply-To: References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> <20220527122528.129445-3-aneesh.kumar@linux.ibm.com> <20220527151531.00002a0c@Huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.226.42] X-ClientProxiedBy: lhreml728-chm.china.huawei.com (10.201.108.79) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected X-Stat-Signature: xdxpx3ufako8aaakgqm7r9cn498wnf1j X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: F2C451C006B X-HE-Tag: 1654527550-174383 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 3 Jun 2022 14:10:47 +0530 Aneesh Kumar K V wrote: > On 5/27/22 7:45 PM, Jonathan Cameron wrote: > > On Fri, 27 May 2022 17:55:23 +0530 > > "Aneesh Kumar K.V" wrote: > > > >> From: Jagdish Gediya > >> > >> Add support to read/write the memory tierindex for a NUMA node. > >> > >> /sys/devices/system/node/nodeN/memtier > >> > >> where N = node id > >> > >> When read, It list the memory tier that the node belongs to. > >> > >> When written, the kernel moves the node into the specified > >> memory tier, the tier assignment of all other nodes are not > >> affected. > >> > >> If the memory tier does not exist, writing to the above file > >> create the tier and assign the NUMA node to that tier. > > creates > > > > There was some discussion in v2 of Wei Xu's RFC that what matter > > for creation is the rank, not the tier number. > > > > My suggestion is move to an explicit creation file such as > > memtier/create_tier_from_rank > > to which writing the rank gives results in a new tier > > with the next device ID and requested rank. > > I think the below workflow is much simpler. > > :/sys/devices/system# cat memtier/memtier1/nodelist > 1-3 > :/sys/devices/system# cat node/node1/memtier > 1 > :/sys/devices/system# ls memtier/memtier* > nodelist power rank subsystem uevent > /sys/devices/system# ls memtier/ > default_rank max_tier memtier1 power uevent > :/sys/devices/system# echo 2 > node/node1/memtier > :/sys/devices/system# > > :/sys/devices/system# ls memtier/ > default_rank max_tier memtier1 memtier2 power uevent > :/sys/devices/system# cat memtier/memtier1/nodelist > 2-3 > :/sys/devices/system# cat memtier/memtier2/nodelist > 1 > :/sys/devices/system# > > ie, to create a tier we just write the tier id/tier index to > node/nodeN/memtier file. That will create a new memory tier if needed > and add the node to that specific memory tier. Since for now we are > having 1:1 mapping between tier index to rank value, we can derive the > rank value from the memory tier index. > > For dynamic memory tier support, we can assign a rank value such that > new memory tiers are always created such that it comes last in the > demotion order. I'm not keen on having to pass through an intermediate state where the rank may well be wrong, but I guess it's not that harmful even if it feels wrong ;) Races are potentially a bit of a pain though depending on what we expect the usage model to be. There are patterns (CXL regions for example) of guaranteeing the 'right' device is created by doing something like cat create_tier > temp.txt #(temp gets 2 for example on first call then # next read of this file gets 3 etc) cat temp.txt > create_tier # will fail if there hasn't been a read of the same value Assuming all software keeps to the model, then there are no race conditions over creation. Otherwise we have two new devices turn up very close to each other and userspace scripting tries to create two new tiers - if it races they may end up in the same tier when that wasn't the intent. Then code to set the rank also races and we get two potentially very different memories in a tier with a randomly selected rank. Fun and games... And a fine illustration why sysfs based 'device' creation is tricky to get right (and lots of cases in the kernel don't). Jonathan > > -aneesh > > > >