From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8A50C433FE for ; Thu, 31 Mar 2022 08:58:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5ED416B0074; Thu, 31 Mar 2022 04:58:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 59B996B0075; Thu, 31 Mar 2022 04:58:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 463B16B0078; Thu, 31 Mar 2022 04:58:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241]) by kanga.kvack.org (Postfix) with ESMTP id 38E686B0074 for ; Thu, 31 Mar 2022 04:58:07 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id F308AA6306 for ; Thu, 31 Mar 2022 08:58:06 +0000 (UTC) X-FDA: 79304079372.26.D02C1E4 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf23.hostedemail.com (Postfix) with ESMTP id 01780140002 for ; Thu, 31 Mar 2022 08:58:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1648717086; x=1680253086; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=yxzYJmtPcg0UU6qCqRqPT7JMOGmN7ucuIzoCz39lvNs=; b=m0Xc6WOlnmiFO47dl6N8tHMM5F6TahJ2h1GsC4uw2uhCl/GvakiXUjyG 0beQgFKGCj9rCl6tyh1L08vSacpdI1mxZLnrccyfke9Os+Qio7NbaECN9 QoRtVK/dF6PJ7do7foCWxje21SXufNYlRdNvNAKXtu+AQaF6dLIh7uXai amjBc7ShFQb9z+ENbj2a764f4h0+JG1Ox5YquWV0nhpbc24f1zGM2OYwy 6i1ddXh5BllGEVhu6aw2jK+v4lSb6+CVqJDezb6v8lZGB2t8YWMhRjyi7 ojhpT40PfK4oBErjSVBzMVZ6LzlXVvFxqUiqlR3vUAUAMzew3UHXoGYq4 w==; X-IronPort-AV: E=McAfee;i="6200,9189,10302"; a="247266873" X-IronPort-AV: E=Sophos;i="5.90,224,1643702400"; d="scan'208";a="247266873" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2022 01:58:04 -0700 X-IronPort-AV: E=Sophos;i="5.90,224,1643702400"; d="scan'208";a="566297847" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.13.94]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2022 01:58:02 -0700 From: "Huang, Ying" To: "Aneesh Kumar K.V" , baolin.wang@linux.alibaba.com Cc: Jagdish Gediya , linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, dave.hansen@linux.intel.com, Fan Du Subject: Re: [PATCH] mm: migrate: set demotion targets differently References: <20220329115222.8923-1-jvgediya@linux.ibm.com> <87pmm4c4ys.fsf@yhuang6-desk2.ccr.corp.intel.com> <87lewrxsv1.fsf@linux.ibm.com> <878rsrc672.fsf@yhuang6-desk2.ccr.corp.intel.com> <87ilruy5zt.fsf@linux.ibm.com> <87h77ebn6j.fsf@yhuang6-desk2.ccr.corp.intel.com> <87fsmyy1a0.fsf@linux.ibm.com> Date: Thu, 31 Mar 2022 16:58:00 +0800 In-Reply-To: <87fsmyy1a0.fsf@linux.ibm.com> (Aneesh Kumar K. V.'s message of "Thu, 31 Mar 2022 13:57:51 +0530") Message-ID: <8735iybisn.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 01780140002 X-Rspam-User: Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=m0Xc6WOl; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf23.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.20) smtp.mailfrom=ying.huang@intel.com X-Stat-Signature: i1858p1qfrkjdgnbkautzkyxrm78t9be X-HE-Tag: 1648717085-227725 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: "Aneesh Kumar K.V" writes: > "Huang, Ying" writes: > >> "Aneesh Kumar K.V" writes: >> >>> "Huang, Ying" writes: >>> >>>> "Aneesh Kumar K.V" writes: >>>> >>>>> "Huang, Ying" writes: >>>>> >>>>>> Hi, Jagdish, >>>>>> >>>>>> Jagdish Gediya writes: >>>>>> >>>>> >>>>> ... >>>>> >>>>>>> e.g. with below NUMA topology, where node 0 & 1 are >>>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>>>>> only nodes, and node 4 is slowest memory only node, >>>>>>> >>>>>>> available: 5 nodes (0-4) >>>>>>> node 0 cpus: 0 1 >>>>>>> node 0 size: n MB >>>>>>> node 0 free: n MB >>>>>>> node 1 cpus: 2 3 >>>>>>> node 1 size: n MB >>>>>>> node 1 free: n MB >>>>>>> node 2 cpus: >>>>>>> node 2 size: n MB >>>>>>> node 2 free: n MB >>>>>>> node 3 cpus: >>>>>>> node 3 size: n MB >>>>>>> node 3 free: n MB >>>>>>> node 4 cpus: >>>>>>> node 4 size: n MB >>>>>>> node 4 free: n MB >>>>>>> node distances: >>>>>>> node 0 1 2 3 4 >>>>>>> 0: 10 20 40 40 80 >>>>>>> 1: 20 10 40 40 80 >>>>>>> 2: 40 40 10 40 80 >>>>>>> 3: 40 40 40 10 80 >>>>>>> 4: 80 80 80 80 10 >>>>>>> >>>>>>> The existing implementation gives below demotion targets, >>>>>>> >>>>>>> node demotion_target >>>>>>> 0 3, 2 >>>>>>> 1 4 >>>>>>> 2 X >>>>>>> 3 X >>>>>>> 4 X >>>>>>> >>>>>>> With this patch applied, below are the demotion targets, >>>>>>> >>>>>>> node demotion_target >>>>>>> 0 3, 2 >>>>>>> 1 3, 2 >>>>>>> 2 3 >>>>>>> 3 4 >>>>>>> 4 X >>>>>> >>>>>> For such machine, I think the perfect demotion order is, >>>>>> >>>>>> node demotion_target >>>>>> 0 2, 3 >>>>>> 1 2, 3 >>>>>> 2 4 >>>>>> 3 4 >>>>>> 4 X >>>>> >>>>> I guess the "equally slow nodes" is a confusing definition here. Now if the >>>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to >>>>> differentiate between them, firmware can present a single NUMA node >>>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >>>>> that there is some difference between these two memory devices. This is >>>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10. >>>> >>>> Do you have more information about this? >>> >>> Not sure I follow the question there. I was checking shouldn't firmware >>> do a single NUMA node if two memory devices are of the same type? How will >>> optane present such a config? Both the DIMMs will have the same >>> proximity domain value and hence dax kmem will add them to the same NUMA >>> node? >> >> Sorry for confusing. I just wanted to check whether you have more >> information about the machine configuration above. The machines in my >> hand have no complex NUMA topology as in the patch description. > > > Even with simple topologies like below > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: 4046 MB > node 0 free: 3478 MB > node 1 cpus: 2 3 > node 1 size: 4090 MB > node 1 free: 3430 MB > node 2 cpus: > node 2 size: 4074 MB > node 2 free: 4037 MB > node distances: > node 0 1 2 > 0: 10 20 40 > 1: 20 10 40 > 2: 40 40 10 > > With current code we get demotion targets assigned as below > > [ 0.337307] Demotion nodes for Node 0: 2 > [ 0.337351] Demotion nodes for Node 1: > [ 0.337380] Demotion nodes for Node 2: > > I guess we should fix that to be below? > > [ 0.344554] Demotion nodes for Node 0: 2 > [ 0.344605] Demotion nodes for Node 1: 2 > [ 0.344638] Demotion nodes for Node 2: If the cross-socket link has enough bandwidth to accommodate the PMEM throughput, the new one is better. If it hasn't, the old one may be better. So, I think we need some kind of user space overridden support here. Right? > Most of the tests we are doing are using Qemu to simulate this. We > started looking at this to avoid using demotion completely when slow > memory is not present. ie, we should have a different way to identify > demotion targets other than node_states[N_MEMORY]. Virtualized platforms > can have configs with memory only NUMA nodes with DRAM and we don't > want to consider those as demotion targets. Even if the demotion targets are set for some node, the demotion will not work before enabling demotion via sysfs (/sys/kernel/mm/numa/demotion_enabled). So for system without slow memory, just don't enable demotion. > While we are at it can you let us know how topology will look on a > system with two optane DIMMs? Do both appear with the same > target_node? In my test system, multiple optane DIMMs in one socket will be represented as one NUMA node. I remember Baolin has different configuration. Hi, Baolin, Can you provide some information about this? >> >>> If you are suggesting that firmware doesn't do that, then I agree with you >>> that a demotion target like the below is good. >>> >>> node demotion_target >>> 0 2, 3 >>> 1 2, 3 >>> 2 4 >>> 3 4 >>> 4 X >>> >>> We can also achieve that with a smiple change as below. >> >> Glad to see the demotion order can be implemented in a simple way. >> >> My concern is that is it necessary to do this? If there are real >> machines with the NUMA topology, then I think it's good to add the >> support. But if not, why do we make the code complex unnecessarily? >> >> I don't have these kind of machines, do you have and will have? >> > > > Based on the above, we still need to get the simpler fix merged right? Or user overridden support? Best Regards, Huang, Ying [snip]