From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CDD9FCDB47E
	for <linux-mm@archiver.kernel.org>; Wed, 18 Oct 2023 08:33:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 506A48D0147; Wed, 18 Oct 2023 04:33:22 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4B6718D0016; Wed, 18 Oct 2023 04:33:22 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3A4EB8D0147; Wed, 18 Oct 2023 04:33:22 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 2C1AC8D0016
	for <linux-mm@kvack.org>; Wed, 18 Oct 2023 04:33:22 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id E680E80194
	for <linux-mm@kvack.org>; Wed, 18 Oct 2023 08:33:21 +0000 (UTC)
X-FDA: 81357917802.22.AF5B5FC
Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120])
	by imf06.hostedemail.com (Postfix) with ESMTP id 65EE318000A
	for <linux-mm@kvack.org>; Wed, 18 Oct 2023 08:33:19 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=X4iJ9q8D;
	spf=pass (imf06.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1697617999;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:cc:cc:cc:cc:cc:cc:cc:cc:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:in-reply-to:in-reply-to:
	 references:references:dkim-signature;
	bh=jx6PzON9MS9vp5iltHMQIjhJIp1EeEnkAGQ2m9ocGUU=;
	b=iZ8/P9S8ZRNLKH1V0MbB6/6ZpcR34CAozBPdBMoVSbVp2VlR6YxeIDkGz7XSAqqU9weZ7I
	4DWh9LKE7lYb8q1ULj/Z4QEPE2DNn8ImMtPlSW3U3tKZzojJMS9xB54R2XuBXZ73V5Nwce
	l/RaipjbPbWwjFPJaHwUhXoBwvBmupc=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697617999; a=rsa-sha256;
	cv=none;
	b=8Vx6MHdqW7O5mqk15ZaeE9L8ZZMW8rJfX4Rh5HquaUJ6IRv9T3wrl463mnxS38hvzWV9hs
	OdxXddKNpwyfBFkV4ci+K0s9ee6hSmyrjrhKTWoVEdQRs20Or67MYGTW+AB68a+GjN1aTg
	LPUa13XUTM/xD3Yw9ph+SoBsiVRC+vE=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=X4iJ9q8D;
	spf=pass (imf06.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1697617999; x=1729153999;
  h=from:to:subject:in-reply-to:references:date:message-id:
   mime-version;
  bh=28BEhauydut+mQimX7U4YfapCopmebuRODTGRijiKyI=;
  b=X4iJ9q8D93m9f1tfNo+asx7YM/fAr3YIw373g+STANa6zHaVQ1elgw4H
   FT8i383JnZy1m9Qmo+XVxxiy2aU0hoIqksxJXUa5MZPoB0YbWhcCF6Jt4
   wftQ0Bmc5EYJ2PAKlBC+kllmlow+A2dMIR8e3IW8JdSWWmLLxZ0wQld4M
   9C+pFH0nyQUBadJnNP+uCwJcijxZFfeCsrrS5vUTvjBnHYuS3Oo2V0jCI
   rFQ/YCK0UGAHlK/6sirs5biAA1ERWdxJAKYcywOk7HKoqfEDsVh181VrE
   H3uq2pdJZDX+OWzD2Drt7SPxggi13v7NmNsMHBu/SrUoa9iiAdI1/cPLZ
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10866"; a="384848478"
X-IronPort-AV: E=Sophos;i="6.03,234,1694761200"; 
   d="scan'208";a="384848478"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Oct 2023 01:33:17 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10866"; a="749994672"
X-IronPort-AV: E=Sophos;i="6.03,234,1694761200"; 
   d="scan'208";a="749994672"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Oct 2023 01:33:13 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,  <linux-mm@kvack.org>,
  <linux-kernel@vger.kernel.org>,  <linux-cxl@vger.kernel.org>,
  <akpm@linux-foundation.org>,  <sthanneeru@micron.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Subject: Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
In-Reply-To: <87pm1cwcz5.fsf@yhuang6-desk2.ccr.corp.intel.com> (Ying Huang's
	message of "Wed, 18 Oct 2023 16:29:02 +0800")
References: <20231009204259.875232-1-gregory.price@memverge.com>
	<87o7gzm22n.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZS3jQRnX4VIdyTL5@memverge.com>
	<87pm1cwcz5.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Wed, 18 Oct 2023 16:31:11 +0800
Message-ID: <87lec0wcvk.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Stat-Signature: snidihxwb1qhbabjpkhnxpf5umqw4nkj
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 65EE318000A
X-Rspam-User: 
X-HE-Tag: 1697617999-751554
X-HE-Meta: U2FsdGVkX1829NPjQVRTl5Ke/1bza+VuZwI65uQMGNx4rnAGymFbct9ZuO3t0K4PUZP+iTxFuzV0fsTxaaurUGu7wGSRgqQpQVEgy61q0MCN1MInmd4QxXjpKneb+obtZgU/fvScG/8vrNZa6PDmX56BlwdgrKoaaQZ3/bWljUA5e+mf2nY5Y9zTLgnDB0DlsuZi3YcNKweMIqIaqIHhschV9b0cKqaBiVQu38LgfvnfCGmeJIAZuDGgA88oP8SieAZezdwqJXHWULCSZqQxmu+SUM95//boMLWF7iAqWPF2KTEp/aqcgbGfFAVLMPoZHu29lW7l2OY0mlyx9+zRuxdEhLVV4DMq7HsmHU3u0reuaJjGMCd+JQx/RgkrrcqPYVInw1OfJZ69YGOyt4wSohe8pWLpj7wB4YAsb6I64snZU+FN1N4X6vZF+3XhG9HYUOG6bGKa6kWQrP4f6c8J7H/5msVjLhiQCfGVge47AEhDEOIRTKS3trUnjlDIASOtl/7Gbn841Cu5/VX/cQJJLzkVik9LDfm52I9aneSNLG7hEAgedExudSr37cZNuP+HAdQjBaNP5q7Uzyu3idXppXxWhicTnpzU2dvRtnlvBr2jLEChacTOUczqfgCFUkmUR4l0iazOYxZ9kt9yO9zMCoLJOr/4TujgpHqwoeECq/aMND/YrhgvxLwHraWz/6VSmMvAEMDQKYM2rVziMPCSAhwCZTKpxq/I0j2f8524MKEpIex+EL2bYRXJ2zsdWguOQf4ab9kGF6iDmDEVNzNfkO9J84v4VBgTrXBI8QErL5IlbqpDyKlEft2EtrK+eOqF/Vd4KN2kAg7YB/DNDzvQh83bJmPtHXU3HRg57TQXm4l0/aduomIqJcp2SVOA9DhnzaMc8ayIY3D+kCYxQyF4aeoY2g44z3udQvri9lx1VHiUdWR+0xtooG8MqNach+hAw2oQPVyTBRD1qKvbecn
 oH38iGZ9
 td56+rCDM4CGHyAc+tvKL/VHv0oc9akZ6+tteeanPucqfDgcRKjm0BGBUgRN3HsccNbAYiYj1PI68/HwFSvT9S5gqlfd2Rqg2qQ/F7g1abeo+uZtg6QiFHE6cBy72VPocCeGi+mO7+a7ZMFRXfkAN0ClidbVr3SgZNt3jVNWtTiTAI8Q0nxui9k+WffT4UwkHxRLjhcH8GIdF+ja/Y1ROSLrFPH66fu+gtfMqbcaZxvgrlUEdziaPsiK+0Z5J+8Pri8bTKBh/eXhnQK/ZPlBD0s7K+5qYIXZW00IuQu3ZGVQWWmeFa5x4SILwHuWl1tdV5ueElM+PBAWT27sZmz+0CB9Wf0IK6AYzCdjJGxahxx/CtICfkso5+PQx4ynqC/0qWQSkP8PMq2/RCDoB5DEKcYyGfg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Forget to Cc more people.

"Huang, Ying" <ying.huang@intel.com> writes:

> Gregory Price <gregory.price@memverge.com> writes:
>
>> On Mon, Oct 16, 2023 at 03:57:52PM +0800, Huang, Ying wrote:
>>> Gregory Price <gourry.memverge@gmail.com> writes:
>>> 
>>> > == Mutex to Semaphore change:
>>> >
>>> > Since it is expected that many threads will be accessing this data
>>> > during allocations, a mutex is not appropriate.
>>> 
>>> IIUC, this is a change for performance.  If so, please show some
>>> performance data.
>>>
>>
>> This change will be dropped in v3 in favor of the existing
>> RCU mechanism in memory-tiers.c as pointed out by Matthew.
>>
>>> > == Source-node relative weighting:
>>> >
>>> > 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
>>> >    echo source_node:weight > /path/to/interleave_weight
>>> 
>>> If source_node is considered, why not consider target_node too?  On a
>>> system with only 1 tier (DRAM), do you want weighted interleaving among
>>> NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
>>> Why not just introduce weighted interleaving for NUMA nodes?
>>>
>>
>> The short answer: Practicality and ease-of-use.
>>
>> The long answer: We have been discussing how to make this more flexible..
>>
>> Personally, I agree with you.  If Task A is on Socket 0, the weight on
>> Socket 0 DRAM should not be the same as the weight on Socket 1 DRAM.
>> However, right now, DRAM nodes are lumped into the same tier together,
>> resulting in them having the same weight.
>>
>> If you scrollback through the list, you'll find an RFC I posted for
>> set_mempolicy2 which implements weighted interleave in mm/mempolicy.
>> However, mm/mempolicy is extremely `current-centric` at the moment,
>> so that makes changing weights at runtime (in response to a hotplug
>> event, for example) very difficult.
>>
>> I still think there is room to extend set_mempolicy to allow
>> task-defined weights to take preference over tier defined weights.
>>
>> We have discussed adding the following features to memory-tiers:
>>
>> 1) breaking up tiers to allow 1 tier per node, as opposed to defaulting
>>    to lumping all nodes of a simlar quality into the same tier
>>
>> 2) enabling movemnet of nodes between tiers (for the purpose of
>>    reconfiguring due to hotplug and other situations)
>>
>> For users that require fine-grained control over each individual node,
>> this would allow for weights to be applied per-node, because a
>> node=tier. For the majority of use cases, it would allow clumping of
>> nodes into tiers based on physical topology and performance class, and
>> then allow for the general weighting to apply.  This seems like the most
>> obvious use-case that a majority of users would use, and also the
>> easiest to set-up in the short term.
>>
>> That said, there are probably 3 or 4 different ways/places to implement
>> this feature.  The question is what is the clear and obvious way?
>> I don't have a definitive answer for that, hence the RFC.
>>
>> There are at least 5 proposals that i know of at the moment
>>
>> 1) mempolicy
>> 2) memory-tiers
>> 3) memory-block interleaving? (weighting among blocks inside a node)
>>    Maybe relevant if Dynamic Capacity devices arrive, but it seems
>>    like the wrong place to do this.
>> 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
>> 5) "just do it in hardware"
>
> It may be easier to start with the use case.  What is the practical use
> cases in your mind that can not be satisfied with simple per-memory-tier
> weight?  Can you compare the memory layout with different proposals?
>
>>> > # Set tier4 weight from node 0 to 85
>>> > echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
>>> > # Set tier4 weight from node 1 to 65
>>> > echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
>>> > # Set tier22 weight from node 0 to 15
>>> > echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
>>> > # Set tier22 weight from node 1 to 10
>>> > echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
>
> --
> Best Regards,
> Huang, Ying