From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2D068C4167B
	for <linux-mm@archiver.kernel.org>; Wed,  1 Nov 2023 02:23:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6D3E48D002F; Tue, 31 Oct 2023 22:23:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 684528D0001; Tue, 31 Oct 2023 22:23:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 54CCD8D002F; Tue, 31 Oct 2023 22:23:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 466328D0001
	for <linux-mm@kvack.org>; Tue, 31 Oct 2023 22:23:57 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 15BDE140A5A
	for <linux-mm@kvack.org>; Wed,  1 Nov 2023 02:23:57 +0000 (UTC)
X-FDA: 81407790114.06.87B23BA
Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93])
	by imf11.hostedemail.com (Postfix) with ESMTP id 1BF4D40014
	for <linux-mm@kvack.org>; Wed,  1 Nov 2023 02:23:53 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=hXK6h26Q;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698805434;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=TjBbrkm/l2aKV10hKrkXXXX8CCAg15bo0vBWcF1QS24=;
	b=QT7RsWskWijTLxls9t23rtzb6blNDVH6QevuiJkl/pT7M1B04nIdRfG9S/fN4bycj/IBsu
	p1r2PG7vTvgtA2Uyr1W5zOkDX0nAu9orRRpfRFVRrfYIClNHRrNs8/82pIVzwTepUaPPLc
	xmBqjcVG9KAOSIGdpa2t7S9UlotP404=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=hXK6h26Q;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698805434; a=rsa-sha256;
	cv=none;
	b=jfKY5vd9QF6fx7uY9KLa2SGXyXuhjr1iutGJWQoo54G+q+MrRt4f0dgC0vKWkxYNd+8OEy
	g6X6JaJSW8KvNxV1QmuYItmZAe7vgexc/p6Cp3fMosMCmJFKbODPtMNZZRHsyylSWr7PWI
	o7MQq4akWjA7lUhMCQFxtFssP6E+7zU=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1698805434; x=1730341434;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=FJQjB45qiCqnh/iEHCzCeeIgns3EtrKJXHjHxjc0/Bg=;
  b=hXK6h26QCUolKdFm49Y35XSNGWMsPAlYHtjTmOMaanuuxTjCN1DEFfqN
   OODWUEMSWyuTXJGANt3Nq0hg4MhZbQEknOkmFSd3hcV/B4Sn/1Iu++rke
   FHLTmg9hccDrPB+19Q05kfovH4+QsMdC9E1bJSHOMCuweGyynfHOzI/IL
   9E/4/TttUQP4/BeQne77UyFKXXBEstgg/cxigBIumQ5bGFXLXpw6m00QZ
   omOZSG15K54egj/70XA6h/0xv+FM571tJ8t2clAooEu7PpBSosYYePe/0
   rNKLdbEwmC199rNGKj3vU7Eb/XZIorHsgkglaolbSTHwyB90hCoVyUeJh
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10880"; a="385598229"
X-IronPort-AV: E=Sophos;i="6.03,266,1694761200"; 
   d="scan'208";a="385598229"
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
  by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Oct 2023 19:23:52 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10880"; a="831228900"
X-IronPort-AV: E=Sophos;i="6.03,266,1694761200"; 
   d="scan'208";a="831228900"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Oct 2023 19:23:48 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,  Gregory Price
 <gourry.memverge@gmail.com>,  linux-kernel@vger.kernel.org,
  linux-cxl@vger.kernel.org,  linux-mm@kvack.org,
  akpm@linux-foundation.org,  aneesh.kumar@linux.ibm.com,
  weixugc@google.com,  apopple@nvidia.com,  tim.c.chen@intel.com,
  dave.hansen@intel.com,  shy828301@gmail.com,  gregkh@linuxfoundation.org,
  rafael@kernel.org,  Gregory Price <gregory.price@memverge.com>
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
In-Reply-To: <jgh5b5bm73qe7m3qmnsjo3drazgfaix3ycqmom5u6tfp6hcerj@ij4vftrutvrt>
	(Michal Hocko's message of "Tue, 31 Oct 2023 16:56:27 +0100")
References: <20231031003810.4532-1-gregory.price@memverge.com>
	<rm43wgtlvwowjolzcf6gj4un4qac4myngxqnd2jwt5yqxree62@t66scnrruttc>
	<20231031152142.GA3029315@cmpxchg.org>
	<jgh5b5bm73qe7m3qmnsjo3drazgfaix3ycqmom5u6tfp6hcerj@ij4vftrutvrt>
Date: Wed, 01 Nov 2023 10:21:47 +0800
Message-ID: <87msvy6wn8.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 1BF4D40014
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: 19q78xwcca77t9qxq7ynu8ip5fupatuy
X-HE-Tag: 1698805433-787652
X-HE-Meta: U2FsdGVkX180dBIS/QwHe51pWIeiCBIbEEoMRwDDpgCwGi5k0d6TxaSpTPMwvvTnfvs0gKke6F+AbE3wLKHEVZIpYu52xPPDZ2a1Siek10dDg+ep8yGl2nRJ4YiuEuWnJ7Mt4Ig5ygR/y2Jqp94K+ZImt6HrVyE6CI1CUiKXyZdmYgMV/PVJHQo3OJzGcFc47czye3BmZD5LxXH7/tA34fWEkjOCaFNVVs2vPlC9rKAxx+I1t1AF0fpvmYTzh6t4HdSqSk25Q0pellhFxnkWiHxqPrTHN9AEN4NwHOvQHx0G4ymRVMcjpSxza3ZYzxc5zo9kEcRzd8XTbXTIdqZIo65Vr599kO6Phnzo7/6E01rLZhW6zNmc6ytFVpaaM56gKkD4nfb6WZWGQCFJdJ4UE9zZjtdBJJKj/RvNQt1cv2Mu70+PhyAOz4jBGpAWjMSd42FK6bd1kdqHNrL0dW3B9nqK0HvGp/lVKtRgzUGOzmYhHg6I6mXZsJxqreRVywcfhZnB6Elo8g5ozqVSaawvaLIvxuiYpmbb8SjNdIe0GNsyYhL3zfGHRRwbeaBPXswTQiOhnJuBFbM2WaJZkG2nS1hEwCKg5mI754PXnzbbduflMIUtxYVlEwSvTFCA9xv5nyHIALiwEnKttM0QTCWOSDCEsuxm3d8DwLcwLe98TygEIaFWlkFx1lJji7yXfpVP+4g9tD4qIUAzFO8UI2UoccRcBSJlOFkk1ghqTulm7slFmcnJO028cBS5zXP4UvpQFcchtqu442D7nxCXtXoSQeTA+F24BKa/dAKTkC9KKdigw2z9xlOdoT1NHkPiPcjBYMZtfMPK9sxEUkp0AFfVgtFtm1EOoT3K4dHPDYC0MuKc5jBZMum0jtlVHL1wZEHH8MLkwX7PABRWOE7NicQpEPOQY30um6fWHMKnArkX9HTWiuLH7o9yBD22YQ1Omh0JTCRNXC92AXDbu7Msbzj
 CL+mRhVb
 DJJK1Vdv1cutdj68KVHSrXqbhxP87Dlf3O+XCeG017XykenFpSCh5w9+NEzy1iXCUBtq9OtCOLOJmi35obxkKRHnPDQjvk5CjvUwRMWSkCkkUu58DkELiTa0U6fdlJlo2xjpBnwDwEwCJWPWPaei6VloVmUKWSdwjnS3fZk3LPMlkzdvtdasjjNTyfvQ3piQDqUa3vnM2CwMhNRxEak2xxHiXtcRni9oKTbr9KbcP+w1bibYz4qEOWmKJHTL0kPhZ3i9Ntj5vdXgPAxfrVBgkLP5OJl00HcV0vkbkKM5SsJAjVtQ4G0M4X0fEqiSsFzxLMkClfyAg+UVs2fekk3XLluAl30V2GWmbq5f29q5WthReGt4KNX6QWtIaUTJcOhteplSZioiRnUmIgLqlSROkLezhMUteqZn3FuBBPQCwsNNI9FQ=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Michal Hocko <mhocko@suse.com> writes:

> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>> > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>> > > This patchset implements weighted interleave and adds a new sysfs
>> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
>> > > 
>> > > The il_weight of a node is used by mempolicy to implement weighted
>> > > interleave when `numactl --interleave=...` is invoked.  By default
>> > > il_weight for a node is always 1, which preserves the default round
>> > > robin interleave behavior.
>> > > 
>> > > Interleave weights may be set from 0-100, and denote the number of
>> > > pages that should be allocated from the node when interleaving
>> > > occurs.
>> > > 
>> > > For example, if a node's interleave weight is set to 5, 5 pages
>> > > will be allocated from that node before the next node is scheduled
>> > > for allocations.
>> > 
>> > I find this semantic rather weird TBH. First of all why do you think it
>> > makes sense to have those weights global for all users? What if
>> > different applications have different view on how to spred their
>> > interleaved memory?
>> > 
>> > I do get that you might have a different tiers with largerly different
>> > runtime characteristics but why would you want to interleave them into a
>> > single mapping and have hard to predict runtime behavior?
>> > 
>> > [...]
>> > > In this way it becomes possible to set an interleaving strategy
>> > > that fits the available bandwidth for the devices available on
>> > > the system. An example system:
>> > > 
>> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
>> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
>> > > 
>> > > In this setup, the effective weights for nodes 0-3 for a task
>> > > running on Node 0 may be [60, 20, 10, 10].
>> > > 
>> > > This spreads memory out across devices which all have different
>> > > latency and bandwidth attributes at a way that can maximize the
>> > > available resources.
>> > 
>> > OK, so why is this any better than not using any memory policy rely
>> > on demotion to push out cold memory down the tier hierarchy?
>> > 
>> > What is the actual real life usecase and what kind of benefits you can
>> > present?
>> 
>> There are two things CXL gives you: additional capacity and additional
>> bus bandwidth.
>> 
>> The promotion/demotion mechanism is good for the capacity usecase,
>> where you have a nice hot/cold gradient in the workingset and want
>> placement accordingly across faster and slower memory.
>> 
>> The interleaving is useful when you have a flatter workingset
>> distribution and poorer access locality. In that case, the CPU caches
>> are less effective and the workload can be bus-bound. The workload
>> might fit entirely into DRAM, but concentrating it there is
>> suboptimal. Fanning it out in proportion to the relative performance
>> of each memory tier gives better resuls.
>> 
>> We experimented with datacenter workloads on such machines last year
>> and found significant performance benefits:
>> 
>> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/
>
> Thanks, this is a useful insight.
>  
>> This hopefully also explains why it's a global setting. The usecase is
>> different from conventional NUMA interleaving, which is used as a
>> locality measure: spread shared data evenly between compute
>> nodes. This one isn't about locality - the CXL tier doesn't have local
>> compute. Instead, the optimal spread is based on hardware parameters,
>> which is a global property rather than a per-workload one.
>
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.

For "pay", per my understanding, we need some cgroup based
per-memory-tier (or per-node) usage limit.  The following patchset is
the first step for that.

https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/

--
Best Regards,
Huang, Ying