From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0C17FC54E64
	for <linux-mm@archiver.kernel.org>; Thu, 28 Mar 2024 06:05:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7C6E16B0089; Thu, 28 Mar 2024 02:05:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 74E846B0096; Thu, 28 Mar 2024 02:05:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5EE6F6B0099; Thu, 28 Mar 2024 02:05:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 3F9DA6B0089
	for <linux-mm@kvack.org>; Thu, 28 Mar 2024 02:05:53 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 051411A0F5D
	for <linux-mm@kvack.org>; Thu, 28 Mar 2024 06:05:52 +0000 (UTC)
X-FDA: 81945411786.07.F70A6DC
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15])
	by imf19.hostedemail.com (Postfix) with ESMTP id 6FA581A000E
	for <linux-mm@kvack.org>; Thu, 28 Mar 2024 06:05:50 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="fR/mhTTC";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1711605951;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=FyvfWd6iXipin0nlJ48e0hi4NL6PceA2kqKIlmHVQV0=;
	b=cDOn4XOo7643WJaKuLCvIK0E8OjwasTKpSNmGNpynWERzws+CKTbtbNandVwApBwBvy1L5
	N93iZBwYH3ZWQqyTxrXRirigeHzsLDUD8KxVD+o/OI21SvDCJNCyJCwhCzLzD6o0bf8S7P
	l9X3K9kKnBFJPo5NU8TsN/lKFdGmtSo=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="fR/mhTTC";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711605951; a=rsa-sha256;
	cv=none;
	b=CfMtpYwhvgXS9ujPBkrjMaVAgOLpfZNhA2K+yCIvcAiX36LAuVCFe/03iyA7W9JIY3qENX
	IMSpP2r189JfyvdZXdCqWfp6RbBDKIyqzSI/XN5Ttyc6mI0NAhY9DYbPDoYWOyi6waHQFi
	J8RaZz6G5sVCk8HPFOLQnLhpY7O2QMU=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1711605950; x=1743141950;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=nHVqUgy0BzoNR9z7rYrLXoLxZPoFsJsPCfy9cl7qqXI=;
  b=fR/mhTTCtkdYsyNF6+0Gf3SxcpkCLeOJC+sH/AOnDEVwZgTuyfTHTQp4
   tsMcuunZOGj4OGMY7tpzd5DzBDnoNc4TZTj8/N9G4lno6bnT9nugsbpIw
   e7CSt8kYGvwO7MAgI42q6u1jEf40E8q7ZOxno5lamNCGnk2VQ4DS+SYkC
   jC01G2nS0E8UPoi15cgxVicX6aCg/UWKgjq4JhV6LJg0ftMmvFdOxjjRG
   pbhBHkfKUYJFQ6QrIeGZokUjK67UrBDlsqznv8FfoYDiAuVBt6xISW/tZ
   RWbhE6L3Q6RmycaowlRJlmAhPbjGNxeE8Tauw1jlYvT7MJgKrG3+OaTP0
   A==;
X-CSE-ConnectionGUID: LRrReKhjS4StuFV7fkvLZg==
X-CSE-MsgGUID: Ui0MC8K/ST6G2SGPGToB1Q==
X-IronPort-AV: E=McAfee;i="6600,9927,11026"; a="6934136"
X-IronPort-AV: E=Sophos;i="6.07,161,1708416000"; 
   d="scan'208";a="6934136"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Mar 2024 23:05:49 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.07,161,1708416000"; 
   d="scan'208";a="21201112"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Mar 2024 23:05:46 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Bharata B Rao <bharata@amd.com>
Cc: <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,
  <akpm@linux-foundation.org>,  <mingo@redhat.com>,
  <peterz@infradead.org>,  <mgorman@techsingularity.net>,
  <raghavendra.kt@amd.com>,  <dave.hansen@linux.intel.com>,
  <hannes@cmpxchg.org>
Subject: Re: [RFC PATCH 0/2] Hot page promotion optimization for large
 address space
In-Reply-To: <dd2bc563-7654-4d83-896e-49a7291dd1aa@amd.com> (Bharata B. Rao's
	message of "Thu, 28 Mar 2024 11:19:40 +0530")
References: <20240327160237.2355-1-bharata@amd.com>
	<87il16lxzl.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<dd2bc563-7654-4d83-896e-49a7291dd1aa@amd.com>
Date: Thu, 28 Mar 2024 14:03:53 +0800
Message-ID: <87edbulwom.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 6FA581A000E
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: jw61hsbjansms7cysknz8zojwodpwpry
X-HE-Tag: 1711605950-579631
X-HE-Meta: U2FsdGVkX1+CnPMW/5PJi3/SWXBFBPaFZHcHJWmnTyOkkpQ6HxCOj5seFinDQeOTXl6Pusij0V5X0oYgBylmErYurXgYz/60qee7MoRKDg7qlhSR6+Hn7X9WWAuAfJW9Eg20wskREWgGld7lmvCPdeo7YLzlhtqbp/sbWe22S0mKvNidqtOSkHqul7A79Qoh+2YKGlmaVqbY1Y/WxiLm5AEBCIGzNd72d2qsPJkwW+FwYCJznjajreXQSv1RK8GajShTHofO76oKKY/YWp1p1iy8BqjA3usadqQyr6Nqt1z/J+eu4TVJiUtHG4NRcZIWX5brhI+S/p8NFx+oolpd2BRdg2mXK72jr28S4oBjiPPf1GmFV9v1LfEaVKVpOyA4sjmGG0/u3fOxoobmTNapvT03BqRiCqWbO3lOwfwAB8dd9et4dRh3L5zpz8shZS7eDvgWX8Ebo7Oe0teiZvHosBgDhojVqoRs5nTw8SXw34Jhsbb3FDTzaTd0k64BcOSM9PLwSHfjPnj/pRKNEyprGxBDH6wXcc4269uh0TR39HzIK11AvWPnumZRhywQ7U7TENYDy5o+fezbOXO9Usv+wPFdmkBojd+8dh+ArGiqCsBVW7wARE35zIIkFzAI2VElEhqM/lgDx+DobqXlL1tItLYReR/nARn6MzVWV/9ZPV7sJF/H/fh08/hu36MzDgc2gEn5GqCjgKyDuFJ5MRTiNNTpC/dG+TBaQwuXbl/hqGQOn8IzJlDZE5xQOiBFfNTbvFWZb+blzSn1sMYQRy6RKYPHFGSu/v5aBT+YdEHhw92LFmrxDXfHkWZtcz8EvS6MQ5Ds+uUamGiCA7LGA81NVOlNJYgI1/j/8CpwE3IZyCiyA7NLF39bGOj6PAtntKPA83rpY8hnLGjqZJvBdX8wwSrz8oTV93J2/JG62Y01scaMvvNDxTljt3oI4pHwil/xZsJV6gDFIj97PtwiZ3w
 iOnYq9yr
 9Zeb49up+PyenoMea9ftxh8Cv0K74xChsdjR2l0dCbwEIbBepS2C6+JfAjy2bMZbT469k4hTrKPf+BucJxLEyUOfAGCjam7Pm6/m7Y9NcIJ2UzQqvyNu5zhi5Zjydxw1wCxydvTVaxw/ubT8ybiVBE38ta3lWJX0rgO5uPc0JXqd6joA=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Bharata B Rao <bharata@amd.com> writes:

> On 28-Mar-24 11:05 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> In order to check how efficiently the existing NUMA balancing
>>> based hot page promotion mechanism can detect hot regions and
>>> promote pages for workloads with large memory footprints, I
>>> wrote and tested a program that allocates huge amount of
>>> memory but routinely touches only small parts of it.
>>>
>>> This microbenchmark provisions memory both on DRAM node and CXL node.
>>> It then divides the entire allocated memory into chunks of smaller
>>> size and randomly choses a chunk for generating memory accesses.
>>> Each chunk is then accessed for a fixed number of iterations to
>>> create the notion of hotness. Within each chunk, the individual
>>> pages at 4K granularity are again accessed in random fashion.
>>>
>>> When a chunk is taken up for access in this manner, its pages
>>> can either be residing on DRAM or CXL. In the latter case, the NUMA
>>> balancing driven hot page promotion logic is expected to detect and
>>> promote the hot pages that reside on CXL.
>>>
>>> The experiment was conducted on a 2P AMD Bergamo system that has
>>> CXL as the 3rd node.
>>>
>>> $ numactl -H
>>> available: 3 nodes (0-2)
>>> node 0 cpus: 0-127,256-383
>>> node 0 size: 128054 MB
>>> node 1 cpus: 128-255,384-511
>>> node 1 size: 128880 MB
>>> node 2 cpus:
>>> node 2 size: 129024 MB
>>> node distances:
>>> node   0   1   2 
>>>   0:  10  32  60 
>>>   1:  32  10  50 
>>>   2:  255  255  10
>>>
>>> It is seen that number of pages that get promoted is really low and
>>> the reason for it happens to be that the NUMA hint fault latency turns
>>> out to be much higher than the hot threshold most of the times. Here
>>> are a few latency and threshold sample values captured from
>>> should_numa_migrate_memory() routine when the benchmark was run:
>>>
>>> latency	threshold (in ms)
>>> 20620	1125
>>> 56185	1125
>>> 98710	1250
>>> 148871	1375
>>> 182891	1625
>>> 369415	1875
>>> 630745	2000
>> 
>> The access latency of your workload is 20s to 630s, which appears too
>> long.  Can you try to increase the range of threshold to deal with that?
>> For example,
>> 
>> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms
>
> That of course should help. But I was exploring alternatives where the
> notion of hotness can be de-linked from the absolute scanning time to

In fact, only relative time from scan to hint fault is recorded and
calculated, we have only limited bits.

> the extent possible. For large memory workloads where only parts of memory
> get accessed at once, the scanning time can lag from the actual access
> time significantly as the data above shows. Wondering if such cases can
> be addressed without having to be workload-specific.

Does it really matter to promote the quite cold pages (accessed every
more than 20s)?  And if so, how can we adjust the current algorithm to
cover that?  I think that may be possible via extending the threshold
range.  And I think that we can find some way to extending the range by
default if necessary.

--
Best Regards,
Huang, Ying