From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 41D7ACD1284
	for <linux-mm@archiver.kernel.org>; Fri, 29 Mar 2024 01:16:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B456F6B0098; Thu, 28 Mar 2024 21:16:03 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AF5106B0099; Thu, 28 Mar 2024 21:16:03 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9BC946B009A; Thu, 28 Mar 2024 21:16:03 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 813CF6B0098
	for <linux-mm@kvack.org>; Thu, 28 Mar 2024 21:16:03 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id C3FFDA01E1
	for <linux-mm@kvack.org>; Fri, 29 Mar 2024 01:16:02 +0000 (UTC)
X-FDA: 81948310164.18.2A8F89E
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21])
	by imf10.hostedemail.com (Postfix) with ESMTP id BB377C0003
	for <linux-mm@kvack.org>; Fri, 29 Mar 2024 01:15:59 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Z90zFWCP;
	spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711674961; a=rsa-sha256;
	cv=none;
	b=z4ZHDRy14mNiF4dxSQAW/sKC2pfyommSQ02KzdUHMZi8DWeLzwqRpJiGM38xhXujpeblYs
	BYOIhS04yynYpiokUY3xeXYzw1lmrmi6Ys+EXbuZ70VC4FDwCP+QaPvfB9KsQyW08P8oYc
	XNg8UdKaM0HoGHq1cl4mptq2f48jnGU=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Z90zFWCP;
	spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1711674960;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=nYYebEQDsMi0zuU9qNxkdxYSoceaN0oV1wKigop/DjQ=;
	b=c+e7NKaJTROZICmqYZlM1rIkDIHgGAKFvmzc971gBo7XHiImiMZIZDjIbMkqpNDcOoGYbo
	IHKfaztLjYBlO3wxXqfObxwTuQIQo++7+vAtL44n13K5mDwMuW0aY7w3SuEJpp+b7RXQ3i
	gfKQj7PWLgFw7f+zKEIhVxYprJNGJzo=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1711674960; x=1743210960;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=cGgwB0n8cSC0WP9JjZ7Cf0+zxSHRWczsVZP28R7/w6E=;
  b=Z90zFWCP8HWaiZ7lFbWn87+fxJwGM+AJvRK/+nR7QuxgNVcUmMdvIRml
   P2ZdK6x23Mgf0MzDWYsRTHmqLm4LGSNckO6s68xiE6B3+LB9gn61tb76X
   3urAvmxwX11whukIVdUD3Vyml5QzKWpolxoHNgFdsNf3nnbDvEnhIYVEe
   f2kPiGb27WZVEP2ZZwR8vz8cpBQ5F7Ti8B4RhMz+WeXngNIUS/LiiJDZP
   LqA6vwCXF8YPRg0mILk1F4GIOnI8HRFQEOWjNZq32NMrkWlPQLISFRu7d
   O7mGL7eY9cKpDzA9FoTk/CnHF6KFEeBE6vnldfCHT08ElMlFDsozIaHeA
   Q==;
X-CSE-ConnectionGUID: EECniNZSTl2eaB1OJnybUw==
X-CSE-MsgGUID: Y8dIT3DMQNabauAaphbgIA==
X-IronPort-AV: E=McAfee;i="6600,9927,11027"; a="6800284"
X-IronPort-AV: E=Sophos;i="6.07,162,1708416000"; 
   d="scan'208";a="6800284"
Received: from fmviesa005.fm.intel.com ([10.60.135.145])
  by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2024 18:15:59 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.07,162,1708416000"; 
   d="scan'208";a="21290431"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2024 18:15:55 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Bharata B Rao <bharata@amd.com>
Cc: <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,
  <akpm@linux-foundation.org>,  <mingo@redhat.com>,
  <peterz@infradead.org>,  <mgorman@techsingularity.net>,
  <raghavendra.kt@amd.com>,  <dave.hansen@linux.intel.com>,
  <hannes@cmpxchg.org>
Subject: Re: [RFC PATCH 0/2] Hot page promotion optimization for large
 address space
In-Reply-To: <929b22ca-bb51-4307-855f-9b4ae0a102e3@amd.com> (Bharata B. Rao's
	message of "Thu, 28 Mar 2024 11:59:15 +0530")
References: <20240327160237.2355-1-bharata@amd.com>
	<87il16lxzl.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<dd2bc563-7654-4d83-896e-49a7291dd1aa@amd.com>
	<87edbulwom.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<929b22ca-bb51-4307-855f-9b4ae0a102e3@amd.com>
Date: Fri, 29 Mar 2024 09:14:02 +0800
Message-ID: <875xx5lu05.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: BB377C0003
X-Stat-Signature: qyxx8sxoubztj31p8kp1hxgzxispkoot
X-Rspam-User: 
X-HE-Tag: 1711674959-641158
X-HE-Meta: U2FsdGVkX184gvzmi049qSf+McEHYU/YdB9DUNGKkEPUAWFYGuMEAEalITy6krWOhusm72dLmtNYKE7f5AYm3YYY5gl6lkBCyoPTkeJom5bp7Tks8nhQ8xhWqArwtMFPHOMVqkru4O5ghk8mQuxDZw6DLg2C845Kv69LhJMiQKH4qsGOR2HDWQsuzNcJ3HUO8fnowwpu80rTCbAcKVJPVshG8uCfzhMkUPAdLCyyfMTmwCjZx/kKgbBuoP9r+xCrJDtF7w7EJqYe/wHPqefOm+qA33O0wW/YoxAt1Dw72g4czyiTpWbWU3hJUH4DnlXTJ06PwklA6LW3xtBq/+LujU5EYoAxYRx/9d4NhV9M4D2eovBFUOVwIhGwc5ULyuu+FF7VEe+DmLqO6JxzHv5NI4NgVAauMzdrBhftup98N7J+bLAUghztE2WS47+m1mGmo/iBStPX9/nWkBaCZCuhnBwMqUW7l+WfNbcPJnyMQTo9DBk/hBb8ttSyhBaPOGUflPIMVufL0874mmXuq/59K1OM5r/uxLRPT3afTMYCP9PYy5+caXDHeiePB6fhXYc7QI+MkOfgCLLnu+9olAb8yH4nh1awcUGDCm2TraNeyLAJz+mNLiVKLAEhut/Jt8+uzUGNNOg2qt7DsmEFvizUXtXn2iydE6fUI5SdfgmCDVHBtmDduR+rX3n6E8fdQENjIBGW0qxvekTovq7H93Mn86vCoyH0KCLZZud5wQ6f+SIRI/mIpsAPs42zLdy4dp/AnDKhFf6f9cM4P4kOvvZ/2mFXciskdxFEor8pe39wU2HxTjsW2YBiEIYUomHnMWYZVuKXraEAU4fwqigR1j8wBxnv1vAdpsEndrStO0qLJYKeDfyr2bIsBLdAMdj6y7hVdFa6I5rMSgahHP2nmBp862dJo2dr4I4R4+CBXvkhuNeaA9SS4Q6zKuQO93TXAjzc9oZFpQqgBaTDCMPA3LW
 DE2ZAcVz
 tnbIRBF0QdQaboXZFyz/DjJ99DUn26hw+AweBkDPtU0RxkS1YjuWnaGS7mdJU0nQOKIAqQBEY3/SV5Io3LM0KveXdQ8Xkk/LSHSBauS4ZR3cyFiez9eoUax2g0IM+2h4Qiz2Azwv/tp/GHclBils0JhSVZgIF/y7jhs4r6Wl6my/5sws=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Bharata B Rao <bharata@amd.com> writes:

> On 28-Mar-24 11:33 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> On 28-Mar-24 11:05 AM, Huang, Ying wrote:
>>>> Bharata B Rao <bharata@amd.com> writes:
>>>>
>>>>> In order to check how efficiently the existing NUMA balancing
>>>>> based hot page promotion mechanism can detect hot regions and
>>>>> promote pages for workloads with large memory footprints, I
>>>>> wrote and tested a program that allocates huge amount of
>>>>> memory but routinely touches only small parts of it.
>>>>>
>>>>> This microbenchmark provisions memory both on DRAM node and CXL node.
>>>>> It then divides the entire allocated memory into chunks of smaller
>>>>> size and randomly choses a chunk for generating memory accesses.
>>>>> Each chunk is then accessed for a fixed number of iterations to
>>>>> create the notion of hotness. Within each chunk, the individual
>>>>> pages at 4K granularity are again accessed in random fashion.
>>>>>
>>>>> When a chunk is taken up for access in this manner, its pages
>>>>> can either be residing on DRAM or CXL. In the latter case, the NUMA
>>>>> balancing driven hot page promotion logic is expected to detect and
>>>>> promote the hot pages that reside on CXL.
>>>>>
>>>>> The experiment was conducted on a 2P AMD Bergamo system that has
>>>>> CXL as the 3rd node.
>>>>>
>>>>> $ numactl -H
>>>>> available: 3 nodes (0-2)
>>>>> node 0 cpus: 0-127,256-383
>>>>> node 0 size: 128054 MB
>>>>> node 1 cpus: 128-255,384-511
>>>>> node 1 size: 128880 MB
>>>>> node 2 cpus:
>>>>> node 2 size: 129024 MB
>>>>> node distances:
>>>>> node   0   1   2 
>>>>>   0:  10  32  60 
>>>>>   1:  32  10  50 
>>>>>   2:  255  255  10
>>>>>
>>>>> It is seen that number of pages that get promoted is really low and
>>>>> the reason for it happens to be that the NUMA hint fault latency turns
>>>>> out to be much higher than the hot threshold most of the times. Here
>>>>> are a few latency and threshold sample values captured from
>>>>> should_numa_migrate_memory() routine when the benchmark was run:
>>>>>
>>>>> latency	threshold (in ms)
>>>>> 20620	1125
>>>>> 56185	1125
>>>>> 98710	1250
>>>>> 148871	1375
>>>>> 182891	1625
>>>>> 369415	1875
>>>>> 630745	2000
>>>>
>>>> The access latency of your workload is 20s to 630s, which appears too
>>>> long.  Can you try to increase the range of threshold to deal with that?
>>>> For example,
>>>>
>>>> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms
>>>
>>> That of course should help. But I was exploring alternatives where the
>>> notion of hotness can be de-linked from the absolute scanning time to
>> 
>> In fact, only relative time from scan to hint fault is recorded and
>> calculated, we have only limited bits.
>> 
>>> the extent possible. For large memory workloads where only parts of memory
>>> get accessed at once, the scanning time can lag from the actual access
>>> time significantly as the data above shows. Wondering if such cases can
>>> be addressed without having to be workload-specific.
>> 
>> Does it really matter to promote the quite cold pages (accessed every
>> more than 20s)?  And if so, how can we adjust the current algorithm to
>> cover that?  I think that may be possible via extending the threshold
>> range.  And I think that we can find some way to extending the range by
>> default if necessary.
>
> I don't think the pages are cold but rather the existing mechanism fails
> to categorize them as hot. This is because the pages were scanned way
> before the accesses start happening. When repeated accesses are made to
> a chunk of memory that has been scanned a while back, none of those
> accesses get classified as hot because the scan time is way behind
> the current access time. That's the reason we are seeing the value
> of latency ranging from 20s to 630s as shown above.

If repeated accesses continue, the page will be identified as hot when
it is scanned next time even if we don't expand the threshold range.  If
the repeated accesses only last very short time, it makes little sense
to identify the pages as hot.  Right?

The bits to record scan time or hint page fault is limited, so it's
possible for it to overflow anyway.  We scan scale time stamp if
necessary (for example, from 1ms to 10ms).  But it's hard to scale fault
counter.  And nobody can guarantee the frequency of hint page fault must
be less 1/ms, if it's 10/ms, it can record even short interval.

--
Best Regards,
Huang, Ying