Re: Slow-tier Page Promotion discussion recap and open questions

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Raghavendra K T <rkodsara@amd.com>
To: David Rientjes <rientjes@google.com>
Cc: Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <k.shutemov@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mel.gorman@gmail.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Rik van Riel <riel@surriel.com>,
	RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
	Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
	Lei Chen <leillc@google.com>,
	"Shukla, Santosh" <santosh.shukla@amd.com>,
	"Grimm, Jon" <jon.grimm@amd.com>,
	sj@kernel.org, shy828301@gmail.com, Zi Yan <ziy@nvidia.com>,
	Liam Howlett <liam.howlett@oracle.com>,
	Gregory Price <gregory.price@memverge.com>,
	linux-mm@kvack.org
Subject: Re: Slow-tier Page Promotion discussion recap and open questions
Date: Mon, 6 Jan 2025 11:59:03 +0530	[thread overview]
Message-ID: <c0475d17-c558-4851-ae07-c835a57ecb76@amd.com> (raw)
In-Reply-To: <abffa289-b627-173e-ed09-f94d6169f8e6@google.com>



On 1/2/2025 10:14 AM, David Rientjes wrote:
> On Fri, 20 Dec 2024, Raghavendra K T wrote:
> 
>>> I asked if this was really done single threaded, which was confirmed.  If
>>> only a single process has pages on a slow memory tier, for example, then
>>> flexible tuning of the scan period and size ensures we do not scan
>>> needlessly.  The scan period can be tuned to be more responsive (down to
>>> 400ms in this proposal) depending on how many accesses we have on the
>>> last scan; similarly, it can be much less responsive (up to 5s) if memory
>>> is not found to be accessed.
>>>
>>> I also asked if scanning can be disabled entirely, Raghu clarified that
>>> it cannot be.
>>>
>>
>> We have a sysfs tunable (kmmscand/scan_enabled) to enable/disable the
>> whole scanning at a global level but not at process level granularity.
>>
> 
> Thanks Raghu for the clarification.  I think during discussion that there
> was a preference to make this multi-threaded so we didn't rely on a single
> kmmscand thread, perhaps this would be (at minimum) one kmmscand thread
> per NUMA node?
> 

Correct. From my side a bit more thought on:
- whether we need kthread for CXL nodes too?
- How to share the scanning between kthreads etc.. needed.

>>> Wei Xu asked if the scan period should be interpreted as the minimal
>>> interval between scans because kmmscand is single threaded and there are
>>> many processes.  Raghu confirmed this is correct, the minimal delay.
>>> Even if the scan period is 400ms, in reality it could be multiple seconds
>>> based on load.
>>>
>>> Liam Howlett asked how we could have two scans colliding in a time
>>> segment.  Raghu noted if we are able to complete the last scan in less
>>> time than 400ms, then we have this delay to avoid continuously scanning
>>> that results in increased cpu overhead.  Liam further asked if processes
>>> opt into a scan or out of the scan, Raghu noted we always scan every
>>> process.  John Hubbard suggested that we have per-process control.
>>
>> +1 for prctl()
>>
>> Also I want to add that, I will get data on:
>>
>> what is the min and max time required to finish the entire scan for the
>> current micro-benchmark and one of the real workload (such as Redis/
>> Rocksdb...), so that we can check if we are meeting the deadline of
>> scanning with single kthread.
>>
> 
> Do we want more fine-grained per-process control other than just the
> ability to opt out entire processes?  There may be situations where we
> want to always serve latency tolerant jobs from CXL extended memory, we
> don't care to ever promote its memory, but I also think there will be
> processes that are between the two extremes (latency critical and latency
> tolerant).
> 
> I think careful consideration needs to be given to how we handle
> per-process policy for multi-tenant systems that have different levels of
> latency sensitivity.  If kmmscand becomes the standard way of doing page
> promotion in the kernel, the userspace API to inform it of these policy
> decisions is going to be key.  There have been approaches where this was
> primarily driven by BPF that has to solve the same challenge.
>

Very good point.
How to defer/skip? Should we provide a numeric value to determine
latency sensitivity range?
This can be provided perhaps along with a normal enable/disable.

>>> Wei noted an important point about separating hot page detection and
>>> promotion, which don't actually need to be coupled at all.  This uses
>>> page table scanning while future support may not need to leverage this at
>>> all.  We'd very much like to avoid multiple promotion solutions for
>>> different ways to track page hotness.
>>>
>>> I strongly supported this because I believe for CXL, at least within the
>>> next three years, that memory hotness will likely not be derived from
>>> page table Accessed bit scanning.  Zi Yan agreed.
>>>
>>> The promotion path may also want to be much less aggressive than on first
>>> access.  Raghu showed many improvements, including handling short lived
>>> processes, more accurate hot page detection using timestamp, etc.
>>
>> Some of these TODOs can be implemented in next version.
>>
> 
> Thanks!  Are you planning on sending out another RFC patch series soon or
> are you interested in publishing this on git.kernel.org or github?  There
> may be an opportunity for others to send you pull requests into the series
> of patches while we discuss.
>

Good idea. will do. Perhaps a simple changes that is needed immediately
as next RFC + github (Will explore on this internally how it is done
here.) sooner.

>>> ----->o-----
>>> I followed up on a discussion point early in the talk about whether this
>>> should be virtual address scanning like the current approach, walking
>>> mm_struct's, or the alternative approach which would be physical address
>>> scanning.
>>>
>>> Raghu sees this as a fully alternative approach such as what DAMON uses
>>> that is based on rmap.  The only advantage appears to be avoiding
>>> scanning on top tier memory completely.
>>
>> Having a clarity here would help. Both the approaches have its own pros
>> and cons.
>>
>> Need to also explore on using / Reusing DMAON/ MGLRU.. to the extent possible
>> based on the approach.
>>
> 
> Yeah, I definitely think this is a key point to discuss early on.  Gregory
> had indicated that unmapped file cache is one of the key downsides to
> using only virtual memory scanning.
> 
> While things like the CHMU are still on the way, I think there's benefit
> to making incremental progress from what we currently have available (NUMA
> Balancing) before we get there.

Agree.

Regards
- Raghu

next prev parent reply	other threads:[~2025-01-06  6:29 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-18  4:19 David Rientjes
2024-12-18 14:50 ` Zi Yan
2024-12-19  6:38   ` Shivank Garg
2024-12-30  5:30     ` David Rientjes
2024-12-30 17:33       ` Zi Yan
2025-01-06  9:14       ` Shivank Garg
2024-12-18 15:21 ` Nadav Amit
2024-12-20 11:28   ` Raghavendra K T
2024-12-18 19:23 ` SeongJae Park
2024-12-19  0:56 ` Gregory Price
2024-12-26  1:28   ` Karim Manaouil
2024-12-30  5:36     ` David Rientjes
2024-12-30  6:51       ` Raghavendra K T
2025-01-06 17:02       ` Gregory Price
2024-12-20 11:21 ` Raghavendra K T
2025-01-02  4:44   ` David Rientjes
2025-01-06  6:29     ` Raghavendra K T [this message]
2025-01-08  5:43     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0475d17-c558-4851-ae07-c835a57ecb76@amd.com \
    --to=rkodsara@amd.com \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Raghavendra.KodsaraThimmappa@amd.com \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=gregory.price@memverge.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mel.gorman@gmail.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=santosh.shukla@amd.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox