Re: Slow-tier Page Promotion discussion recap and open questions

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gregory Price <gourry@gourry.net>
To: David Rientjes <rientjes@google.com>
Cc: Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <k.shutemov@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mel.gorman@gmail.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Rik van Riel <riel@surriel.com>,
	RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
	Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
	Lei Chen <leillc@google.com>,
	"Shukla, Santosh" <santosh.shukla@amd.com>,
	"Grimm, Jon" <jon.grimm@amd.com>,
	sj@kernel.org, shy828301@gmail.com, Zi Yan <ziy@nvidia.com>,
	Liam Howlett <liam.howlett@oracle.com>,
	Gregory Price <gregory.price@memverge.com>,
	linux-mm@kvack.org
Subject: Re: Slow-tier Page Promotion discussion recap and open questions
Date: Wed, 18 Dec 2024 19:56:19 -0500	[thread overview]
Message-ID: <Z2NvM06hMi1MLBbn@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <6d582bb6-3ba5-1768-92f2-6025340a3cd4@google.com>

On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
> ----->o-----
> Raghu noted the current promotion destination is node 0 by default.  Wei
> noted we could get some page owner information to determine things like
> mempolicies or compute the distance between nodes and, if multiple nodes
> have the same distance, choose one of them just as we do for demotions.
> 
> Gregory Price noted some downsides to using mempolicies for this based on
> per-task, per-vma, and cross socket policies, so using the kernel's
> memory tiering policies is probably the best way to go about it.
> 

Slightly elaborating here:
- In an async context, associating a page with a specific task is not
  presently possible (that I know of). The most we know is the last
  accessing CPU - maybe - in the page/folio struct.  Right now this
  is disabled in favor of a timestamp when tiering is enabled.

  a process with 2 tasks which have access to the page may not run
  on the same socket, so we run the risk of migrating to a bad target.
  Best effort here would suggest either socket is fine - since they're
  both "fast nodes" - but this requires that we record the last 
  accessing CPU for a page at identification time.

- Even if you could associate with a particular task, the task and/or
  cgroup are not guaranteed to have a socket affinity. Though obviously
  if it does, that can be used (just doesn't satisfy default behavior).
  Basically just saying we shouldn't depend on this

- per-vma mempolicies are a potential solution, but they're not very
  common in the wild - software would have to become numa aware and
  utilize mbind() on particular memory regions.
  Likewise we shouldn't depend on this either.

- This holds for future mechanisms like CHMU, whose accessing data is
  even more abstract (no concept of accessing task / cpu / owner at all)

More generally - in an async scanning context it's presently not
possible to identify the optimal promotion node - and it likely is
not possible without userland hints.

So probably we should just leverage static configuration data (HMAT)
and some basic math to put together a promotion target in a similar
way to how we calculate a demotion target.

Long winded way of saying I don't think an optimal solution is possible,
so lets start with suboptimal and get data.

> ----->o-----
> My takeaways:
> 
>  - there is a definite need to separate hot page detection and the
>    promotion path since hot pages may be derived from multiple sources,
>    including hardware assists in the future
> 
>  - for the hot page tracking itself, a common abstraction to be used that
>    can effectively describe hotness regardless of the backend it is
>    deriving its information from would likely be quite useful
>

In a synchronous context (Accessing Task), something like:

target_node = numa_node_id; # cpu we're currently operating on
promote_pagevec(vec, numa_node_id, PROMOTE_DEFER);

where the function promotion logic then does something like:

promote_batch(pagevec, target)

In an asynchronous context (Scanning Task), something like:

promote_pagevec(vec, -1, PROMOTE_DEFER);

where the promotion logic then does something like

for page in pagevec:
	target = memory_tiers_promotion_target(page_to_nid(page))
	promote(folio, target)

Plumbing-wise this can be optimized to identify similarly located
pages into a sub-pagevec and use promote_batch() semantics.

My gut says this is the best we're going to get, since async contexts
can't identify accessor locations easily (especially CHMU).

>  - I think virtual memory scanning is likely the only viable approach for

Hard disagree.  Virtual memory scanning misses an entire class of memory

Unmapped file cache.
https://lore.kernel.org/linux-mm/20241210213744.2968-1-gourry@gourry.net/

>    this purpose and we could store state in the underlying struct page,

This is contentious. Look at folio->_last_cpupid for context, we're
already overloading fields in subtle ways to steal a 32 bit area.

>    similar to NUMA Balancing, but that all scanning should be driven by
>    walking the mm_struct's to harvest the Accessed bit
> 

If the goal is to do multi-tenant tiering (i.e. many mm_struct's), then
this scales poorly by design.

Elsewhere, folks agreed that CXL-memory will have HMU-driven hotness
data as the primary mechanism.  This is a physical-memory hotness tracking
mechanism that avoids scanning page tables or page structs.

If we think that's the direction it's going, then we shouldn't bother
investing a ton of effort into a virtual-memory driven design as the
primary user.  (Sure, support it, but don't dive too much further in)

>  - if there is any general pushback on leveraging a kthread for this,
>    this would be very good feedback to have early
>

I think for the promotion system, having one or more kthreads based on
promotion pressure is a good idea.

I'm not sure how well this will scale for many-process, high-memory
systems (1TB+ on a scanning interval of 256MB is very low accuracy).

Need more data!

~Gregory

next prev parent reply	other threads:[~2024-12-19  0:56 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-18  4:19 David Rientjes
2024-12-18 14:50 ` Zi Yan
2024-12-19  6:38   ` Shivank Garg
2024-12-30  5:30     ` David Rientjes
2024-12-30 17:33       ` Zi Yan
2025-01-06  9:14       ` Shivank Garg
2024-12-18 15:21 ` Nadav Amit
2024-12-20 11:28   ` Raghavendra K T
2024-12-18 19:23 ` SeongJae Park
2024-12-19  0:56 ` Gregory Price [this message]
2024-12-26  1:28   ` Karim Manaouil
2024-12-30  5:36     ` David Rientjes
2024-12-30  6:51       ` Raghavendra K T
2025-01-06 17:02       ` Gregory Price
2024-12-20 11:21 ` Raghavendra K T
2025-01-02  4:44   ` David Rientjes
2025-01-06  6:29     ` Raghavendra K T
2025-01-08  5:43     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2NvM06hMi1MLBbn@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Raghavendra.KodsaraThimmappa@amd.com \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=gregory.price@memverge.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mel.gorman@gmail.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=santosh.shukla@amd.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox