From: Gregory Price <gourry@gourry.net>
To: David Rientjes <rientjes@google.com>
Cc: Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Kirill Shutemov <k.shutemov@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Mel Gorman <mel.gorman@gmail.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
Rik van Riel <riel@surriel.com>,
RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
Lei Chen <leillc@google.com>,
"Shukla, Santosh" <santosh.shukla@amd.com>,
"Grimm, Jon" <jon.grimm@amd.com>,
sj@kernel.org, shy828301@gmail.com, Zi Yan <ziy@nvidia.com>,
Liam Howlett <liam.howlett@oracle.com>,
Gregory Price <gregory.price@memverge.com>,
linux-mm@kvack.org
Subject: Re: Slow-tier Page Promotion discussion recap and open questions
Date: Wed, 18 Dec 2024 19:56:19 -0500 [thread overview]
Message-ID: <Z2NvM06hMi1MLBbn@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <6d582bb6-3ba5-1768-92f2-6025340a3cd4@google.com>
On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
> ----->o-----
> Raghu noted the current promotion destination is node 0 by default. Wei
> noted we could get some page owner information to determine things like
> mempolicies or compute the distance between nodes and, if multiple nodes
> have the same distance, choose one of them just as we do for demotions.
>
> Gregory Price noted some downsides to using mempolicies for this based on
> per-task, per-vma, and cross socket policies, so using the kernel's
> memory tiering policies is probably the best way to go about it.
>
Slightly elaborating here:
- In an async context, associating a page with a specific task is not
presently possible (that I know of). The most we know is the last
accessing CPU - maybe - in the page/folio struct. Right now this
is disabled in favor of a timestamp when tiering is enabled.
a process with 2 tasks which have access to the page may not run
on the same socket, so we run the risk of migrating to a bad target.
Best effort here would suggest either socket is fine - since they're
both "fast nodes" - but this requires that we record the last
accessing CPU for a page at identification time.
- Even if you could associate with a particular task, the task and/or
cgroup are not guaranteed to have a socket affinity. Though obviously
if it does, that can be used (just doesn't satisfy default behavior).
Basically just saying we shouldn't depend on this
- per-vma mempolicies are a potential solution, but they're not very
common in the wild - software would have to become numa aware and
utilize mbind() on particular memory regions.
Likewise we shouldn't depend on this either.
- This holds for future mechanisms like CHMU, whose accessing data is
even more abstract (no concept of accessing task / cpu / owner at all)
More generally - in an async scanning context it's presently not
possible to identify the optimal promotion node - and it likely is
not possible without userland hints.
So probably we should just leverage static configuration data (HMAT)
and some basic math to put together a promotion target in a similar
way to how we calculate a demotion target.
Long winded way of saying I don't think an optimal solution is possible,
so lets start with suboptimal and get data.
> ----->o-----
> My takeaways:
>
> - there is a definite need to separate hot page detection and the
> promotion path since hot pages may be derived from multiple sources,
> including hardware assists in the future
>
> - for the hot page tracking itself, a common abstraction to be used that
> can effectively describe hotness regardless of the backend it is
> deriving its information from would likely be quite useful
>
In a synchronous context (Accessing Task), something like:
target_node = numa_node_id; # cpu we're currently operating on
promote_pagevec(vec, numa_node_id, PROMOTE_DEFER);
where the function promotion logic then does something like:
promote_batch(pagevec, target)
In an asynchronous context (Scanning Task), something like:
promote_pagevec(vec, -1, PROMOTE_DEFER);
where the promotion logic then does something like
for page in pagevec:
target = memory_tiers_promotion_target(page_to_nid(page))
promote(folio, target)
Plumbing-wise this can be optimized to identify similarly located
pages into a sub-pagevec and use promote_batch() semantics.
My gut says this is the best we're going to get, since async contexts
can't identify accessor locations easily (especially CHMU).
> - I think virtual memory scanning is likely the only viable approach for
Hard disagree. Virtual memory scanning misses an entire class of memory
Unmapped file cache.
https://lore.kernel.org/linux-mm/20241210213744.2968-1-gourry@gourry.net/
> this purpose and we could store state in the underlying struct page,
This is contentious. Look at folio->_last_cpupid for context, we're
already overloading fields in subtle ways to steal a 32 bit area.
> similar to NUMA Balancing, but that all scanning should be driven by
> walking the mm_struct's to harvest the Accessed bit
>
If the goal is to do multi-tenant tiering (i.e. many mm_struct's), then
this scales poorly by design.
Elsewhere, folks agreed that CXL-memory will have HMU-driven hotness
data as the primary mechanism. This is a physical-memory hotness tracking
mechanism that avoids scanning page tables or page structs.
If we think that's the direction it's going, then we shouldn't bother
investing a ton of effort into a virtual-memory driven design as the
primary user. (Sure, support it, but don't dive too much further in)
> - if there is any general pushback on leveraging a kthread for this,
> this would be very good feedback to have early
>
I think for the promotion system, having one or more kthreads based on
promotion pressure is a good idea.
I'm not sure how well this will scale for many-process, high-memory
systems (1TB+ on a scanning interval of 256MB is very low accuracy).
Need more data!
~Gregory
next prev parent reply other threads:[~2024-12-19 0:56 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-18 4:19 David Rientjes
2024-12-18 14:50 ` Zi Yan
2024-12-19 6:38 ` Shivank Garg
2024-12-30 5:30 ` David Rientjes
2024-12-30 17:33 ` Zi Yan
2025-01-06 9:14 ` Shivank Garg
2024-12-18 15:21 ` Nadav Amit
2024-12-20 11:28 ` Raghavendra K T
2024-12-18 19:23 ` SeongJae Park
2024-12-19 0:56 ` Gregory Price [this message]
2024-12-26 1:28 ` Karim Manaouil
2024-12-30 5:36 ` David Rientjes
2024-12-30 6:51 ` Raghavendra K T
2025-01-06 17:02 ` Gregory Price
2024-12-20 11:21 ` Raghavendra K T
2025-01-02 4:44 ` David Rientjes
2025-01-06 6:29 ` Raghavendra K T
2025-01-08 5:43 ` Raghavendra K T
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z2NvM06hMi1MLBbn@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=AneeshKumar.KizhakeVeetil@arm.com \
--cc=Raghavendra.KodsaraThimmappa@amd.com \
--cc=bharata@amd.com \
--cc=david@redhat.com \
--cc=gregory.price@memverge.com \
--cc=jhubbard@nvidia.com \
--cc=jon.grimm@amd.com \
--cc=k.shutemov@gmail.com \
--cc=leesuyeon0506@gmail.com \
--cc=leillc@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mel.gorman@gmail.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=santosh.shukla@amd.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox