[LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
@ 2025-02-01 13:29 Hyeonggon Yoo
  2025-02-01 14:04 ` Matthew Wilcox
  2025-02-04  9:59 ` David Hildenbrand
  0 siblings, 2 replies; 27+ messages in thread
From: Hyeonggon Yoo @ 2025-02-01 13:29 UTC (permalink / raw)
  To: lsf-pc, linux-mm; +Cc: linux-cxl, Byungchul Park, Honggyu Kim

Hi,

Byungchul and I would like to suggest a topic about the performance impact of
kernel allocations on CXL memory.

As CXL-enabled servers and memory devices are being developed, CXL-supported
hardware is expected to continue emerging in the coming years.

The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality.
The hot-plugged memory allows either unmovable kernel allocations
(ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE)
depending on the hot-plug policy.

Recently, Byungchul and I observed a measurable performance degradation with
memhp_default_state=online compared to memhp_default_state=online_movable
on a server where the ratio of memory capacity between DRAM and CXL is 1:2
when running the llama.cpp workload with the default mempolicy.
The workload performs LLM inference and pressures the memory subsystem
due to its large working set size.

Obviously, allowing kernel allocations from CXL memory degrades performance
because kernel memory like page tables, kernel stacks, and slab allocations,
is accessed frequently and may reside in physical memory with significantly
higher access latency.

However, as far as I can tell there are at least two reasons why we need to
support ZONE_NORMAL for CXL memory (please add if there are more):
  1. When hot-plugging a huge amount of CXL memory, the size of
     the struct page array might not fit into DRAM
     -> This could be relaxed with memmap_on_memory
  2. To hot-unplug CXL memory, pages in CXL memory should be migrated to DRAM,
     which means sometimes some portion of CXL memory should be ZONE_NORMAL.

So, there are certain cases where we want CXL memory to include ZONE_NORMAL,
but this also degrades performance if we allow _all_ kinds of kernel
allocations to be served from CXL memory.

For ideal performance, it would be beneficial to either:
  1) Restrict allocating certain types (e.g. page tables, kernel stacks,
     slabs) of kernel memory from slow tier, or
  2) Allow migrating certain types of kernel memory from slow tier to
     fast tier.

At LSF/MM/BPF, I would like to discuss potential directions for addressing
this problem, ensuring the enablement of CXL memory while minimizing its
performance degradation.

Restricting certain types of kernel allocations from slow tier
==============================================================

We could restrict some kernel allocations to fast tier by passing a
nodemask to __alloc_pages() (with only nodes in fast tier set) or
using a GFP flag like __GFP_FAST_TIER which does the same thing.

This prevents kernel allocations from slow tier and thus avoids
performance degradation due to the high access latency of CXL.
However, binding all leaf page tables to fast tier might not be ideal
due to 1) increased latency from premature reclamation
and 2) premature OOM kill [1].

Migrating certain types of kernel allocations from slow to fast tier
====================================================================

Rather than binding kernel allocations to fast tier and causing premature
reclamation & OOM kill, policies for migrating kernel pages may be more
effective, such as:
  - Migrating page tables to fast tier,
    triggered by data-page promotion [1]
  - Migrating to fast tier when there is low memory pressure:
    - Migrating slab movable objects [2]
    - Migrating kernel stacks (if that's feasible)

although this sounds more intrusive and we need to think about robust policies
that do not degrade existing traditional memory systems.

Any opinions will be appreciated.
Thanks!

[1] https://dl.acm.org/doi/10.1145/3459898.3463907
[2] https://lore.kernel.org/linux-mm/20190411013441.5415-1-tobin@kernel.org

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-01 13:29 [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Hyeonggon Yoo
@ 2025-02-01 14:04 ` Matthew Wilcox
  2025-02-01 15:13   ` Hyeonggon Yoo
  2025-02-07  7:20   ` Byungchul Park
  2025-02-04  9:59 ` David Hildenbrand
  1 sibling, 2 replies; 27+ messages in thread
From: Matthew Wilcox @ 2025-02-01 14:04 UTC (permalink / raw)
  To: Hyeonggon Yoo; +Cc: lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim

On Sat, Feb 01, 2025 at 10:29:23PM +0900, Hyeonggon Yoo wrote:
> The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality.
> The hot-plugged memory allows either unmovable kernel allocations
> (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE)
> depending on the hot-plug policy.

This all seems like a grand waste of time.  Don't do that.  Don't allow
kernel allocations from CXL at all.  Don't build systems that have
vast quantities of CXL memory (or if you do, expose it as really fast
swap, not as memory).

All of the CXL topics I see this year are "It really hurts performance
when ..." and my reaction is "Yes, I told you it would hurt and you did
it anyway".  Just stop doing it.  CXL is this decade's Infiniband / ATM
/ (name your favourite misguided dead technology here).  You can't stop
other people from doing foolish things, but you don't have to join in.
And we don't have to take stupid patches.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-01 14:04 ` Matthew Wilcox
@ 2025-02-01 15:13   ` Hyeonggon Yoo
  2025-02-01 16:30     ` Gregory Price
  2025-02-07  7:20   ` Byungchul Park
  1 sibling, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2025-02-01 15:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim

On Sat, Feb 1, 2025 at 11:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Feb 01, 2025 at 10:29:23PM +0900, Hyeonggon Yoo wrote:
> > The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality.
> > The hot-plugged memory allows either unmovable kernel allocations
> > (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE)
> > depending on the hot-plug policy.
>
> This all seems like a grand waste of time.  Don't do that.  Don't allow
> kernel allocations from CXL at all. Don't build systems that have
> vast quantities of CXL memory (or if you do, expose it as really fast
> swap, not as memory).
>
> All of the CXL topics I see this year are "It really hurts performance
> when ..." and my reaction is "Yes, I told you it would hurt and you did
> it anyway".  Just stop doing it.  CXL is this decade's Infiniband / ATM
> / (name your favourite misguided dead technology here).

Hi, Matthew. Thank you for sharing your opinion.

I don't want to introduce too much complexity to MM due to CXL madness either,
but I think at least we need to guide users who buy CXL hardware to avoid
doing stupid things.

My initial subject was "Clearly documenting the use cases of
memhp_default_state=online{,_kernel}" because at first glance,
it was deemed usable for allowing kernel allocations from CXL,
which turned out to be not after some evaluation.

So there are a few questions from my side:
- Why do we support onlining CXL memory as ZONE_NORMAL then?
- Can we remove the feature completely?
- Or shouldn't we at least warn users adequately about it in the documentation?

I genuinely don't want to see users misusing it either.

Best,
Hyeonggon

> You can't stop other people from doing foolish things, but you don't have to join in.
> And we don't have to take stupid patches.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-01 15:13   ` Hyeonggon Yoo
@ 2025-02-01 16:30     ` Gregory Price
  2025-02-01 18:48       ` Matthew Wilcox
  2025-02-03 22:09       ` Dan Williams
  0 siblings, 2 replies; 27+ messages in thread
From: Gregory Price @ 2025-02-01 16:30 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Matthew Wilcox, lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim

On Sun, Feb 02, 2025 at 12:13:23AM +0900, Hyeonggon Yoo wrote:
> On Sat, Feb 1, 2025 at 11:04 PM Matthew Wilcox <willy@infradead.org> wrote:
> > This all seems like a grand waste of time.  Don't do that.  Don't allow
> > kernel allocations from CXL at all. Don't build systems that have
> > vast quantities of CXL memory (or if you do, expose it as really fast
> > swap, not as memory).
> >
> 
> Hi, Matthew. Thank you for sharing your opinion.
> 
> I don't want to introduce too much complexity to MM due to CXL madness either,
> but I think at least we need to guide users who buy CXL hardware to avoid
> doing stupid things.
> 
> My initial subject was "Clearly documenting the use cases of
> memhp_default_state=online{,_kernel}" because at first glance,
> it was deemed usable for allowing kernel allocations from CXL,
> which turned out to be not after some evaluation.
>

This was the motivation for implementing the build-time switch for
memhp_default_state.  Distros and builders can now have flexibility
to make this their default policy for hotplug memory blocks.

https://lore.kernel.org/linux-mm/20241226182918.648799-1-gourry@gourry.net/

I don't normally agree with Willy's hard takes on CXL, but I do agree
that it's generally not fit for kernel use - and I share general skepticism
that movement-based tiering is fundamentally better than reclaim/swap
semantics (though I have been convinced otherwise in some scenarios,
and I think some clear performance benefits in many scenarios are lost
by treating it as super-fast-swap).

Rather than ask whether we can make portions of the kernel more ammenable
to movable allocations, I think it's more beneficial to focus on whether
we can reduce the ZONE_NORMAL cost of ZONE_MOVABLE capacity. That seems
(to me) like the actual crux of this particular issue.

~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-01 16:30     ` Gregory Price
@ 2025-02-01 18:48       ` Matthew Wilcox
  2025-02-03 22:09       ` Dan Williams
  1 sibling, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2025-02-01 18:48 UTC (permalink / raw)
  To: Gregory Price
  Cc: Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim

On Sat, Feb 01, 2025 at 11:30:24AM -0500, Gregory Price wrote:
> Rather than ask whether we can make portions of the kernel more ammenable
> to movable allocations, I think it's more beneficial to focus on whether
> we can reduce the ZONE_NORMAL cost of ZONE_MOVABLE capacity. That seems
> (to me) like the actual crux of this particular issue.

We can!  This is actually the topic of the talk I'm giving at FOSDEM in
about 15 hours time.

https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/

I'm just going to run through my slides and upload them in an hour or so.

My motivation isn't CXL related, but it's the sign of a good project
that it solves some unrelated problems.  Short version: we can halve the
cost this year, halve it again in 2026 with a fairly managable amount
of work, and maybe halve it a third time (for a total reduction of 7/8)
with a lot more work in 2027.

Further reductions beyond that are possible, but will need a lot more
work.  Some of that work we want to do anyway, regardless of whether
the reduction in overhead from 16MB/GB to 2MB/GB is sufficient.

... or we'll discover the performance effect is negative and shelve the
reduction in memmap size, having only accomplished a massive cleanup of
kernel data structures.  Which would be sad.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-01 16:30     ` Gregory Price
  2025-02-01 18:48       ` Matthew Wilcox
@ 2025-02-03 22:09       ` Dan Williams
  1 sibling, 0 replies; 27+ messages in thread
From: Dan Williams @ 2025-02-03 22:09 UTC (permalink / raw)
  To: Gregory Price, Hyeonggon Yoo
  Cc: Matthew Wilcox, lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim

Gregory Price wrote:
> On Sun, Feb 02, 2025 at 12:13:23AM +0900, Hyeonggon Yoo wrote:
> > On Sat, Feb 1, 2025 at 11:04 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > This all seems like a grand waste of time.  Don't do that.  Don't allow
> > > kernel allocations from CXL at all. Don't build systems that have
> > > vast quantities of CXL memory (or if you do, expose it as really fast
> > > swap, not as memory).
> > >
> > 
> > Hi, Matthew. Thank you for sharing your opinion.
> > 
> > I don't want to introduce too much complexity to MM due to CXL madness either,
> > but I think at least we need to guide users who buy CXL hardware to avoid
> > doing stupid things.
> > 
> > My initial subject was "Clearly documenting the use cases of
> > memhp_default_state=online{,_kernel}" because at first glance,
> > it was deemed usable for allowing kernel allocations from CXL,
> > which turned out to be not after some evaluation.
> >
> 
> This was the motivation for implementing the build-time switch for
> memhp_default_state.  Distros and builders can now have flexibility
> to make this their default policy for hotplug memory blocks.
> 
> https://lore.kernel.org/linux-mm/20241226182918.648799-1-gourry@gourry.net/
> 
> I don't normally agree with Willy's hard takes on CXL, but I do agree
> that it's generally not fit for kernel use - and I share general skepticism
> that movement-based tiering is fundamentally better than reclaim/swap
> semantics (though I have been convinced otherwise in some scenarios,
> and I think some clear performance benefits in many scenarios are lost
> by treating it as super-fast-swap).

It is also the case that CXL topologies enumerate their performance
characteristics, "CXL" is not a latency characteristic unto itself.

For example, like "PCI", "CXL" by itself does not imply a performance
profile. You could have CPU attached DDR that presents as a "CXL"
enumerated device just to take advantage of now standardized RAS
interfaces.

Unless and until this whole heteorgeneous memory experiment fails all
the kernel can do is give userspace the ability to include/exclude
memory ranges that are marked as outside the default pool. That is what
EFI_MEMORY_SP is all about, to set aside: too precious for the default
pool => HBM, or too slow for the default pool => potentially CXL and
PMEM.

A kernel default policy, or better yet distibution policy, that more
aggressively excludes CXL memory based on its relative performance to
the default pool would be a welcome improvement.

> Rather than ask whether we can make portions of the kernel more ammenable
> to movable allocations, I think it's more beneficial to focus on whether
> we can reduce the ZONE_NORMAL cost of ZONE_MOVABLE capacity. That seems
> (to me) like the actual crux of this particular issue.

Yes, I like this line of thinking. Even if CXL attached memory struggles
to graduate out of cold-memory tier use cases, that struggle can yield
other general improvements that are welcome indepdendent of CXL.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-01 14:04 ` Matthew Wilcox
  2025-02-01 15:13   ` Hyeonggon Yoo
@ 2025-02-07  7:20   ` Byungchul Park
  2025-02-07  8:57     ` Gregory Price
  1 sibling, 1 reply; 27+ messages in thread
From: Byungchul Park @ 2025-02-07  7:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Honggyu Kim, kernel_team

On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote:
> On Sat, Feb 01, 2025 at 10:29:23PM +0900, Hyeonggon Yoo wrote:
> > The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality.
> > The hot-plugged memory allows either unmovable kernel allocations
> > (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE)
> > depending on the hot-plug policy.
> 
> This all seems like a grand waste of time.  Don't do that.  Don't allow
> kernel allocations from CXL at all.  Don't build systems that have
> vast quantities of CXL memory (or if you do, expose it as really fast
> swap, not as memory).
> 
> All of the CXL topics I see this year are "It really hurts performance
> when ..." and my reaction is "Yes, I told you it would hurt and you did
> it anyway".  Just stop doing it.  CXL is this decade's Infiniband / ATM
> / (name your favourite misguided dead technology here).  You can't stop
> other people from doing foolish things, but you don't have to join in.
> And we don't have to take stupid patches.

Hyeonggon and I described the topic based on what we observed in CXL
memory environment, but fundamentally it doesn't have to be only CXL
memory issue but also heterogeneous memory or ZONE_NORMAL cost issue as
you and others mentioned.  Lemme clarify it.

<general mm issue>

   1. Allow kernel object to be movable:
      a. ZONE_NORMAL cost will be reduced. (less reclaim and oom)
      b. ZONE_NORMAL covers bigger whole memory.
      c. A smaller ZONE_NORMAL is sufficient.
      d. Need additional consideration about when(or what) to move.

   2. Never allow kernel object to be movable:
      a. ZONE_NORMAL cost keeps high. (premature reclaim and oom)
      b. ZONE_NORMAL covers smaller whole memory.
      c. A bigger ZONE_NORMAL is required.

<heterogeneous memory specific issue>

   3. Allow ZONE_NORMAL in non-DRAM:
      a. Mitigate ZONE_NORMAL cost. (less reclaim and oom)
      b. Followed by e.g. hot-unplug issue.
      c. Option 1: No restricting the ZONE_NORMAL size.
      d. Option 2: Restricting the size as budget to cover its capacity.
      e. Option 3: ?

   4. Never allow ZONE_NORMAL in non-DRAM:
      a. ZONE_NORMAL cost should be low enough to cover non-DRAM too.
      b. Any efforts to reduce ZONE_NORMAL cost should be welcome.
      c. Matthew's work would mitigate the cost.
      d. Allowing kernel object to be movable would work for it too.

Plus, I think Metthew's effort to reduce ZONE_NORMAL cost is amazing and
hope successfully make it.  However, ZONE_NORMAL cost can be reduced in
many ways and all the efforts can be considered meaningful.

We can work with from the easiest object e.g. page table, struct page,
and kernel stack, to harder ones, while struct page cost is getting
reduced by Matthew's work at the same time.

When it comes to this topic, the most important thing is the collected
*direction* from the community so that we can start the work under the
*direction*.

	Byungchul


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  7:20   ` Byungchul Park
@ 2025-02-07  8:57     ` Gregory Price
  2025-02-07  9:27       ` Gregory Price
                         ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Gregory Price @ 2025-02-07  8:57 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl,
	Honggyu Kim, kernel_team

On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote:
> On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote:
> 
> We can work with from the easiest object

>e.g. page table

It's more efficient and easier to change page sizes than it is to make
page tables migratable.

It's also easier to reclaim cold pages eating up significantly more
memory than the page table (which describes pages at ~8 bytes per page).

Also, there's quite a bit of literature that shows page tables landing
on remote nodes (cross-socket) has negative performance impacts.

Putting them on CXL makes the problem worse.

> struct page,

`struct page` is a structure that describes a physically addressed page.

It is common to access it by simply doing `pfn_to_page()`, which is a
fairly simply conversion (bit more complex in sparsemem w/ sections)

This is used in a lockless manner to acquire page references all over
the kernel.

Making that migratable is... ambitious, to say the least.

> and kernel stack,

The default kernel stack size is like 16kb.  You'd need like 100,000
threads to eat up 1.5GB, and 2048 threads only eats like 32MB.

It's not an interesting amount of memory if you have a 20TB system.

> When it comes to this topic, the most important thing is the collected
> *direction* from the community so that we can start the work under the
> *direction*.
> 

My thoughts here are that memory tiering is the wrong tool for the
problem you are trying to solve.

Maybe there's a world in which we propose a ZONE_MEMDESC which is
exclusively used for `struct page` for a node. 

At least then you could design CXL capacities *around* that.

~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  8:57     ` Gregory Price
@ 2025-02-07  9:27       ` Gregory Price
  2025-02-07  9:34       ` Honggyu Kim
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Gregory Price @ 2025-02-07  9:27 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl,
	Honggyu Kim, kernel_team

On Fri, Feb 07, 2025 at 03:57:45AM -0500, Gregory Price wrote:
> On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote:
> 
> My thoughts here are that memory tiering is the wrong tool for the
> problem you are trying to solve.
> 
> Maybe there's a world in which we propose a ZONE_MEMDESC which is
> exclusively used for `struct page` for a node. 
> 
> At least then you could design CXL capacities *around* that.
>

Dumb question time

Is this maybe not an entirely horrible idea?  Even at 16-byte page
structs we use 4GB-per-1TB of capacity.

Maybe a memory device providing additional capacity SHOULD be made
(given the option?) to service its own struct pages - but maintain
some control over hot-plug-ability?

At least it could tear down all the ZONE_MOVABLE regions and finally
release the MEMDESC region when finished.

Seems too obvious to have not been proposed already. :shrug:

~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  8:57     ` Gregory Price
  2025-02-07  9:27       ` Gregory Price
@ 2025-02-07  9:34       ` Honggyu Kim
  2025-02-07  9:54         ` Gregory Price
  2025-02-07 10:14       ` Byungchul Park
  2025-02-10  7:02       ` Byungchul Park
  3 siblings, 1 reply; 27+ messages in thread
From: Honggyu Kim @ 2025-02-07  9:34 UTC (permalink / raw)
  To: Gregory Price, Byungchul Park
  Cc: kernel_team, Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl

Hi Gregory,

On 2/7/2025 5:57 PM, Gregory Price wrote:

[...snip...]

>> and kernel stack,
> 
> The default kernel stack size is like 16kb.  You'd need like 100,000
> threads to eat up 1.5GB, and 2048 threads only eats like 32MB.
> 
> It's not an interesting amount of memory if you have a 20TB system.

The amount might be small, but having those data in slow tier can
make performance degradation if it is heavily accessed.

The number of accesses isn't linearly corelated to the size of the
memory region.

Thanks,
Honggyu


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  9:34       ` Honggyu Kim
@ 2025-02-07  9:54         ` Gregory Price
  2025-02-07 10:49           ` Byungchul Park
  2025-02-10  2:33           ` Harry (Hyeonggon) Yoo
  0 siblings, 2 replies; 27+ messages in thread
From: Gregory Price @ 2025-02-07  9:54 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: Byungchul Park, kernel_team, Matthew Wilcox, Hyeonggon Yoo,
	lsf-pc, linux-mm, linux-cxl

On Fri, Feb 07, 2025 at 06:34:43PM +0900, Honggyu Kim wrote:
> On 2/7/2025 5:57 PM, Gregory Price wrote:
> 
> > The default kernel stack size is like 16kb.  You'd need like 100,000
> > threads to eat up 1.5GB, and 2048 threads only eats like 32MB.
> > 
> > It's not an interesting amount of memory if you have a 20TB system.
> 
> The amount might be small, but having those data in slow tier can
> make performance degradation if it is heavily accessed.
> 
> The number of accesses isn't linearly corelated to the size of the
> memory region.
> 

Right, I started by saying:

[CXL is] "generally not fit for kernel use"

I have the opinion that CXL memory should be defaulted to ZONE_MOVABLE,
but I understand the pressure on ZONE_NORMAL means this may not be
possible for large capacities.

I don't think the solution is to make kernel memory migratable and allow
kernel allocations on CXL.

There's a reason most kernel allocations are not swappable.

> Thanks,
> Honggyu


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  9:54         ` Gregory Price
@ 2025-02-07 10:49           ` Byungchul Park
  2025-02-10  2:33           ` Harry (Hyeonggon) Yoo
  1 sibling, 0 replies; 27+ messages in thread
From: Byungchul Park @ 2025-02-07 10:49 UTC (permalink / raw)
  To: Gregory Price
  Cc: Honggyu Kim, kernel_team, Matthew Wilcox, Hyeonggon Yoo, lsf-pc,
	linux-mm, linux-cxl

On Fri, Feb 07, 2025 at 04:54:10AM -0500, Gregory Price wrote:
> On Fri, Feb 07, 2025 at 06:34:43PM +0900, Honggyu Kim wrote:
> > On 2/7/2025 5:57 PM, Gregory Price wrote:
> > 
> > > The default kernel stack size is like 16kb.  You'd need like 100,000
> > > threads to eat up 1.5GB, and 2048 threads only eats like 32MB.
> > > 
> > > It's not an interesting amount of memory if you have a 20TB system.
> > 
> > The amount might be small, but having those data in slow tier can
> > make performance degradation if it is heavily accessed.
> > 
> > The number of accesses isn't linearly corelated to the size of the
> > memory region.
> > 
> 
> Right, I started by saying:
> 
> [CXL is] "generally not fit for kernel use"
> 
> I have the opinion that CXL memory should be defaulted to ZONE_MOVABLE,
> but I understand the pressure on ZONE_NORMAL means this may not be
> possible for large capacities.

Just to clarify, for moderate capacities where ZONE_NORMAL in DRAM
covers the whole memory, it's not a big issue since the easiest solution
would be to place kernel objects in DRAM's ZONE_NORMAL and not allow
ZONE_NORMAL in non-DRAM.  No objection on it.

For large capacities, kernel object migratability might be a must.

For capacities between moderate and large, kernel object migratibility
or allowing ZONE_NORMAL in non-DRAM, or something better idea, would
help to reduce ZONE_NORMAL cost.

I'm adding my opinion with the last two cases in mind.

	Byungchul

> I don't think the solution is to make kernel memory migratable and allow
> kernel allocations on CXL.
> 
> There's a reason most kernel allocations are not swappable.
> 
> > Thanks,
> > Honggyu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  9:54         ` Gregory Price
  2025-02-07 10:49           ` Byungchul Park
@ 2025-02-10  2:33           ` Harry (Hyeonggon) Yoo
  2025-02-10  3:19             ` Matthew Wilcox
  2025-02-10  6:00             ` Gregory Price
  1 sibling, 2 replies; 27+ messages in thread
From: Harry (Hyeonggon) Yoo @ 2025-02-10  2:33 UTC (permalink / raw)
  To: Gregory Price
  Cc: Honggyu Kim, Byungchul Park, kernel_team, Matthew Wilcox, lsf-pc,
	linux-mm, linux-cxl

On Fri, Feb 07, 2025 at 04:54:10AM -0500, Gregory Price wrote:
> On Fri, Feb 07, 2025 at 06:34:43PM +0900, Honggyu Kim wrote:
> > On 2/7/2025 5:57 PM, Gregory Price wrote:
> > 
> > > The default kernel stack size is like 16kb.  You'd need like 100,000
> > > threads to eat up 1.5GB, and 2048 threads only eats like 32MB.
> > > 
> > > It's not an interesting amount of memory if you have a 20TB system.
> > 
> > The amount might be small, but having those data in slow tier can
> > make performance degradation if it is heavily accessed.
> > 
> > The number of accesses isn't linearly corelated to the size of the
> > memory region.
> > 
> 
> Right, I started by saying:
> 
> [CXL is] "generally not fit for kernel use"
> 
> I have the opinion that CXL memory should be defaulted to ZONE_MOVABLE,

Agreed, when the ratio of slow to fast capacity makes it feasible.

> but I understand the pressure on ZONE_NORMAL means this may not be
> possible for large capacities.

Yes, I this is when we start consider some ZONE_NORMAL capacity on CXL memory.

> I don't think the solution is to make kernel memory migratable and allow
> kernel allocations on CXL.

IMHO the relevant questions here are:

Premise: Some ZONE_NORMAL capacity exists on CXL memory
         due to its large capacity.

Q1. How aggressively should the kernel avoid allocating kernel allocations
from ZONE_NORMAL in slow tier (and instead reclaim pages in fast tier)? e.g.:
  - Only when there's no easily reclaimable memory?
  - Or as a last resort before OOM?
  - Or should certain types of kernel allocations simply not be allowed
    from slow tier?

Q2. If kernel allocations are made from slow tier anyway, would it be
worthwhile to migrate _certain types_ of kernel memory back to fast tier later
when free space becomes available? (sounds like a promotion policy)

> There's a reason most kernel allocations are not swappable.

Because most kernel allocations cannot be swapped, with a few exceptions.

However, there's non-LRU page migration functionality where kernel
allocations can be migrated.

I don't understand why we shouldn't introduce more kernel movable memory
if that turns out to be beneficial?

-- 
Harry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10  2:33           ` Harry (Hyeonggon) Yoo
@ 2025-02-10  3:19             ` Matthew Wilcox
  2025-02-10  6:00             ` Gregory Price
  1 sibling, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2025-02-10  3:19 UTC (permalink / raw)
  To: Harry (Hyeonggon) Yoo
  Cc: Gregory Price, Honggyu Kim, Byungchul Park, kernel_team, lsf-pc,
	linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 11:33:47AM +0900, Harry (Hyeonggon) Yoo wrote:
> Premise: Some ZONE_NORMAL capacity exists on CXL memory
>          due to its large capacity.

I reject your premise.  None of this is inevitable.  Infiniband and ATM
did not beocme dominant networking technologies.  SOP did not dominate
the storage industry.  Itanium did not become the only CPU architcture
that mattered.

Similarly, CXL is a technically flawed protocol.  Lots of money is
being thrown at making it look inevitable, but fundamentally PCIe is
a high-bandwidth protocol, not a low-latency protocol and it can't do
the job.

> > There's a reason most kernel allocations are not swappable.
> 
> Because most kernel allocations cannot be swapped, with a few exceptions.
> 
> However, there's non-LRU page migration functionality where kernel
> allocations can be migrated.
> 
> I don't understand why we shouldn't introduce more kernel movable memory
> if that turns out to be beneficial?

Because it's adding complexity for a stupid use-case.

If you can make the case for making something migratable that's not
currently without using CXL as a justification, then sure, let's do it.
zsmalloc is migratable, and that makes a lot of sense.  But there's
a reason we only have three movable_operations structs defined in the
kernel today.

(also the whole non-LRU page migration needs overhauling to not use
page->lru, but that's a separate matter.  except it's not a separate
matter because that's needed in order to shrink struct page.)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10  2:33           ` Harry (Hyeonggon) Yoo
  2025-02-10  3:19             ` Matthew Wilcox
@ 2025-02-10  6:00             ` Gregory Price
  2025-02-10  7:17               ` Byungchul Park
  1 sibling, 1 reply; 27+ messages in thread
From: Gregory Price @ 2025-02-10  6:00 UTC (permalink / raw)
  To: Harry (Hyeonggon) Yoo
  Cc: Honggyu Kim, Byungchul Park, kernel_team, Matthew Wilcox, lsf-pc,
	linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 11:33:47AM +0900, Harry (Hyeonggon) Yoo wrote:
> 
> Premise: Some ZONE_NORMAL capacity exists on CXL memory
>          due to its large capacity.
>
What you actually need to show to justify increasing the complexity is
(at least - but not limited to)

1) structures you want to migrate are harmful when placed on slow memory

   ex) Is `struct page` on slow mem actually harmful? - no data?
   ex) Are page tables on slow mem actually harmful? - known, yes.

2) The structures cannot be made to take up less space on local tier

   ex) struct page can be shrunk - do that first
   ex) huge-pages can be deployed - do that first

3) the structures take up sufficient space that it matters

   ex) struct page after shrunk might not - do that first
   ex) page tables with multi-sized huge pages may not - do that first

4) Making the structures migratable actually does something useful

   are `struct page` or page tables after #2 and #3 both:

   a) going through hot/cold phases enough to warrant being tiered

   b) hot enough for long enough that migration matters?

   You can probably actually (maybe?) collect data on this today - but
   you still have to contend with #2 and #3.

>
> I don't understand why we shouldn't introduce more kernel movable memory
> if that turns out to be beneficial?
> 

No one is going to stop research you want to do. I'm simply expressing
that I think it's an ill-advised path to take.

~Gregory


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10  6:00             ` Gregory Price
@ 2025-02-10  7:17               ` Byungchul Park
  2025-02-10 15:47                 ` Gregory Price
  0 siblings, 1 reply; 27+ messages in thread
From: Byungchul Park @ 2025-02-10  7:17 UTC (permalink / raw)
  To: Gregory Price
  Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox,
	lsf-pc, linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote:
> On Mon, Feb 10, 2025 at 11:33:47AM +0900, Harry (Hyeonggon) Yoo wrote:
> > 
> > Premise: Some ZONE_NORMAL capacity exists on CXL memory
> >          due to its large capacity.
> >
> What you actually need to show to justify increasing the complexity is
> (at least - but not limited to)
> 
> 1) structures you want to migrate are harmful when placed on slow memory
> 
>    ex) Is `struct page` on slow mem actually harmful? - no data?

Then we can hold this one until it turns out it's harmful or give up.

>    ex) Are page tables on slow mem actually harmful? - known, yes.

Defenitly yes.  What can be the next?

> 2) The structures cannot be made to take up less space on local tier
> 
>    ex) struct page can be shrunk - do that first
>    ex) huge-pages can be deployed - do that first

I'm really courious about this.  Is there any reason that we should work
these in a serialized manner?

> 3) the structures take up sufficient space that it matters
> 
>    ex) struct page after shrunk might not - do that first
>    ex) page tables with multi-sized huge pages may not - do that first

Same.  Should it be serialized?

> 4) Making the structures migratable actually does something useful
> 
>    are `struct page` or page tables after #2 and #3 both:
> 
>    a) going through hot/cold phases enough to warrant being tiered
> 
>    b) hot enough for long enough that migration matters?
> 
>    You can probably actually (maybe?) collect data on this today - but
>    you still have to contend with #2 and #3.

Ah.  You seem to mean those works should be serialized.  Right?  If it
should be for some reason, then it could be sensible.

	Byungchul

> > I don't understand why we shouldn't introduce more kernel movable memory
> > if that turns out to be beneficial?
> > 
> 
> No one is going to stop research you want to do. I'm simply expressing
> that I think it's an ill-advised path to take.
> 
> ~Gregory


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10  7:17               ` Byungchul Park
@ 2025-02-10 15:47                 ` Gregory Price
  2025-02-10 15:55                   ` Matthew Wilcox
                                     ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Gregory Price @ 2025-02-10 15:47 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox,
	lsf-pc, linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote:
> On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote:
> > 
> >    You can probably actually (maybe?) collect data on this today - but
> >    you still have to contend with #2 and #3.
> 
> Ah.  You seem to mean those works should be serialized.  Right?  If it
> should be for some reason, then it could be sensible.
> 

I'm suggesting that there isn't a strong reason (yet) to consider such a
complicated change.  As Willy has said, it's a fairly fundamental change
for a single-reason (CXL), which does not bode well for its acceptance.

Honestly trying to save you some frustration. It would behoove you to
find stronger reasons (w/ data) or consider different solutions. Right
now there are stronger, simplers solutions to the ZONE_NORMAL capacity
issue (struct page resize, huge pages) for possible capacities.

I also think someone should actively ask whether `struct page` can be
hosted on remote memory without performance loss.  I may look into this.

~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10 15:47                 ` Gregory Price
@ 2025-02-10 15:55                   ` Matthew Wilcox
  2025-02-10 16:06                     ` Gregory Price
  2025-02-11  1:53                   ` Byungchul Park
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2025-02-10 15:55 UTC (permalink / raw)
  To: Gregory Price
  Cc: Byungchul Park, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team,
	lsf-pc, linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote:
> I also think someone should actively ask whether `struct page` can be
> hosted on remote memory without performance loss.  I may look into this.

Given that it contains a refcount and various flags, some of which
are quite hot, I would expect performance to suffer.  It also suffers
contention between different CPUs, so depending on your cache protocol
(can it do cache-to-cche transfers or does it have to be written back
to memory first?) it may perform quite poorly.  But this is something
that can be measured.

Of course, the question must be asked whetheer we care.  Certainly Intel's
Apache Pass and similar Optane RAM products put the memmap on the 3DXP
because there wasn't enough DRAM to put it there.  So the pages are
slower, but they were slower anyway!

What I always wondered was what effect it would have on wear.  But
that's not a consideration for DRAM attached via CXL.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10 15:55                   ` Matthew Wilcox
@ 2025-02-10 16:06                     ` Gregory Price
  0 siblings, 0 replies; 27+ messages in thread
From: Gregory Price @ 2025-02-10 16:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Byungchul Park, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team,
	lsf-pc, linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 03:55:47PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote:
> > I also think someone should actively ask whether `struct page` can be
> > hosted on remote memory without performance loss.  I may look into this.
> 
> Given that it contains a refcount and various flags, some of which
> are quite hot, I would expect performance to suffer.  It also suffers
> contention between different CPUs, so depending on your cache protocol
> (can it do cache-to-cche transfers or does it have to be written back
> to memory first?) it may perform quite poorly.  But this is something
> that can be measured.
> 
> Of course, the question must be asked whetheer we care.  Certainly Intel's
> Apache Pass and similar Optane RAM products put the memmap on the 3DXP
> because there wasn't enough DRAM to put it there.  So the pages are
> slower, but they were slower anyway!
> 

Well, *if* said memory is intended to host cold(er) data, then we may
find the structures to describe those pages aren't particularly hot or
contended.  This is my suspicion - and I'd rather limit kernel resource
allocation on remote memory than try to move kernel resources around.

Plus this would still enables hot-unplug.  Once all the zone movable
regions are clicked off, the page-desc regions are unused... probably.

Would just be nice to have some concrete data on when greater zone
movable capacity becomes a net-negative. We're making the assumption
this this occurs fairly early.

~Gregory


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10 15:47                 ` Gregory Price
  2025-02-10 15:55                   ` Matthew Wilcox
@ 2025-02-11  1:53                   ` Byungchul Park
  2025-02-21  1:52                   ` Harry Yoo
  2025-02-25  5:06                   ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park
  3 siblings, 0 replies; 27+ messages in thread
From: Byungchul Park @ 2025-02-11  1:53 UTC (permalink / raw)
  To: Gregory Price
  Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox,
	lsf-pc, linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote:
> On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote:
> > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote:
> > > 
> > >    You can probably actually (maybe?) collect data on this today - but
> > >    you still have to contend with #2 and #3.
> > 
> > Ah.  You seem to mean those works should be serialized.  Right?  If it
> > should be for some reason, then it could be sensible.
> > 
> 
> I'm suggesting that there isn't a strong reason (yet) to consider such a
> complicated change.  As Willy has said, it's a fairly fundamental change
> for a single-reason (CXL), which does not bode well for its acceptance.

I have observed performance difference depending on page table's
placement between DRAM and slow tier, that doesn't have to be CXL memory.
We should place page table in DRAM as long as possible, but when not
possible, we could do either recaiming DRAM for them or temporarily
place them in slow tier and move to DRAM for better performance.

But yes.  If slow tier is *NEVER* allowed to be huge, then reclaiming
DRAM would always work.  This topic is valid only for the other case.

> Honestly trying to save you some frustration. It would behoove you to
> find stronger reasons (w/ data) or consider different solutions. Right
> now there are stronger, simplers solutions to the ZONE_NORMAL capacity
> issue (struct page resize, huge pages) for possible capacities.
> 
> I also think someone should actively ask whether `struct page` can be
> hosted on remote memory without performance loss.  I may look into this.

JFYI, struct page, page table, and kernel stack were just example.
Let's exclude ones that you don't think are feasible.  However, I'd like
to tell at least page table is an interesting kernel object in the topic.

	Byungchul

> ~Gregory


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10 15:47                 ` Gregory Price
  2025-02-10 15:55                   ` Matthew Wilcox
  2025-02-11  1:53                   ` Byungchul Park
@ 2025-02-21  1:52                   ` Harry Yoo
  2025-02-25  4:54                     ` [LSF/MM/BPF TOPIC] Gathering ideas to reduce ZONE_NORMAL cost Byungchul Park
  2025-02-25  5:06                   ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park
  3 siblings, 1 reply; 27+ messages in thread
From: Harry Yoo @ 2025-02-21  1:52 UTC (permalink / raw)
  To: Gregory Price
  Cc: Byungchul Park, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team,
	Matthew Wilcox, lsf-pc, linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote:
> On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote:
> > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote:
> > > 
> > >    You can probably actually (maybe?) collect data on this today - but
> > >    you still have to contend with #2 and #3.
> > 
> > Ah.  You seem to mean those works should be serialized.  Right?  If it
> > should be for some reason, then it could be sensible.
> > 
> 
> I'm suggesting that there isn't a strong reason (yet) to consider such a
> complicated change.  As Willy has said, it's a fairly fundamental change
> for a single-reason (CXL), which does not bode well for its acceptance.
> 
> Honestly trying to save you some frustration. It would behoove you to
> find stronger reasons (w/ data) or consider different solutions. Right
> now there are stronger, simplers solutions to the ZONE_NORMAL capacity
> issue (struct page resize, huge pages) for possible capacities.

Hi, apologies for my late reply. I recently went through a career change.

I truly appreciate your and Matthew's feedback and thank you for saving us
from frustration. I agree that we need a stronger motivation
and data to introduce such a fundamental change. And I also agree that
it's more appropriate to pursue what can be useful for genral MM users
rather than introducing MM changes just for CXL.

With that context, Byungchul and I agree it's a better direction:
Reducing ZONE_NORMAL cost for ZONE_MOVABLE capacity, which is beneficial
for ZONE_MOVABLE users in general, regardless of whether the user is using
CXL memory or not.

Let me organize a few steps to pursue:

- Willy's shrinking struct page project
  - https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ 
  - https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
  - Side note: Byungchul started working on separating the descriptor
    of the pagepool bump allocator

- Slab Movable Objects: This makes sense even without CXL
  as migrating unreclaimable slab will improve compaction success rate.
  It also has been tried in the past by others, but was suspended
  due to lack of data.

  I'm looking for workloads that allocate a decent amount of unreclaimable
  slab AND performs migration frequently - for evaluation.

I might be missing some projects that could be useful,
please feel free to add if there is any.

And for page table migration, while it might be doable even without CXL,
we need strong data that suggests that it's actually makes MM better
to pursue this.

> I also think someone should actively ask whether `struct page` can be
> hosted on remote memory without performance loss.  I may look into this.

Did you have a chance to look at this?

-- 
Cheers,
Harry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [LSF/MM/BPF TOPIC] Gathering ideas to reduce ZONE_NORMAL cost
  2025-02-21  1:52                   ` Harry Yoo
@ 2025-02-25  4:54                     ` Byungchul Park
  0 siblings, 0 replies; 27+ messages in thread
From: Byungchul Park @ 2025-02-25  4:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Gregory Price, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team,
	Matthew Wilcox, lsf-pc, linux-mm, linux-cxl

On Fri, Feb 21, 2025 at 10:52:09AM +0900, Harry Yoo wrote:
> On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote:
> > On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote:
> > > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote:
> > > > 
> > > >    You can probably actually (maybe?) collect data on this today - but
> > > >    you still have to contend with #2 and #3.
> > > 
> > > Ah.  You seem to mean those works should be serialized.  Right?  If it
> > > should be for some reason, then it could be sensible.
> > > 
> > 
> > I'm suggesting that there isn't a strong reason (yet) to consider such a
> > complicated change.  As Willy has said, it's a fairly fundamental change
> > for a single-reason (CXL), which does not bode well for its acceptance.
> > 
> > Honestly trying to save you some frustration. It would behoove you to
> > find stronger reasons (w/ data) or consider different solutions. Right
> > now there are stronger, simplers solutions to the ZONE_NORMAL capacity
> > issue (struct page resize, huge pages) for possible capacities.
> 
> Hi, apologies for my late reply. I recently went through a career change.
> 
> I truly appreciate your and Matthew's feedback and thank you for saving us
> from frustration. I agree that we need a stronger motivation
> and data to introduce such a fundamental change. And I also agree that
> it's more appropriate to pursue what can be useful for genral MM users
> rather than introducing MM changes just for CXL.
> 
> With that context, Byungchul and I agree it's a better direction:
> Reducing ZONE_NORMAL cost for ZONE_MOVABLE capacity, which is beneficial
> for ZONE_MOVABLE users in general, regardless of whether the user is using
> CXL memory or not.
> 
> Let me organize a few steps to pursue:
> 
> - Willy's shrinking struct page project
>   - https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ 
>   - https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
>   - Side note: Byungchul started working on separating the descriptor
>     of the pagepool bump allocator
> 
> - Slab Movable Objects: This makes sense even without CXL
>   as migrating unreclaimable slab will improve compaction success rate.
>   It also has been tried in the past by others, but was suspended
>   due to lack of data.
>   
>   I'm looking for workloads that allocate a decent amount of unreclaimable
>   slab AND performs migration frequently - for evaluation.
> 
> I might be missing some projects that could be useful,
> please feel free to add if there is any.

So..  Let's change the LSF/MM/BPF topic slightly.

	Byungchul

> And for page table migration, while it might be doable even without CXL,
> we need strong data that suggests that it's actually makes MM better
> to pursue this.
> 
> > I also think someone should actively ask whether `struct page` can be
> > hosted on remote memory without performance loss.  I may look into this.
> 
> Did you have a chance to look at this?
> 
> -- 
> Cheers,
> Harry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-10 15:47                 ` Gregory Price
                                     ` (2 preceding siblings ...)
  2025-02-21  1:52                   ` Harry Yoo
@ 2025-02-25  5:06                   ` Byungchul Park
  2025-03-03 15:55                     ` Gregory Price
  3 siblings, 1 reply; 27+ messages in thread
From: Byungchul Park @ 2025-02-25  5:06 UTC (permalink / raw)
  To: Gregory Price
  Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox,
	lsf-pc, linux-mm, linux-cxl

On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote:
> On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote:
> > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote:
> > > 
> > >    You can probably actually (maybe?) collect data on this today - but
> > >    you still have to contend with #2 and #3.
> > 
> > Ah.  You seem to mean those works should be serialized.  Right?  If it
> > should be for some reason, then it could be sensible.
> > 
> 
> I'm suggesting that there isn't a strong reason (yet) to consider such a
> complicated change.  As Willy has said, it's a fairly fundamental change
> for a single-reason (CXL), which does not bode well for its acceptance.
> 
> Honestly trying to save you some frustration. It would behoove you to
> find stronger reasons (w/ data) or consider different solutions. Right
> now there are stronger, simplers solutions to the ZONE_NORMAL capacity
> issue (struct page resize, huge pages) for possible capacities.
> 
> I also think someone should actively ask whether `struct page` can be
> hosted on remote memory without performance loss.  I may look into this.

Could you share the plan or what you have been thinking about it?

We'd be happy to discuss this topic together, and furthermore, it'd be
even better to work on it together.

	Byungchul

> ~Gregory


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-25  5:06                   ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park
@ 2025-03-03 15:55                     ` Gregory Price
  0 siblings, 0 replies; 27+ messages in thread
From: Gregory Price @ 2025-03-03 15:55 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox,
	lsf-pc, linux-mm, linux-cxl

On Tue, Feb 25, 2025 at 02:06:43PM +0900, Byungchul Park wrote:
> On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote:
> > I also think someone should actively ask whether `struct page` can be
> > hosted on remote memory without performance loss.  I may look into this.
> 
> Could you share the plan or what you have been thinking about it?
> 
> We'd be happy to discuss this topic together, and furthermore, it'd be
> even better to work on it together.
> 

Apologies for the late reply, i've been on some R&R.

I haven't written up any specific plan for this, but it is *reasonably*
simple to test with memmap_on_memory.

~Gregory


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  8:57     ` Gregory Price
  2025-02-07  9:27       ` Gregory Price
  2025-02-07  9:34       ` Honggyu Kim
@ 2025-02-07 10:14       ` Byungchul Park
  2025-02-10  7:02       ` Byungchul Park
  3 siblings, 0 replies; 27+ messages in thread
From: Byungchul Park @ 2025-02-07 10:14 UTC (permalink / raw)
  To: Gregory Price
  Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl,
	Honggyu Kim, kernel_team

On Fri, Feb 07, 2025 at 03:57:45AM -0500, Gregory Price wrote:
> On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote:
> > On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote:
> > 
> > We can work with from the easiest object
> 
> >e.g. page table
> 
> It's more efficient and easier to change page sizes than it is to make
> page tables migratable.

You are misunderstanding.  I didn't say 'do not change page sizes'.  I
didn't say it's easier than changing page size.  I said *both* changing
page sizes and making them migratable could reduce ZONE_NORMAL cost.

> It's also easier to reclaim cold pages eating up significantly more
> memory than the page table (which describes pages at ~8 bytes per page).

Same.  We should keep reclaiming cold pages eating up memory.  Why do we
give up reclaiming cold pages if page table becomes migratable?  I
really don't understand why you are trying to exclusively pick up only
one effort for that purpose.

> Also, there's quite a bit of literature that shows page tables landing
> on remote nodes (cross-socket) has negative performance impacts.

Exactly.  That's the motivation to suggest this topic.  That's why we
are asking about kernel object migratibility.  Of course, we try our
best to place kernel object in DRAM in the first place.  However, the
thing would arise when it becomes impossible.  It's about comparison
between 'premature reclaim and die(= oom)' and 'slight degradation of
performance'.

> Putting them on CXL makes the problem worse.

No.  Higher chance to die is worse.

> > struct page,
> 
> `struct page` is a structure that describes a physically addressed page.
> 
> It is common to access it by simply doing `pfn_to_page()`, which is a
> fairly simply conversion (bit more complex in sparsemem w/ sections)
> 
> This is used in a lockless manner to acquire page references all over
> the kernel.
> 
> Making that migratable is... ambitious, to say the least.

Yes.  I don't think it's easy.

> > and kernel stack,
> 
> The default kernel stack size is like 16kb.  You'd need like 100,000
> threads to eat up 1.5GB, and 2048 threads only eats like 32MB.
> 
> It's not an interesting amount of memory if you have a 20TB system.

Kernel stack is an example.  We can skip it and look for better
candidate.

> > When it comes to this topic, the most important thing is the collected
> > *direction* from the community so that we can start the work under the
> > *direction*.
> > 
> 
> My thoughts here are that memory tiering is the wrong tool for the
> problem you are trying to solve.

I think any valid efforts can be considered at the same time.  Is there
any reason that effort in tiering environment should be excluded?

	Byungchul

> Maybe there's a world in which we propose a ZONE_MEMDESC which is
> exclusively used for `struct page` for a node. 
> 
> At least then you could design CXL capacities *around* that.
> 
> ~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-07  8:57     ` Gregory Price
                         ` (2 preceding siblings ...)
  2025-02-07 10:14       ` Byungchul Park
@ 2025-02-10  7:02       ` Byungchul Park
  3 siblings, 0 replies; 27+ messages in thread
From: Byungchul Park @ 2025-02-10  7:02 UTC (permalink / raw)
  To: Gregory Price
  Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl,
	Honggyu Kim, kernel_team

On Fri, Feb 07, 2025 at 03:57:45AM -0500, Gregory Price wrote:
> On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote:
> > On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote:
> > 
> > We can work with from the easiest object
> 
> >e.g. page table
> 
> It's more efficient and easier to change page sizes than it is to make
> page tables migratable.
> 
> It's also easier to reclaim cold pages eating up significantly more
> memory than the page table (which describes pages at ~8 bytes per page).

Sorry for leaving comments in an excited manner last time.  Lemme focus
on what to consider and how to resolve them:

Case 1. A system with no or little non-DRAM capacity

   You are right.  It'd be easier to reclaim cold pages eating up
   ZONE_NORMAL.  ZONE_NORMAL in DRAM probably can cover whole memory.

Case 2. A system with very huge non-DRAM capacity

   ZONE_NORMAL in DRAM might not be able to cover whole memory.  So
   either allowing ZONE_NORMAL in non-DRAM or allowing some kernel
   objects to be placed in ZONE_MOVABLE would be required.

   If all the guys agree with Matthew - a system should never be able to
   equipped with very huge non-DRAM memory, then yes, we might not need
   the discussion.

Case 3. A system with a capacity between huge and little non-DRAM

   ZONE_NORMAL in DRAM might or might not be able to cover whole memory.
   Quite big amount of kernel object would be still required.  Of course,
   properly reclaiming cold pages eating up ZONE_NORMAL in DRAM might
   work for the purpose.  At the same time, any efforts to reduce the
   ZONE_NORMAL cost would help and mitigate the pressure on ZONE_NORMAL
   in DRAM.  Here, the efforts include e.g. reducing size of kernel
   object, making some kernel objects migratable, and so on.

   Same.  If this case is also the one that Matthew and others think
   is not realistic, then yes, we might not need the discussion.  If not,
   we need to consider the issue.

	Byungchul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
  2025-02-01 13:29 [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Hyeonggon Yoo
  2025-02-01 14:04 ` Matthew Wilcox
@ 2025-02-04  9:59 ` David Hildenbrand
  1 sibling, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-02-04  9:59 UTC (permalink / raw)
  To: Hyeonggon Yoo, lsf-pc, linux-mm; +Cc: linux-cxl, Byungchul Park, Honggyu Kim

On 01.02.25 14:29, Hyeonggon Yoo wrote:
> Hi,
> 
> Byungchul and I would like to suggest a topic about the performance impact of
> kernel allocations on CXL memory.
> 
> As CXL-enabled servers and memory devices are being developed, CXL-supported
> hardware is expected to continue emerging in the coming years.
> 
> The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality.
> The hot-plugged memory allows either unmovable kernel allocations
> (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE)
> depending on the hot-plug policy.
> 
> Recently, Byungchul and I observed a measurable performance degradation with
> memhp_default_state=online compared to memhp_default_state=online_movable
> on a server where the ratio of memory capacity between DRAM and CXL is 1:2
> when running the llama.cpp workload with the default mempolicy.
> The workload performs LLM inference and pressures the memory subsystem
> due to its large working set size.
> 
> Obviously, allowing kernel allocations from CXL memory degrades performance
> because kernel memory like page tables, kernel stacks, and slab allocations,
> is accessed frequently and may reside in physical memory with significantly
> higher access latency.
> 
> However, as far as I can tell there are at least two reasons why we need to
> support ZONE_NORMAL for CXL memory (please add if there are more):
>    1. When hot-plugging a huge amount of CXL memory, the size of
>       the struct page array might not fit into DRAM
>       -> This could be relaxed with memmap_on_memory

There are some others, although most are less significant, and I tried 
documenting them here:

https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html#zone-movable-sizing-considerations


E.g., a 4 KiB page requires a single PTE (8 bytes) to be mapped into 
user space, corresponding to 0.2 %. At least for anonymous memory, 
PMD-sized THPs don't help, because we still have to allocate the page 
table to be prepared for a PMD->PTE remapping. In the worst case, the 
directmap requires another 0.2 % (but usually, we rely on PMD mappings). 
So that usage depends on how you are intending to use the CXL memory 
(e.g., pagecache vs. anonymous memory).


 >    2. To hot-unplug CXL memory, pages in CXL memory should be 
migrated to DRAM,
 >       which means sometimes some portion of CXL memory should be 
ZONE_NORMAL.

I don't quite understand that argument for ZONE_NORMAL.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-03-03 15:55 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-01 13:29 [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Hyeonggon Yoo
2025-02-01 14:04 ` Matthew Wilcox
2025-02-01 15:13   ` Hyeonggon Yoo
2025-02-01 16:30     ` Gregory Price
2025-02-01 18:48       ` Matthew Wilcox
2025-02-03 22:09       ` Dan Williams
2025-02-07  7:20   ` Byungchul Park
2025-02-07  8:57     ` Gregory Price
2025-02-07  9:27       ` Gregory Price
2025-02-07  9:34       ` Honggyu Kim
2025-02-07  9:54         ` Gregory Price
2025-02-07 10:49           ` Byungchul Park
2025-02-10  2:33           ` Harry (Hyeonggon) Yoo
2025-02-10  3:19             ` Matthew Wilcox
2025-02-10  6:00             ` Gregory Price
2025-02-10  7:17               ` Byungchul Park
2025-02-10 15:47                 ` Gregory Price
2025-02-10 15:55                   ` Matthew Wilcox
2025-02-10 16:06                     ` Gregory Price
2025-02-11  1:53                   ` Byungchul Park
2025-02-21  1:52                   ` Harry Yoo
2025-02-25  4:54                     ` [LSF/MM/BPF TOPIC] Gathering ideas to reduce ZONE_NORMAL cost Byungchul Park
2025-02-25  5:06                   ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park
2025-03-03 15:55                     ` Gregory Price
2025-02-07 10:14       ` Byungchul Park
2025-02-10  7:02       ` Byungchul Park
2025-02-04  9:59 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox