* [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier
@ 2025-02-01 13:29 Hyeonggon Yoo
2025-02-01 14:04 ` Matthew Wilcox
2025-02-04 9:59 ` David Hildenbrand
0 siblings, 2 replies; 27+ messages in thread
From: Hyeonggon Yoo @ 2025-02-01 13:29 UTC (permalink / raw)
To: lsf-pc, linux-mm; +Cc: linux-cxl, Byungchul Park, Honggyu Kim
Hi,
Byungchul and I would like to suggest a topic about the performance impact of
kernel allocations on CXL memory.
As CXL-enabled servers and memory devices are being developed, CXL-supported
hardware is expected to continue emerging in the coming years.
The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality.
The hot-plugged memory allows either unmovable kernel allocations
(ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE)
depending on the hot-plug policy.
Recently, Byungchul and I observed a measurable performance degradation with
memhp_default_state=online compared to memhp_default_state=online_movable
on a server where the ratio of memory capacity between DRAM and CXL is 1:2
when running the llama.cpp workload with the default mempolicy.
The workload performs LLM inference and pressures the memory subsystem
due to its large working set size.
Obviously, allowing kernel allocations from CXL memory degrades performance
because kernel memory like page tables, kernel stacks, and slab allocations,
is accessed frequently and may reside in physical memory with significantly
higher access latency.
However, as far as I can tell there are at least two reasons why we need to
support ZONE_NORMAL for CXL memory (please add if there are more):
1. When hot-plugging a huge amount of CXL memory, the size of
the struct page array might not fit into DRAM
-> This could be relaxed with memmap_on_memory
2. To hot-unplug CXL memory, pages in CXL memory should be migrated to DRAM,
which means sometimes some portion of CXL memory should be ZONE_NORMAL.
So, there are certain cases where we want CXL memory to include ZONE_NORMAL,
but this also degrades performance if we allow _all_ kinds of kernel
allocations to be served from CXL memory.
For ideal performance, it would be beneficial to either:
1) Restrict allocating certain types (e.g. page tables, kernel stacks,
slabs) of kernel memory from slow tier, or
2) Allow migrating certain types of kernel memory from slow tier to
fast tier.
At LSF/MM/BPF, I would like to discuss potential directions for addressing
this problem, ensuring the enablement of CXL memory while minimizing its
performance degradation.
Restricting certain types of kernel allocations from slow tier
==============================================================
We could restrict some kernel allocations to fast tier by passing a
nodemask to __alloc_pages() (with only nodes in fast tier set) or
using a GFP flag like __GFP_FAST_TIER which does the same thing.
This prevents kernel allocations from slow tier and thus avoids
performance degradation due to the high access latency of CXL.
However, binding all leaf page tables to fast tier might not be ideal
due to 1) increased latency from premature reclamation
and 2) premature OOM kill [1].
Migrating certain types of kernel allocations from slow to fast tier
====================================================================
Rather than binding kernel allocations to fast tier and causing premature
reclamation & OOM kill, policies for migrating kernel pages may be more
effective, such as:
- Migrating page tables to fast tier,
triggered by data-page promotion [1]
- Migrating to fast tier when there is low memory pressure:
- Migrating slab movable objects [2]
- Migrating kernel stacks (if that's feasible)
although this sounds more intrusive and we need to think about robust policies
that do not degrade existing traditional memory systems.
Any opinions will be appreciated.
Thanks!
[1] https://dl.acm.org/doi/10.1145/3459898.3463907
[2] https://lore.kernel.org/linux-mm/20190411013441.5415-1-tobin@kernel.org
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-01 13:29 [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Hyeonggon Yoo @ 2025-02-01 14:04 ` Matthew Wilcox 2025-02-01 15:13 ` Hyeonggon Yoo 2025-02-07 7:20 ` Byungchul Park 2025-02-04 9:59 ` David Hildenbrand 1 sibling, 2 replies; 27+ messages in thread From: Matthew Wilcox @ 2025-02-01 14:04 UTC (permalink / raw) To: Hyeonggon Yoo; +Cc: lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim On Sat, Feb 01, 2025 at 10:29:23PM +0900, Hyeonggon Yoo wrote: > The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality. > The hot-plugged memory allows either unmovable kernel allocations > (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE) > depending on the hot-plug policy. This all seems like a grand waste of time. Don't do that. Don't allow kernel allocations from CXL at all. Don't build systems that have vast quantities of CXL memory (or if you do, expose it as really fast swap, not as memory). All of the CXL topics I see this year are "It really hurts performance when ..." and my reaction is "Yes, I told you it would hurt and you did it anyway". Just stop doing it. CXL is this decade's Infiniband / ATM / (name your favourite misguided dead technology here). You can't stop other people from doing foolish things, but you don't have to join in. And we don't have to take stupid patches. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-01 14:04 ` Matthew Wilcox @ 2025-02-01 15:13 ` Hyeonggon Yoo 2025-02-01 16:30 ` Gregory Price 2025-02-07 7:20 ` Byungchul Park 1 sibling, 1 reply; 27+ messages in thread From: Hyeonggon Yoo @ 2025-02-01 15:13 UTC (permalink / raw) To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim On Sat, Feb 1, 2025 at 11:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Feb 01, 2025 at 10:29:23PM +0900, Hyeonggon Yoo wrote: > > The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality. > > The hot-plugged memory allows either unmovable kernel allocations > > (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE) > > depending on the hot-plug policy. > > This all seems like a grand waste of time. Don't do that. Don't allow > kernel allocations from CXL at all. Don't build systems that have > vast quantities of CXL memory (or if you do, expose it as really fast > swap, not as memory). > > All of the CXL topics I see this year are "It really hurts performance > when ..." and my reaction is "Yes, I told you it would hurt and you did > it anyway". Just stop doing it. CXL is this decade's Infiniband / ATM > / (name your favourite misguided dead technology here). Hi, Matthew. Thank you for sharing your opinion. I don't want to introduce too much complexity to MM due to CXL madness either, but I think at least we need to guide users who buy CXL hardware to avoid doing stupid things. My initial subject was "Clearly documenting the use cases of memhp_default_state=online{,_kernel}" because at first glance, it was deemed usable for allowing kernel allocations from CXL, which turned out to be not after some evaluation. So there are a few questions from my side: - Why do we support onlining CXL memory as ZONE_NORMAL then? - Can we remove the feature completely? - Or shouldn't we at least warn users adequately about it in the documentation? I genuinely don't want to see users misusing it either. Best, Hyeonggon > You can't stop other people from doing foolish things, but you don't have to join in. > And we don't have to take stupid patches. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-01 15:13 ` Hyeonggon Yoo @ 2025-02-01 16:30 ` Gregory Price 2025-02-01 18:48 ` Matthew Wilcox 2025-02-03 22:09 ` Dan Williams 0 siblings, 2 replies; 27+ messages in thread From: Gregory Price @ 2025-02-01 16:30 UTC (permalink / raw) To: Hyeonggon Yoo Cc: Matthew Wilcox, lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim On Sun, Feb 02, 2025 at 12:13:23AM +0900, Hyeonggon Yoo wrote: > On Sat, Feb 1, 2025 at 11:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > This all seems like a grand waste of time. Don't do that. Don't allow > > kernel allocations from CXL at all. Don't build systems that have > > vast quantities of CXL memory (or if you do, expose it as really fast > > swap, not as memory). > > > > Hi, Matthew. Thank you for sharing your opinion. > > I don't want to introduce too much complexity to MM due to CXL madness either, > but I think at least we need to guide users who buy CXL hardware to avoid > doing stupid things. > > My initial subject was "Clearly documenting the use cases of > memhp_default_state=online{,_kernel}" because at first glance, > it was deemed usable for allowing kernel allocations from CXL, > which turned out to be not after some evaluation. > This was the motivation for implementing the build-time switch for memhp_default_state. Distros and builders can now have flexibility to make this their default policy for hotplug memory blocks. https://lore.kernel.org/linux-mm/20241226182918.648799-1-gourry@gourry.net/ I don't normally agree with Willy's hard takes on CXL, but I do agree that it's generally not fit for kernel use - and I share general skepticism that movement-based tiering is fundamentally better than reclaim/swap semantics (though I have been convinced otherwise in some scenarios, and I think some clear performance benefits in many scenarios are lost by treating it as super-fast-swap). Rather than ask whether we can make portions of the kernel more ammenable to movable allocations, I think it's more beneficial to focus on whether we can reduce the ZONE_NORMAL cost of ZONE_MOVABLE capacity. That seems (to me) like the actual crux of this particular issue. ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-01 16:30 ` Gregory Price @ 2025-02-01 18:48 ` Matthew Wilcox 2025-02-03 22:09 ` Dan Williams 1 sibling, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2025-02-01 18:48 UTC (permalink / raw) To: Gregory Price Cc: Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim On Sat, Feb 01, 2025 at 11:30:24AM -0500, Gregory Price wrote: > Rather than ask whether we can make portions of the kernel more ammenable > to movable allocations, I think it's more beneficial to focus on whether > we can reduce the ZONE_NORMAL cost of ZONE_MOVABLE capacity. That seems > (to me) like the actual crux of this particular issue. We can! This is actually the topic of the talk I'm giving at FOSDEM in about 15 hours time. https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ I'm just going to run through my slides and upload them in an hour or so. My motivation isn't CXL related, but it's the sign of a good project that it solves some unrelated problems. Short version: we can halve the cost this year, halve it again in 2026 with a fairly managable amount of work, and maybe halve it a third time (for a total reduction of 7/8) with a lot more work in 2027. Further reductions beyond that are possible, but will need a lot more work. Some of that work we want to do anyway, regardless of whether the reduction in overhead from 16MB/GB to 2MB/GB is sufficient. ... or we'll discover the performance effect is negative and shelve the reduction in memmap size, having only accomplished a massive cleanup of kernel data structures. Which would be sad. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-01 16:30 ` Gregory Price 2025-02-01 18:48 ` Matthew Wilcox @ 2025-02-03 22:09 ` Dan Williams 1 sibling, 0 replies; 27+ messages in thread From: Dan Williams @ 2025-02-03 22:09 UTC (permalink / raw) To: Gregory Price, Hyeonggon Yoo Cc: Matthew Wilcox, lsf-pc, linux-mm, linux-cxl, Byungchul Park, Honggyu Kim Gregory Price wrote: > On Sun, Feb 02, 2025 at 12:13:23AM +0900, Hyeonggon Yoo wrote: > > On Sat, Feb 1, 2025 at 11:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > > This all seems like a grand waste of time. Don't do that. Don't allow > > > kernel allocations from CXL at all. Don't build systems that have > > > vast quantities of CXL memory (or if you do, expose it as really fast > > > swap, not as memory). > > > > > > > Hi, Matthew. Thank you for sharing your opinion. > > > > I don't want to introduce too much complexity to MM due to CXL madness either, > > but I think at least we need to guide users who buy CXL hardware to avoid > > doing stupid things. > > > > My initial subject was "Clearly documenting the use cases of > > memhp_default_state=online{,_kernel}" because at first glance, > > it was deemed usable for allowing kernel allocations from CXL, > > which turned out to be not after some evaluation. > > > > This was the motivation for implementing the build-time switch for > memhp_default_state. Distros and builders can now have flexibility > to make this their default policy for hotplug memory blocks. > > https://lore.kernel.org/linux-mm/20241226182918.648799-1-gourry@gourry.net/ > > I don't normally agree with Willy's hard takes on CXL, but I do agree > that it's generally not fit for kernel use - and I share general skepticism > that movement-based tiering is fundamentally better than reclaim/swap > semantics (though I have been convinced otherwise in some scenarios, > and I think some clear performance benefits in many scenarios are lost > by treating it as super-fast-swap). It is also the case that CXL topologies enumerate their performance characteristics, "CXL" is not a latency characteristic unto itself. For example, like "PCI", "CXL" by itself does not imply a performance profile. You could have CPU attached DDR that presents as a "CXL" enumerated device just to take advantage of now standardized RAS interfaces. Unless and until this whole heteorgeneous memory experiment fails all the kernel can do is give userspace the ability to include/exclude memory ranges that are marked as outside the default pool. That is what EFI_MEMORY_SP is all about, to set aside: too precious for the default pool => HBM, or too slow for the default pool => potentially CXL and PMEM. A kernel default policy, or better yet distibution policy, that more aggressively excludes CXL memory based on its relative performance to the default pool would be a welcome improvement. > Rather than ask whether we can make portions of the kernel more ammenable > to movable allocations, I think it's more beneficial to focus on whether > we can reduce the ZONE_NORMAL cost of ZONE_MOVABLE capacity. That seems > (to me) like the actual crux of this particular issue. Yes, I like this line of thinking. Even if CXL attached memory struggles to graduate out of cold-memory tier use cases, that struggle can yield other general improvements that are welcome indepdendent of CXL. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-01 14:04 ` Matthew Wilcox 2025-02-01 15:13 ` Hyeonggon Yoo @ 2025-02-07 7:20 ` Byungchul Park 2025-02-07 8:57 ` Gregory Price 1 sibling, 1 reply; 27+ messages in thread From: Byungchul Park @ 2025-02-07 7:20 UTC (permalink / raw) To: Matthew Wilcox Cc: Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Honggyu Kim, kernel_team On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote: > On Sat, Feb 01, 2025 at 10:29:23PM +0900, Hyeonggon Yoo wrote: > > The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality. > > The hot-plugged memory allows either unmovable kernel allocations > > (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE) > > depending on the hot-plug policy. > > This all seems like a grand waste of time. Don't do that. Don't allow > kernel allocations from CXL at all. Don't build systems that have > vast quantities of CXL memory (or if you do, expose it as really fast > swap, not as memory). > > All of the CXL topics I see this year are "It really hurts performance > when ..." and my reaction is "Yes, I told you it would hurt and you did > it anyway". Just stop doing it. CXL is this decade's Infiniband / ATM > / (name your favourite misguided dead technology here). You can't stop > other people from doing foolish things, but you don't have to join in. > And we don't have to take stupid patches. Hyeonggon and I described the topic based on what we observed in CXL memory environment, but fundamentally it doesn't have to be only CXL memory issue but also heterogeneous memory or ZONE_NORMAL cost issue as you and others mentioned. Lemme clarify it. <general mm issue> 1. Allow kernel object to be movable: a. ZONE_NORMAL cost will be reduced. (less reclaim and oom) b. ZONE_NORMAL covers bigger whole memory. c. A smaller ZONE_NORMAL is sufficient. d. Need additional consideration about when(or what) to move. 2. Never allow kernel object to be movable: a. ZONE_NORMAL cost keeps high. (premature reclaim and oom) b. ZONE_NORMAL covers smaller whole memory. c. A bigger ZONE_NORMAL is required. <heterogeneous memory specific issue> 3. Allow ZONE_NORMAL in non-DRAM: a. Mitigate ZONE_NORMAL cost. (less reclaim and oom) b. Followed by e.g. hot-unplug issue. c. Option 1: No restricting the ZONE_NORMAL size. d. Option 2: Restricting the size as budget to cover its capacity. e. Option 3: ? 4. Never allow ZONE_NORMAL in non-DRAM: a. ZONE_NORMAL cost should be low enough to cover non-DRAM too. b. Any efforts to reduce ZONE_NORMAL cost should be welcome. c. Matthew's work would mitigate the cost. d. Allowing kernel object to be movable would work for it too. Plus, I think Metthew's effort to reduce ZONE_NORMAL cost is amazing and hope successfully make it. However, ZONE_NORMAL cost can be reduced in many ways and all the efforts can be considered meaningful. We can work with from the easiest object e.g. page table, struct page, and kernel stack, to harder ones, while struct page cost is getting reduced by Matthew's work at the same time. When it comes to this topic, the most important thing is the collected *direction* from the community so that we can start the work under the *direction*. Byungchul ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 7:20 ` Byungchul Park @ 2025-02-07 8:57 ` Gregory Price 2025-02-07 9:27 ` Gregory Price ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Gregory Price @ 2025-02-07 8:57 UTC (permalink / raw) To: Byungchul Park Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Honggyu Kim, kernel_team On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote: > On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote: > > We can work with from the easiest object >e.g. page table It's more efficient and easier to change page sizes than it is to make page tables migratable. It's also easier to reclaim cold pages eating up significantly more memory than the page table (which describes pages at ~8 bytes per page). Also, there's quite a bit of literature that shows page tables landing on remote nodes (cross-socket) has negative performance impacts. Putting them on CXL makes the problem worse. > struct page, `struct page` is a structure that describes a physically addressed page. It is common to access it by simply doing `pfn_to_page()`, which is a fairly simply conversion (bit more complex in sparsemem w/ sections) This is used in a lockless manner to acquire page references all over the kernel. Making that migratable is... ambitious, to say the least. > and kernel stack, The default kernel stack size is like 16kb. You'd need like 100,000 threads to eat up 1.5GB, and 2048 threads only eats like 32MB. It's not an interesting amount of memory if you have a 20TB system. > When it comes to this topic, the most important thing is the collected > *direction* from the community so that we can start the work under the > *direction*. > My thoughts here are that memory tiering is the wrong tool for the problem you are trying to solve. Maybe there's a world in which we propose a ZONE_MEMDESC which is exclusively used for `struct page` for a node. At least then you could design CXL capacities *around* that. ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 8:57 ` Gregory Price @ 2025-02-07 9:27 ` Gregory Price 2025-02-07 9:34 ` Honggyu Kim ` (2 subsequent siblings) 3 siblings, 0 replies; 27+ messages in thread From: Gregory Price @ 2025-02-07 9:27 UTC (permalink / raw) To: Byungchul Park Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Honggyu Kim, kernel_team On Fri, Feb 07, 2025 at 03:57:45AM -0500, Gregory Price wrote: > On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote: > > My thoughts here are that memory tiering is the wrong tool for the > problem you are trying to solve. > > Maybe there's a world in which we propose a ZONE_MEMDESC which is > exclusively used for `struct page` for a node. > > At least then you could design CXL capacities *around* that. > Dumb question time Is this maybe not an entirely horrible idea? Even at 16-byte page structs we use 4GB-per-1TB of capacity. Maybe a memory device providing additional capacity SHOULD be made (given the option?) to service its own struct pages - but maintain some control over hot-plug-ability? At least it could tear down all the ZONE_MOVABLE regions and finally release the MEMDESC region when finished. Seems too obvious to have not been proposed already. :shrug: ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 8:57 ` Gregory Price 2025-02-07 9:27 ` Gregory Price @ 2025-02-07 9:34 ` Honggyu Kim 2025-02-07 9:54 ` Gregory Price 2025-02-07 10:14 ` Byungchul Park 2025-02-10 7:02 ` Byungchul Park 3 siblings, 1 reply; 27+ messages in thread From: Honggyu Kim @ 2025-02-07 9:34 UTC (permalink / raw) To: Gregory Price, Byungchul Park Cc: kernel_team, Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl Hi Gregory, On 2/7/2025 5:57 PM, Gregory Price wrote: [...snip...] >> and kernel stack, > > The default kernel stack size is like 16kb. You'd need like 100,000 > threads to eat up 1.5GB, and 2048 threads only eats like 32MB. > > It's not an interesting amount of memory if you have a 20TB system. The amount might be small, but having those data in slow tier can make performance degradation if it is heavily accessed. The number of accesses isn't linearly corelated to the size of the memory region. Thanks, Honggyu ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 9:34 ` Honggyu Kim @ 2025-02-07 9:54 ` Gregory Price 2025-02-07 10:49 ` Byungchul Park 2025-02-10 2:33 ` Harry (Hyeonggon) Yoo 0 siblings, 2 replies; 27+ messages in thread From: Gregory Price @ 2025-02-07 9:54 UTC (permalink / raw) To: Honggyu Kim Cc: Byungchul Park, kernel_team, Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl On Fri, Feb 07, 2025 at 06:34:43PM +0900, Honggyu Kim wrote: > On 2/7/2025 5:57 PM, Gregory Price wrote: > > > The default kernel stack size is like 16kb. You'd need like 100,000 > > threads to eat up 1.5GB, and 2048 threads only eats like 32MB. > > > > It's not an interesting amount of memory if you have a 20TB system. > > The amount might be small, but having those data in slow tier can > make performance degradation if it is heavily accessed. > > The number of accesses isn't linearly corelated to the size of the > memory region. > Right, I started by saying: [CXL is] "generally not fit for kernel use" I have the opinion that CXL memory should be defaulted to ZONE_MOVABLE, but I understand the pressure on ZONE_NORMAL means this may not be possible for large capacities. I don't think the solution is to make kernel memory migratable and allow kernel allocations on CXL. There's a reason most kernel allocations are not swappable. > Thanks, > Honggyu ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 9:54 ` Gregory Price @ 2025-02-07 10:49 ` Byungchul Park 2025-02-10 2:33 ` Harry (Hyeonggon) Yoo 1 sibling, 0 replies; 27+ messages in thread From: Byungchul Park @ 2025-02-07 10:49 UTC (permalink / raw) To: Gregory Price Cc: Honggyu Kim, kernel_team, Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl On Fri, Feb 07, 2025 at 04:54:10AM -0500, Gregory Price wrote: > On Fri, Feb 07, 2025 at 06:34:43PM +0900, Honggyu Kim wrote: > > On 2/7/2025 5:57 PM, Gregory Price wrote: > > > > > The default kernel stack size is like 16kb. You'd need like 100,000 > > > threads to eat up 1.5GB, and 2048 threads only eats like 32MB. > > > > > > It's not an interesting amount of memory if you have a 20TB system. > > > > The amount might be small, but having those data in slow tier can > > make performance degradation if it is heavily accessed. > > > > The number of accesses isn't linearly corelated to the size of the > > memory region. > > > > Right, I started by saying: > > [CXL is] "generally not fit for kernel use" > > I have the opinion that CXL memory should be defaulted to ZONE_MOVABLE, > but I understand the pressure on ZONE_NORMAL means this may not be > possible for large capacities. Just to clarify, for moderate capacities where ZONE_NORMAL in DRAM covers the whole memory, it's not a big issue since the easiest solution would be to place kernel objects in DRAM's ZONE_NORMAL and not allow ZONE_NORMAL in non-DRAM. No objection on it. For large capacities, kernel object migratability might be a must. For capacities between moderate and large, kernel object migratibility or allowing ZONE_NORMAL in non-DRAM, or something better idea, would help to reduce ZONE_NORMAL cost. I'm adding my opinion with the last two cases in mind. Byungchul > I don't think the solution is to make kernel memory migratable and allow > kernel allocations on CXL. > > There's a reason most kernel allocations are not swappable. > > > Thanks, > > Honggyu ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 9:54 ` Gregory Price 2025-02-07 10:49 ` Byungchul Park @ 2025-02-10 2:33 ` Harry (Hyeonggon) Yoo 2025-02-10 3:19 ` Matthew Wilcox 2025-02-10 6:00 ` Gregory Price 1 sibling, 2 replies; 27+ messages in thread From: Harry (Hyeonggon) Yoo @ 2025-02-10 2:33 UTC (permalink / raw) To: Gregory Price Cc: Honggyu Kim, Byungchul Park, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Fri, Feb 07, 2025 at 04:54:10AM -0500, Gregory Price wrote: > On Fri, Feb 07, 2025 at 06:34:43PM +0900, Honggyu Kim wrote: > > On 2/7/2025 5:57 PM, Gregory Price wrote: > > > > > The default kernel stack size is like 16kb. You'd need like 100,000 > > > threads to eat up 1.5GB, and 2048 threads only eats like 32MB. > > > > > > It's not an interesting amount of memory if you have a 20TB system. > > > > The amount might be small, but having those data in slow tier can > > make performance degradation if it is heavily accessed. > > > > The number of accesses isn't linearly corelated to the size of the > > memory region. > > > > Right, I started by saying: > > [CXL is] "generally not fit for kernel use" > > I have the opinion that CXL memory should be defaulted to ZONE_MOVABLE, Agreed, when the ratio of slow to fast capacity makes it feasible. > but I understand the pressure on ZONE_NORMAL means this may not be > possible for large capacities. Yes, I this is when we start consider some ZONE_NORMAL capacity on CXL memory. > I don't think the solution is to make kernel memory migratable and allow > kernel allocations on CXL. IMHO the relevant questions here are: Premise: Some ZONE_NORMAL capacity exists on CXL memory due to its large capacity. Q1. How aggressively should the kernel avoid allocating kernel allocations from ZONE_NORMAL in slow tier (and instead reclaim pages in fast tier)? e.g.: - Only when there's no easily reclaimable memory? - Or as a last resort before OOM? - Or should certain types of kernel allocations simply not be allowed from slow tier? Q2. If kernel allocations are made from slow tier anyway, would it be worthwhile to migrate _certain types_ of kernel memory back to fast tier later when free space becomes available? (sounds like a promotion policy) > There's a reason most kernel allocations are not swappable. Because most kernel allocations cannot be swapped, with a few exceptions. However, there's non-LRU page migration functionality where kernel allocations can be migrated. I don't understand why we shouldn't introduce more kernel movable memory if that turns out to be beneficial? -- Harry ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 2:33 ` Harry (Hyeonggon) Yoo @ 2025-02-10 3:19 ` Matthew Wilcox 2025-02-10 6:00 ` Gregory Price 1 sibling, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2025-02-10 3:19 UTC (permalink / raw) To: Harry (Hyeonggon) Yoo Cc: Gregory Price, Honggyu Kim, Byungchul Park, kernel_team, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 11:33:47AM +0900, Harry (Hyeonggon) Yoo wrote: > Premise: Some ZONE_NORMAL capacity exists on CXL memory > due to its large capacity. I reject your premise. None of this is inevitable. Infiniband and ATM did not beocme dominant networking technologies. SOP did not dominate the storage industry. Itanium did not become the only CPU architcture that mattered. Similarly, CXL is a technically flawed protocol. Lots of money is being thrown at making it look inevitable, but fundamentally PCIe is a high-bandwidth protocol, not a low-latency protocol and it can't do the job. > > There's a reason most kernel allocations are not swappable. > > Because most kernel allocations cannot be swapped, with a few exceptions. > > However, there's non-LRU page migration functionality where kernel > allocations can be migrated. > > I don't understand why we shouldn't introduce more kernel movable memory > if that turns out to be beneficial? Because it's adding complexity for a stupid use-case. If you can make the case for making something migratable that's not currently without using CXL as a justification, then sure, let's do it. zsmalloc is migratable, and that makes a lot of sense. But there's a reason we only have three movable_operations structs defined in the kernel today. (also the whole non-LRU page migration needs overhauling to not use page->lru, but that's a separate matter. except it's not a separate matter because that's needed in order to shrink struct page.) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 2:33 ` Harry (Hyeonggon) Yoo 2025-02-10 3:19 ` Matthew Wilcox @ 2025-02-10 6:00 ` Gregory Price 2025-02-10 7:17 ` Byungchul Park 1 sibling, 1 reply; 27+ messages in thread From: Gregory Price @ 2025-02-10 6:00 UTC (permalink / raw) To: Harry (Hyeonggon) Yoo Cc: Honggyu Kim, Byungchul Park, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 11:33:47AM +0900, Harry (Hyeonggon) Yoo wrote: > > Premise: Some ZONE_NORMAL capacity exists on CXL memory > due to its large capacity. > What you actually need to show to justify increasing the complexity is (at least - but not limited to) 1) structures you want to migrate are harmful when placed on slow memory ex) Is `struct page` on slow mem actually harmful? - no data? ex) Are page tables on slow mem actually harmful? - known, yes. 2) The structures cannot be made to take up less space on local tier ex) struct page can be shrunk - do that first ex) huge-pages can be deployed - do that first 3) the structures take up sufficient space that it matters ex) struct page after shrunk might not - do that first ex) page tables with multi-sized huge pages may not - do that first 4) Making the structures migratable actually does something useful are `struct page` or page tables after #2 and #3 both: a) going through hot/cold phases enough to warrant being tiered b) hot enough for long enough that migration matters? You can probably actually (maybe?) collect data on this today - but you still have to contend with #2 and #3. > > I don't understand why we shouldn't introduce more kernel movable memory > if that turns out to be beneficial? > No one is going to stop research you want to do. I'm simply expressing that I think it's an ill-advised path to take. ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 6:00 ` Gregory Price @ 2025-02-10 7:17 ` Byungchul Park 2025-02-10 15:47 ` Gregory Price 0 siblings, 1 reply; 27+ messages in thread From: Byungchul Park @ 2025-02-10 7:17 UTC (permalink / raw) To: Gregory Price Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote: > On Mon, Feb 10, 2025 at 11:33:47AM +0900, Harry (Hyeonggon) Yoo wrote: > > > > Premise: Some ZONE_NORMAL capacity exists on CXL memory > > due to its large capacity. > > > What you actually need to show to justify increasing the complexity is > (at least - but not limited to) > > 1) structures you want to migrate are harmful when placed on slow memory > > ex) Is `struct page` on slow mem actually harmful? - no data? Then we can hold this one until it turns out it's harmful or give up. > ex) Are page tables on slow mem actually harmful? - known, yes. Defenitly yes. What can be the next? > 2) The structures cannot be made to take up less space on local tier > > ex) struct page can be shrunk - do that first > ex) huge-pages can be deployed - do that first I'm really courious about this. Is there any reason that we should work these in a serialized manner? > 3) the structures take up sufficient space that it matters > > ex) struct page after shrunk might not - do that first > ex) page tables with multi-sized huge pages may not - do that first Same. Should it be serialized? > 4) Making the structures migratable actually does something useful > > are `struct page` or page tables after #2 and #3 both: > > a) going through hot/cold phases enough to warrant being tiered > > b) hot enough for long enough that migration matters? > > You can probably actually (maybe?) collect data on this today - but > you still have to contend with #2 and #3. Ah. You seem to mean those works should be serialized. Right? If it should be for some reason, then it could be sensible. Byungchul > > I don't understand why we shouldn't introduce more kernel movable memory > > if that turns out to be beneficial? > > > > No one is going to stop research you want to do. I'm simply expressing > that I think it's an ill-advised path to take. > > ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 7:17 ` Byungchul Park @ 2025-02-10 15:47 ` Gregory Price 2025-02-10 15:55 ` Matthew Wilcox ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Gregory Price @ 2025-02-10 15:47 UTC (permalink / raw) To: Byungchul Park Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote: > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote: > > > > You can probably actually (maybe?) collect data on this today - but > > you still have to contend with #2 and #3. > > Ah. You seem to mean those works should be serialized. Right? If it > should be for some reason, then it could be sensible. > I'm suggesting that there isn't a strong reason (yet) to consider such a complicated change. As Willy has said, it's a fairly fundamental change for a single-reason (CXL), which does not bode well for its acceptance. Honestly trying to save you some frustration. It would behoove you to find stronger reasons (w/ data) or consider different solutions. Right now there are stronger, simplers solutions to the ZONE_NORMAL capacity issue (struct page resize, huge pages) for possible capacities. I also think someone should actively ask whether `struct page` can be hosted on remote memory without performance loss. I may look into this. ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 15:47 ` Gregory Price @ 2025-02-10 15:55 ` Matthew Wilcox 2025-02-10 16:06 ` Gregory Price 2025-02-11 1:53 ` Byungchul Park ` (2 subsequent siblings) 3 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2025-02-10 15:55 UTC (permalink / raw) To: Gregory Price Cc: Byungchul Park, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote: > I also think someone should actively ask whether `struct page` can be > hosted on remote memory without performance loss. I may look into this. Given that it contains a refcount and various flags, some of which are quite hot, I would expect performance to suffer. It also suffers contention between different CPUs, so depending on your cache protocol (can it do cache-to-cche transfers or does it have to be written back to memory first?) it may perform quite poorly. But this is something that can be measured. Of course, the question must be asked whetheer we care. Certainly Intel's Apache Pass and similar Optane RAM products put the memmap on the 3DXP because there wasn't enough DRAM to put it there. So the pages are slower, but they were slower anyway! What I always wondered was what effect it would have on wear. But that's not a consideration for DRAM attached via CXL. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 15:55 ` Matthew Wilcox @ 2025-02-10 16:06 ` Gregory Price 0 siblings, 0 replies; 27+ messages in thread From: Gregory Price @ 2025-02-10 16:06 UTC (permalink / raw) To: Matthew Wilcox Cc: Byungchul Park, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 03:55:47PM +0000, Matthew Wilcox wrote: > On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote: > > I also think someone should actively ask whether `struct page` can be > > hosted on remote memory without performance loss. I may look into this. > > Given that it contains a refcount and various flags, some of which > are quite hot, I would expect performance to suffer. It also suffers > contention between different CPUs, so depending on your cache protocol > (can it do cache-to-cche transfers or does it have to be written back > to memory first?) it may perform quite poorly. But this is something > that can be measured. > > Of course, the question must be asked whetheer we care. Certainly Intel's > Apache Pass and similar Optane RAM products put the memmap on the 3DXP > because there wasn't enough DRAM to put it there. So the pages are > slower, but they were slower anyway! > Well, *if* said memory is intended to host cold(er) data, then we may find the structures to describe those pages aren't particularly hot or contended. This is my suspicion - and I'd rather limit kernel resource allocation on remote memory than try to move kernel resources around. Plus this would still enables hot-unplug. Once all the zone movable regions are clicked off, the page-desc regions are unused... probably. Would just be nice to have some concrete data on when greater zone movable capacity becomes a net-negative. We're making the assumption this this occurs fairly early. ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 15:47 ` Gregory Price 2025-02-10 15:55 ` Matthew Wilcox @ 2025-02-11 1:53 ` Byungchul Park 2025-02-21 1:52 ` Harry Yoo 2025-02-25 5:06 ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park 3 siblings, 0 replies; 27+ messages in thread From: Byungchul Park @ 2025-02-11 1:53 UTC (permalink / raw) To: Gregory Price Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote: > On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote: > > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote: > > > > > > You can probably actually (maybe?) collect data on this today - but > > > you still have to contend with #2 and #3. > > > > Ah. You seem to mean those works should be serialized. Right? If it > > should be for some reason, then it could be sensible. > > > > I'm suggesting that there isn't a strong reason (yet) to consider such a > complicated change. As Willy has said, it's a fairly fundamental change > for a single-reason (CXL), which does not bode well for its acceptance. I have observed performance difference depending on page table's placement between DRAM and slow tier, that doesn't have to be CXL memory. We should place page table in DRAM as long as possible, but when not possible, we could do either recaiming DRAM for them or temporarily place them in slow tier and move to DRAM for better performance. But yes. If slow tier is *NEVER* allowed to be huge, then reclaiming DRAM would always work. This topic is valid only for the other case. > Honestly trying to save you some frustration. It would behoove you to > find stronger reasons (w/ data) or consider different solutions. Right > now there are stronger, simplers solutions to the ZONE_NORMAL capacity > issue (struct page resize, huge pages) for possible capacities. > > I also think someone should actively ask whether `struct page` can be > hosted on remote memory without performance loss. I may look into this. JFYI, struct page, page table, and kernel stack were just example. Let's exclude ones that you don't think are feasible. However, I'd like to tell at least page table is an interesting kernel object in the topic. Byungchul > ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 15:47 ` Gregory Price 2025-02-10 15:55 ` Matthew Wilcox 2025-02-11 1:53 ` Byungchul Park @ 2025-02-21 1:52 ` Harry Yoo 2025-02-25 4:54 ` [LSF/MM/BPF TOPIC] Gathering ideas to reduce ZONE_NORMAL cost Byungchul Park 2025-02-25 5:06 ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park 3 siblings, 1 reply; 27+ messages in thread From: Harry Yoo @ 2025-02-21 1:52 UTC (permalink / raw) To: Gregory Price Cc: Byungchul Park, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote: > On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote: > > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote: > > > > > > You can probably actually (maybe?) collect data on this today - but > > > you still have to contend with #2 and #3. > > > > Ah. You seem to mean those works should be serialized. Right? If it > > should be for some reason, then it could be sensible. > > > > I'm suggesting that there isn't a strong reason (yet) to consider such a > complicated change. As Willy has said, it's a fairly fundamental change > for a single-reason (CXL), which does not bode well for its acceptance. > > Honestly trying to save you some frustration. It would behoove you to > find stronger reasons (w/ data) or consider different solutions. Right > now there are stronger, simplers solutions to the ZONE_NORMAL capacity > issue (struct page resize, huge pages) for possible capacities. Hi, apologies for my late reply. I recently went through a career change. I truly appreciate your and Matthew's feedback and thank you for saving us from frustration. I agree that we need a stronger motivation and data to introduce such a fundamental change. And I also agree that it's more appropriate to pursue what can be useful for genral MM users rather than introducing MM changes just for CXL. With that context, Byungchul and I agree it's a better direction: Reducing ZONE_NORMAL cost for ZONE_MOVABLE capacity, which is beneficial for ZONE_MOVABLE users in general, regardless of whether the user is using CXL memory or not. Let me organize a few steps to pursue: - Willy's shrinking struct page project - https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ - https://kernelnewbies.org/MatthewWilcox/Memdescs/Path - Side note: Byungchul started working on separating the descriptor of the pagepool bump allocator - Slab Movable Objects: This makes sense even without CXL as migrating unreclaimable slab will improve compaction success rate. It also has been tried in the past by others, but was suspended due to lack of data. I'm looking for workloads that allocate a decent amount of unreclaimable slab AND performs migration frequently - for evaluation. I might be missing some projects that could be useful, please feel free to add if there is any. And for page table migration, while it might be doable even without CXL, we need strong data that suggests that it's actually makes MM better to pursue this. > I also think someone should actively ask whether `struct page` can be > hosted on remote memory without performance loss. I may look into this. Did you have a chance to look at this? -- Cheers, Harry ^ permalink raw reply [flat|nested] 27+ messages in thread
* [LSF/MM/BPF TOPIC] Gathering ideas to reduce ZONE_NORMAL cost 2025-02-21 1:52 ` Harry Yoo @ 2025-02-25 4:54 ` Byungchul Park 0 siblings, 0 replies; 27+ messages in thread From: Byungchul Park @ 2025-02-25 4:54 UTC (permalink / raw) To: Harry Yoo Cc: Gregory Price, Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Fri, Feb 21, 2025 at 10:52:09AM +0900, Harry Yoo wrote: > On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote: > > On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote: > > > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote: > > > > > > > > You can probably actually (maybe?) collect data on this today - but > > > > you still have to contend with #2 and #3. > > > > > > Ah. You seem to mean those works should be serialized. Right? If it > > > should be for some reason, then it could be sensible. > > > > > > > I'm suggesting that there isn't a strong reason (yet) to consider such a > > complicated change. As Willy has said, it's a fairly fundamental change > > for a single-reason (CXL), which does not bode well for its acceptance. > > > > Honestly trying to save you some frustration. It would behoove you to > > find stronger reasons (w/ data) or consider different solutions. Right > > now there are stronger, simplers solutions to the ZONE_NORMAL capacity > > issue (struct page resize, huge pages) for possible capacities. > > Hi, apologies for my late reply. I recently went through a career change. > > I truly appreciate your and Matthew's feedback and thank you for saving us > from frustration. I agree that we need a stronger motivation > and data to introduce such a fundamental change. And I also agree that > it's more appropriate to pursue what can be useful for genral MM users > rather than introducing MM changes just for CXL. > > With that context, Byungchul and I agree it's a better direction: > Reducing ZONE_NORMAL cost for ZONE_MOVABLE capacity, which is beneficial > for ZONE_MOVABLE users in general, regardless of whether the user is using > CXL memory or not. > > Let me organize a few steps to pursue: > > - Willy's shrinking struct page project > - https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ > - https://kernelnewbies.org/MatthewWilcox/Memdescs/Path > - Side note: Byungchul started working on separating the descriptor > of the pagepool bump allocator > > - Slab Movable Objects: This makes sense even without CXL > as migrating unreclaimable slab will improve compaction success rate. > It also has been tried in the past by others, but was suspended > due to lack of data. > > I'm looking for workloads that allocate a decent amount of unreclaimable > slab AND performs migration frequently - for evaluation. > > I might be missing some projects that could be useful, > please feel free to add if there is any. So.. Let's change the LSF/MM/BPF topic slightly. Byungchul > And for page table migration, while it might be doable even without CXL, > we need strong data that suggests that it's actually makes MM better > to pursue this. > > > I also think someone should actively ask whether `struct page` can be > > hosted on remote memory without performance loss. I may look into this. > > Did you have a chance to look at this? > > -- > Cheers, > Harry ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-10 15:47 ` Gregory Price ` (2 preceding siblings ...) 2025-02-21 1:52 ` Harry Yoo @ 2025-02-25 5:06 ` Byungchul Park 2025-03-03 15:55 ` Gregory Price 3 siblings, 1 reply; 27+ messages in thread From: Byungchul Park @ 2025-02-25 5:06 UTC (permalink / raw) To: Gregory Price Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote: > On Mon, Feb 10, 2025 at 04:17:41PM +0900, Byungchul Park wrote: > > On Mon, Feb 10, 2025 at 01:00:02AM -0500, Gregory Price wrote: > > > > > > You can probably actually (maybe?) collect data on this today - but > > > you still have to contend with #2 and #3. > > > > Ah. You seem to mean those works should be serialized. Right? If it > > should be for some reason, then it could be sensible. > > > > I'm suggesting that there isn't a strong reason (yet) to consider such a > complicated change. As Willy has said, it's a fairly fundamental change > for a single-reason (CXL), which does not bode well for its acceptance. > > Honestly trying to save you some frustration. It would behoove you to > find stronger reasons (w/ data) or consider different solutions. Right > now there are stronger, simplers solutions to the ZONE_NORMAL capacity > issue (struct page resize, huge pages) for possible capacities. > > I also think someone should actively ask whether `struct page` can be > hosted on remote memory without performance loss. I may look into this. Could you share the plan or what you have been thinking about it? We'd be happy to discuss this topic together, and furthermore, it'd be even better to work on it together. Byungchul > ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-25 5:06 ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park @ 2025-03-03 15:55 ` Gregory Price 0 siblings, 0 replies; 27+ messages in thread From: Gregory Price @ 2025-03-03 15:55 UTC (permalink / raw) To: Byungchul Park Cc: Harry (Hyeonggon) Yoo, Honggyu Kim, kernel_team, Matthew Wilcox, lsf-pc, linux-mm, linux-cxl On Tue, Feb 25, 2025 at 02:06:43PM +0900, Byungchul Park wrote: > On Mon, Feb 10, 2025 at 10:47:58AM -0500, Gregory Price wrote: > > I also think someone should actively ask whether `struct page` can be > > hosted on remote memory without performance loss. I may look into this. > > Could you share the plan or what you have been thinking about it? > > We'd be happy to discuss this topic together, and furthermore, it'd be > even better to work on it together. > Apologies for the late reply, i've been on some R&R. I haven't written up any specific plan for this, but it is *reasonably* simple to test with memmap_on_memory. ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 8:57 ` Gregory Price 2025-02-07 9:27 ` Gregory Price 2025-02-07 9:34 ` Honggyu Kim @ 2025-02-07 10:14 ` Byungchul Park 2025-02-10 7:02 ` Byungchul Park 3 siblings, 0 replies; 27+ messages in thread From: Byungchul Park @ 2025-02-07 10:14 UTC (permalink / raw) To: Gregory Price Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Honggyu Kim, kernel_team On Fri, Feb 07, 2025 at 03:57:45AM -0500, Gregory Price wrote: > On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote: > > On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote: > > > > We can work with from the easiest object > > >e.g. page table > > It's more efficient and easier to change page sizes than it is to make > page tables migratable. You are misunderstanding. I didn't say 'do not change page sizes'. I didn't say it's easier than changing page size. I said *both* changing page sizes and making them migratable could reduce ZONE_NORMAL cost. > It's also easier to reclaim cold pages eating up significantly more > memory than the page table (which describes pages at ~8 bytes per page). Same. We should keep reclaiming cold pages eating up memory. Why do we give up reclaiming cold pages if page table becomes migratable? I really don't understand why you are trying to exclusively pick up only one effort for that purpose. > Also, there's quite a bit of literature that shows page tables landing > on remote nodes (cross-socket) has negative performance impacts. Exactly. That's the motivation to suggest this topic. That's why we are asking about kernel object migratibility. Of course, we try our best to place kernel object in DRAM in the first place. However, the thing would arise when it becomes impossible. It's about comparison between 'premature reclaim and die(= oom)' and 'slight degradation of performance'. > Putting them on CXL makes the problem worse. No. Higher chance to die is worse. > > struct page, > > `struct page` is a structure that describes a physically addressed page. > > It is common to access it by simply doing `pfn_to_page()`, which is a > fairly simply conversion (bit more complex in sparsemem w/ sections) > > This is used in a lockless manner to acquire page references all over > the kernel. > > Making that migratable is... ambitious, to say the least. Yes. I don't think it's easy. > > and kernel stack, > > The default kernel stack size is like 16kb. You'd need like 100,000 > threads to eat up 1.5GB, and 2048 threads only eats like 32MB. > > It's not an interesting amount of memory if you have a 20TB system. Kernel stack is an example. We can skip it and look for better candidate. > > When it comes to this topic, the most important thing is the collected > > *direction* from the community so that we can start the work under the > > *direction*. > > > > My thoughts here are that memory tiering is the wrong tool for the > problem you are trying to solve. I think any valid efforts can be considered at the same time. Is there any reason that effort in tiering environment should be excluded? Byungchul > Maybe there's a world in which we propose a ZONE_MEMDESC which is > exclusively used for `struct page` for a node. > > At least then you could design CXL capacities *around* that. > > ~Gregory ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-07 8:57 ` Gregory Price ` (2 preceding siblings ...) 2025-02-07 10:14 ` Byungchul Park @ 2025-02-10 7:02 ` Byungchul Park 3 siblings, 0 replies; 27+ messages in thread From: Byungchul Park @ 2025-02-10 7:02 UTC (permalink / raw) To: Gregory Price Cc: Matthew Wilcox, Hyeonggon Yoo, lsf-pc, linux-mm, linux-cxl, Honggyu Kim, kernel_team On Fri, Feb 07, 2025 at 03:57:45AM -0500, Gregory Price wrote: > On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote: > > On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote: > > > > We can work with from the easiest object > > >e.g. page table > > It's more efficient and easier to change page sizes than it is to make > page tables migratable. > > It's also easier to reclaim cold pages eating up significantly more > memory than the page table (which describes pages at ~8 bytes per page). Sorry for leaving comments in an excited manner last time. Lemme focus on what to consider and how to resolve them: Case 1. A system with no or little non-DRAM capacity You are right. It'd be easier to reclaim cold pages eating up ZONE_NORMAL. ZONE_NORMAL in DRAM probably can cover whole memory. Case 2. A system with very huge non-DRAM capacity ZONE_NORMAL in DRAM might not be able to cover whole memory. So either allowing ZONE_NORMAL in non-DRAM or allowing some kernel objects to be placed in ZONE_MOVABLE would be required. If all the guys agree with Matthew - a system should never be able to equipped with very huge non-DRAM memory, then yes, we might not need the discussion. Case 3. A system with a capacity between huge and little non-DRAM ZONE_NORMAL in DRAM might or might not be able to cover whole memory. Quite big amount of kernel object would be still required. Of course, properly reclaiming cold pages eating up ZONE_NORMAL in DRAM might work for the purpose. At the same time, any efforts to reduce the ZONE_NORMAL cost would help and mitigate the pressure on ZONE_NORMAL in DRAM. Here, the efforts include e.g. reducing size of kernel object, making some kernel objects migratable, and so on. Same. If this case is also the one that Matthew and others think is not realistic, then yes, we might not need the discussion. If not, we need to consider the issue. Byungchul ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier 2025-02-01 13:29 [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Hyeonggon Yoo 2025-02-01 14:04 ` Matthew Wilcox @ 2025-02-04 9:59 ` David Hildenbrand 1 sibling, 0 replies; 27+ messages in thread From: David Hildenbrand @ 2025-02-04 9:59 UTC (permalink / raw) To: Hyeonggon Yoo, lsf-pc, linux-mm; +Cc: linux-cxl, Byungchul Park, Honggyu Kim On 01.02.25 14:29, Hyeonggon Yoo wrote: > Hi, > > Byungchul and I would like to suggest a topic about the performance impact of > kernel allocations on CXL memory. > > As CXL-enabled servers and memory devices are being developed, CXL-supported > hardware is expected to continue emerging in the coming years. > > The Linux kernel supports hot-plugging CXL memory via dax/kmem functionality. > The hot-plugged memory allows either unmovable kernel allocations > (ZONE_NORMAL), or restricts them to movable allocations (ZONE_MOVABLE) > depending on the hot-plug policy. > > Recently, Byungchul and I observed a measurable performance degradation with > memhp_default_state=online compared to memhp_default_state=online_movable > on a server where the ratio of memory capacity between DRAM and CXL is 1:2 > when running the llama.cpp workload with the default mempolicy. > The workload performs LLM inference and pressures the memory subsystem > due to its large working set size. > > Obviously, allowing kernel allocations from CXL memory degrades performance > because kernel memory like page tables, kernel stacks, and slab allocations, > is accessed frequently and may reside in physical memory with significantly > higher access latency. > > However, as far as I can tell there are at least two reasons why we need to > support ZONE_NORMAL for CXL memory (please add if there are more): > 1. When hot-plugging a huge amount of CXL memory, the size of > the struct page array might not fit into DRAM > -> This could be relaxed with memmap_on_memory There are some others, although most are less significant, and I tried documenting them here: https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html#zone-movable-sizing-considerations E.g., a 4 KiB page requires a single PTE (8 bytes) to be mapped into user space, corresponding to 0.2 %. At least for anonymous memory, PMD-sized THPs don't help, because we still have to allocate the page table to be prepared for a PMD->PTE remapping. In the worst case, the directmap requires another 0.2 % (but usually, we rely on PMD mappings). So that usage depends on how you are intending to use the CXL memory (e.g., pagecache vs. anonymous memory). > 2. To hot-unplug CXL memory, pages in CXL memory should be migrated to DRAM, > which means sometimes some portion of CXL memory should be ZONE_NORMAL. I don't quite understand that argument for ZONE_NORMAL. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-03-03 15:55 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-02-01 13:29 [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Hyeonggon Yoo 2025-02-01 14:04 ` Matthew Wilcox 2025-02-01 15:13 ` Hyeonggon Yoo 2025-02-01 16:30 ` Gregory Price 2025-02-01 18:48 ` Matthew Wilcox 2025-02-03 22:09 ` Dan Williams 2025-02-07 7:20 ` Byungchul Park 2025-02-07 8:57 ` Gregory Price 2025-02-07 9:27 ` Gregory Price 2025-02-07 9:34 ` Honggyu Kim 2025-02-07 9:54 ` Gregory Price 2025-02-07 10:49 ` Byungchul Park 2025-02-10 2:33 ` Harry (Hyeonggon) Yoo 2025-02-10 3:19 ` Matthew Wilcox 2025-02-10 6:00 ` Gregory Price 2025-02-10 7:17 ` Byungchul Park 2025-02-10 15:47 ` Gregory Price 2025-02-10 15:55 ` Matthew Wilcox 2025-02-10 16:06 ` Gregory Price 2025-02-11 1:53 ` Byungchul Park 2025-02-21 1:52 ` Harry Yoo 2025-02-25 4:54 ` [LSF/MM/BPF TOPIC] Gathering ideas to reduce ZONE_NORMAL cost Byungchul Park 2025-02-25 5:06 ` [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier Byungchul Park 2025-03-03 15:55 ` Gregory Price 2025-02-07 10:14 ` Byungchul Park 2025-02-10 7:02 ` Byungchul Park 2025-02-04 9:59 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox