Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Gregory Price <gourry@gourry.net>
Cc: Yang Shi <shy828301@gmail.com>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
Date: Wed, 19 Feb 2025 09:53:07 +0100	[thread overview]
Message-ID: <e332391c-30fb-49c3-9c05-574b0c486a81@redhat.com> (raw)
In-Reply-To: <Z7UvchoiRUg_cnhh@gourry-fedora-PF4VCD3F>

> What's mildly confusing is for pages used for altmap to be accounted for
> as if it's an allocation in vmstat - but for that capacity to be chopped
> out of the memory-block (it "makes sense" it's just subtly misleading).

Would the following make it better or worse?

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4765f2928725c..17a4432427051 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -237,9 +237,12 @@ static int memory_block_online(struct memory_block *mem)
          * Account once onlining succeeded. If the zone was unpopulated, it is
          * now already properly populated.
          */
-       if (nr_vmemmap_pages)
+       if (nr_vmemmap_pages) {
                 adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
                                           nr_vmemmap_pages);
+               adjust_managed_page_count(pfn_to_page(start_pfn),
+                                         nr_vmemmap_pages);
+       }
  
         mem->zone = zone;
         mem_hotplug_done();
@@ -273,17 +276,23 @@ static int memory_block_offline(struct memory_block *mem)
                 nr_vmemmap_pages = mem->altmap->free;
  
         mem_hotplug_begin();
-       if (nr_vmemmap_pages)
+       if (nr_vmemmap_pages) {
                 adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
                                           -nr_vmemmap_pages);
+               adjust_managed_page_count(pfn_to_page(start_pfn),
+                                         -nr_vmemmap_pages);
+       }
  
         ret = offline_pages(start_pfn + nr_vmemmap_pages,
                             nr_pages - nr_vmemmap_pages, mem->zone, mem->group);
         if (ret) {
                 /* offline_pages() failed. Account back. */
-               if (nr_vmemmap_pages)
+               if (nr_vmemmap_pages) {
                         adjust_present_page_count(pfn_to_page(start_pfn),
                                                   mem->group, nr_vmemmap_pages);
+                       adjust_managed_page_count(pfn_to_page(start_pfn),
+                                                 nr_vmemmap_pages);
+               }
                 goto out;
         }
  
Then, it would look "just like allocated memory" from that node/zone.

As if, the memmap was allocated immediately when we onlined the memory
(see below).

> 
> I thought the system was saying i'd allocated memory (from the 'free'
> capacity) instead of just reducing capacity.

The question is whether you want that memory to be hidden from MemTotal
(carveout?) or treated just like allocated memory (allocation?).

If you treat the memmap as "just a memory allocation after early boot"
and "memap_on_memory" telling you to allocate that memory from the
hotplugged memory instead of the buddy, then "carveout"
might be more of an internal implementation detail to achieve that memory
allocation.


>>> stupid question - it sorta seems like you'd want this as the default
>>> setting for driver-managed hotplug memory blocks, but I suppose for
>>> very small blocks there's problems (as described in the docs).
>>
>> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
>> of consecutive usable memory (e.g., 1 GiB hugetlb page).
>>
> 
> That makes sense, i had not considered this.  Although it only applies
> for small blocks - which is basically an indictment of this suggestion:
> 
> https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
> 
> So I'll have to consider this and whether this should be a default.
> It's probably this is enough to nak this entirely.
> 
> 
> ... that said ....
> 
> Interestingly, when I tried allocating 1GiB hugetlb pages on a dax device
> in ZONE_MOVABLE (without memmap_on_memory) - the allocation fails silently
> regardless of block size (tried both 2GB and 256MB).  I can't find a reason
> why this would be the case in the existing documentation.

Right, it only currently works with ZONE_NORMAL, because 1 GiB pages are
considered unmovable in practice (try reliably finding a 1 GiB area to
migrate the memory to during memory unplug ... when these hugetlb things are
unswappable etc.).

I documented it under https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html

"Gigantic pages are unmovable, resulting in user space consuming a lot of unmovable memory."

If we ever support THP in that size range, we might consider them movable
because we can just split/swapout them when allcoating a migration target
fails.

> 
> (note: hugepage migration is enabled in build config, so it's not that)
> 
> If I enable one block (256MB) into ZONE_NORMAL, and the remainder in
> movable (with memmap_on_memory=n) the allocation still fails, and:
> 
>     nr_slab_unreclaimable 43
> 
> in node1/vmstat - where previously there was nothing.
> 
> Onlining the dax devices into ZONE_NORMAL successfully allowed 1GiB huge
> pages to allocate.
> > This used the /sys/bus/node/devices/node1/hugepages/* interfaces to test
> 
> Using the /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages with
> interleave mempolicy - all hugepages end up on ZONE_NORMAL.
> 
> (v6.13 base kernel)
> 
> This behavior is *curious* to say the least.  Not sure if bug, or some
> nuance missing from the documentation - but certainly glad I caught it.

See above :)

> 
> 
>> I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.
>>
>> IIRC, the global toggle must be enabled for the driver option to be considered.
> 
> Oh, well, that's an extra layer I missed.  So there's:
> 
> build:
>    CONFIG_MHP_MEMMAP_ON_MEMORY=y
>    CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
> global:
>    /sys/module/memory_hotplug/parameters/memmap_on_memory
> device:
>    /sys/bus/dax/devices/dax0.0/memmap_on_memory
> 
> And looking at it - this does seem to be the default for dax.
> 
> So I can drop the existing `nuance movable/memmap` section and just
> replace it with the hugetlb subtleties x_x.
> 
> I appreciate the clarifications here, sorry for the incorrect info and
> the increasing confusing.


No worries. If you have ideas on what to improve in the memory hotplug
docs, please let me know!


-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2025-02-19  8:53 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05  2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12   ` Yuquan Wang
2025-02-18 16:11     ` Gregory Price
2025-02-20 16:30   ` Jonathan Cameron
2025-02-20 16:52     ` Gregory Price
2025-03-04  0:32   ` Gregory Price
2025-03-13 16:12     ` Jonathan Cameron
2025-03-13 17:20       ` Gregory Price
2025-03-10 10:45   ` Yuquan Wang
2025-03-10 14:19     ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06  0:47   ` Dan Williams
2025-02-06 15:59     ` Gregory Price
2025-03-04  1:32   ` Gregory Price
2025-03-06 23:56   ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07  0:57     ` Zhijian Li (Fujitsu)
2025-03-07 15:07       ` Gregory Price
2025-03-11  2:48         ` Zhijian Li (Fujitsu)
2025-04-02  6:45     ` Zhijian Li (Fujitsu)
2025-04-02 14:18       ` Gregory Price
2025-04-08  3:10         ` Zhijian Li (Fujitsu)
2025-04-08  4:14           ` Gregory Price
2025-04-08  5:37             ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24   ` David Hildenbrand
2025-02-18 17:03     ` Gregory Price
2025-02-18 17:49   ` Yang Shi
2025-02-18 18:04     ` Gregory Price
2025-02-18 19:25       ` David Hildenbrand
2025-02-18 20:25         ` Gregory Price
2025-02-18 20:57           ` David Hildenbrand
2025-02-19  1:10             ` Gregory Price
2025-02-19  8:53               ` David Hildenbrand [this message]
2025-02-19 16:14                 ` Gregory Price
2025-02-20 17:50             ` Yang Shi
2025-02-20 18:43               ` Gregory Price
2025-02-20 19:26                 ` David Hildenbrand
2025-02-20 19:35                   ` Gregory Price
2025-02-20 19:44                     ` David Hildenbrand
2025-02-20 20:06                       ` Gregory Price
2025-03-11 14:53                   ` Zi Yan
2025-03-11 15:58                     ` Gregory Price
2025-03-11 16:08                       ` Zi Yan
2025-03-11 16:15                         ` Gregory Price
2025-03-11 16:35                         ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44   ` Dave Jiang
2025-03-05 23:34     ` Gregory Price
2025-03-05 23:41       ` Dave Jiang
2025-03-06  0:09         ` Gregory Price
2025-03-06  1:37   ` Yuquan Wang
2025-03-06 17:08     ` Gregory Price
2025-03-07  2:20       ` Yuquan Wang
2025-03-07 15:12         ` Gregory Price
2025-03-13 17:00           ` Jonathan Cameron
2025-03-08  3:23   ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20     ` Jonathan Cameron
2025-03-13 18:17       ` Gregory Price
2025-03-14 11:09         ` Jonathan Cameron
2025-03-14 13:46           ` Gregory Price
2025-03-13 16:55   ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30     ` Gregory Price
2025-03-14 11:14       ` Jonathan Cameron
2025-03-27  9:34     ` Yuquan Wang
2025-03-27 12:36       ` Gregory Price
2025-03-27 13:21         ` Dan Williams
2025-03-27 16:36           ` Gregory Price
2025-03-31 23:49             ` [Lsf-pc] " Dan Williams
2025-03-12  0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13  8:31   ` Yuquan Wang
2025-03-13 16:48     ` Gregory Price
2025-03-26  9:28   ` Yuquan Wang
2025-03-26 12:53     ` Gregory Price
2025-03-27  2:20       ` Yuquan Wang
2025-03-27  2:51         ` [Lsf-pc] " Dan Williams
2025-03-27  6:29           ` Yuquan Wang
2025-03-14  3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02  4:49   ` Gregory Price
     [not found]     ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14       ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e332391c-30fb-49c3-9c05-574b0c486a81@redhat.com \
    --to=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=shy828301@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox