linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
@ 2025-10-07 21:44 Gregory Price
  2025-10-07 21:59 ` Andrew Morton
  2025-10-08  8:58 ` David Hildenbrand
  0 siblings, 2 replies; 26+ messages in thread
From: Gregory Price @ 2025-10-07 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: corbet, muchun.song, osalvador, david, akpm, hannes, laoar.shao,
	gourry, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	Mike Kravetz, David Rientjes

This reverts commit d6cb41cc44c63492702281b1d329955ca767d399.

This sysctl provides some flexibility between multiple requirements which
are difficult to square without adding significantly more complexity.

1) onlining memory in ZONE_MOVABLE to maintain hotplug compatibility
2) onlining memory in ZONE_MOVABLE to prevent GFP_KERNEL usage
3) passing NUMA structure through to a virtual machine (node0=vnode0,
   node1=vnode1) so a guest can make good placement decisions.
4) utilizing 1GB hugepages for VM host memory to reduce TLB pressure
5) Managing device memory after init-time to avoid incidental usage
   at boot (due to being placed in ZONE_NORMAL), or to provide users
   configuration flexibility.

When device-hotplugged memory does not require hot-unplug assurances,
there is no reason to avoid allowing otherwise non-migratable hugepages
in this zone.  This allows for allocation of 1GB gigantic pages for VMs
with existing mechanisms.

Boot-time CMA is not possible for driver-managed hotplug memory, as CMA
requires the memory to be registered as SystemRAM at boot time.

Updated the code to land in appropriate locations since it all moved.
Updated the documentation to add more context when this is useful.

Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
Link: https://lore.kernel.org/all/20180201193132.Hk7vI_xaU%25akpm@linux-foundation.org/
---
 Documentation/admin-guide/sysctl/vm.rst | 31 +++++++++++++++++++++++++
 include/linux/hugetlb.h                 |  4 +++-
 mm/hugetlb.c                            |  9 +++++++
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4d71211fdad8..c9f26cd447d7 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -40,6 +40,7 @@ Currently, these files are in /proc/sys/vm:
 - enable_soft_offline
 - extfrag_threshold
 - highmem_is_dirtyable
+- hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
 - legacy_va_layout
@@ -356,6 +357,36 @@ only use the low memory and they can fill it up with dirty data without
 any throttling.
 
 
+hugepages_treat_as_movable
+==========================
+
+This parameter controls whether otherwise immovable hugepages (e.g. 1GB
+gigantic pages) may be allocated from from ZONE_MOVABLE. If set to non-zero,
+gigantic hugepages can be allocated from ZONE_MOVABLE. ZONE_MOVABLE memory
+may be created via the kernel boot parameter `kernelcore` or via memory
+hotplug as discussed in Documentation/admin-guide/mm/memory-hotplug.rst.
+
+Support may depend on specific architecture and/or the hugepage size. If
+a hugepage supports migration, allocation from ZONE_MOVABLE is always
+enabled (for example 2MB on x86) for the hugepage regardless of the value
+of this parameter. IOW, this parameter affects only non-migratable hugepages.
+
+Assuming that hugepages are not migratable in your system, one usecase of
+this parameter is that users can make hugepage pool more extensible by
+enabling the allocation from ZONE_MOVABLE. This is because on ZONE_MOVABLE
+page reclaim/migration/compaction work more and you can get contiguous
+memory more likely. Note that using ZONE_MOVABLE for non-migratable
+hugepages can do harm to other features like memory hotremove (because
+memory hotremove expects that memory blocks on ZONE_MOVABLE are always
+removable,) so it's a trade-off responsible for the users.
+
+One common use-case of this feature is allocate 1GB gigantic pages for
+virtual machines from otherwise not-hotplugged memory which has been
+isolated from kernel allocations by being onlined into ZONE_MOVABLE.
+These pages tend to be allocated and released more explicitly, and so
+hotplug can still be achieved with appropriate orchestration.
+
+
 hugetlb_shm_group
 =================
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 526d27e88b3b..bbaa1b4908b6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -172,6 +172,7 @@ bool hugetlbfs_pagecache_present(struct hstate *h,
 
 struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio);
 
+extern int hugepages_treat_as_movable;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages[MAX_NUMNODES];
 
@@ -926,7 +927,8 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
 	gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
 
-	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
+	gfp |= (hugepage_movable_supported(h) || hugepages_treat_as_movable) ?
+	       GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
 
 	return gfp;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 753f99b4c718..4b2213ccbb29 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -55,6 +55,8 @@
 #include "hugetlb_cma.h"
 #include <linux/page-isolation.h>
 
+int hugepages_treat_as_movable;
+
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
@@ -5195,6 +5197,13 @@ static const struct ctl_table hugetlb_table[] = {
 		.mode		= 0644,
 		.proc_handler	= hugetlb_overcommit_handler,
 	},
+	{
+		.procname	= "hugepages_treat_as_movable",
+		.data		= &hugepages_treat_as_movable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 };
 
 static void __init hugetlb_sysctl_init(void)
-- 
2.51.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-07 21:44 [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl" Gregory Price
@ 2025-10-07 21:59 ` Andrew Morton
  2025-10-07 22:12   ` Gregory Price
  2025-10-08  8:58 ` David Hildenbrand
  1 sibling, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2025-10-07 21:59 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, corbet, muchun.song, osalvador, david, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Tue,  7 Oct 2025 17:44:12 -0400 Gregory Price <gourry@gourry.net> wrote:

> This reverts commit d6cb41cc44c63492702281b1d329955ca767d399.

It's been seven years.  Perhaps "reintroduce hugepages_treat_as_movable
sysctl" would be a better way of presenting this.  Not very important.

> This sysctl provides some flexibility between multiple requirements which
> are difficult to square without adding significantly more complexity.
> 
> 1) onlining memory in ZONE_MOVABLE to maintain hotplug compatibility
> 2) onlining memory in ZONE_MOVABLE to prevent GFP_KERNEL usage
> 3) passing NUMA structure through to a virtual machine (node0=vnode0,
>    node1=vnode1) so a guest can make good placement decisions.
> 4) utilizing 1GB hugepages for VM host memory to reduce TLB pressure
> 5) Managing device memory after init-time to avoid incidental usage
>    at boot (due to being placed in ZONE_NORMAL), or to provide users
>    configuration flexibility.
> 
> When device-hotplugged memory does not require hot-unplug assurances,
> there is no reason to avoid allowing otherwise non-migratable hugepages
> in this zone.  This allows for allocation of 1GB gigantic pages for VMs
> with existing mechanisms.
> 
> Boot-time CMA is not possible for driver-managed hotplug memory, as CMA
> requires the memory to be registered as SystemRAM at boot time.
> 
> Updated the code to land in appropriate locations since it all moved.
> Updated the documentation to add more context when this is useful.

I'll duck the patch for now, see what people have to say.

> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -55,6 +55,8 @@
>  #include "hugetlb_cma.h"
>  #include <linux/page-isolation.h>
>  
> +int hugepages_treat_as_movable;
> +
>  int hugetlb_max_hstate __read_mostly;
>  unsigned int default_hstate_idx;
>  struct hstate hstates[HUGE_MAX_HSTATE];

Could sprinkle some more __read_mostlys around here?

>
> ...
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-07 21:59 ` Andrew Morton
@ 2025-10-07 22:12   ` Gregory Price
  0 siblings, 0 replies; 26+ messages in thread
From: Gregory Price @ 2025-10-07 22:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, corbet, muchun.song, osalvador, david, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Tue, Oct 07, 2025 at 02:59:55PM -0700, Andrew Morton wrote:
> On Tue,  7 Oct 2025 17:44:12 -0400 Gregory Price <gourry@gourry.net> wrote:
> 
> > This reverts commit d6cb41cc44c63492702281b1d329955ca767d399.
> 
> It's been seven years.  Perhaps "reintroduce hugepages_treat_as_movable
> sysctl" would be a better way of presenting this.  Not very important.
>

But a blink of an eye! Will fix it up if feedback is positive.

> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -55,6 +55,8 @@
> >  #include "hugetlb_cma.h"
> >  #include <linux/page-isolation.h>
> >  
> > +int hugepages_treat_as_movable;
> > +
> >  int hugetlb_max_hstate __read_mostly;
> >  unsigned int default_hstate_idx;
> >  struct hstate hstates[HUGE_MAX_HSTATE];
> 
> Could sprinkle some more __read_mostlys around here?
> 

Makes sense, will take a look while I'm poking around.

Thanks Andrew!
~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-07 21:44 [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl" Gregory Price
  2025-10-07 21:59 ` Andrew Morton
@ 2025-10-08  8:58 ` David Hildenbrand
  2025-10-08 14:18   ` Gregory Price
  2025-10-08 14:59   ` Michal Hocko
  1 sibling, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-10-08  8:58 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: corbet, muchun.song, osalvador, akpm, hannes, laoar.shao,
	brauner, mclapinski, joel.granados, linux-doc, linux-kernel,
	Mel Gorman, Michal Hocko, Alexandru Moise, Mike Kravetz,
	David Rientjes

On 07.10.25 23:44, Gregory Price wrote:
> This reverts commit d6cb41cc44c63492702281b1d329955ca767d399.
> 
> This sysctl provides some flexibility between multiple requirements which
> are difficult to square without adding significantly more complexity.
> 
> 1) onlining memory in ZONE_MOVABLE to maintain hotplug compatibility
> 2) onlining memory in ZONE_MOVABLE to prevent GFP_KERNEL usage
> 3) passing NUMA structure through to a virtual machine (node0=vnode0,
>     node1=vnode1) so a guest can make good placement decisions.
> 4) utilizing 1GB hugepages for VM host memory to reduce TLB pressure
> 5) Managing device memory after init-time to avoid incidental usage
>     at boot (due to being placed in ZONE_NORMAL), or to provide users
>     configuration flexibility.
> 
> When device-hotplugged memory does not require hot-unplug assurances,
> there is no reason to avoid allowing otherwise non-migratable hugepages
> in this zone.  This allows for allocation of 1GB gigantic pages for VMs
> with existing mechanisms.
> 
> Boot-time CMA is not possible for driver-managed hotplug memory, as CMA
> requires the memory to be registered as SystemRAM at boot time.
> 
> Updated the code to land in appropriate locations since it all moved.
> Updated the documentation to add more context when this is useful.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Alexandru Moise <00moses.alexander00@gmail.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Link: https://lore.kernel.org/all/20180201193132.Hk7vI_xaU%25akpm@linux-foundation.org/
> ---
>   Documentation/admin-guide/sysctl/vm.rst | 31 +++++++++++++++++++++++++
>   include/linux/hugetlb.h                 |  4 +++-
>   mm/hugetlb.c                            |  9 +++++++
>   3 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 4d71211fdad8..c9f26cd447d7 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -40,6 +40,7 @@ Currently, these files are in /proc/sys/vm:
>   - enable_soft_offline
>   - extfrag_threshold
>   - highmem_is_dirtyable
> +- hugepages_treat_as_movable
>   - hugetlb_shm_group
>   - laptop_mode
>   - legacy_va_layout
> @@ -356,6 +357,36 @@ only use the low memory and they can fill it up with dirty data without
>   any throttling.
>   
>   
> +hugepages_treat_as_movable
> +==========================
> +
> +This parameter controls whether otherwise immovable hugepages (e.g. 1GB
> +gigantic pages) may be allocated from from ZONE_MOVABLE. If set to non-zero,
> +gigantic hugepages can be allocated from ZONE_MOVABLE. ZONE_MOVABLE memory
> +may be created via the kernel boot parameter `kernelcore` or via memory
> +hotplug as discussed in Documentation/admin-guide/mm/memory-hotplug.rst.
> +
> +Support may depend on specific architecture and/or the hugepage size. If
> +a hugepage supports migration, allocation from ZONE_MOVABLE is always
> +enabled (for example 2MB on x86) for the hugepage regardless of the value
> +of this parameter. IOW, this parameter affects only non-migratable hugepages.
> +
> +Assuming that hugepages are not migratable in your system, one usecase of
> +this parameter is that users can make hugepage pool more extensible by
> +enabling the allocation from ZONE_MOVABLE. This is because on ZONE_MOVABLE
> +page reclaim/migration/compaction work more and you can get contiguous
> +memory more likely. Note that using ZONE_MOVABLE for non-migratable
> +hugepages can do harm to other features like memory hotremove (because
> +memory hotremove expects that memory blocks on ZONE_MOVABLE are always
> +removable,) so it's a trade-off responsible for the users.
> +
> +One common use-case of this feature is allocate 1GB gigantic pages for
> +virtual machines from otherwise not-hotplugged memory which has been
> +isolated from kernel allocations by being onlined into ZONE_MOVABLE.
> +These pages tend to be allocated and released more explicitly, and so
> +hotplug can still be achieved with appropriate orchestration.
> +
> +
>   hugetlb_shm_group
>   =================
>   
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 526d27e88b3b..bbaa1b4908b6 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -172,6 +172,7 @@ bool hugetlbfs_pagecache_present(struct hstate *h,
>   
>   struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio);
>   
> +extern int hugepages_treat_as_movable;
>   extern int sysctl_hugetlb_shm_group;
>   extern struct list_head huge_boot_pages[MAX_NUMNODES];
>   
> @@ -926,7 +927,8 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
>   {
>   	gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
>   
> -	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> +	gfp |= (hugepage_movable_supported(h) || hugepages_treat_as_movable) ?
> +	       GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;

I mean, this is as ugly as it gets.

Can't we just let that old approach RIP where it belongs? :)

If something unmovable, it does not belong on ZONE_MOVABLE, as simple as that.

Something I could sympathize is is treaing gigantic pages that are actually
migratable as movable.


Like

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 526d27e88b3b2..78da85b1308dd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -896,37 +896,12 @@ static inline bool hugepage_migration_supported(struct hstate *h)
         return arch_hugetlb_migration_supported(h);
  }
  
-/*
- * Movability check is different as compared to migration check.
- * It determines whether or not a huge page should be placed on
- * movable zone or not. Movability of any huge page should be
- * required only if huge page size is supported for migration.
- * There won't be any reason for the huge page to be movable if
- * it is not migratable to start with. Also the size of the huge
- * page should be large enough to be placed under a movable zone
- * and still feasible enough to be migratable. Just the presence
- * in movable zone does not make the migration feasible.
- *
- * So even though large huge page sizes like the gigantic ones
- * are migratable they should not be movable because its not
- * feasible to migrate them from movable zone.
- */
-static inline bool hugepage_movable_supported(struct hstate *h)
-{
-       if (!hugepage_migration_supported(h))
-               return false;
-
-       if (hstate_is_gigantic(h))
-               return false;
-       return true;
-}
-
  /* Movability of hugepages depends on migration support. */
  static inline gfp_t htlb_alloc_mask(struct hstate *h)
  {
         gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
  
-       gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
+       gfp |= hugepage_migration_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
  
         return gfp;
  }


Assume you want to offline part of the ZONE_MOVABLE there might still be sufficient
space to possibly allocate a 1 GiB area elsewhere and actually move the gigantic page.

IIRC, we do the same for memory offlining already.


Now, maybe we want to make the configurable. But then, I would much rather tweak the
hstate_is_gigantic() check in hugepage_movable_supported(). And the parameter
would need a much better name than some "treat as movable".

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08  8:58 ` David Hildenbrand
@ 2025-10-08 14:18   ` Gregory Price
  2025-10-08 14:44     ` David Hildenbrand
  2025-10-08 14:59   ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Gregory Price @ 2025-10-08 14:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	David Rientjes

On Wed, Oct 08, 2025 at 10:58:23AM +0200, David Hildenbrand wrote:
> On 07.10.25 23:44, Gregory Price wrote:
> I mean, this is as ugly as it gets.
> 
> Can't we just let that old approach RIP where it belongs? :)
> 

Definitely - just found this previously existed and wanted to probe for
how offensive reintroducing it would be. Seems the answer is essentially
"lets do it a little differently".

> Something I could sympathize is is treaing gigantic pages that are actually
> migratable as movable.
> 
...
> -       gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> +       gfp |= hugepage_migration_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> 
> Assume you want to offline part of the ZONE_MOVABLE there might still be sufficient
> space to possibly allocate a 1 GiB area elsewhere and actually move the gigantic page.
> 
> IIRC, we do the same for memory offlining already.
> 

This is generally true of other page sizes as well, though, isn't it?
If the system is truly so pressured that it can't successfully move a
2MB page - offline may still fail.  So allowing 1GB pages is only a risk
in the sense that they're harder to allocate new targets.

It matters more if your system has 64GB than it does if it has 4TB.

> Now, maybe we want to make the configurable. But then, I would much rather tweak the
> hstate_is_gigantic() check in hugepage_movable_supported(). And the parameter
> would need a much better name than some "treat as movable".
> 

Makes sense - I think the change is logically equivalent.

So it would look like...

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 42f374e828a2..36b1eec58e6f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -924,7 +924,7 @@ static inline bool hugepage_movable_supported(struct hstate *h)
        if (!hugepage_migration_supported(h))
                return false;

-       if (hstate_is_gigantic(h))
+       if (hstate_is_gigantic(h) && !movable_gigantic_pages)
                return false;
        return true;
 }

And adjust documentation accordingly.

I'm running some tests in QEMU atm, but it's taking a bit.  Will report
back if I see issues with migration when this is turned on.

If that's acceptable, I'll hack this up.

Thanks David,
~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 14:18   ` Gregory Price
@ 2025-10-08 14:44     ` David Hildenbrand
  2025-10-08 18:58       ` Gregory Price
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-08 14:44 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	David Rientjes

On 08.10.25 16:18, Gregory Price wrote:
> On Wed, Oct 08, 2025 at 10:58:23AM +0200, David Hildenbrand wrote:
>> On 07.10.25 23:44, Gregory Price wrote:
>> I mean, this is as ugly as it gets.
>>
>> Can't we just let that old approach RIP where it belongs? :)
>>
> 
> Definitely - just found this previously existed and wanted to probe for
> how offensive reintroducing it would be. Seems the answer is essentially
> "lets do it a little differently".
> 
>> Something I could sympathize is is treaing gigantic pages that are actually
>> migratable as movable.
>>
> ...
>> -       gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>> +       gfp |= hugepage_migration_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>
>> Assume you want to offline part of the ZONE_MOVABLE there might still be sufficient
>> space to possibly allocate a 1 GiB area elsewhere and actually move the gigantic page.
>>
>> IIRC, we do the same for memory offlining already.
>>
> 
> This is generally true of other page sizes as well, though, isn't it?
> If the system is truly so pressured that it can't successfully move a
> 2MB page - offline may still fail.  So allowing 1GB pages is only a risk
> in the sense that they're harder to allocate new targets.

Right, but memory defragmentation works on pageblock level, so 2 MiB is 
much MUCH more reliable :)

> 
> It matters more if your system has 64GB than it does if it has 4TB.
> 
>> Now, maybe we want to make the configurable. But then, I would much rather tweak the
>> hstate_is_gigantic() check in hugepage_movable_supported(). And the parameter
>> would need a much better name than some "treat as movable".
>>
> 
> Makes sense - I think the change is logically equivalent.
> 
> So it would look like...
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 42f374e828a2..36b1eec58e6f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -924,7 +924,7 @@ static inline bool hugepage_movable_supported(struct hstate *h)
>          if (!hugepage_migration_supported(h))
>                  return false;
> 
> -       if (hstate_is_gigantic(h))
> +       if (hstate_is_gigantic(h) && !movable_gigantic_pages)
>                  return false;
>          return true;
>   }
> 
> And adjust documentation accordingly.
> 
> I'm running some tests in QEMU atm, but it's taking a bit.  Will report
> back if I see issues with migration when this is turned on.
> 
> If that's acceptable, I'll hack this up.

That looks better to me indeed.

Maybe we can export this toggle only if the arch supports migration? 
Then there is also nothing odd to document.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08  8:58 ` David Hildenbrand
  2025-10-08 14:18   ` Gregory Price
@ 2025-10-08 14:59   ` Michal Hocko
  2025-10-08 15:14     ` David Hildenbrand
  2025-10-08 16:08     ` Frank van der Linden
  1 sibling, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2025-10-08 14:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gregory Price, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Wed 08-10-25 10:58:23, David Hildenbrand wrote:
> On 07.10.25 23:44, Gregory Price wrote:
[...]
> > @@ -926,7 +927,8 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
> >   {
> >   	gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
> > -	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> > +	gfp |= (hugepage_movable_supported(h) || hugepages_treat_as_movable) ?
> > +	       GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> 
> I mean, this is as ugly as it gets.
> 
> Can't we just let that old approach RIP where it belongs? :)
> 
> If something unmovable, it does not belong on ZONE_MOVABLE, as simple as that.

yes, I do agree. This is just muddying the semantic of the zone.

Maybe what we really want is to have a configurable zone rather than a
very specific consumer of it instead. What do I mean by that? We clearly
have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
zones. So rather than having a MOVABLE zone we can have a single zone
$FOO_NAME zone with configurable attributes - like allocation
constrains (kernel, user, movable, etc). Now that we can overlap zones
this should allow for quite a lot flexibility. Implementation wise this
would require some tricks as we have 2 zone types for potentially 3
different major usecases (kernel allocations, userspace reserved ranges
without movability and movable allocations). I haven't thought this
through completely and mostly throwing this as an idea (maybe won't
work). Does that make sense?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 14:59   ` Michal Hocko
@ 2025-10-08 15:14     ` David Hildenbrand
  2025-10-08 15:23       ` Michal Hocko
  2025-10-08 16:08     ` Frank van der Linden
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-08 15:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gregory Price, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On 08.10.25 16:59, Michal Hocko wrote:
> On Wed 08-10-25 10:58:23, David Hildenbrand wrote:
>> On 07.10.25 23:44, Gregory Price wrote:
> [...]
>>> @@ -926,7 +927,8 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
>>>    {
>>>    	gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
>>> -	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>> +	gfp |= (hugepage_movable_supported(h) || hugepages_treat_as_movable) ?
>>> +	       GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>
>> I mean, this is as ugly as it gets.
>>
>> Can't we just let that old approach RIP where it belongs? :)
>>
>> If something unmovable, it does not belong on ZONE_MOVABLE, as simple as that.
> 
> yes, I do agree. This is just muddying the semantic of the zone.
> 
> Maybe what we really want is to have a configurable zone rather than a
> very specific consumer of it instead. What do I mean by that? We clearly
> have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
> zones. So rather than having a MOVABLE zone we can have a single zone
> $FOO_NAME zone with configurable attributes - like allocation
> constrains (kernel, user, movable, etc). Now that we can overlap zones
> this should allow for quite a lot flexibility. Implementation wise this
> would require some tricks as we have 2 zone types for potentially 3
> different major usecases (kernel allocations, userspace reserved ranges
> without movability and movable allocations). I haven't thought this
> through completely and mostly throwing this as an idea (maybe won't
> work). Does that make sense?

I suggested something called PREFER_MOVABLE in the past, that would 
prefer movable allocations but nothing would stop unmovable allocations 
to end up on it. But only as a last resort or when explicitly requested 
(e.g., gigantic pages).

Maybe that's similar to what you have in mind?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 15:14     ` David Hildenbrand
@ 2025-10-08 15:23       ` Michal Hocko
  2025-10-08 15:43         ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2025-10-08 15:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gregory Price, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Wed 08-10-25 17:14:26, David Hildenbrand wrote:
> On 08.10.25 16:59, Michal Hocko wrote:
> > On Wed 08-10-25 10:58:23, David Hildenbrand wrote:
> > > On 07.10.25 23:44, Gregory Price wrote:
> > [...]
> > > > @@ -926,7 +927,8 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > > >    {
> > > >    	gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
> > > > -	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> > > > +	gfp |= (hugepage_movable_supported(h) || hugepages_treat_as_movable) ?
> > > > +	       GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> > > 
> > > I mean, this is as ugly as it gets.
> > > 
> > > Can't we just let that old approach RIP where it belongs? :)
> > > 
> > > If something unmovable, it does not belong on ZONE_MOVABLE, as simple as that.
> > 
> > yes, I do agree. This is just muddying the semantic of the zone.
> > 
> > Maybe what we really want is to have a configurable zone rather than a
> > very specific consumer of it instead. What do I mean by that? We clearly
> > have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
> > zones. So rather than having a MOVABLE zone we can have a single zone
> > $FOO_NAME zone with configurable attributes - like allocation
> > constrains (kernel, user, movable, etc). Now that we can overlap zones
> > this should allow for quite a lot flexibility. Implementation wise this
> > would require some tricks as we have 2 zone types for potentially 3
> > different major usecases (kernel allocations, userspace reserved ranges
> > without movability and movable allocations). I haven't thought this
> > through completely and mostly throwing this as an idea (maybe won't
> > work). Does that make sense?
> 
> I suggested something called PREFER_MOVABLE in the past, that would prefer
> movable allocations but nothing would stop unmovable allocations to end up
> on it. But only as a last resort or when explicitly requested (e.g.,
> gigantic pages).
> 
> Maybe that's similar to what you have in mind?

Slightly different because what I was thinking about was more towards
guarantee/predictability. Last resort is quite hard to plan around. It
might be a peak memory pressure to eat up portion of a memory block and
then fragmenting it to prevent other use planned for that memroy block.
That is why I called it user allocations because those are supposed to
be configured for userspace consumation and planned for that use. So you
would get pretty much a guarantee that no kernel allocations will fall
there.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 15:23       ` Michal Hocko
@ 2025-10-08 15:43         ` David Hildenbrand
  2025-10-08 16:31           ` Gregory Price
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-08 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gregory Price, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On 08.10.25 17:23, Michal Hocko wrote:
> On Wed 08-10-25 17:14:26, David Hildenbrand wrote:
>> On 08.10.25 16:59, Michal Hocko wrote:
>>> On Wed 08-10-25 10:58:23, David Hildenbrand wrote:
>>>> On 07.10.25 23:44, Gregory Price wrote:
>>> [...]
>>>>> @@ -926,7 +927,8 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
>>>>>     {
>>>>>     	gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
>>>>> -	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>>>> +	gfp |= (hugepage_movable_supported(h) || hugepages_treat_as_movable) ?
>>>>> +	       GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>>>
>>>> I mean, this is as ugly as it gets.
>>>>
>>>> Can't we just let that old approach RIP where it belongs? :)
>>>>
>>>> If something unmovable, it does not belong on ZONE_MOVABLE, as simple as that.
>>>
>>> yes, I do agree. This is just muddying the semantic of the zone.
>>>
>>> Maybe what we really want is to have a configurable zone rather than a
>>> very specific consumer of it instead. What do I mean by that? We clearly
>>> have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
>>> zones. So rather than having a MOVABLE zone we can have a single zone
>>> $FOO_NAME zone with configurable attributes - like allocation
>>> constrains (kernel, user, movable, etc). Now that we can overlap zones
>>> this should allow for quite a lot flexibility. Implementation wise this
>>> would require some tricks as we have 2 zone types for potentially 3
>>> different major usecases (kernel allocations, userspace reserved ranges
>>> without movability and movable allocations). I haven't thought this
>>> through completely and mostly throwing this as an idea (maybe won't
>>> work). Does that make sense?
>>
>> I suggested something called PREFER_MOVABLE in the past, that would prefer
>> movable allocations but nothing would stop unmovable allocations to end up
>> on it. But only as a last resort or when explicitly requested (e.g.,
>> gigantic pages).
>>
>> Maybe that's similar to what you have in mind?
> 
> Slightly different because what I was thinking about was more towards
> guarantee/predictability. Last resort is quite hard to plan around. It
> might be a peak memory pressure to eat up portion of a memory block and
> then fragmenting it to prevent other use planned for that memroy block.
> That is why I called it user allocations because those are supposed to
> be configured for userspace consumation and planned for that use. So you
> would get pretty much a guarantee that no kernel allocations will fall
> there.

What could end up on it that would not already end up on ZONE_MOVABLE? I 
guess long-term pinned pages, secretmem, guest_memfd, gigantic pages.

Anything else?

I'm not quite clear yet on the use case, though. If all the user 
allocations end up fragmenting the memory, there is also not a lot of 
benefit to be had from that zone long term.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 14:59   ` Michal Hocko
  2025-10-08 15:14     ` David Hildenbrand
@ 2025-10-08 16:08     ` Frank van der Linden
  2025-10-08 16:39       ` Gregory Price
  2025-10-08 17:05       ` Gregory Price
  1 sibling, 2 replies; 26+ messages in thread
From: Frank van der Linden @ 2025-10-08 16:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Gregory Price, linux-mm, corbet, muchun.song,
	osalvador, akpm, hannes, laoar.shao, brauner, mclapinski,
	joel.granados, linux-doc, linux-kernel, Mel Gorman,
	Alexandru Moise, Mike Kravetz, David Rientjes

On Wed, Oct 8, 2025 at 7:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 08-10-25 10:58:23, David Hildenbrand wrote:
> > On 07.10.25 23:44, Gregory Price wrote:
> [...]
> > > @@ -926,7 +927,8 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > >   {
> > >     gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
> > > -   gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> > > +   gfp |= (hugepage_movable_supported(h) || hugepages_treat_as_movable) ?
> > > +          GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> >
> > I mean, this is as ugly as it gets.
> >
> > Can't we just let that old approach RIP where it belongs? :)
> >
> > If something unmovable, it does not belong on ZONE_MOVABLE, as simple as that.
>
> yes, I do agree. This is just muddying the semantic of the zone.
>
> Maybe what we really want is to have a configurable zone rather than a
> very specific consumer of it instead. What do I mean by that? We clearly
> have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
> zones. So rather than having a MOVABLE zone we can have a single zone
> $FOO_NAME zone with configurable attributes - like allocation
> constrains (kernel, user, movable, etc). Now that we can overlap zones
> this should allow for quite a lot flexibility. Implementation wise this
> would require some tricks as we have 2 zone types for potentially 3
> different major usecases (kernel allocations, userspace reserved ranges
> without movability and movable allocations). I haven't thought this
> through completely and mostly throwing this as an idea (maybe won't
> work). Does that make sense?
> --
> Michal Hocko
> SUSE Labs
>

Right, it's all about what the intended goal is. There are two
different goals here. If the goal is hotremove, then no, you don't
want anything in ZONE_MOVABLE that is not migratable. But if the goal
is to have normal allocations always be migratable so that you can get
'gigantic' hugepages, then it is fine to have those gigantic hugepages
not be migratable. They are the goal, after all, and won't get in the
way of other gigantic hugepage allocations.

Somewhat similar situation with CMA (currently only hugetlb_cma). Is
is never ok to pin something in MIGRATE_CMA pageblocks. But what if
you have allocated a hugetlb page through cma_alloc? There is no point
in disallowing pinning for it. That 1G page is the goal, and pinning
it won't get in the way of other cma_alloc calls for 1G pages.

I agree that having mutiple zone properties is probably the way to go.

At one point, I implemented something like this, a minimum pinnable
page order for ZONE_MOVABLE. E.g. if you set it to 9, then anything >=
2M can be pinned, so ZONE_MOVABLE will help get you THPs, but not
necessarily anything else. There is also the use case for CXL memory,
where you don't want any kernel allocations to come from zones on that
node.

- Frank


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 15:43         ` David Hildenbrand
@ 2025-10-08 16:31           ` Gregory Price
  2025-10-09  6:14             ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Gregory Price @ 2025-10-08 16:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Wed, Oct 08, 2025 at 05:43:23PM +0200, David Hildenbrand wrote:
> On 08.10.25 17:23, Michal Hocko wrote:
> > On Wed 08-10-25 17:14:26, David Hildenbrand wrote:
> > > On 08.10.25 16:59, Michal Hocko wrote:
> > > > yes, I do agree. This is just muddying the semantic of the zone.
> > > > 
> > > > Maybe what we really want is to have a configurable zone rather than a
> > > > very specific consumer of it instead. What do I mean by that? We clearly
> > > > have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
> > > > zones. So rather than having a MOVABLE zone we can have a single zone
> > > > $FOO_NAME zone with configurable attributes - like allocation
> > > > constrains (kernel, user, movable, etc). Now that we can overlap zones
> > > > this should allow for quite a lot flexibility. Implementation wise this
> > > > would require some tricks as we have 2 zone types for potentially 3
> > > > different major usecases (kernel allocations, userspace reserved ranges
> > > > without movability and movable allocations). I haven't thought this
> > > > through completely and mostly throwing this as an idea (maybe won't
> > > > work). Does that make sense?
> > >

I'd also considered something between NORMAL and MOVABLE, something like
ZONE_NOKERNEL or ZONE_USER. But that seemed excessive.

> > That is why I called it user allocations because those are supposed to
> > be configured for userspace consumation and planned for that use. So you
> > would get pretty much a guarantee that no kernel allocations will fall
> > there.
> 
> What could end up on it that would not already end up on ZONE_MOVABLE? I
> guess long-term pinned pages, secretmem, guest_memfd, gigantic pages.
> 
> Anything else?
> 
> I'm not quite clear yet on the use case, though. If all the user allocations
> end up fragmenting the memory, there is also not a lot of benefit to be had
> from that zone long term.
>

The only real use case i've seen is exactly: 
 - Don't want random GFP_KERNEL to land there
 - Might want it to be pinnable

I think that covers what you've described above.

But adding an entire zone felt a bit heavy handed.  Allowing gigantic in
movable seemed less - immediately - offensive.

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 16:08     ` Frank van der Linden
@ 2025-10-08 16:39       ` Gregory Price
  2025-10-08 17:05       ` Gregory Price
  1 sibling, 0 replies; 26+ messages in thread
From: Gregory Price @ 2025-10-08 16:39 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Michal Hocko, David Hildenbrand, linux-mm, corbet, muchun.song,
	osalvador, akpm, hannes, laoar.shao, brauner, mclapinski,
	joel.granados, linux-doc, linux-kernel, Mel Gorman,
	Alexandru Moise, Mike Kravetz, David Rientjes

On Wed, Oct 08, 2025 at 09:08:01AM -0700, Frank van der Linden wrote:
> On Wed, Oct 8, 2025 at 7:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Maybe what we really want is to have a configurable zone rather than a
> > very specific consumer of it instead. What do I mean by that? We clearly
> > have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
> > zones. So rather than having a MOVABLE zone we can have a single zone
> > $FOO_NAME zone with configurable attributes - like allocation
> > constrains (kernel, user, movable, etc).
...
> 
> I agree that having mutiple zone properties is probably the way to go.
> 

This I imagine would need to be a build-time configuration, as you'd run
into issues flipping these bits if the memory is already in use.

This of course begs the question - if one configurable zone, why not N
configuable zones?

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 16:08     ` Frank van der Linden
  2025-10-08 16:39       ` Gregory Price
@ 2025-10-08 17:05       ` Gregory Price
  1 sibling, 0 replies; 26+ messages in thread
From: Gregory Price @ 2025-10-08 17:05 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Michal Hocko, David Hildenbrand, linux-mm, corbet, muchun.song,
	osalvador, akpm, hannes, laoar.shao, brauner, mclapinski,
	joel.granados, linux-doc, linux-kernel, Mel Gorman,
	Alexandru Moise, Mike Kravetz, David Rientjes

On Wed, Oct 08, 2025 at 09:08:01AM -0700, Frank van der Linden wrote:
> On Wed, Oct 8, 2025 at 7:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Maybe what we really want is to have a configurable zone rather than a
> > very specific consumer of it instead. What do I mean by that? We clearly
> > have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
> > zones. So rather than having a MOVABLE zone we can have a single zone
> > $FOO_NAME zone with configurable attributes - like allocation
> > constrains (kernel, user, movable, etc). Now that we can overlap zones
...
> 
> I agree that having mutiple zone properties is probably the way to go.
> 

Ah, I should also mention that I've been kicking around the idea of a
ZONE_DEVICE allocator - but this blows up pretty quickly into
maintaining an entirely separate page allocator for non-general-use
memory, so i didn't want to start off with that until later.

tl;dr: pgmap->alloc_folio(gfp, order)

Then allow driver managed memory to "online" this capacity via
ZONE_DEVICE and integrate *specific* areas of the kernel to use it -
rather than everything.  The device's driver is then responsible for
implementing alloc_folio(gfp, order), and a zone_device_alloc() is
responsible for hitting all the relevant devices for a compatible
allocation.

I alluded to this in the hotness/compression discussions - where there
is some compressed memory you want to draw hard boundaries around how
it is accessed/mapped, but want it available as a demotion source.

https://lore.kernel.org/linux-mm/aNzWwz5OYLOjwjLv@gourry-fedora-PF4VCD3F/

Not sure if i'm just overcomplicating the discussion here, but if we're
talking about new ZONEs then maybe it's worth considering something like
this as well.

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 14:44     ` David Hildenbrand
@ 2025-10-08 18:58       ` Gregory Price
  2025-10-08 19:01         ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Gregory Price @ 2025-10-08 18:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	David Rientjes

On Wed, Oct 08, 2025 at 04:44:22PM +0200, David Hildenbrand wrote:
> On 08.10.25 16:18, Gregory Price wrote:
> > On Wed, Oct 08, 2025 at 10:58:23AM +0200, David Hildenbrand wrote:
> > > On 07.10.25 23:44, Gregory Price wrote:
> > > I mean, this is as ugly as it gets.
> > > 
> > > Can't we just let that old approach RIP where it belongs? :)
> > > 
> > 
> > Definitely - just found this previously existed and wanted to probe for
> > how offensive reintroducing it would be. Seems the answer is essentially
> > "lets do it a little differently".
> > 
> > > Something I could sympathize is is treaing gigantic pages that are actually
> > > migratable as movable.
> > > 
> > ...
> > > -       gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> > > +       gfp |= hugepage_migration_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> > > 
> > > Assume you want to offline part of the ZONE_MOVABLE there might still be sufficient
> > > space to possibly allocate a 1 GiB area elsewhere and actually move the gigantic page.
> > > 
> > > IIRC, we do the same for memory offlining already.
> > > 
> > 
> > This is generally true of other page sizes as well, though, isn't it?
> > If the system is truly so pressured that it can't successfully move a
> > 2MB page - offline may still fail.  So allowing 1GB pages is only a risk
> > in the sense that they're harder to allocate new targets.
> 
> Right, but memory defragmentation works on pageblock level, so 2 MiB is much
> MUCH more reliable :)
> 

fwiw this works cleanly.  Just dropping this here, but should continue
the zone conversation.  I need to check, but does this actually allow
pinnable allocations?  I thought pinning kicked off migration.

================== test =======================

# echo 1 > /proc/sys/vm/movable_gigantic_pages
# echo 1 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
# echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
# ./huge
Allocating 1GB hugepage
Binding hugepage to NUMA node 1
Faulting page in
Resetting mbind policy to MPOL_DEFAULT (local policy)
Migrating
Migrated pages from node 1 to node 0, pages not moved: 0

================== patch  =======================

commit 395988dc319771db980dab3f95ed9ec8f0b74945
Author: Gregory Price <gourry@gourry.net>
Date:   Tue Oct 7 10:11:51 2025 -0700

    mm, hugetlb: introduce movable_gigantic_pages

    Signed-off-by: Gregory Price <gourry@gourry.net>

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 9bef46151d53..1535c9a964dc 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -54,6 +54,7 @@ Currently, these files are in /proc/sys/vm:
 - mmap_min_addr
 - mmap_rnd_bits
 - mmap_rnd_compat_bits
+- movable_gigantic_pages
 - nr_hugepages
 - nr_hugepages_mempolicy
 - nr_overcommit_hugepages
@@ -624,6 +625,22 @@ This value can be changed after boot using the
 /proc/sys/vm/mmap_rnd_compat_bits tunable


+movable_gigantic_pages
+======================
+
+This parameter controls whether gigantic pages may be allocated from
+ZONE_MOVABLE. If set to non-zero, gigantic hugepages can be allocated
+from ZONE_MOVABLE. ZONE_MOVABLE memory may be created via the kernel
+boot parameter `kernelcore` or via memory hotplug as discussed in
+Documentation/admin-guide/mm/memory-hotplug.rst.
+
+Support may depend on specific architecture.
+
+Note that using ZONE_MOVABLE gigantic pages may make features like
+memory hotremove more unreliable, as migrating gigantic pages is more
+difficult due to needing larger amounts of phyiscally contiguous memory.
+
+
 nr_hugepages
 ============

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 42f374e828a2..834061eb2ddd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -172,6 +172,7 @@ bool hugetlbfs_pagecache_present(struct hstate *h,

 struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio);

+extern int movable_gigantic_pages __read_mostly;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages[MAX_NUMNODES];

@@ -924,7 +925,7 @@ static inline bool hugepage_movable_supported(struct hstate *h)
        if (!hugepage_migration_supported(h))
                return false;

-       if (hstate_is_gigantic(h))
+       if (hstate_is_gigantic(h) && !movable_gigantic_pages)
                return false;
        return true;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a0d285d20992..3f8f3d6f2d60 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -54,6 +54,8 @@
 #include "hugetlb_cma.h"
 #include <linux/page-isolation.h>

+int movable_gigantic_pages;
+
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
@@ -5199,6 +5201,13 @@ static const struct ctl_table hugetlb_table[] = {
                .mode           = 0644,
                .proc_handler   = hugetlb_overcommit_handler,
        },
+       {
+               .procname       = "movable_gigantic_pages",
+               .data           = &movable_gigantic_pages,
+               .maxlen         = sizeof(int),
+               .mode           = 0644,
+               .proc_handler   = proc_dointvec,
+       },
 };

 static void __init hugetlb_sysctl_init(void)


================== huge.c =======================
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <sys/syscall.h>
#include <linux/mempolicy.h>
#include <stdint.h>
#include <time.h>

#ifndef MAP_HUGE_SHIFT
#define MAP_HUGE_SHIFT 26
#endif

#ifndef MAP_HUGE_1GB
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
#endif

static long mbind_syscall(void *addr, unsigned long len, int mode,
                          const unsigned long *nodemask, unsigned long maxnode, unsigned flags) {
    return syscall(__NR_mbind, addr, len, mode, nodemask, maxnode, flags);
}

static long migrate_pages_syscall(pid_t pid, unsigned long maxnode,
                                  const unsigned long *from, const unsigned long *to) {
    return syscall(__NR_migrate_pages, pid, maxnode, from, to);
}

int main() {
    size_t size = 1UL << 30; // 1GB
    int node_from = 1;
    int node_to = 0;

    printf("Allocating 1GB hugepage\n");
    void *addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, -1, 0);
    if (addr == MAP_FAILED) {
        perror("mmap hugepage");
        return 1;
    }
    printf("Binding hugepage to NUMA node %d\n", node_from);
    unsigned long nodemask = 1UL << node_from;
    if (mbind_syscall(addr, size, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0) != 0) {
        perror("mbind");
        munmap(addr, size);
        return 1;
    }
    printf("Faulting page in\n");
    ((volatile char *)addr)[0] = 0;
    printf("Resetting mbind policy to MPOL_DEFAULT (local policy)\n");
    if (mbind_syscall(addr, size, MPOL_DEFAULT, NULL, 0, 0) != 0) {
        perror("mbind failed to reset");
        munmap(addr, size);
        return 1;
    }
    printf("Migrating\n");
    unsigned long from_mask = 1UL << node_from;
    unsigned long to_mask = 1UL << node_to;
    long ret = migrate_pages_syscall(0, sizeof(unsigned long) * 8, &from_mask, &to_mask);
    if (ret < 0) {
        perror("migrate_pages");
        munmap(addr, size);
        return 1;
    }
    printf("Migrated pages from node %d to node %d, pages not moved: %ld\n", node_from, node_to, ret);
    munmap(addr, size);
    return 0;
}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 18:58       ` Gregory Price
@ 2025-10-08 19:01         ` David Hildenbrand
  2025-10-08 19:44           ` Gregory Price
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-08 19:01 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	David Rientjes

On 08.10.25 20:58, Gregory Price wrote:
> On Wed, Oct 08, 2025 at 04:44:22PM +0200, David Hildenbrand wrote:
>> On 08.10.25 16:18, Gregory Price wrote:
>>> On Wed, Oct 08, 2025 at 10:58:23AM +0200, David Hildenbrand wrote:
>>>> On 07.10.25 23:44, Gregory Price wrote:
>>>> I mean, this is as ugly as it gets.
>>>>
>>>> Can't we just let that old approach RIP where it belongs? :)
>>>>
>>>
>>> Definitely - just found this previously existed and wanted to probe for
>>> how offensive reintroducing it would be. Seems the answer is essentially
>>> "lets do it a little differently".
>>>
>>>> Something I could sympathize is is treaing gigantic pages that are actually
>>>> migratable as movable.
>>>>
>>> ...
>>>> -       gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>>> +       gfp |= hugepage_migration_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>>>
>>>> Assume you want to offline part of the ZONE_MOVABLE there might still be sufficient
>>>> space to possibly allocate a 1 GiB area elsewhere and actually move the gigantic page.
>>>>
>>>> IIRC, we do the same for memory offlining already.
>>>>
>>>
>>> This is generally true of other page sizes as well, though, isn't it?
>>> If the system is truly so pressured that it can't successfully move a
>>> 2MB page - offline may still fail.  So allowing 1GB pages is only a risk
>>> in the sense that they're harder to allocate new targets.
>>
>> Right, but memory defragmentation works on pageblock level, so 2 MiB is much
>> MUCH more reliable :)
>>
> 
> fwiw this works cleanly.  Just dropping this here, but should continue
> the zone conversation.  I need to check, but does this actually allow
> pinnable allocations?  I thought pinning kicked off migration.

Yes, it should because longterm pinning -> unmovable.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 19:01         ` David Hildenbrand
@ 2025-10-08 19:44           ` Gregory Price
  2025-10-08 19:52             ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Gregory Price @ 2025-10-08 19:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	David Rientjes

On Wed, Oct 08, 2025 at 09:01:09PM +0200, David Hildenbrand wrote:
> > 
> > fwiw this works cleanly.  Just dropping this here, but should continue
> > the zone conversation.  I need to check, but does this actually allow
> > pinnable allocations?  I thought pinning kicked off migration.
> 
> Yes, it should because longterm pinning -> unmovable.
> 

You know i just realized, my test here only works before I allocated 1GB
pages on both node0 and node1.  If I only allocate 1gb hugetlb on node1,
then the migrate pages call fails - because there are no 1gb pages
available there.

I imagine this would cause hot-unplug/offline to fail since it uses the
same migration mechanisms.

Worse I would imagine this would fail for 2MB.

Seems like the 1GB limitation is arbitrary if 2MB causes the same issue.

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 19:44           ` Gregory Price
@ 2025-10-08 19:52             ` David Hildenbrand
  2025-10-08 19:59               ` Gregory Price
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-08 19:52 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	David Rientjes

On 08.10.25 21:44, Gregory Price wrote:
> On Wed, Oct 08, 2025 at 09:01:09PM +0200, David Hildenbrand wrote:
>>>
>>> fwiw this works cleanly.  Just dropping this here, but should continue
>>> the zone conversation.  I need to check, but does this actually allow
>>> pinnable allocations?  I thought pinning kicked off migration.
>>
>> Yes, it should because longterm pinning -> unmovable.
>>
> 
> You know i just realized, my test here only works before I allocated 1GB
> pages on both node0 and node1.  If I only allocate 1gb hugetlb on node1,
> then the migrate pages call fails - because there are no 1gb pages
> available there.
> 
> I imagine this would cause hot-unplug/offline to fail since it uses the
> same migration mechanisms.
> 
> Worse I would imagine this would fail for 2MB.
> 
> Seems like the 1GB limitation is arbitrary if 2MB causes the same issue.

Yeah, with hugetlb allocations there are no guarantees either. It's just 
that page compaction / defragmentation makes it much less likely to fail 
in many scenarios.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 19:52             ` David Hildenbrand
@ 2025-10-08 19:59               ` Gregory Price
  0 siblings, 0 replies; 26+ messages in thread
From: Gregory Price @ 2025-10-08 19:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Michal Hocko, Alexandru Moise,
	David Rientjes, Joshua Hahn

On Wed, Oct 08, 2025 at 09:52:09PM +0200, David Hildenbrand wrote:
> On 08.10.25 21:44, Gregory Price wrote:
> > On Wed, Oct 08, 2025 at 09:01:09PM +0200, David Hildenbrand wrote:
> > > > 
> > > > fwiw this works cleanly.  Just dropping this here, but should continue
> > > > the zone conversation.  I need to check, but does this actually allow
> > > > pinnable allocations?  I thought pinning kicked off migration.
> > > 
> > > Yes, it should because longterm pinning -> unmovable.
> > > 
> > 
> > You know i just realized, my test here only works before I allocated 1GB
> > pages on both node0 and node1.  If I only allocate 1gb hugetlb on node1,
> > then the migrate pages call fails - because there are no 1gb pages
> > available there.
> > 
> > I imagine this would cause hot-unplug/offline to fail since it uses the
> > same migration mechanisms.
> > 
> > Worse I would imagine this would fail for 2MB.
> > 
> > Seems like the 1GB limitation is arbitrary if 2MB causes the same issue.
> 
> Yeah, with hugetlb allocations there are no guarantees either. It's just
> that page compaction / defragmentation makes it much less likely to fail in
> many scenarios.
> 

Gotcha, well I am open to suggestions.  This chicken bit here feels like
a sufficient guardrail, but I'm happy to explore the ZONE discussion
further if we think that's fruitful.

Joshua Hahn (cc) did privately question whether zonelist ordering breaks
for such a configuable zone.  If memory can't live in ZONE_NORMAL or
ZONE_MOVABLE, but you want it to have some combination of attributes
between the two, it can't also live above ZONE_MOVABLE I don't think.

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-08 16:31           ` Gregory Price
@ 2025-10-09  6:14             ` Michal Hocko
  2025-10-09 15:29               ` Gregory Price
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2025-10-09  6:14 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand, linux-mm, corbet, muchun.song, osalvador,
	akpm, hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Wed 08-10-25 12:31:22, Gregory Price wrote:
> On Wed, Oct 08, 2025 at 05:43:23PM +0200, David Hildenbrand wrote:
> > On 08.10.25 17:23, Michal Hocko wrote:
> > > On Wed 08-10-25 17:14:26, David Hildenbrand wrote:
> > > > On 08.10.25 16:59, Michal Hocko wrote:
> > > > > yes, I do agree. This is just muddying the semantic of the zone.
> > > > > 
> > > > > Maybe what we really want is to have a configurable zone rather than a
> > > > > very specific consumer of it instead. What do I mean by that? We clearly
> > > > > have physically (DMA, DMA32) and usability (NORMAL, MOVABLE) constrained
> > > > > zones. So rather than having a MOVABLE zone we can have a single zone
> > > > > $FOO_NAME zone with configurable attributes - like allocation
> > > > > constrains (kernel, user, movable, etc). Now that we can overlap zones
> > > > > this should allow for quite a lot flexibility. Implementation wise this
> > > > > would require some tricks as we have 2 zone types for potentially 3
> > > > > different major usecases (kernel allocations, userspace reserved ranges
> > > > > without movability and movable allocations). I haven't thought this
> > > > > through completely and mostly throwing this as an idea (maybe won't
> > > > > work). Does that make sense?
> > > >
> 
> I'd also considered something between NORMAL and MOVABLE, something like
> ZONE_NOKERNEL or ZONE_USER. But that seemed excessive.
> 
> > > That is why I called it user allocations because those are supposed to
> > > be configured for userspace consumation and planned for that use. So you
> > > would get pretty much a guarantee that no kernel allocations will fall
> > > there.
> > 
> > What could end up on it that would not already end up on ZONE_MOVABLE? I
> > guess long-term pinned pages, secretmem, guest_memfd, gigantic pages.
> > 
> > Anything else?
> > 
> > I'm not quite clear yet on the use case, though. If all the user allocations
> > end up fragmenting the memory, there is also not a lot of benefit to be had
> > from that zone long term.
> >
> 
> The only real use case i've seen is exactly: 
>  - Don't want random GFP_KERNEL to land there
>  - Might want it to be pinnable
> 
> I think that covers what you've described above.
> 
> But adding an entire zone felt a bit heavy handed.  Allowing gigantic in
> movable seemed less - immediately - offensive.

The question is whether we need a full zone for that or we can control
those allocation constrains on per memory block bases to override
otherwise default. So it wouldn't be MOVABLE but rather something like
USER zone.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-09  6:14             ` Michal Hocko
@ 2025-10-09 15:29               ` Gregory Price
  2025-10-09 18:47                 ` Michal Hocko
  2025-10-09 18:51                 ` David Hildenbrand
  0 siblings, 2 replies; 26+ messages in thread
From: Gregory Price @ 2025-10-09 15:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-mm, corbet, muchun.song, osalvador,
	akpm, hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Thu, Oct 09, 2025 at 08:14:22AM +0200, Michal Hocko wrote:
> On Wed 08-10-25 12:31:22, Gregory Price wrote:
> > > I'm not quite clear yet on the use case, though. If all the user allocations
> > > end up fragmenting the memory, there is also not a lot of benefit to be had
> > > from that zone long term.
> > >
> > 
> > The only real use case i've seen is exactly: 
> >  - Don't want random GFP_KERNEL to land there
> >  - Might want it to be pinnable
> > 
> > I think that covers what you've described above.
> > 
> > But adding an entire zone felt a bit heavy handed.  Allowing gigantic in
> > movable seemed less - immediately - offensive.
> 
> The question is whether we need a full zone for that or we can control
> those allocation constrains on per memory block bases to override
> otherwise default. So it wouldn't be MOVABLE but rather something like
> USER zone.


Mild ignorance here - but I don't think the buddy allocator currently
differentiates chunks of memory based on block membership, it just eats
folios from certain zones/nodes.

I'm scratching my head trying to think of the discrete mechanism to do
this that doesn't inject significantly more complexity into the buddy
allocator.

Looking at the recent[1] THP support for ZONE_DEVICE, I wonder if we end
up with something more along these lines?  But this aschews the other
requirement of wanting the memory to be otherwise general purpose.

https://lore.kernel.org/linux-mm/20251001065707.920170-1-balbirs@nvidia.com/

ZONE_USER does feel like the most natural solution.  Literally just
(ZONE_NORMAL - GFP_KERNEL).  This might need a new GFP flag for certain
use cases like KVM (GFP_USER) to denote certain "This isn't technically
kernel memory, but it needs to be pinnable".  That would slot right
between ZONE_NORMAL and ZONE_MOVABLE.

Alternatively we could go the opposite way and introduce ZONE_KERNEL
below ZONE_NORMAL and disallow GFP_KERNEL from ZONE_NORMAL - then have
strict watermarks on ZONE_KERNEL to ensure the kernel is always able
to get memory. 

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-09 15:29               ` Gregory Price
@ 2025-10-09 18:47                 ` Michal Hocko
  2025-10-09 18:51                 ` David Hildenbrand
  1 sibling, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2025-10-09 18:47 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand, linux-mm, corbet, muchun.song, osalvador,
	akpm, hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Thu 09-10-25 11:29:57, Gregory Price wrote:
> On Thu, Oct 09, 2025 at 08:14:22AM +0200, Michal Hocko wrote:
> > On Wed 08-10-25 12:31:22, Gregory Price wrote:
> > > > I'm not quite clear yet on the use case, though. If all the user allocations
> > > > end up fragmenting the memory, there is also not a lot of benefit to be had
> > > > from that zone long term.
> > > >
> > > 
> > > The only real use case i've seen is exactly: 
> > >  - Don't want random GFP_KERNEL to land there
> > >  - Might want it to be pinnable
> > > 
> > > I think that covers what you've described above.
> > > 
> > > But adding an entire zone felt a bit heavy handed.  Allowing gigantic in
> > > movable seemed less - immediately - offensive.
> > 
> > The question is whether we need a full zone for that or we can control
> > those allocation constrains on per memory block bases to override
> > otherwise default. So it wouldn't be MOVABLE but rather something like
> > USER zone.
> 
> 
> Mild ignorance here - but I don't think the buddy allocator currently
> differentiates chunks of memory based on block membership, it just eats
> folios from certain zones/nodes.

No ignorance on your end. As I've said this is not fully thought through
idea. Memory block was meant to be userspace configurable unit.
Internally this would need to be mapped into migrate type or something
like that.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-09 15:29               ` Gregory Price
  2025-10-09 18:47                 ` Michal Hocko
@ 2025-10-09 18:51                 ` David Hildenbrand
  2025-10-09 21:31                   ` Gregory Price
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-09 18:51 UTC (permalink / raw)
  To: Gregory Price, Michal Hocko
  Cc: linux-mm, corbet, muchun.song, osalvador, akpm, hannes,
	laoar.shao, brauner, mclapinski, joel.granados, linux-doc,
	linux-kernel, Mel Gorman, Alexandru Moise, Mike Kravetz,
	David Rientjes

On 09.10.25 17:29, Gregory Price wrote:
> On Thu, Oct 09, 2025 at 08:14:22AM +0200, Michal Hocko wrote:
>> On Wed 08-10-25 12:31:22, Gregory Price wrote:
>>>> I'm not quite clear yet on the use case, though. If all the user allocations
>>>> end up fragmenting the memory, there is also not a lot of benefit to be had
>>>> from that zone long term.
>>>>
>>>
>>> The only real use case i've seen is exactly:
>>>   - Don't want random GFP_KERNEL to land there
>>>   - Might want it to be pinnable
>>>
>>> I think that covers what you've described above.
>>>
>>> But adding an entire zone felt a bit heavy handed.  Allowing gigantic in
>>> movable seemed less - immediately - offensive.
>>
>> The question is whether we need a full zone for that or we can control
>> those allocation constrains on per memory block bases to override
>> otherwise default. So it wouldn't be MOVABLE but rather something like
>> USER zone.
> 
> 
> Mild ignorance here - but I don't think the buddy allocator currently
> differentiates chunks of memory based on block membership, it just eats
> folios from certain zones/nodes.
> 
> I'm scratching my head trying to think of the discrete mechanism to do
> this that doesn't inject significantly more complexity into the buddy
> allocator.
> 
> Looking at the recent[1] THP support for ZONE_DEVICE, I wonder if we end
> up with something more along these lines?  But this aschews the other
> requirement of wanting the memory to be otherwise general purpose.
> 
> https://lore.kernel.org/linux-mm/20251001065707.920170-1-balbirs@nvidia.com/
> 
> ZONE_USER does feel like the most natural solution.  Literally just
> (ZONE_NORMAL - GFP_KERNEL).  This might need a new GFP flag for certain
> use cases like KVM (GFP_USER) to denote certain "This isn't technically
> kernel memory, but it needs to be pinnable".  That would slot right
> between ZONE_NORMAL and ZONE_MOVABLE.
> 
> Alternatively we could go the opposite way and introduce ZONE_KERNEL
> below ZONE_NORMAL and disallow GFP_KERNEL from ZONE_NORMAL - then have
> strict watermarks on ZONE_KERNEL to ensure the kernel is always able
> to get memory.

I'm afraid any new zone will be highly controversial and take a long 
time to get accepted, if ever :)

The real question is: would we really need a system where we mix e.g., 
ZONE_USER with ZONE_MOVABLE?

Or would it be sufficient to selectively enable (explicit opt-in) some 
user pages to end up on ZONE_MOVABLE? IOW, change the semantics of the 
zone by an admin.

Like, allowing longterm pinning on ZONE_MOVABLE.

Sure, it would degrade memory hotunplug (until the relevant applications 
are shut down) and probably some other things.

Further, I am not so sure about the value of having ZONE_MOVABLE 
sprinkled with small unmovable allocations (same concern regarding any 
such zone that allows for unmovable things). Kind of against the whole 
concept.

But I mean, if the admin decides to do that (opt in), so he is to blame.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-09 18:51                 ` David Hildenbrand
@ 2025-10-09 21:31                   ` Gregory Price
  2025-10-10  7:40                     ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Gregory Price @ 2025-10-09 21:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Thu, Oct 09, 2025 at 08:51:54PM +0200, David Hildenbrand wrote:
> On 09.10.25 17:29, Gregory Price wrote:
> Or would it be sufficient to selectively enable (explicit opt-in) some user
> pages to end up on ZONE_MOVABLE? IOW, change the semantics of the zone by an
> admin.
> 
> Like, allowing longterm pinning on ZONE_MOVABLE.
> 
> Sure, it would degrade memory hotunplug (until the relevant applications are
> shut down) and probably some other things.
> 
> Further, I am not so sure about the value of having ZONE_MOVABLE sprinkled
> with small unmovable allocations (same concern regarding any such zone that
> allows for unmovable things). Kind of against the whole concept.
> 
> But I mean, if the admin decides to do that (opt in), so he is to blame.
> 

For what it's worth, this patch (or the new one i posted as an RFC), I
was able to allocate gigantic pages and migrate them back and forth
between nodes even after they were allocated for KVM instances.

I was surprised this did not cause pinning.

This was all while running the QEMU machine actively eating ~2GB of
memory.  So this seems... acceptable?  My primary use case was VM
hugepages, but it doesn't even seem like these have been pinned.

I think the confidential-compute / guest_memfd path would have an
issue, because those are pinned and/or entirely unmapped from the
host, but that just seems like a known quantity and a reason to leave
this off by default (make them read the docs :]).

Seems like this is pretty stable tbh.  Obviously if you hack off the
node0 hugepages migration fails - but I feel like you're signing up for
that when you turn the bit on.

Test I ran is below.

~Gregory

---
Host allocates hugepages, runs a qemu image with numa structure, and
migrates the huge pages back and forth

cat /proc/sys/vm/movable_gigantic_pages 
1
cat .../node0/hugepages/hugepages-1048576kB/nr_hugepages
24
cat .../node1/hugepages/hugepages-1048576kB/nr_hugepages
12

qemu-system-x86_64 \
  -machine q35,accel=kvm \
  -cpu host \
  -smp 8,sockets=1,cores=8,threads=1 \
  -m 2G \
  -mem-prealloc \
  -object memory-backend-file,id=mem0,mem-path=/dev/hugepages,prealloc=on,size=1G,host-nodes=0,policy=bind \
  -object memory-backend-file,id=mem1,mem-path=/dev/hugepages,prealloc=on,size=1G,host-nodes=1,policy=bind \
  -numa node,nodeid=0,cpus=0-7,memdev=mem0 \
  -numa node,nodeid=1,memdev=mem1 \
  -nographic \
  -drive file=fedora/hdd.qcow2 \
  -cdrom fedora/seedci.iso

grep bind /proc/1041805/numa_maps
7efc80000000 bind:1 file=/dev/hugepages/qemu_back_mem.mem1.xKNv1N\040(deleted) huge anon=1 dirty=1 N1=1 kernelpagesize_kB=1048576
7efd00000000 bind:0 file=/dev/hugepages/qemu_back_mem.mem0.E19dYs\040(deleted) huge anon=1 dirty=1 N0=1 kernelpagesize_kB=1048576

# Move both to node 0  (uses move_pages(pid, ...)
./move.sh 0
Pages migrated successfully
status[0]: 0
Pages migrated successfully
status[0]: 0

grep bind /proc/1041805/numa_maps
7efc80000000 bind:1 file=/dev/hugepages/qemu_back_mem.mem1.xKNv1N\040(deleted) huge anon=1 dirty=1 N0=1 kernelpagesize_kB=1048576
7efd00000000 bind:0 file=/dev/hugepages/qemu_back_mem.mem0.E19dYs\040(deleted) huge anon=1 dirty=1 N0=1 kernelpagesize_kB=1048576

# Move both to node 1
./move.sh 1
Pages migrated successfully
status[0]: 1
Pages migrated successfully
status[0]: 1

grep bind /proc/1041805/numa_maps 
7efc80000000 bind:1 file=/dev/hugepages/qemu_back_mem.mem1.xKNv1N\040(deleted) huge anon=1 dirty=1 N1=1 kernelpagesize_kB=1048576
7efd00000000 bind:0 file=/dev/hugepages/qemu_back_mem.mem0.E19dYs\040(deleted) huge anon=1 dirty=1 N1=1 kernelpagesize_kB=1048576

---
Guest
Running python script that eats 1.7GB of memory

import time

# 1.2 GB in bytes
size_in_bytes = int(1.7 * 1024 * 1024 * 1024)

# Allocate memory
data = bytearray(size_in_bytes)

print(f"Allocated {len(data) / (1024 * 1024 * 1024):.2f} GB of memory.")

# Keep the process alive so you can inspect it (e.g., with top or htop)
try:
    while True:
        print("nom")
        time.sleep(10)
except KeyboardInterrupt:
    print("Exiting.")


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-09 21:31                   ` Gregory Price
@ 2025-10-10  7:40                     ` David Hildenbrand
  2025-10-10 18:53                       ` Gregory Price
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-10  7:40 UTC (permalink / raw)
  To: Gregory Price
  Cc: Michal Hocko, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On 09.10.25 23:31, Gregory Price wrote:
> On Thu, Oct 09, 2025 at 08:51:54PM +0200, David Hildenbrand wrote:
>> On 09.10.25 17:29, Gregory Price wrote:
>> Or would it be sufficient to selectively enable (explicit opt-in) some user
>> pages to end up on ZONE_MOVABLE? IOW, change the semantics of the zone by an
>> admin.
>>
>> Like, allowing longterm pinning on ZONE_MOVABLE.
>>
>> Sure, it would degrade memory hotunplug (until the relevant applications are
>> shut down) and probably some other things.
>>
>> Further, I am not so sure about the value of having ZONE_MOVABLE sprinkled
>> with small unmovable allocations (same concern regarding any such zone that
>> allows for unmovable things). Kind of against the whole concept.
>>
>> But I mean, if the admin decides to do that (opt in), so he is to blame.
>>
> 
> For what it's worth, this patch (or the new one i posted as an RFC), I
> was able to allocate gigantic pages and migrate them back and forth
> between nodes even after they were allocated for KVM instances.
> 
> I was surprised this did not cause pinning.

KVM does not end up longterm-pinning pages (what we care about regarding 
migration) when mapping stuff into the guest MMU, so KVM in general is 
not a problem.

The problem shows up once you would try to use something like vfio, 
liburing fixed buffers etc, where we will longterm-pin pages.

> 
> This was all while running the QEMU machine actively eating ~2GB of
> memory.  So this seems... acceptable?  My primary use case was VM
> hugepages, but it doesn't even seem like these have been pinned.
> 
> I think the confidential-compute / guest_memfd path would have an
> issue, because those are pinned and/or entirely unmapped from the
> host, but that just seems like a known quantity and a reason to leave
> this off by default (make them read the docs :]).

guest_memfd allocates folios without GFP_MOVABLE, because they are ... 
unmovable. So they would never end up on ZONE_MOVABLE.

There are prototypes / ideas to support migration of guest_memfd pages, 
so it would be solvable. At least for some scenarios.

> 
> Seems like this is pretty stable tbh.  Obviously if you hack off the
> node0 hugepages migration fails - but I feel like you're signing up for
> that when you turn the bit on.

Right, just needs to be documented thoroughly IMHO.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl"
  2025-10-10  7:40                     ` David Hildenbrand
@ 2025-10-10 18:53                       ` Gregory Price
  0 siblings, 0 replies; 26+ messages in thread
From: Gregory Price @ 2025-10-10 18:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, corbet, muchun.song, osalvador, akpm,
	hannes, laoar.shao, brauner, mclapinski, joel.granados,
	linux-doc, linux-kernel, Mel Gorman, Alexandru Moise,
	Mike Kravetz, David Rientjes

On Fri, Oct 10, 2025 at 09:40:29AM +0200, David Hildenbrand wrote:
> > 
> > Seems like this is pretty stable tbh.  Obviously if you hack off the
> > node0 hugepages migration fails - but I feel like you're signing up for
> > that when you turn the bit on.
> 
> Right, just needs to be documented thoroughly IMHO.
> 

happy to make any doc adjustments.

Better to move any notes every here though:

https://lore.kernel.org/linux-mm/20251009161515.422292-1-gourry@gourry.net/T/#u

---

Maybe worth finding some time at plumbers to discuss block-level
allocation twiddly bits.  Still not quite clear how this would pan
out.  Maybe it's as simple as adding GFP flags to blocks and having
something like:

echo NO_KERNEL > sys/bus/node/devices/node0/memory1234/eligibility
echo PINNABLE > sys/bus/node/devices/node0/memory1234/eligibility

folio_to_block(folio)->eligible(gfp)

The issue here is obviously that it's clearly racey, in that a bit
twiddle can change the eligibility mid-allocation.  

I'll think a bit more about this.

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-10-10 18:53 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-07 21:44 [PATCH] Revert "mm, hugetlb: remove hugepages_treat_as_movable sysctl" Gregory Price
2025-10-07 21:59 ` Andrew Morton
2025-10-07 22:12   ` Gregory Price
2025-10-08  8:58 ` David Hildenbrand
2025-10-08 14:18   ` Gregory Price
2025-10-08 14:44     ` David Hildenbrand
2025-10-08 18:58       ` Gregory Price
2025-10-08 19:01         ` David Hildenbrand
2025-10-08 19:44           ` Gregory Price
2025-10-08 19:52             ` David Hildenbrand
2025-10-08 19:59               ` Gregory Price
2025-10-08 14:59   ` Michal Hocko
2025-10-08 15:14     ` David Hildenbrand
2025-10-08 15:23       ` Michal Hocko
2025-10-08 15:43         ` David Hildenbrand
2025-10-08 16:31           ` Gregory Price
2025-10-09  6:14             ` Michal Hocko
2025-10-09 15:29               ` Gregory Price
2025-10-09 18:47                 ` Michal Hocko
2025-10-09 18:51                 ` David Hildenbrand
2025-10-09 21:31                   ` Gregory Price
2025-10-10  7:40                     ` David Hildenbrand
2025-10-10 18:53                       ` Gregory Price
2025-10-08 16:08     ` Frank van der Linden
2025-10-08 16:39       ` Gregory Price
2025-10-08 17:05       ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox