* [PATCH v4 1/4] mm: defer THP insertion to khugepaged
2025-04-17 0:18 [PATCH v4 0/4] mm: introduce THP deferred setting Nico Pache
@ 2025-04-17 0:18 ` Nico Pache
2025-04-17 0:18 ` [PATCH v4 2/4] mm: document (m)THP defer usage Nico Pache
` (3 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Nico Pache @ 2025-04-17 0:18 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-kselftest
Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david,
baohua, baolin.wang, ryan.roberts, willy, peterx, shuah, ziy,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
dev.jain, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap
setting /transparent_hugepages/enabled=always allows applications
to benefit from THPs without having to madvise. However, the pf handler
takes very few considerations to decide weather or not to actually use a
THP. This can lead to a lot of wasted memory. khugepaged only operates
on memory that was either allocated with enabled=always or MADV_HUGEPAGE.
Introduce the ability to set enabled=defer, which will prevent THPs from
being allocated by the page fault handler unless madvise is set,
leaving it up to khugepaged to decide which allocations will collapse to a
THP. This should allow applications to benefits from THPs, while curbing
some of the memory waste.
Co-developed-by: Rafael Aquini <raquini@redhat.com>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/linux/huge_mm.h | 15 +++++++++++++--
mm/huge_memory.c | 31 +++++++++++++++++++++++++++----
2 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 782d3a7854b4..b88cc3154ec0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -48,6 +48,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_UNSUPPORTED,
TRANSPARENT_HUGEPAGE_FLAG,
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+ TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
@@ -186,6 +187,7 @@ static inline bool hugepage_global_enabled(void)
{
return transparent_hugepage_flags &
((1<<TRANSPARENT_HUGEPAGE_FLAG) |
+ (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG) |
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
}
@@ -195,6 +197,12 @@ static inline bool hugepage_global_always(void)
(1<<TRANSPARENT_HUGEPAGE_FLAG);
}
+static inline bool hugepage_global_defer(void)
+{
+ return transparent_hugepage_flags &
+ (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG);
+}
+
static inline int highest_order(unsigned long orders)
{
return fls_long(orders) - 1;
@@ -291,13 +299,16 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long tva_flags,
unsigned long orders)
{
+ if ((tva_flags & TVA_IN_PF) && hugepage_global_defer() &&
+ !(vm_flags & VM_HUGEPAGE))
+ return 0;
+
/* Optimization to check if required orders are enabled early. */
if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
unsigned long mask = READ_ONCE(huge_anon_orders_always);
-
if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_anon_orders_madvise);
- if (hugepage_global_always() ||
+ if (hugepage_global_always() || hugepage_global_defer() ||
((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
mask |= READ_ONCE(huge_anon_orders_inherit);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de4704af0022..568ae2363959 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -297,12 +297,15 @@ static ssize_t enabled_show(struct kobject *kobj,
const char *output;
if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags))
- output = "[always] madvise never";
+ output = "[always] madvise defer never";
else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags))
- output = "always [madvise] never";
+ output = "always [madvise] defer never";
+ else if (test_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG,
+ &transparent_hugepage_flags))
+ output = "always madvise [defer] never";
else
- output = "always madvise [never]";
+ output = "always madvise defer [never]";
return sysfs_emit(buf, "%s\n", output);
}
@@ -315,13 +318,20 @@ static ssize_t enabled_store(struct kobject *kobj,
if (sysfs_streq(buf, "always")) {
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags);
set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ } else if (sysfs_streq(buf, "defer")) {
+ clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags);
} else if (sysfs_streq(buf, "madvise")) {
clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags);
set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
} else if (sysfs_streq(buf, "never")) {
clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags);
} else
ret = -EINVAL;
@@ -954,18 +964,31 @@ static int __init setup_transparent_hugepage(char *str)
&transparent_hugepage_flags);
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG,
+ &transparent_hugepage_flags);
ret = 1;
+ } else if (!strcmp(str, "defer")) {
+ clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
+ &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+ &transparent_hugepage_flags);
+ set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG,
+ &transparent_hugepage_flags);
} else if (!strcmp(str, "madvise")) {
clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
&transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG,
+ &transparent_hugepage_flags);
set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
- &transparent_hugepage_flags);
+ &transparent_hugepage_flags);
ret = 1;
} else if (!strcmp(str, "never")) {
clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
&transparent_hugepage_flags);
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG,
+ &transparent_hugepage_flags);
ret = 1;
}
out:
--
2.48.1
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH v4 2/4] mm: document (m)THP defer usage
2025-04-17 0:18 [PATCH v4 0/4] mm: introduce THP deferred setting Nico Pache
2025-04-17 0:18 ` [PATCH v4 1/4] mm: defer THP insertion to khugepaged Nico Pache
@ 2025-04-17 0:18 ` Nico Pache
2025-04-18 1:02 ` Bagas Sanjaya
2025-04-17 0:18 ` [PATCH v4 3/4] khugepaged: add defer option to mTHP options Nico Pache
` (2 subsequent siblings)
4 siblings, 1 reply; 8+ messages in thread
From: Nico Pache @ 2025-04-17 0:18 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-kselftest
Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david,
baohua, baolin.wang, ryan.roberts, willy, peterx, shuah, ziy,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
dev.jain, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap
The new defer option for (m)THPs allows for a more conservative
approach to (m)THPs. Document its usage in the transhuge admin-guide.
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 31 ++++++++++++++++------
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 06814e05e1d5..38e1778d9eaa 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -88,8 +88,9 @@ In certain cases when hugepages are enabled system wide, application
may end up allocating more memory resources. An application may mmap a
large region but only touch 1 byte of it, in that case a 2M page might
be allocated instead of a 4k page for no good. This is why it's
-possible to disable hugepages system-wide and to only have them inside
-MADV_HUGEPAGE madvise regions.
+possible to disable hugepages system-wide, only have them inside
+MADV_HUGEPAGE madvise regions, or defer them away from the page fault
+handler to khugepaged.
Embedded systems should enable hugepages only inside madvise regions
to eliminate any risk of wasting any precious byte of memory and to
@@ -99,6 +100,15 @@ Applications that gets a lot of benefit from hugepages and that don't
risk to lose memory by using hugepages, should use
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+Applications that would like to benefit from THPs but would still like a
+more memory conservative approach can choose 'defer'. This avoids
+inserting THPs at the page fault handler unless they are MADV_HUGEPAGE.
+Khugepaged will then scan the mappings for potential collapses into (m)THP
+pages. Admins using this the 'defer' setting should consider
+tweaking khugepaged/max_ptes_none. The current default of 511 may
+aggressively collapse your PTEs into PMDs. Lower this value to conserve
+more memory (i.e., max_ptes_none=64).
+
.. _thp_sysfs:
sysfs
@@ -109,11 +119,14 @@ Global THP controls
Transparent Hugepage Support for anonymous memory can be entirely disabled
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
-regions (to avoid the risk of consuming more memory resources) or enabled
-system wide. This can be achieved per-supported-THP-size with one of::
+regions (to avoid the risk of consuming more memory resources), deferred to
+khugepaged, or enabled system wide.
+
+This can be achieved per-supported-THP-size with one of::
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+ echo defer >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
where <size> is the hugepage size being addressed, the available sizes
@@ -136,6 +149,7 @@ The top-level setting (for use with "inherit") can be set by issuing
one of the following commands::
echo always >/sys/kernel/mm/transparent_hugepage/enabled
+ echo defer >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled
@@ -282,7 +296,8 @@ of small pages into one large page::
A higher value leads to use additional memory for programs.
A lower value leads to gain less thp performance. Value of
max_ptes_none can waste cpu time very little, you can
-ignore it.
+ignore it. Consider lowering this value when using
+``transparent_hugepage=defer``
``max_ptes_swap`` specifies how many pages can be brought in from
swap when collapsing a group of pages into a transparent huge page::
@@ -307,14 +322,14 @@ Boot parameters
You can change the sysfs boot time default for the top-level "enabled"
control by passing the parameter ``transparent_hugepage=always`` or
-``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
-kernel command line.
+``transparent_hugepage=madvise`` or ``transparent_hugepage=defer`` or
+``transparent_hugepage=never`` to the kernel command line.
Alternatively, each supported anonymous THP size can be controlled by
passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and
supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``,
-``never`` or ``inherit``.
+``defer``, ``never`` or ``inherit``.
For example, the following will set 16K, 32K, 64K THP to ``always``,
set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M
--
2.48.1
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH v4 2/4] mm: document (m)THP defer usage
2025-04-17 0:18 ` [PATCH v4 2/4] mm: document (m)THP defer usage Nico Pache
@ 2025-04-18 1:02 ` Bagas Sanjaya
0 siblings, 0 replies; 8+ messages in thread
From: Bagas Sanjaya @ 2025-04-18 1:02 UTC (permalink / raw)
To: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-kselftest
Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david,
baohua, baolin.wang, ryan.roberts, willy, peterx, shuah, ziy,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
dev.jain, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap
[-- Attachment #1: Type: text/plain, Size: 4808 bytes --]
On Wed, Apr 16, 2025 at 06:18:44PM -0600, Nico Pache wrote:
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 06814e05e1d5..38e1778d9eaa 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -88,8 +88,9 @@ In certain cases when hugepages are enabled system wide, application
> may end up allocating more memory resources. An application may mmap a
> large region but only touch 1 byte of it, in that case a 2M page might
> be allocated instead of a 4k page for no good. This is why it's
> -possible to disable hugepages system-wide and to only have them inside
> -MADV_HUGEPAGE madvise regions.
> +possible to disable hugepages system-wide, only have them inside
> +MADV_HUGEPAGE madvise regions, or defer them away from the page fault
> +handler to khugepaged.
>
> Embedded systems should enable hugepages only inside madvise regions
> to eliminate any risk of wasting any precious byte of memory and to
> @@ -99,6 +100,15 @@ Applications that gets a lot of benefit from hugepages and that don't
> risk to lose memory by using hugepages, should use
> madvise(MADV_HUGEPAGE) on their critical mmapped regions.
>
> +Applications that would like to benefit from THPs but would still like a
> +more memory conservative approach can choose 'defer'. This avoids
> +inserting THPs at the page fault handler unless they are MADV_HUGEPAGE.
> +Khugepaged will then scan the mappings for potential collapses into (m)THP
> +pages. Admins using this the 'defer' setting should consider
> +tweaking khugepaged/max_ptes_none. The current default of 511 may
> +aggressively collapse your PTEs into PMDs. Lower this value to conserve
> +more memory (i.e., max_ptes_none=64).
> +
> .. _thp_sysfs:
>
> sysfs
> @@ -109,11 +119,14 @@ Global THP controls
>
> Transparent Hugepage Support for anonymous memory can be entirely disabled
> (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
> -regions (to avoid the risk of consuming more memory resources) or enabled
> -system wide. This can be achieved per-supported-THP-size with one of::
> +regions (to avoid the risk of consuming more memory resources), deferred to
> +khugepaged, or enabled system wide.
> +
> +This can be achieved per-supported-THP-size with one of::
>
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> + echo defer >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
>
> where <size> is the hugepage size being addressed, the available sizes
> @@ -136,6 +149,7 @@ The top-level setting (for use with "inherit") can be set by issuing
> one of the following commands::
>
> echo always >/sys/kernel/mm/transparent_hugepage/enabled
> + echo defer >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo never >/sys/kernel/mm/transparent_hugepage/enabled
>
> @@ -282,7 +296,8 @@ of small pages into one large page::
> A higher value leads to use additional memory for programs.
> A lower value leads to gain less thp performance. Value of
> max_ptes_none can waste cpu time very little, you can
> -ignore it.
> +ignore it. Consider lowering this value when using
> +``transparent_hugepage=defer``
>
> ``max_ptes_swap`` specifies how many pages can be brought in from
> swap when collapsing a group of pages into a transparent huge page::
> @@ -307,14 +322,14 @@ Boot parameters
>
> You can change the sysfs boot time default for the top-level "enabled"
> control by passing the parameter ``transparent_hugepage=always`` or
> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
> -kernel command line.
> +``transparent_hugepage=madvise`` or ``transparent_hugepage=defer`` or
> +``transparent_hugepage=never`` to the kernel command line.
>
> Alternatively, each supported anonymous THP size can be controlled by
> passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
> where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and
> supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``,
> -``never`` or ``inherit``.
> +``defer``, ``never`` or ``inherit``.
>
> For example, the following will set 16K, 32K, 64K THP to ``always``,
> set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M
Looks good, thanks!
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
--
An old man doll... just what I always wanted! - Clara
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v4 3/4] khugepaged: add defer option to mTHP options
2025-04-17 0:18 [PATCH v4 0/4] mm: introduce THP deferred setting Nico Pache
2025-04-17 0:18 ` [PATCH v4 1/4] mm: defer THP insertion to khugepaged Nico Pache
2025-04-17 0:18 ` [PATCH v4 2/4] mm: document (m)THP defer usage Nico Pache
@ 2025-04-17 0:18 ` Nico Pache
2025-04-17 23:09 ` Andrew Morton
2025-04-17 0:18 ` [PATCH v4 4/4] selftests: mm: add defer to thp setting parser Nico Pache
2025-04-17 23:11 ` [PATCH v4 0/4] mm: introduce THP deferred setting Andrew Morton
4 siblings, 1 reply; 8+ messages in thread
From: Nico Pache @ 2025-04-17 0:18 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-kselftest
Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david,
baohua, baolin.wang, ryan.roberts, willy, peterx, shuah, ziy,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
dev.jain, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap
Now that we have defer to globally disable THPs at fault time, lets add
a defer setting to the mTHP options. This will allow khugepaged to
operate at that order, while avoiding it at PF time.
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/linux/huge_mm.h | 5 +++++
mm/huge_memory.c | 38 +++++++++++++++++++++++++++++++++-----
mm/khugepaged.c | 10 +++++-----
3 files changed, 43 insertions(+), 10 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b88cc3154ec0..a4c87d80badc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,6 +96,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
#define TVA_SMAPS (1 << 0) /* Will be used for procfs */
#define TVA_IN_PF (1 << 1) /* Page fault handler */
#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */
+#define TVA_IN_KHUGEPAGE ((1 << 2) | (1 << 3)) /* Khugepaged defer support */
#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
@@ -182,6 +183,7 @@ extern unsigned long transparent_hugepage_flags;
extern unsigned long huge_anon_orders_always;
extern unsigned long huge_anon_orders_madvise;
extern unsigned long huge_anon_orders_inherit;
+extern unsigned long huge_anon_orders_defer;
static inline bool hugepage_global_enabled(void)
{
@@ -306,6 +308,9 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
/* Optimization to check if required orders are enabled early. */
if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
unsigned long mask = READ_ONCE(huge_anon_orders_always);
+
+ if ((tva_flags & TVA_IN_KHUGEPAGE) == TVA_IN_KHUGEPAGE)
+ mask |= READ_ONCE(huge_anon_orders_defer);
if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_anon_orders_madvise);
if (hugepage_global_always() || hugepage_global_defer() ||
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 568ae2363959..f10d307091d8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -81,6 +81,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL;
unsigned long huge_anon_orders_always __read_mostly;
unsigned long huge_anon_orders_madvise __read_mostly;
unsigned long huge_anon_orders_inherit __read_mostly;
+unsigned long huge_anon_orders_defer __read_mostly;
static bool anon_orders_configured __initdata;
static inline bool file_thp_enabled(struct vm_area_struct *vma)
@@ -505,13 +506,15 @@ static ssize_t anon_enabled_show(struct kobject *kobj,
const char *output;
if (test_bit(order, &huge_anon_orders_always))
- output = "[always] inherit madvise never";
+ output = "[always] inherit madvise defer never";
else if (test_bit(order, &huge_anon_orders_inherit))
- output = "always [inherit] madvise never";
+ output = "always [inherit] madvise defer never";
else if (test_bit(order, &huge_anon_orders_madvise))
- output = "always inherit [madvise] never";
+ output = "always inherit [madvise] defer never";
+ else if (test_bit(order, &huge_anon_orders_defer))
+ output = "always inherit madvise [defer] never";
else
- output = "always inherit madvise [never]";
+ output = "always inherit madvise defer [never]";
return sysfs_emit(buf, "%s\n", output);
}
@@ -527,25 +530,36 @@ static ssize_t anon_enabled_store(struct kobject *kobj,
spin_lock(&huge_anon_orders_lock);
clear_bit(order, &huge_anon_orders_inherit);
clear_bit(order, &huge_anon_orders_madvise);
+ clear_bit(order, &huge_anon_orders_defer);
set_bit(order, &huge_anon_orders_always);
spin_unlock(&huge_anon_orders_lock);
} else if (sysfs_streq(buf, "inherit")) {
spin_lock(&huge_anon_orders_lock);
clear_bit(order, &huge_anon_orders_always);
clear_bit(order, &huge_anon_orders_madvise);
+ clear_bit(order, &huge_anon_orders_defer);
set_bit(order, &huge_anon_orders_inherit);
spin_unlock(&huge_anon_orders_lock);
} else if (sysfs_streq(buf, "madvise")) {
spin_lock(&huge_anon_orders_lock);
clear_bit(order, &huge_anon_orders_always);
clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_defer);
set_bit(order, &huge_anon_orders_madvise);
spin_unlock(&huge_anon_orders_lock);
+ } else if (sysfs_streq(buf, "defer")) {
+ spin_lock(&huge_anon_orders_lock);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);
+ set_bit(order, &huge_anon_orders_defer);
+ spin_unlock(&huge_anon_orders_lock);
} else if (sysfs_streq(buf, "never")) {
spin_lock(&huge_anon_orders_lock);
clear_bit(order, &huge_anon_orders_always);
clear_bit(order, &huge_anon_orders_inherit);
clear_bit(order, &huge_anon_orders_madvise);
+ clear_bit(order, &huge_anon_orders_defer);
spin_unlock(&huge_anon_orders_lock);
} else
ret = -EINVAL;
@@ -1002,7 +1016,7 @@ static char str_dup[PAGE_SIZE] __initdata;
static int __init setup_thp_anon(char *str)
{
char *token, *range, *policy, *subtoken;
- unsigned long always, inherit, madvise;
+ unsigned long always, inherit, madvise, defer;
char *start_size, *end_size;
int start, end, nr;
char *p;
@@ -1014,6 +1028,8 @@ static int __init setup_thp_anon(char *str)
always = huge_anon_orders_always;
madvise = huge_anon_orders_madvise;
inherit = huge_anon_orders_inherit;
+ defer = huge_anon_orders_defer;
+
p = str_dup;
while ((token = strsep(&p, ";")) != NULL) {
range = strsep(&token, ":");
@@ -1053,18 +1069,28 @@ static int __init setup_thp_anon(char *str)
bitmap_set(&always, start, nr);
bitmap_clear(&inherit, start, nr);
bitmap_clear(&madvise, start, nr);
+ bitmap_clear(&defer, start, nr);
} else if (!strcmp(policy, "madvise")) {
bitmap_set(&madvise, start, nr);
bitmap_clear(&inherit, start, nr);
bitmap_clear(&always, start, nr);
+ bitmap_clear(&defer, start, nr);
} else if (!strcmp(policy, "inherit")) {
bitmap_set(&inherit, start, nr);
bitmap_clear(&madvise, start, nr);
bitmap_clear(&always, start, nr);
+ bitmap_clear(&defer, start, nr);
+ } else if (!strcmp(policy, "defer")) {
+ bitmap_set(&defer, start, nr);
+ bitmap_clear(&madvise, start, nr);
+ bitmap_clear(&always, start, nr);
+ bitmap_clear(&inherit, start, nr);
} else if (!strcmp(policy, "never")) {
bitmap_clear(&inherit, start, nr);
bitmap_clear(&madvise, start, nr);
bitmap_clear(&always, start, nr);
+ bitmap_clear(&defer, start, nr);
+
} else {
pr_err("invalid policy %s in thp_anon boot parameter\n", policy);
goto err;
@@ -1075,6 +1101,8 @@ static int __init setup_thp_anon(char *str)
huge_anon_orders_always = always;
huge_anon_orders_madvise = madvise;
huge_anon_orders_inherit = inherit;
+ huge_anon_orders_defer = defer;
+
anon_orders_configured = true;
return 1;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 38643a681ba5..f9faff6917d3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -491,7 +491,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
{
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
hugepage_pmd_enabled()) {
- if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
+ if (thp_vma_allowable_order(vma, vm_flags, TVA_IN_KHUGEPAGE,
PMD_ORDER))
__khugepaged_enter(vma->vm_mm);
}
@@ -955,7 +955,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
struct collapse_control *cc, int order)
{
struct vm_area_struct *vma;
- unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+ unsigned long tva_flags = cc->is_khugepaged ? TVA_IN_KHUGEPAGE : 0;
if (unlikely(khugepaged_test_exit_or_disable(mm)))
return SCAN_ANY_PROCESS;
@@ -1430,7 +1430,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
bool writable = false;
int chunk_none_count = 0;
int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
- unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+ unsigned long tva_flags = cc->is_khugepaged ? TVA_IN_KHUGEPAGE : 0;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
result = find_pmd_or_thp_or_none(mm, address, &pmd);
@@ -2550,7 +2550,7 @@ static int khugepaged_collapse_single_pmd(unsigned long addr,
{
int result = SCAN_FAIL;
struct mm_struct *mm = vma->vm_mm;
- unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+ unsigned long tva_flags = cc->is_khugepaged ? TVA_IN_KHUGEPAGE : 0;
if (thp_vma_allowable_order(vma, vma->vm_flags,
tva_flags, PMD_ORDER)) {
@@ -2635,7 +2635,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
break;
}
if (!thp_vma_allowable_order(vma, vma->vm_flags,
- TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+ TVA_IN_KHUGEPAGE, PMD_ORDER)) {
skip:
progress++;
continue;
--
2.48.1
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH v4 3/4] khugepaged: add defer option to mTHP options
2025-04-17 0:18 ` [PATCH v4 3/4] khugepaged: add defer option to mTHP options Nico Pache
@ 2025-04-17 23:09 ` Andrew Morton
0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2025-04-17 23:09 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, corbet,
rostedt, mhiramat, mathieu.desnoyers, david, baohua, baolin.wang,
ryan.roberts, willy, peterx, shuah, ziy, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap
On Wed, 16 Apr 2025 18:18:45 -0600 Nico Pache <npache@redhat.com> wrote:
> Now that we have defer to globally disable THPs at fault time, lets add
> a defer setting to the mTHP options. This will allow khugepaged to
> operate at that order, while avoiding it at PF time.
khugepaged.c has changed somewhat in mm.git's mm-new branch. Can you
pleae take a look?
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v4 4/4] selftests: mm: add defer to thp setting parser
2025-04-17 0:18 [PATCH v4 0/4] mm: introduce THP deferred setting Nico Pache
` (2 preceding siblings ...)
2025-04-17 0:18 ` [PATCH v4 3/4] khugepaged: add defer option to mTHP options Nico Pache
@ 2025-04-17 0:18 ` Nico Pache
2025-04-17 23:11 ` [PATCH v4 0/4] mm: introduce THP deferred setting Andrew Morton
4 siblings, 0 replies; 8+ messages in thread
From: Nico Pache @ 2025-04-17 0:18 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-kselftest
Cc: akpm, corbet, rostedt, mhiramat, mathieu.desnoyers, david,
baohua, baolin.wang, ryan.roberts, willy, peterx, shuah, ziy,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kirill.shutemov, aarcange, raquini,
dev.jain, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, surenb, zokeefe, hannes,
rientjes, mhocko, rdunlap
add the defer setting to the selftests library for reading thp settings.
Signed-off-by: Nico Pache <npache@redhat.com>
---
tools/testing/selftests/mm/thp_settings.c | 1 +
tools/testing/selftests/mm/thp_settings.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c
index ad872af1c81a..b2f9f62b302a 100644
--- a/tools/testing/selftests/mm/thp_settings.c
+++ b/tools/testing/selftests/mm/thp_settings.c
@@ -20,6 +20,7 @@ static const char * const thp_enabled_strings[] = {
"always",
"inherit",
"madvise",
+ "defer",
NULL
};
diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h
index fc131d23d593..0d52e6d4f754 100644
--- a/tools/testing/selftests/mm/thp_settings.h
+++ b/tools/testing/selftests/mm/thp_settings.h
@@ -11,6 +11,7 @@ enum thp_enabled {
THP_ALWAYS,
THP_INHERIT,
THP_MADVISE,
+ THP_DEFER,
};
enum thp_defrag {
--
2.48.1
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH v4 0/4] mm: introduce THP deferred setting
2025-04-17 0:18 [PATCH v4 0/4] mm: introduce THP deferred setting Nico Pache
` (3 preceding siblings ...)
2025-04-17 0:18 ` [PATCH v4 4/4] selftests: mm: add defer to thp setting parser Nico Pache
@ 2025-04-17 23:11 ` Andrew Morton
4 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2025-04-17 23:11 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, corbet,
rostedt, mhiramat, mathieu.desnoyers, david, baohua, baolin.wang,
ryan.roberts, willy, peterx, shuah, ziy, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kirill.shutemov, aarcange, raquini, dev.jain, anshuman.khandual,
catalin.marinas, tiwai, will, dave.hansen, jack, cl, jglisse,
surenb, zokeefe, hannes, rientjes, mhocko, rdunlap
On Wed, 16 Apr 2025 18:18:42 -0600 Nico Pache <npache@redhat.com> wrote:
> This series is a follow-up to [1], which adds mTHP support to khugepaged.
> mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> configs to make sense. Without it global="defer" and mTHP="inherit" case
> is "undefined" behavior.
>
> We've seen cases were customers switching from RHEL7 to RHEL8 see a
> significant increase in the memory footprint for the same workloads.
>
> Through our investigations we found that a large contributing factor to
> the increase in RSS was an increase in THP usage.
>
> For workloads like MySQL, or when using allocators like jemalloc, it is
> often recommended to set /transparent_hugepages/enabled=never. This is
> in part due to performance degradations and increased memory waste.
>
> This series introduces enabled=defer, this setting acts as a middle
> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> page fault handler will act normally, making a hugepage if possible. If
> the allocation is not MADV_HUGEPAGE, then the page fault handler will
> default to the base size allocation. The caveat is that khugepaged can
> still operate on pages thats not MADV_HUGEPAGE.
>
> This allows for three things... one, applications specifically designed to
> use hugepages will get them, and two, applications that don't use
> hugepages can still benefit from them without aggressively inserting
> THPs at every possible chance. This curbs the memory waste, and defers
> the use of hugepages to khugepaged. Khugepaged can then scan the memory
> for eligible collapsing. Lastly there is the added benefit for those who
> want THPs but experience higher latency PFs. Now you can get base page
> performance at the PF handler and Hugepage performance for those mappings
> after they collapse.
>
> Admins may want to lower max_ptes_none, if not, khugepaged may
> aggressively collapse single allocations into hugepages.
>
> TESTING:
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - In [1] I provided a script [2] that has multiple access patterns
Namely https://gitlab.com/npache/khugepaged_mthp_test?
Looks useful and could perhaps be directly linked to from this
patchset's [0/N] changelog?
^ permalink raw reply [flat|nested] 8+ messages in thread