* [PATCH v2 0/4] Improve folio split related functions
@ 2025-11-22 2:55 Zi Yan
2025-11-22 2:55 ` [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable() Zi Yan
` (3 more replies)
0 siblings, 4 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-22 2:55 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
Hi all,
This patchset improves several folio split related functions to avoid
future misuse. The changes are:
1. Consolidated folio splittable checks by moving truncated folio check,
huge zero folio check, and writeback folio check into
folio_split_supported(). Changed the function return type. Renamed it
to folio_check_splittable() for clarification.
2. Replaced can_split_folio() with open coded folio_expected_ref_count()
and folio_ref_count().
3. Changed min_order_for_split() to always return an order.
4. Fixed folio split stats counting.
Motivation
===
This is based on Wei's observation[1] and solves several potential
issues:
1. Dereferencing NULL folio->mapping in try_folio_split_to_order() if it
is called on truncated folios.
2. Not handling of negative return value of min_order_for_split() in
mm/memory-failure.c
There is no bug in the current code.
Changelog
===
From RFC[2]
1. Renamed folio_split_supported() to folio_check_splittable(), changed
its return type from bool to int to return error code directly, and
added kernel-doc.
2. Moved truncated folio check, zero huge folio check, and writeback
check in folio_check_splittable().
3. Changed zero huge folio check's error number from -EBUSY to -EINVAL.
4. Replaced can_split_folio() with open code.
5. Changed min_order_for_split() to return 0 for truncated folio instead
of -EBUSY and added kernel-doc.
6. Fixed folio split stats counting.
Comments and feedbacks are welcome.
Link: https://lore.kernel.org/all/20251120004735.52z7r4xmogw7mbsj@master/ [1]
Link: https://lore.kernel.org/all/20251120035953.1115736-1-ziy@nvidia.com/ [2]
Zi Yan (4):
mm/huge_memory: change folio_split_supported() to
folio_check_splittable()
mm/huge_memory: replace can_split_folio() with direct refcount
calculation
mm/huge_memory: make min_order_for_split() always return an order
mm/huge_memory: fix folio split stats counting
include/linux/huge_mm.h | 17 +++--
mm/huge_memory.c | 152 +++++++++++++++++++++++-----------------
mm/vmscan.c | 3 +-
3 files changed, 101 insertions(+), 71 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable()
2025-11-22 2:55 [PATCH v2 0/4] Improve folio split related functions Zi Yan
@ 2025-11-22 2:55 ` Zi Yan
2025-11-23 1:50 ` Wei Yang
` (2 more replies)
2025-11-22 2:55 ` [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation Zi Yan
` (2 subsequent siblings)
3 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-22 2:55 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
folio_split_supported() used in try_folio_split_to_order() requires
folio->mapping to be non NULL, but current try_folio_split_to_order() does
not check it. There is no issue in the current code, since
try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
where folio->mapping is not NULL.
To prevent future misuse, move folio->mapping NULL check (i.e., folio is
truncated) into folio_split_supported(). Since folio->mapping NULL check
returns -EBUSY and folio_split_supported() == false means -EINVAL, change
folio_split_supported() return type from bool to int and return error
numbers accordingly. Rename folio_split_supported() to
folio_check_splittable() to match the return type change.
While at it, move is_huge_zero_folio() check and folio_test_writeback()
check into folio_check_splittable() and add kernel-doc.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
include/linux/huge_mm.h | 10 ++++--
mm/huge_memory.c | 74 +++++++++++++++++++++++++----------------
2 files changed, 53 insertions(+), 31 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1d439de1ca2c..97686fb46e30 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -375,8 +375,8 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
int folio_split_unmapped(struct folio *folio, unsigned int new_order);
int min_order_for_split(struct folio *folio);
int split_folio_to_list(struct folio *folio, struct list_head *list);
-bool folio_split_supported(struct folio *folio, unsigned int new_order,
- enum split_type split_type, bool warns);
+int folio_check_splittable(struct folio *folio, unsigned int new_order,
+ enum split_type split_type, bool warns);
int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
struct list_head *list);
@@ -407,7 +407,11 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
static inline int try_folio_split_to_order(struct folio *folio,
struct page *page, unsigned int new_order)
{
- if (!folio_split_supported(folio, new_order, SPLIT_TYPE_NON_UNIFORM, /* warns= */ false))
+ int ret;
+
+ ret = folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM,
+ /* warns= */ false);
+ if (ret)
return split_huge_page_to_order(&folio->page, new_order);
return folio_split(folio, new_order, page, NULL);
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 041b554c7115..c1f1055165dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3688,15 +3688,43 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
return 0;
}
-bool folio_split_supported(struct folio *folio, unsigned int new_order,
- enum split_type split_type, bool warns)
+/**
+ * folio_check_splittable() - check if a folio can be split to a given order
+ * @folio: folio to be split
+ * @new_order: the smallest order of the after split folios (since buddy
+ * allocator like split generates folios with orders from @folio's
+ * order - 1 to new_order).
+ * @split_type: uniform or non-uniform split
+ * @warns: whether gives warnings or not for the checks in the function
+ *
+ * folio_check_splittable() checks if @folio can be split to @new_order using
+ * @split_type method. The truncated folio check must come first.
+ *
+ * Context: folio must be locked.
+ *
+ * Return: 0 - @folio can be split to @new_order, otherwise an error number is
+ * returned.
+ */
+int folio_check_splittable(struct folio *folio, unsigned int new_order,
+ enum split_type split_type, bool warns)
{
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+ /*
+ * Folios that just got truncated cannot get split. Signal to the
+ * caller that there was a race.
+ *
+ * TODO: this will also currently refuse shmem folios that are in the
+ * swapcache.
+ */
+ if (!folio_test_anon(folio) && !folio->mapping)
+ return -EBUSY;
+
if (folio_test_anon(folio)) {
/* order-1 is not supported for anonymous THP. */
VM_WARN_ONCE(warns && new_order == 1,
"Cannot split to order-1 folio");
if (new_order == 1)
- return false;
+ return -EINVAL;
} else if (split_type == SPLIT_TYPE_NON_UNIFORM || new_order) {
if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
!mapping_large_folio_support(folio->mapping)) {
@@ -3719,7 +3747,7 @@ bool folio_split_supported(struct folio *folio, unsigned int new_order,
*/
VM_WARN_ONCE(warns,
"Cannot split file folio to non-0 order");
- return false;
+ return -EINVAL;
}
}
@@ -3734,10 +3762,18 @@ bool folio_split_supported(struct folio *folio, unsigned int new_order,
if ((split_type == SPLIT_TYPE_NON_UNIFORM || new_order) && folio_test_swapcache(folio)) {
VM_WARN_ONCE(warns,
"Cannot split swapcache folio to non-0 order");
- return false;
+ return -EINVAL;
}
- return true;
+ if (is_huge_zero_folio(folio)) {
+ pr_warn_ratelimited("Called split_huge_page for huge zero page\n");
+ return -EINVAL;
+ }
+
+ if (folio_test_writeback(folio))
+ return -EBUSY;
+
+ return 0;
}
static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
@@ -3922,7 +3958,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
int remap_flags = 0;
int extra_pins, ret;
pgoff_t end = 0;
- bool is_hzp;
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
@@ -3930,30 +3965,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
if (folio != page_folio(split_at) || folio != page_folio(lock_at))
return -EINVAL;
- /*
- * Folios that just got truncated cannot get split. Signal to the
- * caller that there was a race.
- *
- * TODO: this will also currently refuse shmem folios that are in the
- * swapcache.
- */
- if (!is_anon && !folio->mapping)
- return -EBUSY;
-
if (new_order >= old_order)
return -EINVAL;
- if (!folio_split_supported(folio, new_order, split_type, /* warn = */ true))
- return -EINVAL;
-
- is_hzp = is_huge_zero_folio(folio);
- if (is_hzp) {
- pr_warn_ratelimited("Called split_huge_page for huge zero page\n");
- return -EBUSY;
- }
-
- if (folio_test_writeback(folio))
- return -EBUSY;
+ ret = folio_check_splittable(folio, new_order, split_type,
+ /* warn = */ true);
+ if (ret)
+ return ret;
if (is_anon) {
/*
--
2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-22 2:55 [PATCH v2 0/4] Improve folio split related functions Zi Yan
2025-11-22 2:55 ` [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable() Zi Yan
@ 2025-11-22 2:55 ` Zi Yan
2025-11-23 1:51 ` Wei Yang
` (2 more replies)
2025-11-22 2:55 ` [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order Zi Yan
2025-11-22 2:55 ` [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting Zi Yan
3 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-22 2:55 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
can_split_folio() is just a refcount comparison, making sure only the
split caller holds an extra pin. Open code it with
folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
used by folio_ref_freeze(), add folio_cache_references() to calculate it.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
include/linux/huge_mm.h | 1 -
mm/huge_memory.c | 43 ++++++++++++++++-------------------------
mm/vmscan.c | 3 ++-
3 files changed, 19 insertions(+), 28 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 97686fb46e30..1ecaeccf39c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -369,7 +369,6 @@ enum split_type {
SPLIT_TYPE_NON_UNIFORM,
};
-bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
int folio_split_unmapped(struct folio *folio, unsigned int new_order);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c1f1055165dd..6c821c1c0ac3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
}
}
-/* Racy check whether the huge page can be split */
-bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
-{
- int extra_pins;
-
- /* Additional pins from page cache */
- if (folio_test_anon(folio))
- extra_pins = folio_test_swapcache(folio) ?
- folio_nr_pages(folio) : 0;
- else
- extra_pins = folio_nr_pages(folio);
- if (pextra_pins)
- *pextra_pins = extra_pins;
- return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
- caller_pins;
-}
-
static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
{
for (; nr_pages; page++, nr_pages--)
@@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
return 0;
}
+/* Number of folio references from the pagecache or the swapcache. */
+static unsigned int folio_cache_references(const struct folio *folio)
+{
+ if (folio_test_anon(folio) && !folio_test_swapcache(folio))
+ return 0;
+ return folio_nr_pages(folio);
+}
+
static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
struct page *split_at, struct xa_state *xas,
struct address_space *mapping, bool do_lru,
struct list_head *list, enum split_type split_type,
- pgoff_t end, int *nr_shmem_dropped, int extra_pins)
+ pgoff_t end, int *nr_shmem_dropped)
{
struct folio *end_folio = folio_next(folio);
struct folio *new_folio, *next;
int old_order = folio_order(folio);
int ret = 0;
struct deferred_split *ds_queue;
+ int extra_pins = folio_cache_references(folio);
VM_WARN_ON_ONCE(!mapping && end);
/* Prevent deferred_split_scan() touching ->_refcount */
@@ -3956,7 +3948,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
struct folio *new_folio, *next;
int nr_shmem_dropped = 0;
int remap_flags = 0;
- int extra_pins, ret;
+ int ret;
pgoff_t end = 0;
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
@@ -4036,7 +4028,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
* Racy check if we can split the page, before unmap_folio() will
* split PMDs
*/
- if (!can_split_folio(folio, 1, &extra_pins)) {
+ if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1) {
ret = -EAGAIN;
goto out_unlock;
}
@@ -4059,8 +4051,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
}
ret = __folio_freeze_and_split_unmapped(folio, new_order, split_at, &xas, mapping,
- true, list, split_type, end, &nr_shmem_dropped,
- extra_pins);
+ true, list, split_type, end, &nr_shmem_dropped);
fail:
if (mapping)
xas_unlock(&xas);
@@ -4134,20 +4125,20 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
*/
int folio_split_unmapped(struct folio *folio, unsigned int new_order)
{
- int extra_pins, ret = 0;
+ int ret = 0;
VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_anon(folio), folio);
- if (!can_split_folio(folio, 1, &extra_pins))
+ if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1)
return -EAGAIN;
local_irq_disable();
ret = __folio_freeze_and_split_unmapped(folio, new_order, &folio->page, NULL,
NULL, false, NULL, SPLIT_TYPE_UNIFORM,
- 0, NULL, extra_pins);
+ 0, NULL);
local_irq_enable();
return ret;
}
@@ -4640,7 +4631,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
* can be split or not. So skip the check here.
*/
if (!folio_test_private(folio) &&
- !can_split_folio(folio, 0, NULL))
+ folio_expected_ref_count(folio) != folio_ref_count(folio))
goto next;
if (!folio_trylock(folio))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92980b072121..3b85652a42b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1284,7 +1284,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
goto keep_locked;
if (folio_test_large(folio)) {
/* cannot split folio, skip it */
- if (!can_split_folio(folio, 1, NULL))
+ if (folio_expected_ref_count(folio) !=
+ folio_ref_count(folio) - 1)
goto activate_locked;
/*
* Split partially mapped folios right away.
--
2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order
2025-11-22 2:55 [PATCH v2 0/4] Improve folio split related functions Zi Yan
2025-11-22 2:55 ` [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable() Zi Yan
2025-11-22 2:55 ` [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation Zi Yan
@ 2025-11-22 2:55 ` Zi Yan
2025-11-23 1:53 ` Wei Yang
` (2 more replies)
2025-11-22 2:55 ` [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting Zi Yan
3 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-22 2:55 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
min_order_for_split() returns -EBUSY when the folio is truncated and cannot
be split. In commit 77008e1b2ef7 ("mm/huge_memory: do not change
split_huge_page*() target order silently"), memory_failure() does not
handle it and pass -EBUSY to try_to_split_thp_page() directly.
try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as
new_order is unsigned int in __folio_split() and this large new_order is
rejected as an invalid input. The code does not cause a bug.
soft_offline_in_use_page() also uses min_order_for_split() but it always
passes 0 as new_order for split.
Fix it by making min_order_for_split() always return an order. When the
given folio is truncated, namely folio->mapping == NULL, return 0 and let
a subsequent split function handle the situation and return -EBUSY.
Add kernel-doc to min_order_for_split() to clarify its use.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
include/linux/huge_mm.h | 6 +++---
mm/huge_memory.c | 25 +++++++++++++++++++------
2 files changed, 22 insertions(+), 9 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1ecaeccf39c9..9b3a4e2b0668 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -372,7 +372,7 @@ enum split_type {
int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
int folio_split_unmapped(struct folio *folio, unsigned int new_order);
-int min_order_for_split(struct folio *folio);
+unsigned int min_order_for_split(struct folio *folio);
int split_folio_to_list(struct folio *folio, struct list_head *list);
int folio_check_splittable(struct folio *folio, unsigned int new_order,
enum split_type split_type, bool warns);
@@ -634,10 +634,10 @@ static inline int split_huge_page(struct page *page)
return -EINVAL;
}
-static inline int min_order_for_split(struct folio *folio)
+static inline unsigned int min_order_for_split(struct folio *folio)
{
VM_WARN_ON_ONCE_FOLIO(1, folio);
- return -EINVAL;
+ return 0;
}
static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6c821c1c0ac3..ebc3ba0907fd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4230,16 +4230,29 @@ int folio_split(struct folio *folio, unsigned int new_order,
SPLIT_TYPE_NON_UNIFORM);
}
-int min_order_for_split(struct folio *folio)
+/**
+ * min_order_for_split() - get the minimum order @folio can be split to
+ * @folio: folio to split
+ *
+ * min_order_for_split() tells the minimum order @folio can be split to.
+ * If a file-backed folio is truncated, 0 will be returned. Any subsequent
+ * split attempt should get -EBUSY from split checking code.
+ *
+ * Return: @folio's minimum order for split
+ */
+unsigned int min_order_for_split(struct folio *folio)
{
if (folio_test_anon(folio))
return 0;
- if (!folio->mapping) {
- if (folio_test_pmd_mappable(folio))
- count_vm_event(THP_SPLIT_PAGE_FAILED);
- return -EBUSY;
- }
+ /*
+ * If the folio got truncated, we don't know the previous mapping and
+ * consequently the old min order. But it doesn't matter, as any split
+ * attempt will immediately fail with -EBUSY as the folio cannot get
+ * split until freed.
+ */
+ if (!folio->mapping)
+ return 0;
return mapping_min_folio_order(folio->mapping);
}
--
2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting
2025-11-22 2:55 [PATCH v2 0/4] Improve folio split related functions Zi Yan
` (2 preceding siblings ...)
2025-11-22 2:55 ` [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order Zi Yan
@ 2025-11-22 2:55 ` Zi Yan
2025-11-23 1:56 ` Wei Yang
` (2 more replies)
3 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-22 2:55 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
The "return <error code>" statements for error checks at the beginning of
__folio_split() skip necessary count_vm_event() and count_mthp_stat() at
the end of the function. Fix these by replacing them with
"ret = <error code>; goto out;".
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/huge_memory.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ebc3ba0907fd..a42c4f29ce4f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3954,16 +3954,20 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
- if (folio != page_folio(split_at) || folio != page_folio(lock_at))
- return -EINVAL;
+ if (folio != page_folio(split_at) || folio != page_folio(lock_at)) {
+ ret = -EINVAL;
+ goto out;
+ }
- if (new_order >= old_order)
- return -EINVAL;
+ if (new_order >= old_order) {
+ ret = -EINVAL;
+ goto out;
+ }
ret = folio_check_splittable(folio, new_order, split_type,
/* warn = */ true);
if (ret)
- return ret;
+ goto out;
if (is_anon) {
/*
--
2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable()
2025-11-22 2:55 ` [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable() Zi Yan
@ 2025-11-23 1:50 ` Wei Yang
2025-11-23 18:38 ` Barry Song
2025-11-25 8:58 ` David Hildenbrand (Red Hat)
2 siblings, 0 replies; 32+ messages in thread
From: Wei Yang @ 2025-11-23 1:50 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh,
linux-mm, linux-kernel
On Fri, Nov 21, 2025 at 09:55:26PM -0500, Zi Yan wrote:
>folio_split_supported() used in try_folio_split_to_order() requires
>folio->mapping to be non NULL, but current try_folio_split_to_order() does
>not check it. There is no issue in the current code, since
>try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
>where folio->mapping is not NULL.
>
>To prevent future misuse, move folio->mapping NULL check (i.e., folio is
>truncated) into folio_split_supported(). Since folio->mapping NULL check
>returns -EBUSY and folio_split_supported() == false means -EINVAL, change
>folio_split_supported() return type from bool to int and return error
>numbers accordingly. Rename folio_split_supported() to
>folio_check_splittable() to match the return type change.
>
>While at it, move is_huge_zero_folio() check and folio_test_writeback()
>check into folio_check_splittable() and add kernel-doc.
>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
LGTM, Thanks
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-22 2:55 ` [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation Zi Yan
@ 2025-11-23 1:51 ` Wei Yang
2025-11-24 10:41 ` David Hildenbrand (Red Hat)
2025-11-24 22:14 ` Balbir Singh
2 siblings, 0 replies; 32+ messages in thread
From: Wei Yang @ 2025-11-23 1:51 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh,
linux-mm, linux-kernel
On Fri, Nov 21, 2025 at 09:55:27PM -0500, Zi Yan wrote:
>can_split_folio() is just a refcount comparison, making sure only the
>split caller holds an extra pin. Open code it with
>folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>
>Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
LGTM, Thanks
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order
2025-11-22 2:55 ` [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order Zi Yan
@ 2025-11-23 1:53 ` Wei Yang
2025-11-24 10:43 ` David Hildenbrand (Red Hat)
2025-11-24 15:18 ` Lorenzo Stoakes
2 siblings, 0 replies; 32+ messages in thread
From: Wei Yang @ 2025-11-23 1:53 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh,
linux-mm, linux-kernel
On Fri, Nov 21, 2025 at 09:55:28PM -0500, Zi Yan wrote:
>min_order_for_split() returns -EBUSY when the folio is truncated and cannot
>be split. In commit 77008e1b2ef7 ("mm/huge_memory: do not change
>split_huge_page*() target order silently"), memory_failure() does not
>handle it and pass -EBUSY to try_to_split_thp_page() directly.
>try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as
>new_order is unsigned int in __folio_split() and this large new_order is
>rejected as an invalid input. The code does not cause a bug.
>soft_offline_in_use_page() also uses min_order_for_split() but it always
>passes 0 as new_order for split.
>
>Fix it by making min_order_for_split() always return an order. When the
>given folio is truncated, namely folio->mapping == NULL, return 0 and let
>a subsequent split function handle the situation and return -EBUSY.
>
>Add kernel-doc to min_order_for_split() to clarify its use.
>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
LGTM, Thanks
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting
2025-11-22 2:55 ` [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting Zi Yan
@ 2025-11-23 1:56 ` Wei Yang
2025-11-24 10:45 ` David Hildenbrand (Red Hat)
2025-11-24 15:21 ` Lorenzo Stoakes
2 siblings, 0 replies; 32+ messages in thread
From: Wei Yang @ 2025-11-23 1:56 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh,
linux-mm, linux-kernel
On Fri, Nov 21, 2025 at 09:55:29PM -0500, Zi Yan wrote:
>The "return <error code>" statements for error checks at the beginning of
>__folio_split() skip necessary count_vm_event() and count_mthp_stat() at
>the end of the function. Fix these by replacing them with
>"ret = <error code>; goto out;".
>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
Sounds reasonable, Thanks.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable()
2025-11-22 2:55 ` [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable() Zi Yan
2025-11-23 1:50 ` Wei Yang
@ 2025-11-23 18:38 ` Barry Song
2025-11-24 10:33 ` David Hildenbrand (Red Hat)
2025-11-25 8:58 ` David Hildenbrand (Red Hat)
2 siblings, 1 reply; 32+ messages in thread
From: Barry Song @ 2025-11-23 18:38 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
Hi Zi Yan,
Thanks for the nice cleanup.
On Sat, Nov 22, 2025 at 10:55 AM Zi Yan <ziy@nvidia.com> wrote:
>
> folio_split_supported() used in try_folio_split_to_order() requires
> folio->mapping to be non NULL, but current try_folio_split_to_order() does
> not check it. There is no issue in the current code, since
> try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
> where folio->mapping is not NULL.
>
> To prevent future misuse, move folio->mapping NULL check (i.e., folio is
> truncated) into folio_split_supported(). Since folio->mapping NULL check
> returns -EBUSY and folio_split_supported() == false means -EINVAL, change
> folio_split_supported() return type from bool to int and return error
> numbers accordingly. Rename folio_split_supported() to
> folio_check_splittable() to match the return type change.
>
> While at it, move is_huge_zero_folio() check and folio_test_writeback()
> check into folio_check_splittable() and add kernel-doc.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> include/linux/huge_mm.h | 10 ++++--
> mm/huge_memory.c | 74 +++++++++++++++++++++++++----------------
> 2 files changed, 53 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1d439de1ca2c..97686fb46e30 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -375,8 +375,8 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
> int min_order_for_split(struct folio *folio);
> int split_folio_to_list(struct folio *folio, struct list_head *list);
> -bool folio_split_supported(struct folio *folio, unsigned int new_order,
> - enum split_type split_type, bool warns);
> +int folio_check_splittable(struct folio *folio, unsigned int new_order,
> + enum split_type split_type, bool warns);
It feels a bit odd to have a warns parameter here, especially given that it's
a bool. I understand that in one case we're only checking whether a split is
possible, without actually performing it. In the other case, we are performing
the split, so we must confirm it's valid — otherwise it's a bug.
Could we rename split_type to something more like gfp_flags, where we have
variants such as __GFP_NOWARN or something similar? That would make the code
much more readable.
[...]
>
> @@ -3734,10 +3762,18 @@ bool folio_split_supported(struct folio *folio, unsigned int new_order,
> if ((split_type == SPLIT_TYPE_NON_UNIFORM || new_order) && folio_test_swapcache(folio)) {
> VM_WARN_ONCE(warns,
> "Cannot split swapcache folio to non-0 order");
> - return false;
> + return -EINVAL;
> }
>
> - return true;
> + if (is_huge_zero_folio(folio)) {
> + pr_warn_ratelimited("Called split_huge_page for huge zero page\n");
> + return -EINVAL;
> + }
However, I don’t quite understand why this doesn’t check warns or why it
isn’t using VM_WARN_ONCE. Why is the zero-huge case different?
Thanks
Barry
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable()
2025-11-23 18:38 ` Barry Song
@ 2025-11-24 10:33 ` David Hildenbrand (Red Hat)
2025-11-24 16:38 ` Zi Yan
0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 10:33 UTC (permalink / raw)
To: Barry Song, Zi Yan
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
On 11/23/25 19:38, Barry Song wrote:
> Hi Zi Yan,
>
> Thanks for the nice cleanup.
>
> On Sat, Nov 22, 2025 at 10:55 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> folio_split_supported() used in try_folio_split_to_order() requires
>> folio->mapping to be non NULL, but current try_folio_split_to_order() does
>> not check it. There is no issue in the current code, since
>> try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
>> where folio->mapping is not NULL.
>>
>> To prevent future misuse, move folio->mapping NULL check (i.e., folio is
>> truncated) into folio_split_supported(). Since folio->mapping NULL check
>> returns -EBUSY and folio_split_supported() == false means -EINVAL, change
>> folio_split_supported() return type from bool to int and return error
>> numbers accordingly. Rename folio_split_supported() to
>> folio_check_splittable() to match the return type change.
>>
>> While at it, move is_huge_zero_folio() check and folio_test_writeback()
>> check into folio_check_splittable() and add kernel-doc.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>> include/linux/huge_mm.h | 10 ++++--
>> mm/huge_memory.c | 74 +++++++++++++++++++++++++----------------
>> 2 files changed, 53 insertions(+), 31 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 1d439de1ca2c..97686fb46e30 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -375,8 +375,8 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>> int min_order_for_split(struct folio *folio);
>> int split_folio_to_list(struct folio *folio, struct list_head *list);
>> -bool folio_split_supported(struct folio *folio, unsigned int new_order,
>> - enum split_type split_type, bool warns);
>> +int folio_check_splittable(struct folio *folio, unsigned int new_order,
>> + enum split_type split_type, bool warns);
>
>
> It feels a bit odd to have a warns parameter here, especially given that it's
> a bool. I understand that in one case we're only checking whether a split is
> possible, without actually performing it. In the other case, we are performing
> the split, so we must confirm it's valid — otherwise it's a bug.
>
> Could we rename split_type to something more like gfp_flags, where we have
> variants such as __GFP_NOWARN or something similar? That would make the code
> much more readable.
Could we get rid of the "warns" parameter and simply always do a
pr_warn_once()?
As an alternative, simply move the warning to the single caller
VM_WARN_ONCE(ret == -EINVAL, "Tried to split an unsplittable folio");
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-22 2:55 ` [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation Zi Yan
2025-11-23 1:51 ` Wei Yang
@ 2025-11-24 10:41 ` David Hildenbrand (Red Hat)
2025-11-24 17:05 ` Zi Yan
2025-11-24 22:14 ` Balbir Singh
2 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 10:41 UTC (permalink / raw)
To: Zi Yan, Lorenzo Stoakes
Cc: Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
On 11/22/25 03:55, Zi Yan wrote:
> can_split_folio() is just a refcount comparison, making sure only the
> split caller holds an extra pin. Open code it with
> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> include/linux/huge_mm.h | 1 -
> mm/huge_memory.c | 43 ++++++++++++++++-------------------------
> mm/vmscan.c | 3 ++-
> 3 files changed, 19 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 97686fb46e30..1ecaeccf39c9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -369,7 +369,6 @@ enum split_type {
> SPLIT_TYPE_NON_UNIFORM,
> };
>
> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> unsigned int new_order);
> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c1f1055165dd..6c821c1c0ac3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
> }
> }
>
> -/* Racy check whether the huge page can be split */
> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
> -{
> - int extra_pins;
> -
> - /* Additional pins from page cache */
> - if (folio_test_anon(folio))
> - extra_pins = folio_test_swapcache(folio) ?
> - folio_nr_pages(folio) : 0;
> - else
> - extra_pins = folio_nr_pages(folio);
> - if (pextra_pins)
> - *pextra_pins = extra_pins;
> - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
> - caller_pins;
> -}
> -
> static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
> {
> for (; nr_pages; page++, nr_pages--)
> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
> return 0;
> }
>
> +/* Number of folio references from the pagecache or the swapcache. */
> +static unsigned int folio_cache_references(const struct folio *folio)
> +{
> + if (folio_test_anon(folio) && !folio_test_swapcache(folio))
> + return 0;
> + return folio_nr_pages(folio);
> +}
> +
> static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
> struct page *split_at, struct xa_state *xas,
> struct address_space *mapping, bool do_lru,
> struct list_head *list, enum split_type split_type,
> - pgoff_t end, int *nr_shmem_dropped, int extra_pins)
> + pgoff_t end, int *nr_shmem_dropped)
> {
> struct folio *end_folio = folio_next(folio);
> struct folio *new_folio, *next;
> int old_order = folio_order(folio);
> int ret = 0;
> struct deferred_split *ds_queue;
> + int extra_pins = folio_cache_references(folio);
Can we just inline the call do folio_cache_references() and get rid of extra_pins.
(which is a bad name either way)
if (folio_ref_freeze(folio, folio_cache_references(folio) + 1) {
BTW, now that we have this helper, I wonder if we should then also do for
clarification on the unfreeze path:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0acdc2f26ee0c..7cbcf61b7971d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3824,8 +3824,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
zone_device_private_split_cb(folio, new_folio);
- expected_refs = folio_expected_ref_count(new_folio) + 1;
- folio_ref_unfreeze(new_folio, expected_refs);
+ folio_ref_unfreeze(new_folio, folio_cache_references(new_folio) + 1);
if (do_lru)
lru_add_split_folio(folio, new_folio, lruvec, list);
@@ -3868,8 +3867,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
* Otherwise, a parallel folio_try_get() can grab @folio
* and its caller can see stale page cache entries.
*/
- expected_refs = folio_expected_ref_count(folio) + 1;
- folio_ref_unfreeze(folio, expected_refs);
+ folio_ref_unfreeze(folio, folio_cache_references(folio) + 1);
if (do_lru)
unlock_page_lruvec(lruvec);
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order
2025-11-22 2:55 ` [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order Zi Yan
2025-11-23 1:53 ` Wei Yang
@ 2025-11-24 10:43 ` David Hildenbrand (Red Hat)
2025-11-24 15:18 ` Lorenzo Stoakes
2 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 10:43 UTC (permalink / raw)
To: Zi Yan, Lorenzo Stoakes
Cc: Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
On 11/22/25 03:55, Zi Yan wrote:
> min_order_for_split() returns -EBUSY when the folio is truncated and cannot
> be split. In commit 77008e1b2ef7 ("mm/huge_memory: do not change
> split_huge_page*() target order silently"), memory_failure() does not
> handle it and pass -EBUSY to try_to_split_thp_page() directly.
> try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as
> new_order is unsigned int in __folio_split() and this large new_order is
> rejected as an invalid input. The code does not cause a bug.
> soft_offline_in_use_page() also uses min_order_for_split() but it always
> passes 0 as new_order for split.
>
> Fix it by making min_order_for_split() always return an order. When the
> given folio is truncated, namely folio->mapping == NULL, return 0 and let
> a subsequent split function handle the situation and return -EBUSY.
>
> Add kernel-doc to min_order_for_split() to clarify its use.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting
2025-11-22 2:55 ` [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting Zi Yan
2025-11-23 1:56 ` Wei Yang
@ 2025-11-24 10:45 ` David Hildenbrand (Red Hat)
2025-11-24 17:23 ` Zi Yan
2025-11-24 15:21 ` Lorenzo Stoakes
2 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 10:45 UTC (permalink / raw)
To: Zi Yan, Lorenzo Stoakes
Cc: Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
On 11/22/25 03:55, Zi Yan wrote:
> The "return <error code>" statements for error checks at the beginning of
> __folio_split() skip necessary count_vm_event() and count_mthp_stat() at
> the end of the function. Fix these by replacing them with
> "ret = <error code>; goto out;".
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> mm/huge_memory.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ebc3ba0907fd..a42c4f29ce4f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3954,16 +3954,20 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
>
> - if (folio != page_folio(split_at) || folio != page_folio(lock_at))
> - return -EINVAL;
> + if (folio != page_folio(split_at) || folio != page_folio(lock_at)) {
> + ret = -EINVAL;
> + goto out;
> + }
>
> - if (new_order >= old_order)
> - return -EINVAL;
> + if (new_order >= old_order) {
> + ret = -EINVAL;
> + goto out;
> + }
>
> ret = folio_check_splittable(folio, new_order, split_type,
> /* warn = */ true);
> if (ret)
> - return ret;
> + goto out;
>
> if (is_anon) {
> /*
I guess this is not Fixes:/stable material. Wonder if such early (mostly
-EINVAL etc) checks were at some point not intended to be counted.
In any case
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order
2025-11-22 2:55 ` [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order Zi Yan
2025-11-23 1:53 ` Wei Yang
2025-11-24 10:43 ` David Hildenbrand (Red Hat)
@ 2025-11-24 15:18 ` Lorenzo Stoakes
2025-11-24 17:11 ` Zi Yan
2 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-11-24 15:18 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On Fri, Nov 21, 2025 at 09:55:28PM -0500, Zi Yan wrote:
> min_order_for_split() returns -EBUSY when the folio is truncated and cannot
> be split. In commit 77008e1b2ef7 ("mm/huge_memory: do not change
> split_huge_page*() target order silently"), memory_failure() does not
> handle it and pass -EBUSY to try_to_split_thp_page() directly.
> try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as
> new_order is unsigned int in __folio_split() and this large new_order is
> rejected as an invalid input. The code does not cause a bug.
Yikes!
This class of bug is all too common... 'unexpectedly returning an error the
caller wasn't prepared for'.
> soft_offline_in_use_page() also uses min_order_for_split() but it always
> passes 0 as new_order for split.
>
> Fix it by making min_order_for_split() always return an order. When the
> given folio is truncated, namely folio->mapping == NULL, return 0 and let
> a subsequent split function handle the situation and return -EBUSY.
OK so we allow the split essentially or rather give a return value that is
essentially 'we don't care' because any attempt at the split will run into
something like:
anon_vma = folio_get_anon_vma(folio);
if (!anon_vma) {
ret = -EBUSY;
goto out;
}
In __folio_split() right?
>
> Add kernel-doc to min_order_for_split() to clarify its use.
Nice.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/huge_mm.h | 6 +++---
> mm/huge_memory.c | 25 +++++++++++++++++++------
> 2 files changed, 22 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1ecaeccf39c9..9b3a4e2b0668 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -372,7 +372,7 @@ enum split_type {
> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> unsigned int new_order);
> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
> -int min_order_for_split(struct folio *folio);
> +unsigned int min_order_for_split(struct folio *folio);
> int split_folio_to_list(struct folio *folio, struct list_head *list);
> int folio_check_splittable(struct folio *folio, unsigned int new_order,
> enum split_type split_type, bool warns);
> @@ -634,10 +634,10 @@ static inline int split_huge_page(struct page *page)
> return -EINVAL;
> }
>
> -static inline int min_order_for_split(struct folio *folio)
> +static inline unsigned int min_order_for_split(struct folio *folio)
> {
> VM_WARN_ON_ONCE_FOLIO(1, folio);
> - return -EINVAL;
> + return 0;
> }
>
> static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6c821c1c0ac3..ebc3ba0907fd 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4230,16 +4230,29 @@ int folio_split(struct folio *folio, unsigned int new_order,
> SPLIT_TYPE_NON_UNIFORM);
> }
>
> -int min_order_for_split(struct folio *folio)
> +/**
> + * min_order_for_split() - get the minimum order @folio can be split to
> + * @folio: folio to split
> + *
> + * min_order_for_split() tells the minimum order @folio can be split to.
> + * If a file-backed folio is truncated, 0 will be returned. Any subsequent
> + * split attempt should get -EBUSY from split checking code.
> + *
> + * Return: @folio's minimum order for split
> + */
> +unsigned int min_order_for_split(struct folio *folio)
> {
> if (folio_test_anon(folio))
> return 0;
>
> - if (!folio->mapping) {
> - if (folio_test_pmd_mappable(folio))
> - count_vm_event(THP_SPLIT_PAGE_FAILED);
> - return -EBUSY;
> - }
> + /*
> + * If the folio got truncated, we don't know the previous mapping and
> + * consequently the old min order. But it doesn't matter, as any split
> + * attempt will immediately fail with -EBUSY as the folio cannot get
> + * split until freed.
> + */
Nice to have a comment here to clarify this!
> + if (!folio->mapping)
> + return 0;
>
> return mapping_min_folio_order(folio->mapping);
> }
> --
> 2.51.0
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting
2025-11-22 2:55 ` [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting Zi Yan
2025-11-23 1:56 ` Wei Yang
2025-11-24 10:45 ` David Hildenbrand (Red Hat)
@ 2025-11-24 15:21 ` Lorenzo Stoakes
2025-11-24 17:29 ` Zi Yan
2 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-11-24 15:21 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On Fri, Nov 21, 2025 at 09:55:29PM -0500, Zi Yan wrote:
> The "return <error code>" statements for error checks at the beginning of
> __folio_split() skip necessary count_vm_event() and count_mthp_stat() at
> the end of the function. Fix these by replacing them with
> "ret = <error code>; goto out;".
I guess the xas_destroy() there will be a no-op in these cases!
Good spot, as David said, maybe one for stable then... not sure if necessary for
statistical stuff though?
But at the same time, maybe users will be misled if these are incorrect?
Has this bug been around since the beginning? Be curious to know if that's the
case or if it was introduced somewhere along the line?
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/huge_memory.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ebc3ba0907fd..a42c4f29ce4f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3954,16 +3954,20 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
>
> - if (folio != page_folio(split_at) || folio != page_folio(lock_at))
> - return -EINVAL;
> + if (folio != page_folio(split_at) || folio != page_folio(lock_at)) {
> + ret = -EINVAL;
> + goto out;
> + }
>
> - if (new_order >= old_order)
> - return -EINVAL;
> + if (new_order >= old_order) {
> + ret = -EINVAL;
> + goto out;
> + }
>
> ret = folio_check_splittable(folio, new_order, split_type,
> /* warn = */ true);
> if (ret)
> - return ret;
> + goto out;
>
> if (is_anon) {
> /*
> --
> 2.51.0
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable()
2025-11-24 10:33 ` David Hildenbrand (Red Hat)
@ 2025-11-24 16:38 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-24 16:38 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Barry Song, Lorenzo Stoakes, Andrew Morton, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 24 Nov 2025, at 5:33, David Hildenbrand (Red Hat) wrote:
> On 11/23/25 19:38, Barry Song wrote:
>> Hi Zi Yan,
>>
>> Thanks for the nice cleanup.
>>
>> On Sat, Nov 22, 2025 at 10:55 AM Zi Yan <ziy@nvidia.com> wrote:
>>>
>>> folio_split_supported() used in try_folio_split_to_order() requires
>>> folio->mapping to be non NULL, but current try_folio_split_to_order() does
>>> not check it. There is no issue in the current code, since
>>> try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
>>> where folio->mapping is not NULL.
>>>
>>> To prevent future misuse, move folio->mapping NULL check (i.e., folio is
>>> truncated) into folio_split_supported(). Since folio->mapping NULL check
>>> returns -EBUSY and folio_split_supported() == false means -EINVAL, change
>>> folio_split_supported() return type from bool to int and return error
>>> numbers accordingly. Rename folio_split_supported() to
>>> folio_check_splittable() to match the return type change.
>>>
>>> While at it, move is_huge_zero_folio() check and folio_test_writeback()
>>> check into folio_check_splittable() and add kernel-doc.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>> include/linux/huge_mm.h | 10 ++++--
>>> mm/huge_memory.c | 74 +++++++++++++++++++++++++----------------
>>> 2 files changed, 53 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 1d439de1ca2c..97686fb46e30 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -375,8 +375,8 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
>>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>>> int min_order_for_split(struct folio *folio);
>>> int split_folio_to_list(struct folio *folio, struct list_head *list);
>>> -bool folio_split_supported(struct folio *folio, unsigned int new_order,
>>> - enum split_type split_type, bool warns);
>>> +int folio_check_splittable(struct folio *folio, unsigned int new_order,
>>> + enum split_type split_type, bool warns);
>>
>>
>> It feels a bit odd to have a warns parameter here, especially given that it's
>> a bool. I understand that in one case we're only checking whether a split is
>> possible, without actually performing it. In the other case, we are performing
>> the split, so we must confirm it's valid — otherwise it's a bug.
>>
>> Could we rename split_type to something more like gfp_flags, where we have
>> variants such as __GFP_NOWARN or something similar? That would make the code
>> much more readable.
We do not want to make folio split complicated, especially the long term plan
is to move to non uniform split entirely. And this warn parameter is solely
for CONFIG_READ_ONLY_THP_FOR_FS, since large folios created via it
cannot be split in non uniform way.
>
> Could we get rid of the "warns" parameter and simply always do a pr_warn_once()?
The issue with this method is that truncating to a large folio created via
CONFIG_READ_ONLY_THP_FOR_FS triggers an undesirable warning.
>
> As an alternative, simply move the warning to the single caller
>
> VM_WARN_ONCE(ret == -EINVAL, "Tried to split an unsplittable folio");
It sounds good to me. It at most needs the person causes the warning to
add some code to find the actual violation.
I will do this in the next version. All VM_WARN_ONCE() and pr_warn_ratelimited()
in folio_check_splittable() will be removed and __folio_split() will
do it when ret is -EINVAL.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-24 10:41 ` David Hildenbrand (Red Hat)
@ 2025-11-24 17:05 ` Zi Yan
2025-11-24 19:22 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-11-24 17:05 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 24 Nov 2025, at 5:41, David Hildenbrand (Red Hat) wrote:
> On 11/22/25 03:55, Zi Yan wrote:
>> can_split_folio() is just a refcount comparison, making sure only the
>> split caller holds an extra pin. Open code it with
>> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>> include/linux/huge_mm.h | 1 -
>> mm/huge_memory.c | 43 ++++++++++++++++-------------------------
>> mm/vmscan.c | 3 ++-
>> 3 files changed, 19 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 97686fb46e30..1ecaeccf39c9 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -369,7 +369,6 @@ enum split_type {
>> SPLIT_TYPE_NON_UNIFORM,
>> };
>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> unsigned int new_order);
>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c1f1055165dd..6c821c1c0ac3 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>> }
>> }
>> -/* Racy check whether the huge page can be split */
>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>> -{
>> - int extra_pins;
>> -
>> - /* Additional pins from page cache */
>> - if (folio_test_anon(folio))
>> - extra_pins = folio_test_swapcache(folio) ?
>> - folio_nr_pages(folio) : 0;
>> - else
>> - extra_pins = folio_nr_pages(folio);
>> - if (pextra_pins)
>> - *pextra_pins = extra_pins;
>> - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
>> - caller_pins;
>> -}
>> -
>> static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>> {
>> for (; nr_pages; page++, nr_pages--)
>> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>> return 0;
>> }
>> +/* Number of folio references from the pagecache or the swapcache. */
>> +static unsigned int folio_cache_references(const struct folio *folio)
>> +{
>> + if (folio_test_anon(folio) && !folio_test_swapcache(folio))
>> + return 0;
>> + return folio_nr_pages(folio);
>> +}
>> +
>> static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
>> struct page *split_at, struct xa_state *xas,
>> struct address_space *mapping, bool do_lru,
>> struct list_head *list, enum split_type split_type,
>> - pgoff_t end, int *nr_shmem_dropped, int extra_pins)
>> + pgoff_t end, int *nr_shmem_dropped)
>> {
>> struct folio *end_folio = folio_next(folio);
>> struct folio *new_folio, *next;
>> int old_order = folio_order(folio);
>> int ret = 0;
>> struct deferred_split *ds_queue;
>> + int extra_pins = folio_cache_references(folio);
>
> Can we just inline the call do folio_cache_references() and get rid of extra_pins.
> (which is a bad name either way)
>
>
> if (folio_ref_freeze(folio, folio_cache_references(folio) + 1) {
>
>
> BTW, now that we have this helper, I wonder if we should then also do for
> clarification on the unfreeze path:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0acdc2f26ee0c..7cbcf61b7971d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3824,8 +3824,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> zone_device_private_split_cb(folio, new_folio);
> - expected_refs = folio_expected_ref_count(new_folio) + 1;
> - folio_ref_unfreeze(new_folio, expected_refs);
> + folio_ref_unfreeze(new_folio, folio_cache_references(new_folio) + 1);
> if (do_lru)
> lru_add_split_folio(folio, new_folio, lruvec, list);
> @@ -3868,8 +3867,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> * Otherwise, a parallel folio_try_get() can grab @folio
> * and its caller can see stale page cache entries.
> */
> - expected_refs = folio_expected_ref_count(folio) + 1;
> - folio_ref_unfreeze(folio, expected_refs);
> + folio_ref_unfreeze(folio, folio_cache_references(folio) + 1);
> if (do_lru)
> unlock_page_lruvec(lruvec);
>
>
Both make sense to me. Will make the change.
By comparing folio_cache_references() with folio_expected_ref_count(),
one difference is that folio_expected_ref_count() does not give right
refcount for shmem in swapcache.
This is the folio_expected_ref_count() code:
if (folio_test_anon(folio)) {
/* One reference per page from the swapcache. */
ref_count += folio_test_swapcache(folio) << order;
} else {
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
/* One reference from PG_private. */
ref_count += folio_test_private(folio);
}
shmem in swapcache mean !folio_test_anon(folio) && folio_test_swapcache(folio).
The above code gives 0, but folio_cache_references() gives folio_nr_pages(folio).
It should not cause any issue, since IIUC shmem in swapcache happens
when the folio has an additional ref,
folio_expected_ref_count() != folio_ref_count() anyway. For split, it is
not supported yet, so folio_expected_ref_count() in split code does not
affect shmem in swapcache. But folio_expected_ref_count() should be
fixed, right?
Like:
if (folio_test_anon(folio)) {
/* One reference per page from the swapcache. */
ref_count += folio_test_swapcache(folio) << order;
} else {
/* One reference per page from shmem in the swapcache. */
ref_count += folio_test_swapcache(folio) << order;
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
/* One reference from PG_private. */
ref_count += folio_test_private(folio);
}
or simplified into
if (!folio_test_anon(folio)) {
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
/* One reference from PG_private. */
ref_count += folio_test_private(folio);
}
/* One reference per page from the swapcache (anon or shmem). */
ref_count += folio_test_swapcache(folio) << order;
?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order
2025-11-24 15:18 ` Lorenzo Stoakes
@ 2025-11-24 17:11 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-24 17:11 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 24 Nov 2025, at 10:18, Lorenzo Stoakes wrote:
> On Fri, Nov 21, 2025 at 09:55:28PM -0500, Zi Yan wrote:
>> min_order_for_split() returns -EBUSY when the folio is truncated and cannot
>> be split. In commit 77008e1b2ef7 ("mm/huge_memory: do not change
>> split_huge_page*() target order silently"), memory_failure() does not
>> handle it and pass -EBUSY to try_to_split_thp_page() directly.
>> try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as
>> new_order is unsigned int in __folio_split() and this large new_order is
>> rejected as an invalid input. The code does not cause a bug.
>
> Yikes!
>
> This class of bug is all too common... 'unexpectedly returning an error the
> caller wasn't prepared for'.
>
>> soft_offline_in_use_page() also uses min_order_for_split() but it always
>> passes 0 as new_order for split.
>>
>> Fix it by making min_order_for_split() always return an order. When the
>> given folio is truncated, namely folio->mapping == NULL, return 0 and let
>> a subsequent split function handle the situation and return -EBUSY.
>
> OK so we allow the split essentially or rather give a return value that is
> essentially 'we don't care' because any attempt at the split will run into
> something like:
>
> anon_vma = folio_get_anon_vma(folio);
> if (!anon_vma) {
> ret = -EBUSY;
> goto out;
> }
>
> In __folio_split() right?
Not this one for the issue I mentioned above, since this is for anon folios
and min_order_for_split() returns 0 for all anon folios. anon_vma == NULL
does not mean folio->mapping == NULL, since folio->mapping still has
FOLIO_MAPPING_ANON set. The fun never ends, right? :)
The above issue is handled by
/*
* Folios that just got truncated cannot get split. Signal to the
* caller that there was a race.
*
* TODO: this will also currently refuse shmem folios that are in the
* swapcache.
*/
if (!folio_test_anon(folio) && !folio->mapping)
return -EBUSY;
>
>>
>> Add kernel-doc to min_order_for_split() to clarify its use.
>
> Nice.
>
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Thanks.
>
>> ---
>> include/linux/huge_mm.h | 6 +++---
>> mm/huge_memory.c | 25 +++++++++++++++++++------
>> 2 files changed, 22 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 1ecaeccf39c9..9b3a4e2b0668 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -372,7 +372,7 @@ enum split_type {
>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> unsigned int new_order);
>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>> -int min_order_for_split(struct folio *folio);
>> +unsigned int min_order_for_split(struct folio *folio);
>> int split_folio_to_list(struct folio *folio, struct list_head *list);
>> int folio_check_splittable(struct folio *folio, unsigned int new_order,
>> enum split_type split_type, bool warns);
>> @@ -634,10 +634,10 @@ static inline int split_huge_page(struct page *page)
>> return -EINVAL;
>> }
>>
>> -static inline int min_order_for_split(struct folio *folio)
>> +static inline unsigned int min_order_for_split(struct folio *folio)
>> {
>> VM_WARN_ON_ONCE_FOLIO(1, folio);
>> - return -EINVAL;
>> + return 0;
>> }
>>
>> static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 6c821c1c0ac3..ebc3ba0907fd 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4230,16 +4230,29 @@ int folio_split(struct folio *folio, unsigned int new_order,
>> SPLIT_TYPE_NON_UNIFORM);
>> }
>>
>> -int min_order_for_split(struct folio *folio)
>> +/**
>> + * min_order_for_split() - get the minimum order @folio can be split to
>> + * @folio: folio to split
>> + *
>> + * min_order_for_split() tells the minimum order @folio can be split to.
>> + * If a file-backed folio is truncated, 0 will be returned. Any subsequent
>> + * split attempt should get -EBUSY from split checking code.
>> + *
>> + * Return: @folio's minimum order for split
>> + */
>> +unsigned int min_order_for_split(struct folio *folio)
>> {
>> if (folio_test_anon(folio))
>> return 0;
>>
>> - if (!folio->mapping) {
>> - if (folio_test_pmd_mappable(folio))
>> - count_vm_event(THP_SPLIT_PAGE_FAILED);
>> - return -EBUSY;
>> - }
>> + /*
>> + * If the folio got truncated, we don't know the previous mapping and
>> + * consequently the old min order. But it doesn't matter, as any split
>> + * attempt will immediately fail with -EBUSY as the folio cannot get
>> + * split until freed.
>> + */
>
> Nice to have a comment here to clarify this!
>
>> + if (!folio->mapping)
>> + return 0;
>>
>> return mapping_min_folio_order(folio->mapping);
>> }
>> --
>> 2.51.0
>>
>
> Cheers, Lorenzo
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting
2025-11-24 10:45 ` David Hildenbrand (Red Hat)
@ 2025-11-24 17:23 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-24 17:23 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 24 Nov 2025, at 5:45, David Hildenbrand (Red Hat) wrote:
> On 11/22/25 03:55, Zi Yan wrote:
>> The "return <error code>" statements for error checks at the beginning of
>> __folio_split() skip necessary count_vm_event() and count_mthp_stat() at
>> the end of the function. Fix these by replacing them with
>> "ret = <error code>; goto out;".
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>> mm/huge_memory.c | 14 +++++++++-----
>> 1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ebc3ba0907fd..a42c4f29ce4f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3954,16 +3954,20 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
>> VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
>> - if (folio != page_folio(split_at) || folio != page_folio(lock_at))
>> - return -EINVAL;
>> + if (folio != page_folio(split_at) || folio != page_folio(lock_at)) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> - if (new_order >= old_order)
>> - return -EINVAL;
>> + if (new_order >= old_order) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> ret = folio_check_splittable(folio, new_order, split_type,
>> /* warn = */ true);
>> if (ret)
>> - return ret;
>> + goto out;
>> if (is_anon) {
>> /*
>
> I guess this is not Fixes:/stable material. Wonder if such early (mostly -EINVAL etc) checks were at some point not intended to be counted.
I do not think it is worth Fixes/stable, since most checks should be caught
during development and not be triggered, except folio_test_writeback(folio)
one. And no one complained so far.
The inconsistency starts from commit 59807685a7e7 ("mm, THP, swap: support
splitting THP for THP swap out”), where if (PageWriteback(page)) return -EBUSY;
was added. Then commit 478d134e9506 ("mm/huge_memory: do not overkill when
splitting huge_zero_page") followed and so on.
This patch is intended to make code consistent.
>
> In any case
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Thanks.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting
2025-11-24 15:21 ` Lorenzo Stoakes
@ 2025-11-24 17:29 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-24 17:29 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 24 Nov 2025, at 10:21, Lorenzo Stoakes wrote:
> On Fri, Nov 21, 2025 at 09:55:29PM -0500, Zi Yan wrote:
>> The "return <error code>" statements for error checks at the beginning of
>> __folio_split() skip necessary count_vm_event() and count_mthp_stat() at
>> the end of the function. Fix these by replacing them with
>> "ret = <error code>; goto out;".
>
> I guess the xas_destroy() there will be a no-op in these cases!
Right. And there is no memory leak, since xas_split_alloc() does the
memory allocation and code after it all never returns directly, letting
xas_destroy() do its job.
>
> Good spot, as David said, maybe one for stable then... not sure if necessary for
> statistical stuff though?
>
> But at the same time, maybe users will be misled if these are incorrect?
>
> Has this bug been around since the beginning? Be curious to know if that's the
> case or if it was introduced somewhere along the line?
It started from commit 59807685a7e7 ("mm, THP, swap: support
splitting THP for THP swap out”) back in 2017 and more inconsistent code
was added later.
Unless someone relies on split stats heavily, I am not sure we need to backport
it.
>
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Thanks.
>
>> ---
>> mm/huge_memory.c | 14 +++++++++-----
>> 1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ebc3ba0907fd..a42c4f29ce4f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3954,16 +3954,20 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
>> VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
>>
>> - if (folio != page_folio(split_at) || folio != page_folio(lock_at))
>> - return -EINVAL;
>> + if (folio != page_folio(split_at) || folio != page_folio(lock_at)) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>>
>> - if (new_order >= old_order)
>> - return -EINVAL;
>> + if (new_order >= old_order) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>>
>> ret = folio_check_splittable(folio, new_order, split_type,
>> /* warn = */ true);
>> if (ret)
>> - return ret;
>> + goto out;
>>
>> if (is_anon) {
>> /*
>> --
>> 2.51.0
>>
>
> Cheers, Lorenzo
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-24 17:05 ` Zi Yan
@ 2025-11-24 19:22 ` David Hildenbrand (Red Hat)
2025-11-24 21:08 ` Zi Yan
0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 19:22 UTC (permalink / raw)
To: Zi Yan
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 11/24/25 18:05, Zi Yan wrote:
> On 24 Nov 2025, at 5:41, David Hildenbrand (Red Hat) wrote:
>
>> On 11/22/25 03:55, Zi Yan wrote:
>>> can_split_folio() is just a refcount comparison, making sure only the
>>> split caller holds an extra pin. Open code it with
>>> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>>> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>> include/linux/huge_mm.h | 1 -
>>> mm/huge_memory.c | 43 ++++++++++++++++-------------------------
>>> mm/vmscan.c | 3 ++-
>>> 3 files changed, 19 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 97686fb46e30..1ecaeccf39c9 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -369,7 +369,6 @@ enum split_type {
>>> SPLIT_TYPE_NON_UNIFORM,
>>> };
>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> unsigned int new_order);
>>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c1f1055165dd..6c821c1c0ac3 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>> }
>>> }
>>> -/* Racy check whether the huge page can be split */
>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>> -{
>>> - int extra_pins;
>>> -
>>> - /* Additional pins from page cache */
>>> - if (folio_test_anon(folio))
>>> - extra_pins = folio_test_swapcache(folio) ?
>>> - folio_nr_pages(folio) : 0;
>>> - else
>>> - extra_pins = folio_nr_pages(folio);
>>> - if (pextra_pins)
>>> - *pextra_pins = extra_pins;
>>> - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
>>> - caller_pins;
>>> -}
>>> -
>>> static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>> {
>>> for (; nr_pages; page++, nr_pages--)
>>> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>>> return 0;
>>> }
>>> +/* Number of folio references from the pagecache or the swapcache. */
>>> +static unsigned int folio_cache_references(const struct folio *folio)
>>> +{
>>> + if (folio_test_anon(folio) && !folio_test_swapcache(folio))
>>> + return 0;
>>> + return folio_nr_pages(folio);
>>> +}
>>> +
>>> static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
>>> struct page *split_at, struct xa_state *xas,
>>> struct address_space *mapping, bool do_lru,
>>> struct list_head *list, enum split_type split_type,
>>> - pgoff_t end, int *nr_shmem_dropped, int extra_pins)
>>> + pgoff_t end, int *nr_shmem_dropped)
>>> {
>>> struct folio *end_folio = folio_next(folio);
>>> struct folio *new_folio, *next;
>>> int old_order = folio_order(folio);
>>> int ret = 0;
>>> struct deferred_split *ds_queue;
>>> + int extra_pins = folio_cache_references(folio);
>>
>> Can we just inline the call do folio_cache_references() and get rid of extra_pins.
>> (which is a bad name either way)
>>
>>
>> if (folio_ref_freeze(folio, folio_cache_references(folio) + 1) {
>>
>>
>> BTW, now that we have this helper, I wonder if we should then also do for
>> clarification on the unfreeze path:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 0acdc2f26ee0c..7cbcf61b7971d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3824,8 +3824,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>> zone_device_private_split_cb(folio, new_folio);
>> - expected_refs = folio_expected_ref_count(new_folio) + 1;
>> - folio_ref_unfreeze(new_folio, expected_refs);
>> + folio_ref_unfreeze(new_folio, folio_cache_references(new_folio) + 1);
>> if (do_lru)
>> lru_add_split_folio(folio, new_folio, lruvec, list);
>> @@ -3868,8 +3867,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>> * Otherwise, a parallel folio_try_get() can grab @folio
>> * and its caller can see stale page cache entries.
>> */
>> - expected_refs = folio_expected_ref_count(folio) + 1;
>> - folio_ref_unfreeze(folio, expected_refs);
>> + folio_ref_unfreeze(folio, folio_cache_references(folio) + 1);
>> if (do_lru)
>> unlock_page_lruvec(lruvec);
>>
>>
>
> Both make sense to me. Will make the change.
>
> By comparing folio_cache_references() with folio_expected_ref_count(),
> one difference is that folio_expected_ref_count() does not give right
> refcount for shmem in swapcache.
Good point. Likely nobody runs into that right now because nobody can
really does anything with these folios before they were re-added to the
pagecache or mapped into page tables.
>
> This is the folio_expected_ref_count() code:
>
> if (folio_test_anon(folio)) {
> /* One reference per page from the swapcache. */
> ref_count += folio_test_swapcache(folio) << order;
> } else {
> /* One reference per page from the pagecache. */
> ref_count += !!folio->mapping << order;
> /* One reference from PG_private. */
> ref_count += folio_test_private(folio);
> }
>
> shmem in swapcache mean !folio_test_anon(folio) && folio_test_swapcache(folio).
See below, it's actually
folio_test_anon(folio) && folio_test_swapbacked(folio)&&
folio_test_swapcache(folio)
I think ...
> The above code gives 0, but folio_cache_references() gives folio_nr_pages(folio).
> It should not cause any issue, since IIUC shmem in swapcache happens
> when the folio has an additional ref,
> folio_expected_ref_count() != folio_ref_count() anyway. For split, it is
> not supported yet,
Right.
> so folio_expected_ref_count() in split code does not
> affect shmem in swapcache. But folio_expected_ref_count() should be
> fixed, right?
We should better handle it, agreed.
Staring at the history of folio_expected_ref_count() once again, back
when we had folio_expected_refs() in migration code we didn't seem to
handle it I think.
-static int folio_expected_refs(struct address_space *mapping,
- struct folio *folio)
-{
- int refs = 1;
- if (!mapping)
- return refs;
-
- refs += folio_nr_pages(folio);
- if (folio_test_private(folio))
- refs++;
-
- return refs;
-}
gup.c doesn't care, because the pages are still mapped.
khugepaged.c similarly.
memfd.c doesn't care because the pages are still in the pagecache.
So I suspect nothing is broken, but the migration case needs a second look.
>
> Like:
>
> if (folio_test_anon(folio)) {
> /* One reference per page from the swapcache. */
> ref_count += folio_test_swapcache(folio) << order;
> } else {
> /* One reference per page from shmem in the swapcache. */
> ref_count += folio_test_swapcache(folio) << order;
> /* One reference per page from the pagecache. */
> ref_count += !!folio->mapping << order;
> /* One reference from PG_private. */
> ref_count += folio_test_private(folio);
> }
>
> or simplified into
>
> if (!folio_test_anon(folio)) {
> /* One reference per page from the pagecache. */
> ref_count += !!folio->mapping << order;
> /* One reference from PG_private. */
> ref_count += folio_test_private(folio);
> }
> /* One reference per page from the swapcache (anon or shmem). */
> ref_count += folio_test_swapcache(folio) << order;
> ?
That is incorrect I think due to swapcache being able to give false
positives (PG_owner_priv_1).
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-24 19:22 ` David Hildenbrand (Red Hat)
@ 2025-11-24 21:08 ` Zi Yan
2025-11-25 8:52 ` David Hildenbrand (Red Hat)
2025-11-25 9:10 ` Miaohe Lin
0 siblings, 2 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-24 21:08 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 24 Nov 2025, at 14:22, David Hildenbrand (Red Hat) wrote:
> On 11/24/25 18:05, Zi Yan wrote:
>> On 24 Nov 2025, at 5:41, David Hildenbrand (Red Hat) wrote:
>>
>>> On 11/22/25 03:55, Zi Yan wrote:
>>>> can_split_folio() is just a refcount comparison, making sure only the
>>>> split caller holds an extra pin. Open code it with
>>>> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>>>> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>> include/linux/huge_mm.h | 1 -
>>>> mm/huge_memory.c | 43 ++++++++++++++++-------------------------
>>>> mm/vmscan.c | 3 ++-
>>>> 3 files changed, 19 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 97686fb46e30..1ecaeccf39c9 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -369,7 +369,6 @@ enum split_type {
>>>> SPLIT_TYPE_NON_UNIFORM,
>>>> };
>>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> unsigned int new_order);
>>>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index c1f1055165dd..6c821c1c0ac3 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>>> }
>>>> }
>>>> -/* Racy check whether the huge page can be split */
>>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>> -{
>>>> - int extra_pins;
>>>> -
>>>> - /* Additional pins from page cache */
>>>> - if (folio_test_anon(folio))
>>>> - extra_pins = folio_test_swapcache(folio) ?
>>>> - folio_nr_pages(folio) : 0;
>>>> - else
>>>> - extra_pins = folio_nr_pages(folio);
>>>> - if (pextra_pins)
>>>> - *pextra_pins = extra_pins;
>>>> - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
>>>> - caller_pins;
>>>> -}
>>>> -
>>>> static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>>> {
>>>> for (; nr_pages; page++, nr_pages--)
>>>> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>>>> return 0;
>>>> }
>>>> +/* Number of folio references from the pagecache or the swapcache. */
>>>> +static unsigned int folio_cache_references(const struct folio *folio)
>>>> +{
>>>> + if (folio_test_anon(folio) && !folio_test_swapcache(folio))
>>>> + return 0;
>>>> + return folio_nr_pages(folio);
>>>> +}
>>>> +
>>>> static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
>>>> struct page *split_at, struct xa_state *xas,
>>>> struct address_space *mapping, bool do_lru,
>>>> struct list_head *list, enum split_type split_type,
>>>> - pgoff_t end, int *nr_shmem_dropped, int extra_pins)
>>>> + pgoff_t end, int *nr_shmem_dropped)
>>>> {
>>>> struct folio *end_folio = folio_next(folio);
>>>> struct folio *new_folio, *next;
>>>> int old_order = folio_order(folio);
>>>> int ret = 0;
>>>> struct deferred_split *ds_queue;
>>>> + int extra_pins = folio_cache_references(folio);
>>>
>>> Can we just inline the call do folio_cache_references() and get rid of extra_pins.
>>> (which is a bad name either way)
>>>
>>>
>>> if (folio_ref_freeze(folio, folio_cache_references(folio) + 1) {
>>>
>>>
>>> BTW, now that we have this helper, I wonder if we should then also do for
>>> clarification on the unfreeze path:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 0acdc2f26ee0c..7cbcf61b7971d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3824,8 +3824,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>> zone_device_private_split_cb(folio, new_folio);
>>> - expected_refs = folio_expected_ref_count(new_folio) + 1;
>>> - folio_ref_unfreeze(new_folio, expected_refs);
>>> + folio_ref_unfreeze(new_folio, folio_cache_references(new_folio) + 1);
>>> if (do_lru)
>>> lru_add_split_folio(folio, new_folio, lruvec, list);
>>> @@ -3868,8 +3867,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>> * Otherwise, a parallel folio_try_get() can grab @folio
>>> * and its caller can see stale page cache entries.
>>> */
>>> - expected_refs = folio_expected_ref_count(folio) + 1;
>>> - folio_ref_unfreeze(folio, expected_refs);
>>> + folio_ref_unfreeze(folio, folio_cache_references(folio) + 1);
>>> if (do_lru)
>>> unlock_page_lruvec(lruvec);
>>>
>>>
>>
>> Both make sense to me. Will make the change.
>>
>> By comparing folio_cache_references() with folio_expected_ref_count(),
>> one difference is that folio_expected_ref_count() does not give right
>> refcount for shmem in swapcache.
>
> Good point. Likely nobody runs into that right now because nobody can really does anything with these folios before they were re-added to the pagecache or mapped into page tables.
>
>>
>> This is the folio_expected_ref_count() code:
>>
>> if (folio_test_anon(folio)) {
>> /* One reference per page from the swapcache. */
>> ref_count += folio_test_swapcache(folio) << order;
>> } else {
>> /* One reference per page from the pagecache. */
>> ref_count += !!folio->mapping << order;
>> /* One reference from PG_private. */
>> ref_count += folio_test_private(folio);
>> }
>>
>> shmem in swapcache mean !folio_test_anon(folio) && folio_test_swapcache(folio).
>
> See below, it's actually
>
> folio_test_anon(folio) && folio_test_swapbacked(folio)&& folio_test_swapcache(folio)
!folio_test_anon(folio) && folio_test_swapbacked(folio)&&
folio_test_swapcache(folio)
Right?
>
> I think ...
>
>> The above code gives 0, but folio_cache_references() gives folio_nr_pages(folio).
>> It should not cause any issue, since IIUC shmem in swapcache happens
>> when the folio has an additional ref,
>> folio_expected_ref_count() != folio_ref_count() anyway. For split, it is
>> not supported yet,
>
> Right.
>
>> so folio_expected_ref_count() in split code does not
>> affect shmem in swapcache. But folio_expected_ref_count() should be
>> fixed, right?
>
> We should better handle it, agreed.
>
> Staring at the history of folio_expected_ref_count() once again, back when we had folio_expected_refs() in migration code we didn't seem to handle it I think.
>
> -static int folio_expected_refs(struct address_space *mapping,
> - struct folio *folio)
> -{
> - int refs = 1;
> - if (!mapping)
> - return refs;
> -
> - refs += folio_nr_pages(folio);
> - if (folio_test_private(folio))
> - refs++;
> -
> - return refs;
> -}
>
>
> gup.c doesn't care, because the pages are still mapped.
>
> khugepaged.c similarly.
>
> memfd.c doesn't care because the pages are still in the pagecache.
>
> So I suspect nothing is broken, but the migration case needs a second look.
For migration, shmem in swapcache happens in shmem_writeout(), where an
additional ref is placed on the folio. And migration caller places
a ref on the folio to before a migration. The folio has 2 refs and it is
not equal to folio_expected_ref_count()(returning 0) + 1 ,
or folio_expected_refs()(returning 1).
So it is safe.
>
>>
>> Like:
>>
>> if (folio_test_anon(folio)) {
>> /* One reference per page from the swapcache. */
>> ref_count += folio_test_swapcache(folio) << order;
>> } else {
>> /* One reference per page from shmem in the swapcache. */
>> ref_count += folio_test_swapcache(folio) << order;
>> /* One reference per page from the pagecache. */
>> ref_count += !!folio->mapping << order;
>> /* One reference from PG_private. */
>> ref_count += folio_test_private(folio);
>> }
>>
>> or simplified into
>>
>> if (!folio_test_anon(folio)) {
>> /* One reference per page from the pagecache. */
>> ref_count += !!folio->mapping << order;
>> /* One reference from PG_private. */
>> ref_count += folio_test_private(folio);
>> }
>> /* One reference per page from the swapcache (anon or shmem). */
>> ref_count += folio_test_swapcache(folio) << order;
>> ?
>
> That is incorrect I think due to swapcache being able to give false positives (PG_owner_priv_1).
Got it. So it should be:
if (folio_test_anon(folio)) {
/* One reference per page from the swapcache. */
ref_count += folio_test_swapcache(folio) << order;
} else {
/* One reference per page from shmem in the swapcache. */
ref_count += (folio_test_swapbacked (folio) &&
folio_test_swapcache(folio)) << order;
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
/* One reference from PG_private. */
ref_count += folio_test_private(folio);
}
I wonder if we should have folio_test_shmem_in_swapcache() instead.
BTW, this page flag reuse is really confusing. I see PG_checked is
PG_owner_priv_1 too and __folio_migrate_mapping() uses folio_test_swapcache()
to decide the number of i_pages entries. Wouldn’t that cause any issue?
ext4 does not release_folio() for migration when PG_checked is set,
ubifs clears PG_checked in release_folio(). I have not checked all other FS
yet. Maybe later.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-22 2:55 ` [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation Zi Yan
2025-11-23 1:51 ` Wei Yang
2025-11-24 10:41 ` David Hildenbrand (Red Hat)
@ 2025-11-24 22:14 ` Balbir Singh
2025-11-25 8:55 ` David Hildenbrand (Red Hat)
2 siblings, 1 reply; 32+ messages in thread
From: Balbir Singh @ 2025-11-24 22:14 UTC (permalink / raw)
To: Zi Yan, David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, linux-mm, linux-kernel
On 11/22/25 13:55, Zi Yan wrote:
> can_split_folio() is just a refcount comparison, making sure only the
> split caller holds an extra pin. Open code it with
> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> include/linux/huge_mm.h | 1 -
> mm/huge_memory.c | 43 ++++++++++++++++-------------------------
> mm/vmscan.c | 3 ++-
> 3 files changed, 19 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 97686fb46e30..1ecaeccf39c9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -369,7 +369,6 @@ enum split_type {
> SPLIT_TYPE_NON_UNIFORM,
> };
>
> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> unsigned int new_order);
> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c1f1055165dd..6c821c1c0ac3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
> }
> }
>
> -/* Racy check whether the huge page can be split */
> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
> -{
> - int extra_pins;
> -
> - /* Additional pins from page cache */
> - if (folio_test_anon(folio))
> - extra_pins = folio_test_swapcache(folio) ?
> - folio_nr_pages(folio) : 0;
> - else
> - extra_pins = folio_nr_pages(folio);
> - if (pextra_pins)
> - *pextra_pins = extra_pins;
> - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
> - caller_pins;
> -}
> -
> static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
> {
> for (; nr_pages; page++, nr_pages--)
> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
> return 0;
> }
>
> +/* Number of folio references from the pagecache or the swapcache. */
> +static unsigned int folio_cache_references(const struct folio *folio)
folio_cache_ref_count?
> +{
> + if (folio_test_anon(folio) && !folio_test_swapcache(folio))
> + return 0;
> + return folio_nr_pages(folio);
> +}
> +
Does this belong to include/linux/mm.h with the other helpers?
> static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
> struct page *split_at, struct xa_state *xas,
> struct address_space *mapping, bool do_lru,
> struct list_head *list, enum split_type split_type,
> - pgoff_t end, int *nr_shmem_dropped, int extra_pins)
> + pgoff_t end, int *nr_shmem_dropped)
> {
> struct folio *end_folio = folio_next(folio);
> struct folio *new_folio, *next;
> int old_order = folio_order(folio);
> int ret = 0;
> struct deferred_split *ds_queue;
> + int extra_pins = folio_cache_references(folio);
>
> VM_WARN_ON_ONCE(!mapping && end);
> /* Prevent deferred_split_scan() touching ->_refcount */
> @@ -3956,7 +3948,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> struct folio *new_folio, *next;
> int nr_shmem_dropped = 0;
> int remap_flags = 0;
> - int extra_pins, ret;
> + int ret;
> pgoff_t end = 0;
>
> VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> @@ -4036,7 +4028,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> * Racy check if we can split the page, before unmap_folio() will
> * split PMDs
> */
> - if (!can_split_folio(folio, 1, &extra_pins)) {
> + if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1) {
> ret = -EAGAIN;
> goto out_unlock;
> }
> @@ -4059,8 +4051,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> }
>
> ret = __folio_freeze_and_split_unmapped(folio, new_order, split_at, &xas, mapping,
> - true, list, split_type, end, &nr_shmem_dropped,
> - extra_pins);
> + true, list, split_type, end, &nr_shmem_dropped);
> fail:
> if (mapping)
> xas_unlock(&xas);
> @@ -4134,20 +4125,20 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> */
> int folio_split_unmapped(struct folio *folio, unsigned int new_order)
> {
> - int extra_pins, ret = 0;
> + int ret = 0;
>
> VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
> VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
> VM_WARN_ON_ONCE_FOLIO(!folio_test_anon(folio), folio);
>
> - if (!can_split_folio(folio, 1, &extra_pins))
> + if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1)
> return -EAGAIN;
>
> local_irq_disable();
> ret = __folio_freeze_and_split_unmapped(folio, new_order, &folio->page, NULL,
> NULL, false, NULL, SPLIT_TYPE_UNIFORM,
> - 0, NULL, extra_pins);
> + 0, NULL);
> local_irq_enable();
> return ret;
> }
> @@ -4640,7 +4631,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> * can be split or not. So skip the check here.
> */
> if (!folio_test_private(folio) &&
> - !can_split_folio(folio, 0, NULL))
> + folio_expected_ref_count(folio) != folio_ref_count(folio))
> goto next;
>
> if (!folio_trylock(folio))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 92980b072121..3b85652a42b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1284,7 +1284,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> goto keep_locked;
> if (folio_test_large(folio)) {
> /* cannot split folio, skip it */
> - if (!can_split_folio(folio, 1, NULL))
> + if (folio_expected_ref_count(folio) !=
> + folio_ref_count(folio) - 1)
> goto activate_locked;
> /*
> * Split partially mapped folios right away.
Otherwise, LGTM
Acked-by: Balbir Singh <balbirs@nvidia.com>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-24 21:08 ` Zi Yan
@ 2025-11-25 8:52 ` David Hildenbrand (Red Hat)
2025-11-25 15:55 ` Zi Yan
2025-11-25 9:10 ` Miaohe Lin
1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-25 8:52 UTC (permalink / raw)
To: Zi Yan
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
>>
>>>
>>> Like:
>>>
>>> if (folio_test_anon(folio)) {
>>> /* One reference per page from the swapcache. */
>>> ref_count += folio_test_swapcache(folio) << order;
>>> } else {
>>> /* One reference per page from shmem in the swapcache. */
>>> ref_count += folio_test_swapcache(folio) << order;
>>> /* One reference per page from the pagecache. */
>>> ref_count += !!folio->mapping << order;
>>> /* One reference from PG_private. */
>>> ref_count += folio_test_private(folio);
>>> }
>>>
>>> or simplified into
>>>
>>> if (!folio_test_anon(folio)) {
>>> /* One reference per page from the pagecache. */
>>> ref_count += !!folio->mapping << order;
>>> /* One reference from PG_private. */
>>> ref_count += folio_test_private(folio);
>>> }
>>> /* One reference per page from the swapcache (anon or shmem). */
>>> ref_count += folio_test_swapcache(folio) << order;
>>> ?
>>
>> That is incorrect I think due to swapcache being able to give false positives (PG_owner_priv_1).
>
> Got it. So it should be:
>
> if (folio_test_anon(folio)) {
> /* One reference per page from the swapcache. */
> ref_count += folio_test_swapcache(folio) << order;
> } else {
> /* One reference per page from shmem in the swapcache. */
> ref_count += (folio_test_swapbacked (folio) &&
> folio_test_swapcache(folio)) << order;
> /* One reference per page from the pagecache. */
> ref_count += !!folio->mapping << order;
> /* One reference from PG_private. */
> ref_count += folio_test_private(folio);
> }
Interestingly, I think we would then also take proper care of anon folios in the
swapcache that are not anon yet. See __read_swap_cache_async().
I wonder if we can clean that up a bit, to highlight that PG_private etc
do not apply.
if (folio_test_anon(folio)) {
/* One reference per page from the swapcache. */
ref_count += folio_test_swapcache(folio) << order;
} else if (folio_test_swapbacked (folio) && folio_test_swapcache(folio)) {
/* to-be-anon or shmem folio in the swapcache (!folio->mapping) */
ref_count += 1ul << order;
VM_WAN_ON_ONCE(folio->mapping);
} else {
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
/* One reference from PG_private. */
ref_count += folio_test_private(folio);
}
Or maybe simply:
if (folio_test_swapbacked (folio) && folio_test_swapcache(folio)) {
/*
* (to-be) anon or shmem (!folio->mapping) folio in the swapcache:
* One reference per page from the swapcache.
*/
ref_count += 1 << order;
VM_WAN_ON_ONCE(!folio_test_anon(folio) && folio->mapping);
} else if (!folio_test_anon(folio)) {
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
/* One reference from PG_private. */
ref_count += folio_test_private(folio);
}
>
> I wonder if we should have folio_test_shmem_in_swapcache() instead.
Interestingly, thinking about it, I think it would also match to-be anon folios
and anon folios.
folio_in_swapcache() maybe ?
>
> BTW, this page flag reuse is really confusing.
Yes ...
> I see PG_checked is
> PG_owner_priv_1 too and __folio_migrate_mapping() uses folio_test_swapcache()
> to decide the number of i_pages entries. Wouldn’t that cause any issue?
Maybe at that point all false positives were ruled out?
It is horrible TBH.
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-24 22:14 ` Balbir Singh
@ 2025-11-25 8:55 ` David Hildenbrand (Red Hat)
2025-11-25 15:41 ` Zi Yan
0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-25 8:55 UTC (permalink / raw)
To: Balbir Singh, Zi Yan, Lorenzo Stoakes
Cc: Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, linux-mm, linux-kernel
On 11/24/25 23:14, Balbir Singh wrote:
> On 11/22/25 13:55, Zi Yan wrote:
>> can_split_folio() is just a refcount comparison, making sure only the
>> split caller holds an extra pin. Open code it with
>> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>> include/linux/huge_mm.h | 1 -
>> mm/huge_memory.c | 43 ++++++++++++++++-------------------------
>> mm/vmscan.c | 3 ++-
>> 3 files changed, 19 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 97686fb46e30..1ecaeccf39c9 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -369,7 +369,6 @@ enum split_type {
>> SPLIT_TYPE_NON_UNIFORM,
>> };
>>
>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> unsigned int new_order);
>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c1f1055165dd..6c821c1c0ac3 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>> }
>> }
>>
>> -/* Racy check whether the huge page can be split */
>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>> -{
>> - int extra_pins;
>> -
>> - /* Additional pins from page cache */
>> - if (folio_test_anon(folio))
>> - extra_pins = folio_test_swapcache(folio) ?
>> - folio_nr_pages(folio) : 0;
>> - else
>> - extra_pins = folio_nr_pages(folio);
>> - if (pextra_pins)
>> - *pextra_pins = extra_pins;
>> - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
>> - caller_pins;
>> -}
>> -
>> static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>> {
>> for (; nr_pages; page++, nr_pages--)
>> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>> return 0;
>> }
>>
>> +/* Number of folio references from the pagecache or the swapcache. */
>> +static unsigned int folio_cache_references(const struct folio *folio)
>
> folio_cache_ref_count?
Yes, makes sense.
>
>> +{
>> + if (folio_test_anon(folio) && !folio_test_swapcache(folio))
>> + return 0;
>> + return folio_nr_pages(folio);
>> +}
>> +>
> Does this belong to include/linux/mm.h with the other helpers?
Not for now I think, in particular, as we require earlier
!folio->mapping checks to give a correct answer. Most people should be
using folio_expected_ref_count().
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable()
2025-11-22 2:55 ` [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable() Zi Yan
2025-11-23 1:50 ` Wei Yang
2025-11-23 18:38 ` Barry Song
@ 2025-11-25 8:58 ` David Hildenbrand (Red Hat)
2025-11-25 17:44 ` Andrew Morton
2 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-25 8:58 UTC (permalink / raw)
To: Zi Yan, Lorenzo Stoakes
Cc: Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Miaohe Lin,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
On 11/22/25 03:55, Zi Yan wrote:
> folio_split_supported() used in try_folio_split_to_order() requires
> folio->mapping to be non NULL, but current try_folio_split_to_order() does
> not check it. There is no issue in the current code, since
> try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
> where folio->mapping is not NULL.
>
> To prevent future misuse, move folio->mapping NULL check (i.e., folio is
> truncated) into folio_split_supported(). Since folio->mapping NULL check
> returns -EBUSY and folio_split_supported() == false means -EINVAL, change
> folio_split_supported() return type from bool to int and return error
> numbers accordingly. Rename folio_split_supported() to
> folio_check_splittable() to match the return type change.
>
> While at it, move is_huge_zero_folio() check and folio_test_writeback()
> check into folio_check_splittable() and add kernel-doc.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> include/linux/huge_mm.h | 10 ++++--
> mm/huge_memory.c | 74 +++++++++++++++++++++++++----------------
> 2 files changed, 53 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1d439de1ca2c..97686fb46e30 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -375,8 +375,8 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
> int min_order_for_split(struct folio *folio);
> int split_folio_to_list(struct folio *folio, struct list_head *list);
> -bool folio_split_supported(struct folio *folio, unsigned int new_order,
> - enum split_type split_type, bool warns);
> +int folio_check_splittable(struct folio *folio, unsigned int new_order,
> + enum split_type split_type, bool warns);
> int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
> struct list_head *list);
>
> @@ -407,7 +407,11 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
> static inline int try_folio_split_to_order(struct folio *folio,
> struct page *page, unsigned int new_order)
> {
> - if (!folio_split_supported(folio, new_order, SPLIT_TYPE_NON_UNIFORM, /* warns= */ false))
> + int ret;
> +
> + ret = folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM,
> + /* warns= */ false);
> + if (ret)
> return split_huge_page_to_order(&folio->page, new_order);
> return folio_split(folio, new_order, page, NULL);
> }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 041b554c7115..c1f1055165dd 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3688,15 +3688,43 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
> return 0;
> }
>
> -bool folio_split_supported(struct folio *folio, unsigned int new_order,
> - enum split_type split_type, bool warns)
> +/**
> + * folio_check_splittable() - check if a folio can be split to a given order
> + * @folio: folio to be split
> + * @new_order: the smallest order of the after split folios (since buddy
> + * allocator like split generates folios with orders from @folio's
> + * order - 1 to new_order).
> + * @split_type: uniform or non-uniform split
> + * @warns: whether gives warnings or not for the checks in the function
> + *
> + * folio_check_splittable() checks if @folio can be split to @new_order using
> + * @split_type method. The truncated folio check must come first.
> + *
> + * Context: folio must be locked.
> + *
> + * Return: 0 - @folio can be split to @new_order, otherwise an error number is
> + * returned.
> + */
> +int folio_check_splittable(struct folio *folio, unsigned int new_order,
> + enum split_type split_type, bool warns)
> {
> + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> + /*
> + * Folios that just got truncated cannot get split. Signal to the
> + * caller that there was a race.
> + *
> + * TODO: this will also currently refuse shmem folios that are in the
> + * swapcache.
> + */
Per the other discussion, should this even be:
"this will also currently refuse folios without a mapping in the
swapcache (shmem or to-be-anon folios)"
IOW, to spell out that anon folios that were read into the swapcache but
not mapped yet into page tables (where we set folio->mapping).
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-24 21:08 ` Zi Yan
2025-11-25 8:52 ` David Hildenbrand (Red Hat)
@ 2025-11-25 9:10 ` Miaohe Lin
2025-11-25 9:34 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 32+ messages in thread
From: Miaohe Lin @ 2025-11-25 9:10 UTC (permalink / raw)
To: Zi Yan, David Hildenbrand (Red Hat)
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
On 2025/11/25 5:08, Zi Yan wrote:
> On 24 Nov 2025, at 14:22, David Hildenbrand (Red Hat) wrote:
>
<snip>
>
> BTW, this page flag reuse is really confusing. I see PG_checked is
> PG_owner_priv_1 too and __folio_migrate_mapping() uses folio_test_swapcache()
> to decide the number of i_pages entries. Wouldn’t that cause any issue?
> ext4 does not release_folio() for migration when PG_checked is set,
> ubifs clears PG_checked in release_folio(). I have not checked all other FS
> yet. Maybe later.
folio_test_swapbacked() is also checked in folio_test_swapcache:
static __always_inline bool folio_test_swapcache(struct folio *folio)
{
return folio_test_swapbacked(folio) &&
test_bit(PG_swapcache, folio_flags(folio, 0));
}
So IMHO the reuse of this page flag should work fine.
Thanks.
.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-25 9:10 ` Miaohe Lin
@ 2025-11-25 9:34 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-25 9:34 UTC (permalink / raw)
To: Miaohe Lin, Zi Yan
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm, linux-kernel
On 11/25/25 10:10, Miaohe Lin wrote:
> On 2025/11/25 5:08, Zi Yan wrote:
>> On 24 Nov 2025, at 14:22, David Hildenbrand (Red Hat) wrote:
>>
>
> <snip>
>
>>
>> BTW, this page flag reuse is really confusing. I see PG_checked is
>> PG_owner_priv_1 too and __folio_migrate_mapping() uses folio_test_swapcache()
>> to decide the number of i_pages entries. Wouldn’t that cause any issue?
>> ext4 does not release_folio() for migration when PG_checked is set,
>> ubifs clears PG_checked in release_folio(). I have not checked all other FS
>> yet. Maybe later.
>
> folio_test_swapbacked() is also checked in folio_test_swapcache:
>
> static __always_inline bool folio_test_swapcache(struct folio *folio)
> {
> return folio_test_swapbacked(folio) &&
> test_bit(PG_swapcache, folio_flags(folio, 0));
> }
>
> So IMHO the reuse of this page flag should work fine.
Ahh, thanks for pointing that out. Confusing, as usually the
folio_test_*() helper are straight bit tests,
All good then :)
--
Cheers
David
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-25 8:55 ` David Hildenbrand (Red Hat)
@ 2025-11-25 15:41 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-25 15:41 UTC (permalink / raw)
To: Balbir Singh, David Hildenbrand (Red Hat)
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, linux-mm, linux-kernel
On 25 Nov 2025, at 3:55, David Hildenbrand (Red Hat) wrote:
> On 11/24/25 23:14, Balbir Singh wrote:
>> On 11/22/25 13:55, Zi Yan wrote:
>>> can_split_folio() is just a refcount comparison, making sure only the
>>> split caller holds an extra pin. Open code it with
>>> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>>> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>> include/linux/huge_mm.h | 1 -
>>> mm/huge_memory.c | 43 ++++++++++++++++-------------------------
>>> mm/vmscan.c | 3 ++-
>>> 3 files changed, 19 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 97686fb46e30..1ecaeccf39c9 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -369,7 +369,6 @@ enum split_type {
>>> SPLIT_TYPE_NON_UNIFORM,
>>> };
>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> unsigned int new_order);
>>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c1f1055165dd..6c821c1c0ac3 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>> }
>>> }
>>> -/* Racy check whether the huge page can be split */
>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>> -{
>>> - int extra_pins;
>>> -
>>> - /* Additional pins from page cache */
>>> - if (folio_test_anon(folio))
>>> - extra_pins = folio_test_swapcache(folio) ?
>>> - folio_nr_pages(folio) : 0;
>>> - else
>>> - extra_pins = folio_nr_pages(folio);
>>> - if (pextra_pins)
>>> - *pextra_pins = extra_pins;
>>> - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
>>> - caller_pins;
>>> -}
>>> -
>>> static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>> {
>>> for (; nr_pages; page++, nr_pages--)
>>> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>>> return 0;
>>> }
>>> +/* Number of folio references from the pagecache or the swapcache. */
>>> +static unsigned int folio_cache_references(const struct folio *folio)
>>
>> folio_cache_ref_count?
>
> Yes, makes sense.
>
>>
>>> +{
>>> + if (folio_test_anon(folio) && !folio_test_swapcache(folio))
>>> + return 0;
>>> + return folio_nr_pages(folio);
>>> +}
>>> +>
>> Does this belong to include/linux/mm.h with the other helpers?
>
> Not for now I think, in particular, as we require earlier !folio->mapping checks to give a correct answer. Most people should be using folio_expected_ref_count().
>
Got it. Will use folio_cache_ref_count() in the next version.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation
2025-11-25 8:52 ` David Hildenbrand (Red Hat)
@ 2025-11-25 15:55 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-11-25 15:55 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Lorenzo Stoakes, Andrew Morton, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On 25 Nov 2025, at 3:52, David Hildenbrand (Red Hat) wrote:
>>>
>>>>
>>>> Like:
>>>>
>>>> if (folio_test_anon(folio)) {
>>>> /* One reference per page from the swapcache. */
>>>> ref_count += folio_test_swapcache(folio) << order;
>>>> } else {
>>>> /* One reference per page from shmem in the swapcache. */
>>>> ref_count += folio_test_swapcache(folio) << order;
>>>> /* One reference per page from the pagecache. */
>>>> ref_count += !!folio->mapping << order;
>>>> /* One reference from PG_private. */
>>>> ref_count += folio_test_private(folio);
>>>> }
>>>>
>>>> or simplified into
>>>>
>>>> if (!folio_test_anon(folio)) {
>>>> /* One reference per page from the pagecache. */
>>>> ref_count += !!folio->mapping << order;
>>>> /* One reference from PG_private. */
>>>> ref_count += folio_test_private(folio);
>>>> }
>>>> /* One reference per page from the swapcache (anon or shmem). */
>>>> ref_count += folio_test_swapcache(folio) << order;
>>>> ?
>>>
>>> That is incorrect I think due to swapcache being able to give false positives (PG_owner_priv_1).
>>
>> Got it. So it should be:
>>
>> if (folio_test_anon(folio)) {
>> /* One reference per page from the swapcache. */
>> ref_count += folio_test_swapcache(folio) << order;
>> } else {
>> /* One reference per page from shmem in the swapcache. */
>> ref_count += (folio_test_swapbacked (folio) &&
>> folio_test_swapcache(folio)) << order;
>> /* One reference per page from the pagecache. */
>> ref_count += !!folio->mapping << order;
>> /* One reference from PG_private. */
>> ref_count += folio_test_private(folio);
>> }
>
> Interestingly, I think we would then also take proper care of anon folios in the
> swapcache that are not anon yet. See __read_swap_cache_async().
Right. After add_to_swap_cache() in __read_swap_cache_async(), the folio
there is in the same state as shmem in swapcache.
>
> I wonder if we can clean that up a bit, to highlight that PG_private etc
> do not apply.
>
> if (folio_test_anon(folio)) {
> /* One reference per page from the swapcache. */
> ref_count += folio_test_swapcache(folio) << order;
> } else if (folio_test_swapbacked (folio) && folio_test_swapcache(folio)) {
> /* to-be-anon or shmem folio in the swapcache (!folio->mapping) */
> ref_count += 1ul << order;
> VM_WAN_ON_ONCE(folio->mapping);
> } else {
> /* One reference per page from the pagecache. */
> ref_count += !!folio->mapping << order;
> /* One reference from PG_private. */
> ref_count += folio_test_private(folio);
> }
I like this better, will send a patch for folio_expected_ref_count()
separately. Since folio_test_swapcache(folio) implies
folio_test_swapbacked (folio) as Maolin pointed out in another
email, I will get rid of folio_test_swapbacked(folio) in the above code.
>
> Or maybe simply:
>
>
> if (folio_test_swapbacked (folio) && folio_test_swapcache(folio)) {
> /*
> * (to-be) anon or shmem (!folio->mapping) folio in the swapcache:
> * One reference per page from the swapcache.
> */
> ref_count += 1 << order;
> VM_WAN_ON_ONCE(!folio_test_anon(folio) && folio->mapping);
> } else if (!folio_test_anon(folio)) {
> /* One reference per page from the pagecache. */
> ref_count += !!folio->mapping << order;
> /* One reference from PG_private. */
> ref_count += folio_test_private(folio);
> }
>
>>
>> I wonder if we should have folio_test_shmem_in_swapcache() instead.
>
> Interestingly, thinking about it, I think it would also match to-be anon folios
> and anon folios.
>
> folio_in_swapcache() maybe ?
Yes, will come up with a patch for it and send along with
folio_expected_ref_count() patch.
>
>>
>> BTW, this page flag reuse is really confusing.
>
> Yes ...
>
>> I see PG_checked is
>> PG_owner_priv_1 too and __folio_migrate_mapping() uses folio_test_swapcache()
>> to decide the number of i_pages entries. Wouldn’t that cause any issue?
>
> Maybe at that point all false positives were ruled out?
>
> It is horrible TBH.
>
> --
> Cheers
>
> David
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable()
2025-11-25 8:58 ` David Hildenbrand (Red Hat)
@ 2025-11-25 17:44 ` Andrew Morton
0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2025-11-25 17:44 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Zi Yan, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Miaohe Lin, Naoya Horiguchi, Wei Yang, Balbir Singh, linux-mm,
linux-kernel
On Tue, 25 Nov 2025 09:58:03 +0100 "David Hildenbrand (Red Hat)" <david@kernel.org> wrote:
> > + * Return: 0 - @folio can be split to @new_order, otherwise an error number is
> > + * returned.
> > + */
> > +int folio_check_splittable(struct folio *folio, unsigned int new_order,
> > + enum split_type split_type, bool warns)
> > {
> > + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > + /*
> > + * Folios that just got truncated cannot get split. Signal to the
> > + * caller that there was a race.
> > + *
> > + * TODO: this will also currently refuse shmem folios that are in the
> > + * swapcache.
> > + */
>
> Per the other discussion, should this even be:
>
> "this will also currently refuse folios without a mapping in the
> swapcache (shmem or to-be-anon folios)"
>
> IOW, to spell out that anon folios that were read into the swapcache but
> not mapped yet into page tables (where we set folio->mapping).
This?
--- a/mm/huge_memory.c~mm-huge_memory-change-folio_split_supported-to-folio_check_splittable-fix
+++ a/mm/huge_memory.c
@@ -3714,7 +3714,8 @@ int folio_check_splittable(struct folio
* caller that there was a race.
*
* TODO: this will also currently refuse shmem folios that are in the
- * swapcache.
+ * swapcache. Currently it will also refuse folios without a mapping
+ * in the swapcache (shmem or to-be-anon folios).
*/
if (!folio_test_anon(folio) && !folio->mapping)
return -EBUSY;
_
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2025-11-25 17:45 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-22 2:55 [PATCH v2 0/4] Improve folio split related functions Zi Yan
2025-11-22 2:55 ` [PATCH v2 1/4] mm/huge_memory: change folio_split_supported() to folio_check_splittable() Zi Yan
2025-11-23 1:50 ` Wei Yang
2025-11-23 18:38 ` Barry Song
2025-11-24 10:33 ` David Hildenbrand (Red Hat)
2025-11-24 16:38 ` Zi Yan
2025-11-25 8:58 ` David Hildenbrand (Red Hat)
2025-11-25 17:44 ` Andrew Morton
2025-11-22 2:55 ` [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with direct refcount calculation Zi Yan
2025-11-23 1:51 ` Wei Yang
2025-11-24 10:41 ` David Hildenbrand (Red Hat)
2025-11-24 17:05 ` Zi Yan
2025-11-24 19:22 ` David Hildenbrand (Red Hat)
2025-11-24 21:08 ` Zi Yan
2025-11-25 8:52 ` David Hildenbrand (Red Hat)
2025-11-25 15:55 ` Zi Yan
2025-11-25 9:10 ` Miaohe Lin
2025-11-25 9:34 ` David Hildenbrand (Red Hat)
2025-11-24 22:14 ` Balbir Singh
2025-11-25 8:55 ` David Hildenbrand (Red Hat)
2025-11-25 15:41 ` Zi Yan
2025-11-22 2:55 ` [PATCH v2 3/4] mm/huge_memory: make min_order_for_split() always return an order Zi Yan
2025-11-23 1:53 ` Wei Yang
2025-11-24 10:43 ` David Hildenbrand (Red Hat)
2025-11-24 15:18 ` Lorenzo Stoakes
2025-11-24 17:11 ` Zi Yan
2025-11-22 2:55 ` [PATCH v2 4/4] mm/huge_memory: fix folio split stats counting Zi Yan
2025-11-23 1:56 ` Wei Yang
2025-11-24 10:45 ` David Hildenbrand (Red Hat)
2025-11-24 17:23 ` Zi Yan
2025-11-24 15:21 ` Lorenzo Stoakes
2025-11-24 17:29 ` Zi Yan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox