* [PATCH v4 00/10] Buddy allocator like folio split
@ 2025-01-06 16:55 Zi Yan
2025-01-06 16:55 ` [PATCH v4 01/10] selftests/mm: use selftests framework to print test result Zi Yan
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
Hi all
This patchset adds a new buddy allocator like large folio split to the total
number of resulting folios, the amount of memory needed for multi-index xarray
split, and keep more large folios after a split. It is on top of
mm-everything-2025-01-04-04-41 and just a resend of v3.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like split. All
existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
THP tests in selftesting passed for split_huge_page*() runs and I also
tested folio_split() for anon large folio, pagecache folio, and
truncate. I also ran truncate related tests from xfstests quick test group
and saw no issues.
Changelog
===
From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
and FS without large folio split. This check was missed in the prior
versions.
From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
for xas_split() during buddy allocator like split.
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
folio_split(). The same code is used for both uniform split and buddy
allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
instead of the one containing the given page, since
the caller of truncate_inode_partial_folio() locks and unlocks the
first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().
Design
===
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
the whole folio into multiple smaller ones with the same order in a
shot, this approach splits the folio iteratively. Taking the example
above, this approach first splits the original order-9 into two order-8,
then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.
__folio_split_without_mapping() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__folio_split_without_mapping() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
uniformly split the given folio. All resulting folios are put back to
the list after split. The folio containing the given page is left to
caller to unlock and others are unlocked.
2. buddy allocator like split: old_order - new_order calls to
__split_folio_to_order() are used to split the given folio at order N to
order N-1. After each call, the target folio is changed to the one
containing the page, which is given via folio_split() parameters.
After each call, folios not containing the page are put back to the list.
The folio containing the page is put back to the list when its order
is new_order. All folios are unlocked except the first folio, which
is left to caller to unlock.
Patch Overview
===
1. Patch 1 fixed a selftest counting bug in split_huge_page_test and
patch 2 added tests for splitting a PMD huge page to any lower order.
They can be picked independent of this patchset.
2. Patch 3 enabled split shmem to any lower order, since large folio
support for shmem was upstreamed.
3. Patch 4 added __folio_split_without_mapping() and
__split_folio_to_order() to prepare for moving to new backend split
code.
4. Patch 5 replaced __split_huge_page() with
__folio_split_without_mapping() in split_huge_page_to_list_to_order().
5. Patch 6 added new folio_split().
6. Patch 7 removed __split_huge_page() and __split_huge_page_tail().
7. Patch 8 added a new in_folio_offset to split_huge_page debugfs for
folio_split() test.
8. Patch 9 used folio_split() for truncate operation.
9. Patch 10 added folio_split() tests.
Any comments and/or suggestions are welcome. Thanks.
[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43ncly4l4ogr@7ruqsay6i2h2/
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
Zi Yan (10):
selftests/mm: use selftests framework to print test result.
selftests/mm: add tests for splitting pmd THPs to all lower orders.
mm/huge_memory: allow split shmem large folio to any order
mm/huge_memory: add two new (not yet used) functions for folio_split()
mm/huge_memory: move folio split common code to __folio_split()
mm/huge_memory: add buddy allocator like folio_split()
mm/huge_memory: remove the old, unused __split_huge_page()
mm/huge_memory: add folio_split() to debugfs testing interface.
mm/truncate: use folio_split() for truncate operation.
selftests/mm: add tests for folio_split(), buddy allocator like split.
include/linux/huge_mm.h | 17 +
mm/huge_memory.c | 689 +++++++++++-------
mm/truncate.c | 31 +-
.../selftests/mm/split_huge_page_test.c | 70 +-
4 files changed, 522 insertions(+), 285 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 01/10] selftests/mm: use selftests framework to print test result.
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 02/10] selftests/mm: add tests for splitting pmd THPs to all lower orders Zi Yan
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
Otherwise the number of tests does not match the reality.
Fixes: 391e86971161 ("mm: selftest to verify zero-filled pages are mapped to zeropage")
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
.../selftests/mm/split_huge_page_test.c | 34 +++++++------------
1 file changed, 12 insertions(+), 22 deletions(-)
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index eb6d1b9fc362..cd74ea9b1295 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -108,38 +108,28 @@ static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hp
unsigned long rss_anon_before, rss_anon_after;
size_t i;
- if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
- printf("No THP is allocated\n");
- exit(EXIT_FAILURE);
- }
+ if (!check_huge_anon(one_page, 4, pmd_pagesize))
+ ksft_exit_fail_msg("No THP is allocated\n");
rss_anon_before = rss_anon();
- if (!rss_anon_before) {
- printf("No RssAnon is allocated before split\n");
- exit(EXIT_FAILURE);
- }
+ if (!rss_anon_before)
+ ksft_exit_fail_msg("No RssAnon is allocated before split\n");
/* split all THPs */
write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
(uint64_t)one_page + len, 0);
for (i = 0; i < len; i++)
- if (one_page[i] != (char)0) {
- printf("%ld byte corrupted\n", i);
- exit(EXIT_FAILURE);
- }
+ if (one_page[i] != (char)0)
+ ksft_exit_fail_msg("%ld byte corrupted\n", i);
- if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
- printf("Still AnonHugePages not split\n");
- exit(EXIT_FAILURE);
- }
+ if (!check_huge_anon(one_page, 0, pmd_pagesize))
+ ksft_exit_fail_msg("Still AnonHugePages not split\n");
rss_anon_after = rss_anon();
- if (rss_anon_after >= rss_anon_before) {
- printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+ if (rss_anon_after >= rss_anon_before)
+ ksft_exit_fail_msg("Incorrect RssAnon value. Before: %ld After: %ld\n",
rss_anon_before, rss_anon_after);
- exit(EXIT_FAILURE);
- }
}
void split_pmd_zero_pages(void)
@@ -150,7 +140,7 @@ void split_pmd_zero_pages(void)
one_page = allocate_zero_filled_hugepage(len);
verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
- printf("Split zero filled huge pages successful\n");
+ ksft_test_result_pass("Split zero filled huge pages successful\n");
free(one_page);
}
@@ -491,7 +481,7 @@ int main(int argc, char **argv)
if (argc > 1)
optional_xfs_path = argv[1];
- ksft_set_plan(3+9);
+ ksft_set_plan(4+9);
pagesize = getpagesize();
pageshift = ffs(pagesize) - 1;
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 02/10] selftests/mm: add tests for splitting pmd THPs to all lower orders.
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
2025-01-06 16:55 ` [PATCH v4 01/10] selftests/mm: use selftests framework to print test result Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 03/10] mm/huge_memory: allow split shmem large folio to any order Zi Yan
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
Kernel already supports splitting a folio to any lower order. Test it.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
tools/testing/selftests/mm/split_huge_page_test.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index cd74ea9b1295..5bb159ebc83d 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -144,7 +144,7 @@ void split_pmd_zero_pages(void)
free(one_page);
}
-void split_pmd_thp(void)
+void split_pmd_thp_to_order(int order)
{
char *one_page;
size_t len = 4 * pmd_pagesize;
@@ -164,7 +164,7 @@ void split_pmd_thp(void)
/* split all THPs */
write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
- (uint64_t)one_page + len, 0);
+ (uint64_t)one_page + len, order);
for (i = 0; i < len; i++)
if (one_page[i] != (char)i)
@@ -174,7 +174,7 @@ void split_pmd_thp(void)
if (!check_huge_anon(one_page, 0, pmd_pagesize))
ksft_exit_fail_msg("Still AnonHugePages not split\n");
- ksft_test_result_pass("Split huge pages successful\n");
+ ksft_test_result_pass("Split huge pages to order %d successful\n", order);
free(one_page);
}
@@ -481,7 +481,7 @@ int main(int argc, char **argv)
if (argc > 1)
optional_xfs_path = argv[1];
- ksft_set_plan(4+9);
+ ksft_set_plan(1+9+2+9);
pagesize = getpagesize();
pageshift = ffs(pagesize) - 1;
@@ -492,7 +492,10 @@ int main(int argc, char **argv)
fd_size = 2 * pmd_pagesize;
split_pmd_zero_pages();
- split_pmd_thp();
+
+ for (i = 0; i < 9; i++)
+ split_pmd_thp_to_order(i);
+
split_pte_mapped_thp();
split_file_backed_thp();
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 03/10] mm/huge_memory: allow split shmem large folio to any order
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
2025-01-06 16:55 ` [PATCH v4 01/10] selftests/mm: use selftests framework to print test result Zi Yan
2025-01-06 16:55 ` [PATCH v4 02/10] selftests/mm: add tests for splitting pmd THPs to all lower orders Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 04/10] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
Commit 4d684b5f92ba ("mm: shmem: add large folio support for tmpfs") has
added large folio support to shmem. Remove the restriction in
split_huge_page*().
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/huge_memory.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c89aed1510f1..511b5b23894b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3287,7 +3287,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
/* Some pages can be beyond EOF: drop them from page cache */
if (tail->index >= end) {
if (shmem_mapping(folio->mapping))
- nr_dropped++;
+ nr_dropped += new_nr;
else if (folio_test_clear_dirty(tail))
folio_account_cleaned(tail,
inode_to_wb(folio->mapping->host));
@@ -3453,12 +3453,6 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
return -EINVAL;
}
} else if (new_order) {
- /* Split shmem folio to non-zero order not supported */
- if (shmem_mapping(folio->mapping)) {
- VM_WARN_ONCE(1,
- "Cannot split shmem folio to non-0 order");
- return -EINVAL;
- }
/*
* No split if the file system does not support large folio.
* Note that we might still have THPs in such mappings due to
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 04/10] mm/huge_memory: add two new (not yet used) functions for folio_split()
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
` (2 preceding siblings ...)
2025-01-06 16:55 ` [PATCH v4 03/10] mm/huge_memory: allow split shmem large folio to any order Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 05/10] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
This is a preparation patch, both added functions are not used yet.
The added __split_unmapped_folio() is able to split a folio with
its mapping removed in two manners: 1) uniform split (the existing way),
and 2) buddy allocator like split.
The added __split_folio_to_order() can split a folio into any lower order.
For uniform split, __split_unmapped_folio() calls it once to split
the given folio to the new order. For buddy allocator split,
__split_unmapped_folio() calls it (folio_order - new_order) times
and each time splits the folio containing the given page to one lower
order.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/huge_memory.c | 348 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 347 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 511b5b23894b..d8e743f81e76 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3134,7 +3134,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
static void lru_add_page_tail(struct folio *folio, struct page *tail,
struct lruvec *lruvec, struct list_head *list)
{
- VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
VM_BUG_ON_FOLIO(PageLRU(tail), folio);
lockdep_assert_held(&lruvec->lru_lock);
@@ -3378,6 +3377,353 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
caller_pins;
}
+/*
+ * It splits @folio into @new_order folios and copies the @folio metadata to
+ * all the resulting folios.
+ */
+static int __split_folio_to_order(struct folio *folio, int new_order)
+{
+ int curr_order = folio_order(folio);
+ long nr_pages = folio_nr_pages(folio);
+ long new_nr_pages = 1 << new_order;
+ long index;
+
+ if (curr_order <= new_order)
+ return -EINVAL;
+
+ /*
+ * Skip the first new_nr_pages, since the new folio from them have all
+ * the flags from the original folio.
+ */
+ for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
+ struct page *head = &folio->page;
+ struct page *new_head = head + index;
+
+ /*
+ * Careful: new_folio is not a "real" folio before we cleared PageTail.
+ * Don't pass it around before clear_compound_head().
+ */
+ struct folio *new_folio = (struct folio *)new_head;
+
+ VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
+
+ /*
+ * Clone page flags before unfreezing refcount.
+ *
+ * After successful get_page_unless_zero() might follow flags change,
+ * for example lock_page() which set PG_waiters.
+ *
+ * Note that for mapped sub-pages of an anonymous THP,
+ * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
+ * the migration entry instead from where remap_page() will restore it.
+ * We can still have PG_anon_exclusive set on effectively unmapped and
+ * unreferenced sub-pages of an anonymous THP: we can simply drop
+ * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
+ */
+ new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ new_head->flags |= (head->flags &
+ ((1L << PG_referenced) |
+ (1L << PG_swapbacked) |
+ (1L << PG_swapcache) |
+ (1L << PG_mlocked) |
+ (1L << PG_uptodate) |
+ (1L << PG_active) |
+ (1L << PG_workingset) |
+ (1L << PG_locked) |
+ (1L << PG_unevictable) |
+#ifdef CONFIG_ARCH_USES_PG_ARCH_2
+ (1L << PG_arch_2) |
+#endif
+#ifdef CONFIG_ARCH_USES_PG_ARCH_3
+ (1L << PG_arch_3) |
+#endif
+ (1L << PG_dirty) |
+ LRU_GEN_MASK | LRU_REFS_MASK));
+
+ /* ->mapping in first and second tail page is replaced by other uses */
+ VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
+ new_head);
+ new_head->mapping = head->mapping;
+ new_head->index = head->index + index;
+
+ /*
+ * page->private should not be set in tail pages. Fix up and warn once
+ * if private is unexpectedly set.
+ */
+ if (unlikely(new_head->private)) {
+ VM_WARN_ON_ONCE_PAGE(true, new_head);
+ new_head->private = 0;
+ }
+
+ if (folio_test_swapcache(folio))
+ new_folio->swap.val = folio->swap.val + index;
+
+ /* Page flags must be visible before we make the page non-compound. */
+ smp_wmb();
+
+ /*
+ * Clear PageTail before unfreezing page refcount.
+ *
+ * After successful get_page_unless_zero() might follow put_page()
+ * which needs correct compound_head().
+ */
+ clear_compound_head(new_head);
+ if (new_order) {
+ prep_compound_page(new_head, new_order);
+ folio_set_large_rmappable(new_folio);
+
+ folio_set_order(folio, new_order);
+ }
+
+ if (folio_test_young(folio))
+ folio_set_young(new_folio);
+ if (folio_test_idle(folio))
+ folio_set_idle(new_folio);
+
+ folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
+ }
+
+ if (!new_order)
+ ClearPageCompound(&folio->page);
+
+ return 0;
+}
+
+/*
+ * It splits an unmapped @folio to lower order smaller folios in two ways.
+ * @folio: the to-be-split folio
+ * @new_order: the smallest order of the after split folios (since buddy
+ * allocator like split generates folios with orders from @folio's
+ * order - 1 to new_order).
+ * @page: in buddy allocator like split, the folio containing @page will be
+ * split until its order becomes @new_order.
+ * @list: the after split folios will be added to @list if it is not NULL,
+ * otherwise to LRU lists.
+ * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
+ * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
+ * @mapping: @folio->mapping
+ * @uniform_split: if the split is uniform or not (buddy allocator like split)
+ *
+ *
+ * 1. uniform split: the given @folio into multiple @new_order small folios,
+ * where all small folios have the same order. This is done when
+ * uniform_split is true.
+ * 2. buddy allocator like split: the given @folio is split into half and one
+ * of the half (containing the given page) is split into half until the
+ * given @page's order becomes @new_order. This is done when uniform_split is
+ * false.
+ *
+ * The high level flow for these two methods are:
+ * 1. uniform split: a single __split_folio_to_order() is called to split the
+ * @folio into @new_order, then we traverse all the resulting folios one by
+ * one in PFN ascending order and perform stats, unfreeze, adding to list,
+ * and file mapping index operations.
+ * 2. buddy allocator like split: in general, folio_order - @new_order calls to
+ * __split_folio_to_order() are called in the for loop to split the @folio
+ * to one lower order at a time. The resulting small folios are processed
+ * like what is done during the traversal in 1, except the one containing
+ * @page, which is split in next for loop.
+ *
+ * After splitting, the caller's folio reference will be transferred to the
+ * folio containing @page. The other folios may be freed if they are not mapped.
+ *
+ * In terms of locking, after splitting,
+ * 1. uniform split leaves @page (or the folio contains it) locked;
+ * 2. buddy allocator like split leaves @folio locked.
+ *
+ *
+ * For !uniform_split, when -ENOMEM is returned, the original folio might be
+ * split. The caller needs to check the input folio.
+ */
+static int __split_unmapped_folio(struct folio *folio, int new_order,
+ struct page *page, struct list_head *list, pgoff_t end,
+ struct xa_state *xas, struct address_space *mapping,
+ bool uniform_split)
+{
+ struct lruvec *lruvec;
+ struct address_space *swap_cache = NULL;
+ struct folio *origin_folio = folio;
+ struct folio *next_folio = folio_next(folio);
+ struct folio *new_folio;
+ struct folio *next;
+ int order = folio_order(folio);
+ int split_order;
+ int start_order = uniform_split ? new_order : order - 1;
+ int nr_dropped = 0;
+ int ret = 0;
+ bool stop_split = false;
+
+ if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
+ /* a swapcache folio can only be uniformly split to order-0 */
+ if (!uniform_split || new_order != 0)
+ return -EINVAL;
+
+ swap_cache = swap_address_space(folio->swap);
+ xa_lock(&swap_cache->i_pages);
+ }
+
+ if (folio_test_anon(folio))
+ mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
+
+ /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
+ lruvec = folio_lruvec_lock(folio);
+
+ /*
+ * split to new_order one order at a time. For uniform split,
+ * folio is split to new_order directly.
+ */
+ for (split_order = start_order;
+ split_order >= new_order && !stop_split;
+ split_order--) {
+ int old_order = folio_order(folio);
+ struct folio *release;
+ struct folio *end_folio = folio_next(folio);
+ int status;
+
+ /* order-1 anonymous folio is not supported */
+ if (folio_test_anon(folio) && split_order == 1)
+ continue;
+ if (uniform_split && split_order != new_order)
+ continue;
+
+ if (mapping) {
+ /*
+ * uniform split has xas_split_alloc() called before
+ * irq is disabled, since xas_nomem() might not be
+ * able to allocate enough memory.
+ */
+ if (uniform_split)
+ xas_split(xas, folio, old_order);
+ else {
+ xas_set_order(xas, folio->index, split_order);
+ xas_split_alloc(xas, folio, folio_order(folio),
+ GFP_NOWAIT);
+ if (xas_error(xas)) {
+ ret = xas_error(xas);
+ stop_split = true;
+ goto after_split;
+ }
+ xas_split(xas, folio, old_order);
+ }
+ }
+
+ /* complete memcg works before add pages to LRU */
+ split_page_memcg(&folio->page, old_order, split_order);
+ split_page_owner(&folio->page, old_order, split_order);
+ pgalloc_tag_split(folio, old_order, split_order);
+
+ status = __split_folio_to_order(folio, split_order);
+
+ if (status < 0) {
+ stop_split = true;
+ ret = -EINVAL;
+ }
+
+after_split:
+ /*
+ * Iterate through after-split folios and perform related
+ * operations. But in buddy allocator like split, the folio
+ * containing the specified page is skipped until its order
+ * is new_order, since the folio will be worked on in next
+ * iteration.
+ */
+ for (release = folio, next = folio_next(folio);
+ release != end_folio;
+ release = next, next = folio_next(next)) {
+ /*
+ * for buddy allocator like split, the folio containing
+ * page will be split next and should not be released,
+ * until the folio's order is new_order or stop_split
+ * is set to true by the above xas_split() failure.
+ */
+ if (release == page_folio(page)) {
+ folio = release;
+ if (split_order != new_order && !stop_split)
+ continue;
+ }
+ if (folio_test_anon(release)) {
+ mod_mthp_stat(folio_order(release),
+ MTHP_STAT_NR_ANON, 1);
+ }
+
+ /*
+ * Unfreeze refcount first. Additional reference from
+ * page cache.
+ */
+ folio_ref_unfreeze(release,
+ 1 + ((!folio_test_anon(origin_folio) ||
+ folio_test_swapcache(origin_folio)) ?
+ folio_nr_pages(release) : 0));
+
+ if (release != origin_folio)
+ lru_add_page_tail(origin_folio, &release->page,
+ lruvec, list);
+
+ /* Some pages can be beyond EOF: drop them from page cache */
+ if (release->index >= end) {
+ if (shmem_mapping(origin_folio->mapping))
+ nr_dropped += folio_nr_pages(release);
+ else if (folio_test_clear_dirty(release))
+ folio_account_cleaned(release,
+ inode_to_wb(origin_folio->mapping->host));
+ __filemap_remove_folio(release, NULL);
+ folio_put(release);
+ } else if (!folio_test_anon(release)) {
+ __xa_store(&origin_folio->mapping->i_pages,
+ release->index, &release->page, 0);
+ } else if (swap_cache) {
+ __xa_store(&swap_cache->i_pages,
+ swap_cache_index(release->swap),
+ &release->page, 0);
+ }
+ }
+ }
+
+ unlock_page_lruvec(lruvec);
+
+ if (folio_test_anon(origin_folio)) {
+ if (folio_test_swapcache(origin_folio))
+ xa_unlock(&swap_cache->i_pages);
+ } else
+ xa_unlock(&mapping->i_pages);
+
+ /* Caller disabled irqs, so they are still disabled here */
+ local_irq_enable();
+
+ if (nr_dropped)
+ shmem_uncharge(mapping->host, nr_dropped);
+
+ remap_page(origin_folio, 1 << order,
+ folio_test_anon(origin_folio) ?
+ RMP_USE_SHARED_ZEROPAGE : 0);
+
+ /*
+ * At this point, folio should contain the specified page.
+ * For uniform split, it is left for caller to unlock.
+ * For buddy allocator like split, the first after-split folio is left
+ * for caller to unlock.
+ */
+ for (new_folio = origin_folio, next = folio_next(origin_folio);
+ new_folio != next_folio;
+ new_folio = next, next = folio_next(next)) {
+ if (uniform_split && new_folio == folio)
+ continue;
+ if (!uniform_split && new_folio == origin_folio)
+ continue;
+
+ folio_unlock(new_folio);
+ /*
+ * Subpages may be freed if there wasn't any mapping
+ * like if add_to_swap() is running on a lru page that
+ * had its mapping zapped. And freeing these pages
+ * requires taking the lru_lock so we do the put_page
+ * of the tail pages after the split is complete.
+ */
+ free_page_and_swap_cache(&new_folio->page);
+ }
+ return ret;
+}
+
/*
* This function splits a large folio into smaller folios of order @new_order.
* @page can point to any page of the large folio to split. The split operation
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 05/10] mm/huge_memory: move folio split common code to __folio_split()
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
` (3 preceding siblings ...)
2025-01-06 16:55 ` [PATCH v4 04/10] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 06/10] mm/huge_memory: add buddy allocator like folio_split() Zi Yan
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
This is a preparation patch for folio_split().
In the upcoming patch folio_split() will share folio unmapping and
remapping code with split_huge_page_to_list_to_order(), so move the code
to a common function __folio_split() first.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/huge_memory.c | 107 +++++++++++++++++++++++++----------------------
1 file changed, 57 insertions(+), 50 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8e743f81e76..586870e60003 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3724,57 +3724,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
return ret;
}
-/*
- * This function splits a large folio into smaller folios of order @new_order.
- * @page can point to any page of the large folio to split. The split operation
- * does not change the position of @page.
- *
- * Prerequisites:
- *
- * 1) The caller must hold a reference on the @page's owning folio, also known
- * as the large folio.
- *
- * 2) The large folio must be locked.
- *
- * 3) The folio must not be pinned. Any unexpected folio references, including
- * GUP pins, will result in the folio not getting split; instead, the caller
- * will receive an -EAGAIN.
- *
- * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
- * supported for non-file-backed folios, because folio->_deferred_list, which
- * is used by partially mapped folios, is stored in subpage 2, but an order-1
- * folio only has subpages 0 and 1. File-backed order-1 folios are supported,
- * since they do not use _deferred_list.
- *
- * After splitting, the caller's folio reference will be transferred to @page,
- * resulting in a raised refcount of @page after this call. The other pages may
- * be freed if they are not mapped.
- *
- * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
- *
- * Pages in @new_order will inherit the mapping, flags, and so on from the
- * huge page.
- *
- * Returns 0 if the huge page was split successfully.
- *
- * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
- * the folio was concurrently removed from the page cache.
- *
- * Returns -EBUSY when trying to split the huge zeropage, if the folio is
- * under writeback, if fs-specific folio metadata cannot currently be
- * released, or if some unexpected race happened (e.g., anon VMA disappeared,
- * truncation).
- *
- * Callers should ensure that the order respects the address space mapping
- * min-order if one is set for non-anonymous folios.
- *
- * Returns -EINVAL when trying to split to an order that is incompatible
- * with the folio. Splitting to order 0 is compatible with all folios.
- */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
- unsigned int new_order)
+static int __folio_split(struct folio *folio, unsigned int new_order,
+ struct page *page, struct list_head *list)
{
- struct folio *folio = page_folio(page);
struct deferred_split *ds_queue = get_deferred_split_queue(folio);
/* reset xarray order to new order after split */
XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
@@ -3984,6 +3936,61 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
return ret;
}
+/*
+ * This function splits a large folio into smaller folios of order @new_order.
+ * @page can point to any page of the large folio to split. The split operation
+ * does not change the position of @page.
+ *
+ * Prerequisites:
+ *
+ * 1) The caller must hold a reference on the @page's owning folio, also known
+ * as the large folio.
+ *
+ * 2) The large folio must be locked.
+ *
+ * 3) The folio must not be pinned. Any unexpected folio references, including
+ * GUP pins, will result in the folio not getting split; instead, the caller
+ * will receive an -EAGAIN.
+ *
+ * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
+ * supported for non-file-backed folios, because folio->_deferred_list, which
+ * is used by partially mapped folios, is stored in subpage 2, but an order-1
+ * folio only has subpages 0 and 1. File-backed order-1 folios are supported,
+ * since they do not use _deferred_list.
+ *
+ * After splitting, the caller's folio reference will be transferred to @page,
+ * resulting in a raised refcount of @page after this call. The other pages may
+ * be freed if they are not mapped.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Pages in @new_order will inherit the mapping, flags, and so on from the
+ * huge page.
+ *
+ * Returns 0 if the huge page was split successfully.
+ *
+ * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
+ * the folio was concurrently removed from the page cache.
+ *
+ * Returns -EBUSY when trying to split the huge zeropage, if the folio is
+ * under writeback, if fs-specific folio metadata cannot currently be
+ * released, or if some unexpected race happened (e.g., anon VMA disappeared,
+ * truncation).
+ *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
+ * Returns -EINVAL when trying to split to an order that is incompatible
+ * with the folio. Splitting to order 0 is compatible with all folios.
+ */
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+ unsigned int new_order)
+{
+ struct folio *folio = page_folio(page);
+
+ return __folio_split(folio, new_order, page, list);
+}
+
int min_order_for_split(struct folio *folio)
{
if (folio_test_anon(folio))
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 06/10] mm/huge_memory: add buddy allocator like folio_split()
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
` (4 preceding siblings ...)
2025-01-06 16:55 ` [PATCH v4 05/10] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 07/10] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
It generates fewer folios than existing page split approach, which splits
the order-9 to 512 order-0 folios.
folio_split() and existing split_huge_page_to_list_to_order() share
the folio unmapping and remapping code in __folio_split() and the common
backend split code in __split_unmapped_folio() using
uniform_split variable to distinguish their operations.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/huge_memory.c | 58 +++++++++++++++++++++++++++++++++---------------
1 file changed, 40 insertions(+), 18 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 586870e60003..e5f70ff88a1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3725,11 +3725,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
}
static int __folio_split(struct folio *folio, unsigned int new_order,
- struct page *page, struct list_head *list)
+ struct page *page, struct list_head *list, bool uniform_split)
{
struct deferred_split *ds_queue = get_deferred_split_queue(folio);
- /* reset xarray order to new order after split */
- XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
+ XA_STATE(xas, &folio->mapping->i_pages, folio->index);
bool is_anon = folio_test_anon(folio);
struct address_space *mapping = NULL;
struct anon_vma *anon_vma = NULL;
@@ -3750,14 +3749,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
VM_WARN_ONCE(1, "Cannot split to order-1 folio");
return -EINVAL;
}
- } else if (new_order) {
+ } else {
/*
* No split if the file system does not support large folio.
* Note that we might still have THPs in such mappings due to
* CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
* does not actually support large folios properly.
*/
- if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+ if ((!uniform_split || new_order) &&
+ IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
!mapping_large_folio_support(folio->mapping)) {
VM_WARN_ONCE(1,
"Cannot split file folio to non-0 order");
@@ -3766,7 +3766,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
}
/* Only swapping a whole PMD-mapped folio is supported */
- if (folio_test_swapcache(folio) && new_order)
+ if (folio_test_swapcache(folio) && (!uniform_split || new_order))
return -EINVAL;
is_hzp = is_huge_zero_folio(folio);
@@ -3823,10 +3823,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
goto out;
}
- xas_split_alloc(&xas, folio, folio_order(folio), gfp);
- if (xas_error(&xas)) {
- ret = xas_error(&xas);
- goto out;
+ if (uniform_split) {
+ xas_set_order(&xas, folio->index, new_order);
+ xas_split_alloc(&xas, folio, folio_order(folio), gfp);
+ if (xas_error(&xas)) {
+ ret = xas_error(&xas);
+ goto out;
+ }
}
anon_vma = NULL;
@@ -3891,7 +3894,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
if (mapping) {
int nr = folio_nr_pages(folio);
- xas_split(&xas, folio, folio_order(folio));
if (folio_test_pmd_mappable(folio) &&
new_order < HPAGE_PMD_ORDER) {
if (folio_test_swapbacked(folio)) {
@@ -3905,12 +3907,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
}
}
- if (is_anon) {
- mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
- mod_mthp_stat(new_order, MTHP_STAT_NR_ANON, 1 << (order - new_order));
- }
- __split_huge_page(page, list, end, new_order);
- ret = 0;
+ ret = __split_unmapped_folio(page_folio(page), new_order,
+ page, list, end, &xas, mapping, uniform_split);
} else {
spin_unlock(&ds_queue->split_queue_lock);
fail:
@@ -3988,7 +3986,31 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
{
struct folio *folio = page_folio(page);
- return __folio_split(folio, new_order, page, list);
+ return __folio_split(folio, new_order, page, list, true);
+}
+
+/*
+ * folio_split: split a folio at offset_in_new_order to a new_order folio
+ * @folio: folio to split
+ * @new_order: the order of the new folio
+ * @page: a page within the new folio
+ *
+ * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be
+ * split but not to @new_order, the caller needs to check)
+ *
+ * Split a folio at offset_in_new_order to a new_order folio, leave the
+ * remaining subpages of the original folio as large as possible. For example,
+ * split an order-9 folio at its third order-3 subpages to an order-3 folio.
+ * There are 2^6=64 order-3 subpages in an order-9 folio and the result will be
+ * a set of folios with different order and the new folio is in bracket:
+ * [order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].
+ *
+ * After split, folio is left locked for caller.
+ */
+int folio_split(struct folio *folio, unsigned int new_order,
+ struct page *page, struct list_head *list)
+{
+ return __folio_split(folio, new_order, page, list, false);
}
int min_order_for_split(struct folio *folio)
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 07/10] mm/huge_memory: remove the old, unused __split_huge_page()
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
` (5 preceding siblings ...)
2025-01-06 16:55 ` [PATCH v4 06/10] mm/huge_memory: add buddy allocator like folio_split() Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 08/10] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
Now split_huge_page_to_list_to_order() uses the new backend split code in
__folio_split_without_mapping(), the old __split_huge_page() and
__split_huge_page_tail() can be removed.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/huge_memory.c | 207 -----------------------------------------------
1 file changed, 207 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5f70ff88a1a..ec27287b7cbb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3153,213 +3153,6 @@ static void lru_add_page_tail(struct folio *folio, struct page *tail,
}
}
-static void __split_huge_page_tail(struct folio *folio, int tail,
- struct lruvec *lruvec, struct list_head *list,
- unsigned int new_order)
-{
- struct page *head = &folio->page;
- struct page *page_tail = head + tail;
- /*
- * Careful: new_folio is not a "real" folio before we cleared PageTail.
- * Don't pass it around before clear_compound_head().
- */
- struct folio *new_folio = (struct folio *)page_tail;
-
- VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
-
- /*
- * Clone page flags before unfreezing refcount.
- *
- * After successful get_page_unless_zero() might follow flags change,
- * for example lock_page() which set PG_waiters.
- *
- * Note that for mapped sub-pages of an anonymous THP,
- * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
- * the migration entry instead from where remap_page() will restore it.
- * We can still have PG_anon_exclusive set on effectively unmapped and
- * unreferenced sub-pages of an anonymous THP: we can simply drop
- * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
- */
- page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
- page_tail->flags |= (head->flags &
- ((1L << PG_referenced) |
- (1L << PG_swapbacked) |
- (1L << PG_swapcache) |
- (1L << PG_mlocked) |
- (1L << PG_uptodate) |
- (1L << PG_active) |
- (1L << PG_workingset) |
- (1L << PG_locked) |
- (1L << PG_unevictable) |
-#ifdef CONFIG_ARCH_USES_PG_ARCH_2
- (1L << PG_arch_2) |
-#endif
-#ifdef CONFIG_ARCH_USES_PG_ARCH_3
- (1L << PG_arch_3) |
-#endif
- (1L << PG_dirty) |
- LRU_GEN_MASK | LRU_REFS_MASK));
-
- /* ->mapping in first and second tail page is replaced by other uses */
- VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
- page_tail);
- new_folio->mapping = folio->mapping;
- new_folio->index = folio->index + tail;
-
- /*
- * page->private should not be set in tail pages. Fix up and warn once
- * if private is unexpectedly set.
- */
- if (unlikely(page_tail->private)) {
- VM_WARN_ON_ONCE_PAGE(true, page_tail);
- page_tail->private = 0;
- }
- if (folio_test_swapcache(folio))
- new_folio->swap.val = folio->swap.val + tail;
-
- /* Page flags must be visible before we make the page non-compound. */
- smp_wmb();
-
- /*
- * Clear PageTail before unfreezing page refcount.
- *
- * After successful get_page_unless_zero() might follow put_page()
- * which needs correct compound_head().
- */
- clear_compound_head(page_tail);
- if (new_order) {
- prep_compound_page(page_tail, new_order);
- folio_set_large_rmappable(new_folio);
- }
-
- /* Finally unfreeze refcount. Additional reference from page cache. */
- page_ref_unfreeze(page_tail,
- 1 + ((!folio_test_anon(folio) || folio_test_swapcache(folio)) ?
- folio_nr_pages(new_folio) : 0));
-
- if (folio_test_young(folio))
- folio_set_young(new_folio);
- if (folio_test_idle(folio))
- folio_set_idle(new_folio);
-
- folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
-
- /*
- * always add to the tail because some iterators expect new
- * pages to show after the currently processed elements - e.g.
- * migrate_pages
- */
- lru_add_page_tail(folio, page_tail, lruvec, list);
-}
-
-static void __split_huge_page(struct page *page, struct list_head *list,
- pgoff_t end, unsigned int new_order)
-{
- struct folio *folio = page_folio(page);
- struct page *head = &folio->page;
- struct lruvec *lruvec;
- struct address_space *swap_cache = NULL;
- unsigned long offset = 0;
- int i, nr_dropped = 0;
- unsigned int new_nr = 1 << new_order;
- int order = folio_order(folio);
- unsigned int nr = 1 << order;
-
- /* complete memcg works before add pages to LRU */
- split_page_memcg(head, order, new_order);
-
- if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
- offset = swap_cache_index(folio->swap);
- swap_cache = swap_address_space(folio->swap);
- xa_lock(&swap_cache->i_pages);
- }
-
- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);
-
- ClearPageHasHWPoisoned(head);
-
- for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
- struct folio *tail;
- __split_huge_page_tail(folio, i, lruvec, list, new_order);
- tail = page_folio(head + i);
- /* Some pages can be beyond EOF: drop them from page cache */
- if (tail->index >= end) {
- if (shmem_mapping(folio->mapping))
- nr_dropped += new_nr;
- else if (folio_test_clear_dirty(tail))
- folio_account_cleaned(tail,
- inode_to_wb(folio->mapping->host));
- __filemap_remove_folio(tail, NULL);
- folio_put(tail);
- } else if (!folio_test_anon(folio)) {
- __xa_store(&folio->mapping->i_pages, tail->index,
- tail, 0);
- } else if (swap_cache) {
- __xa_store(&swap_cache->i_pages, offset + i,
- tail, 0);
- }
- }
-
- if (!new_order)
- ClearPageCompound(head);
- else {
- struct folio *new_folio = (struct folio *)head;
-
- folio_set_order(new_folio, new_order);
- }
- unlock_page_lruvec(lruvec);
- /* Caller disabled irqs, so they are still disabled here */
-
- split_page_owner(head, order, new_order);
- pgalloc_tag_split(folio, order, new_order);
-
- /* See comment in __split_huge_page_tail() */
- if (folio_test_anon(folio)) {
- /* Additional pin to swap cache */
- if (folio_test_swapcache(folio)) {
- folio_ref_add(folio, 1 + new_nr);
- xa_unlock(&swap_cache->i_pages);
- } else {
- folio_ref_inc(folio);
- }
- } else {
- /* Additional pin to page cache */
- folio_ref_add(folio, 1 + new_nr);
- xa_unlock(&folio->mapping->i_pages);
- }
- local_irq_enable();
-
- if (nr_dropped)
- shmem_uncharge(folio->mapping->host, nr_dropped);
- remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
-
- /*
- * set page to its compound_head when split to non order-0 pages, so
- * we can skip unlocking it below, since PG_locked is transferred to
- * the compound_head of the page and the caller will unlock it.
- */
- if (new_order)
- page = compound_head(page);
-
- for (i = 0; i < nr; i += new_nr) {
- struct page *subpage = head + i;
- struct folio *new_folio = page_folio(subpage);
- if (subpage == page)
- continue;
- folio_unlock(new_folio);
-
- /*
- * Subpages may be freed if there wasn't any mapping
- * like if add_to_swap() is running on a lru page that
- * had its mapping zapped. And freeing these pages
- * requires taking the lru_lock so we do the put_page
- * of the tail pages after the split is complete.
- */
- free_page_and_swap_cache(subpage);
- }
-}
-
/* Racy check whether the huge page can be split */
bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
{
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 08/10] mm/huge_memory: add folio_split() to debugfs testing interface.
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
` (6 preceding siblings ...)
2025-01-06 16:55 ` [PATCH v4 07/10] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 09/10] mm/truncate: use folio_split() for truncate operation Zi Yan
2025-01-06 16:55 ` [PATCH v4 10/10] selftests/mm: add tests for folio_split(), buddy allocator like split Zi Yan
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
This allows to test folio_split() by specifying an additional in folio
page offset parameter to split_huge_page debugfs interface.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/huge_memory.c | 47 ++++++++++++++++++++++++++++++++++-------------
1 file changed, 34 insertions(+), 13 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec27287b7cbb..50f268741c4f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4125,7 +4125,8 @@ static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
}
static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
- unsigned long vaddr_end, unsigned int new_order)
+ unsigned long vaddr_end, unsigned int new_order,
+ long in_folio_offset)
{
int ret = 0;
struct task_struct *task;
@@ -4209,8 +4210,16 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
if (!folio_test_anon(folio) && folio->mapping != mapping)
goto unlock;
- if (!split_folio_to_order(folio, target_order))
- split++;
+ if (in_folio_offset < 0 ||
+ in_folio_offset >= folio_nr_pages(folio)) {
+ if (!split_folio_to_order(folio, target_order))
+ split++;
+ } else {
+ struct page *split_at = folio_page(folio,
+ in_folio_offset);
+ if (!folio_split(folio, target_order, split_at, NULL))
+ split++;
+ }
unlock:
@@ -4233,7 +4242,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
}
static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
- pgoff_t off_end, unsigned int new_order)
+ pgoff_t off_end, unsigned int new_order,
+ long in_folio_offset)
{
struct filename *file;
struct file *candidate;
@@ -4282,8 +4292,15 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
if (folio->mapping != mapping)
goto unlock;
- if (!split_folio_to_order(folio, target_order))
- split++;
+ if (in_folio_offset < 0 || in_folio_offset >= nr_pages) {
+ if (!split_folio_to_order(folio, target_order))
+ split++;
+ } else {
+ struct page *split_at = folio_page(folio,
+ in_folio_offset);
+ if (!folio_split(folio, target_order, split_at, NULL))
+ split++;
+ }
unlock:
folio_unlock(folio);
@@ -4316,6 +4333,7 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
int pid;
unsigned long vaddr_start, vaddr_end;
unsigned int new_order = 0;
+ long in_folio_offset = -1;
ret = mutex_lock_interruptible(&split_debug_mutex);
if (ret)
@@ -4344,30 +4362,33 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
goto out;
}
- ret = sscanf(tok_buf, "0x%lx,0x%lx,%d", &off_start,
- &off_end, &new_order);
- if (ret != 2 && ret != 3) {
+ ret = sscanf(tok_buf, "0x%lx,0x%lx,%d,%ld", &off_start, &off_end,
+ &new_order, &in_folio_offset);
+ if (ret != 2 && ret != 3 && ret != 4) {
ret = -EINVAL;
goto out;
}
- ret = split_huge_pages_in_file(file_path, off_start, off_end, new_order);
+ ret = split_huge_pages_in_file(file_path, off_start, off_end,
+ new_order, in_folio_offset);
if (!ret)
ret = input_len;
goto out;
}
- ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", &pid, &vaddr_start, &vaddr_end, &new_order);
+ ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d,%ld", &pid, &vaddr_start,
+ &vaddr_end, &new_order, &in_folio_offset);
if (ret == 1 && pid == 1) {
split_huge_pages_all();
ret = strlen(input_buf);
goto out;
- } else if (ret != 3 && ret != 4) {
+ } else if (ret != 3 && ret != 4 && ret != 5) {
ret = -EINVAL;
goto out;
}
- ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order);
+ ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order,
+ in_folio_offset);
if (!ret)
ret = strlen(input_buf);
out:
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 09/10] mm/truncate: use folio_split() for truncate operation.
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
` (7 preceding siblings ...)
2025-01-06 16:55 ` [PATCH v4 08/10] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
2025-01-06 16:55 ` [PATCH v4 10/10] selftests/mm: add tests for folio_split(), buddy allocator like split Zi Yan
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
Instead of splitting the large folio uniformly during truncation, use
buddy allocator like split at the start of truncation range to minimize
the number of resulting folios.
For example, to truncate a order-4 folio
[0, 1, 2, 3, 4, 5, ..., 15]
between [3, 10] (inclusive), folio_split() splits the folio to
[0,1], [2], [3], [4..7], [8..15] and [3], [4..7] can be dropped and
[8..15] is kept with zeros in [8..10], then another folio_split() is
done at 10, so [8..10] can be dropped.
One possible optimization is to make folio_split() to split a folio
based on a given range, like [3..10] above. But that complicates
folio_split(), so it will be investigated when necessary.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
include/linux/huge_mm.h | 17 +++++++++++++++++
mm/truncate.c | 31 ++++++++++++++++++++++++++++++-
2 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93e509b6c00e..e5693820cb0d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -341,6 +341,17 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
int min_order_for_split(struct folio *folio);
int split_folio_to_list(struct folio *folio, struct list_head *list);
+int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
+ struct list_head *list);
+static inline int split_folio_at(struct folio *folio, struct page *page,
+ struct list_head *list)
+{
+ int ret = min_order_for_split(folio);
+
+ if (ret < 0)
+ return ret;
+ return folio_split(folio, ret, page, list);
+}
static inline int split_huge_page(struct page *page)
{
struct folio *folio = page_folio(page);
@@ -533,6 +544,12 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
return 0;
}
+static inline int split_folio_at(struct folio *folio, struct page *page,
+ struct list_head *list)
+{
+ return 0;
+}
+
static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
diff --git a/mm/truncate.c b/mm/truncate.c
index 7c304d2f0052..a8bb0c6e685d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -178,6 +178,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
{
loff_t pos = folio_pos(folio);
unsigned int offset, length;
+ struct page *split_at, *split_at2;
if (pos < start)
offset = start - pos;
@@ -207,8 +208,36 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
folio_invalidate(folio, offset, length);
if (!folio_test_large(folio))
return true;
- if (split_folio(folio) == 0)
+
+ split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
+ split_at2 = folio_page(folio,
+ PAGE_ALIGN_DOWN(offset + length) / PAGE_SIZE);
+
+ if (!split_folio_at(folio, split_at, NULL)) {
+ /*
+ * try to split at offset + length to make sure folios within
+ * the range can be dropped, especially to avoid memory waste
+ * for shmem truncate
+ */
+ struct folio *folio2 = page_folio(split_at2);
+
+ if (!folio_try_get(folio2))
+ goto no_split;
+
+ if (!folio_test_large(folio2))
+ goto out;
+
+ if (!folio_trylock(folio2))
+ goto out;
+
+ /* split result does not matter here */
+ split_folio_at(folio2, split_at2, NULL);
+ folio_unlock(folio2);
+out:
+ folio_put(folio2);
+no_split:
return true;
+ }
if (folio_test_dirty(folio))
return false;
truncate_inode_folio(folio->mapping, folio);
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 10/10] selftests/mm: add tests for folio_split(), buddy allocator like split.
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
` (8 preceding siblings ...)
2025-01-06 16:55 ` [PATCH v4 09/10] mm/truncate: use folio_split() for truncate operation Zi Yan
@ 2025-01-06 16:55 ` Zi Yan
9 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-01-06 16:55 UTC (permalink / raw)
To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
Zi Yan
It splits page cache folios to orders from 0 to 8 at different in-folio
offset.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
.../selftests/mm/split_huge_page_test.c | 29 ++++++++++++++-----
1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index 5bb159ebc83d..1af8d6fa4465 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -14,6 +14,7 @@
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/mount.h>
+#include <sys/param.h>
#include <malloc.h>
#include <stdbool.h>
#include <time.h>
@@ -420,7 +421,8 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd,
return -1;
}
-void split_thp_in_pagecache_to_order(size_t fd_size, int order, const char *fs_loc)
+void split_thp_in_pagecache_to_order_at(size_t fd_size, const char *fs_loc,
+ int order, int offset)
{
int fd;
char *addr;
@@ -438,7 +440,12 @@ void split_thp_in_pagecache_to_order(size_t fd_size, int order, const char *fs_l
return;
err = 0;
- write_debugfs(PID_FMT, getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order);
+ if (offset == -1)
+ write_debugfs(PID_FMT, getpid(), (uint64_t)addr,
+ (uint64_t)addr + fd_size, order);
+ else
+ write_debugfs(PID_FMT, getpid(), (uint64_t)addr,
+ (uint64_t)addr + fd_size, order, offset);
for (i = 0; i < fd_size; i++)
if (*(addr + i) != (char)i) {
@@ -458,8 +465,8 @@ void split_thp_in_pagecache_to_order(size_t fd_size, int order, const char *fs_l
close(fd);
unlink(testfile);
if (err)
- ksft_exit_fail_msg("Split PMD-mapped pagecache folio to order %d failed\n", order);
- ksft_test_result_pass("Split PMD-mapped pagecache folio to order %d passed\n", order);
+ ksft_exit_fail_msg("Split PMD-mapped pagecache folio to order %d at in-folio offset %d failed\n", order, offset);
+ ksft_test_result_pass("Split PMD-mapped pagecache folio to order %d at in-folio offset %d passed\n", order, offset);
}
int main(int argc, char **argv)
@@ -470,6 +477,7 @@ int main(int argc, char **argv)
char fs_loc_template[] = "/tmp/thp_fs_XXXXXX";
const char *fs_loc;
bool created_tmp;
+ int offset;
ksft_print_header();
@@ -481,7 +489,7 @@ int main(int argc, char **argv)
if (argc > 1)
optional_xfs_path = argv[1];
- ksft_set_plan(1+9+2+9);
+ ksft_set_plan(1+8+2+9+8*4+2);
pagesize = getpagesize();
pageshift = ffs(pagesize) - 1;
@@ -494,7 +502,8 @@ int main(int argc, char **argv)
split_pmd_zero_pages();
for (i = 0; i < 9; i++)
- split_pmd_thp_to_order(i);
+ if (i != 1)
+ split_pmd_thp_to_order(i);
split_pte_mapped_thp();
split_file_backed_thp();
@@ -502,7 +511,13 @@ int main(int argc, char **argv)
created_tmp = prepare_thp_fs(optional_xfs_path, fs_loc_template,
&fs_loc);
for (i = 8; i >= 0; i--)
- split_thp_in_pagecache_to_order(fd_size, i, fs_loc);
+ split_thp_in_pagecache_to_order_at(fd_size, fs_loc, i, -1);
+
+ for (i = 0; i < 9; i++)
+ for (offset = 0;
+ offset < pmd_pagesize / pagesize;
+ offset += MAX(pmd_pagesize / pagesize / 4, 1 << i))
+ split_thp_in_pagecache_to_order_at(fd_size, fs_loc, i, offset);
cleanup_thp_fs(fs_loc, created_tmp);
ksft_finished();
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-01-06 16:59 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-06 16:55 [PATCH v4 00/10] Buddy allocator like folio split Zi Yan
2025-01-06 16:55 ` [PATCH v4 01/10] selftests/mm: use selftests framework to print test result Zi Yan
2025-01-06 16:55 ` [PATCH v4 02/10] selftests/mm: add tests for splitting pmd THPs to all lower orders Zi Yan
2025-01-06 16:55 ` [PATCH v4 03/10] mm/huge_memory: allow split shmem large folio to any order Zi Yan
2025-01-06 16:55 ` [PATCH v4 04/10] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
2025-01-06 16:55 ` [PATCH v4 05/10] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
2025-01-06 16:55 ` [PATCH v4 06/10] mm/huge_memory: add buddy allocator like folio_split() Zi Yan
2025-01-06 16:55 ` [PATCH v4 07/10] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
2025-01-06 16:55 ` [PATCH v4 08/10] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
2025-01-06 16:55 ` [PATCH v4 09/10] mm/truncate: use folio_split() for truncate operation Zi Yan
2025-01-06 16:55 ` [PATCH v4 10/10] selftests/mm: add tests for folio_split(), buddy allocator like split Zi Yan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox