linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Buddy allocator like folio split
@ 2024-11-01 15:03 Zi Yan
  2024-11-01 15:03 ` [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split() Zi Yan
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-01 15:03 UTC (permalink / raw)
  To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
	Zi Yan

Hi all

This patchset adds a new buddy allocator like large folio split to the total
number of resulting folios, the amount of memory needed for multi-index xarray
split, and keep more large folios after a split. It is on top of
mm-everything-2024-11-01-04-30.


Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like split. All
existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().

THP tests in selftesting passed for split_huge_page*() runs and I also
tested folio_split() for anon large folio, pagecache folio, and
truncate. I will run more extensive tests.

Changelog
===
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
   Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
   for xas_split() during buddy allocator like split.

From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
   folio_split(). The same code is used for both uniform split and buddy
   allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
   instead of the one containing the given page, since
   the caller of truncate_inode_partial_folio() locks and unlocks the
   first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().


Design
===

folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.

The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
   the whole folio into multiple smaller ones with the same order in a
   shot, this approach splits the folio iteratively. Taking the example
   above, this approach first splits the original order-9 into two order-8,
   then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
   adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.


__folio_split_without_mapping() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__folio_split_without_mapping() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
   uniformly split the given folio. All resulting folios are put back to
   the list after split. The folio containing the given page is left to
   caller to unlock and others are unlocked.

2. buddy allocator like split: old_order - new_order calls to
   __split_folio_to_order() are used to split the given folio at order N to
   order N-1. After each call, the target folio is changed to the one
   containing the page, which is given via folio_split() parameters.
   After each call, folios not containing the page are put back to the list.
   The folio containing the page is put back to the list when its order
   is new_order. All folios are unlocked except the first folio, which
   is left to caller to unlock.


Patch Overview
===
1. Patch 1 added __folio_split_without_mapping() and
   __split_folio_to_order() to prepare for moving to new backend split
   code.

2. Patch 2 replaced __split_huge_page() with
   __folio_split_without_mapping() in split_huge_page_to_list_to_order().

3. Patch 3 added new folio_split().

4. Patch 4 removed __split_huge_page() and __split_huge_page_tail().

5. Patch 5 added a new in_folio_offset to split_huge_page debugfs for
   folio_split() test.

6. Patch 6 used folio_split() for truncate operation.



Any comments and/or suggestions are welcome. Thanks.

[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/

Zi Yan (6):
  mm/huge_memory: add two new (yet used) functions for folio_split()
  mm/huge_memory: move folio split common code to __folio_split()
  mm/huge_memory: add buddy allocator like folio_split()
  mm/huge_memory: remove the old, unused __split_huge_page()
  mm/huge_memory: add folio_split() to debugfs testing interface.
  mm/truncate: use folio_split() for truncate operation.

 include/linux/huge_mm.h |  12 +
 mm/huge_memory.c        | 664 +++++++++++++++++++++++++---------------
 mm/truncate.c           |   5 +-
 3 files changed, 435 insertions(+), 246 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split()
  2024-11-01 15:03 [PATCH v2 0/6] Buddy allocator like folio split Zi Yan
@ 2024-11-01 15:03 ` Zi Yan
  2024-11-06 10:44   ` Kirill A . Shutemov
  2024-11-01 15:03 ` [PATCH v2 2/6] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2024-11-01 15:03 UTC (permalink / raw)
  To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
	Zi Yan

This is a preparation patch, both added functions are not used yet.

The added __folio_split_without_mapping() is able to split a folio with
its mapping removed in two manners: 1) uniform split (the existing way),
and 2) buddy allocator like split.

The added __split_folio_to_order() can split a folio into any lower order.
For uniform split, __folio_split_without_mapping() calls it once to split
the given folio to the new order. For buddy allocator split,
__folio_split_without_mapping() calls it (folio_order - new_order) times
and each time splits the folio containing the given page to one lower
order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 328 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 327 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f92068864469..f7649043ddb7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3135,7 +3135,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
 static void lru_add_page_tail(struct folio *folio, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
 	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
@@ -3379,6 +3378,333 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 					caller_pins;
 }
 
+static long page_in_folio_offset(struct page *page, struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	unsigned long pages_pfn = page_to_pfn(page);
+	unsigned long folios_pfn = folio_pfn(folio);
+
+	if (pages_pfn >= folios_pfn && pages_pfn < (folios_pfn + nr_pages))
+		return pages_pfn - folios_pfn;
+
+	return -EINVAL;
+}
+
+/*
+ * It splits @folio into @new_order folios and copies the @folio metadata to
+ * all the resulting folios.
+ */
+static int __split_folio_to_order(struct folio *folio, int new_order)
+{
+	int curr_order = folio_order(folio);
+	long nr_pages = folio_nr_pages(folio);
+	long new_nr_pages = 1 << new_order;
+	long index;
+
+	if (curr_order <= new_order)
+		return -EINVAL;
+
+	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
+		struct page *head = &folio->page;
+		struct page *second_head = head + index;
+
+		/*
+		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
+		 * Don't pass it around before clear_compound_head().
+		 */
+		struct folio *new_folio = (struct folio *)second_head;
+
+		VM_BUG_ON_PAGE(atomic_read(&second_head->_mapcount) != -1, second_head);
+
+		/*
+		 * Clone page flags before unfreezing refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow flags change,
+		 * for example lock_page() which set PG_waiters.
+		 *
+		 * Note that for mapped sub-pages of an anonymous THP,
+		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
+		 * the migration entry instead from where remap_page() will restore it.
+		 * We can still have PG_anon_exclusive set on effectively unmapped and
+		 * unreferenced sub-pages of an anonymous THP: we can simply drop
+		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
+		 */
+		second_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		second_head->flags |= (head->flags &
+				((1L << PG_referenced) |
+				 (1L << PG_swapbacked) |
+				 (1L << PG_swapcache) |
+				 (1L << PG_mlocked) |
+				 (1L << PG_uptodate) |
+				 (1L << PG_active) |
+				 (1L << PG_workingset) |
+				 (1L << PG_locked) |
+				 (1L << PG_unevictable) |
+#ifdef CONFIG_ARCH_USES_PG_ARCH_2
+				 (1L << PG_arch_2) |
+#endif
+#ifdef CONFIG_ARCH_USES_PG_ARCH_3
+				 (1L << PG_arch_3) |
+#endif
+				 (1L << PG_dirty) |
+				 LRU_GEN_MASK | LRU_REFS_MASK));
+
+		/* ->mapping in first and second tail page is replaced by other uses */
+		VM_BUG_ON_PAGE(new_nr_pages > 2 && second_head->mapping != TAIL_MAPPING,
+			       second_head);
+		second_head->mapping = head->mapping;
+		second_head->index = head->index + index;
+
+		/*
+		 * page->private should not be set in tail pages. Fix up and warn once
+		 * if private is unexpectedly set.
+		 */
+		if (unlikely(second_head->private)) {
+			VM_WARN_ON_ONCE_PAGE(true, second_head);
+			second_head->private = 0;
+		}
+		if (folio_test_swapcache(folio))
+			new_folio->swap.val = folio->swap.val + index;
+
+		/* Page flags must be visible before we make the page non-compound. */
+		smp_wmb();
+
+		/*
+		 * Clear PageTail before unfreezing page refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow put_page()
+		 * which needs correct compound_head().
+		 */
+		clear_compound_head(second_head);
+		if (new_order) {
+			prep_compound_page(second_head, new_order);
+			folio_set_large_rmappable(new_folio);
+
+			folio_set_order(folio, new_order);
+		} else {
+			if (PageHead(head))
+				ClearPageCompound(head);
+		}
+
+		if (folio_test_young(folio))
+			folio_set_young(new_folio);
+		if (folio_test_idle(folio))
+			folio_set_idle(new_folio);
+
+		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
+	}
+
+	return 0;
+}
+
+#define for_each_folio_until_end_safe(iter, iter2, start, end)	\
+	for (iter = start, iter2 = folio_next(start);		\
+	     iter != end;					\
+	     iter = iter2, iter2 = folio_next(iter2))
+
+/*
+ * It splits a @folio (without mapping) to lower order smaller folios in two
+ * ways.
+ * 1. uniform split: the given @folio into multiple @new_order small folios,
+ *    where all small folios have the same order. This is done when
+ *    uniform_split is true.
+ * 2. buddy allocator like split: the given @folio is split into half and one
+ *    of the half (containing the given page) is split into half until the
+ *    given @page's order becomes @new_order. This is done when uniform_split is
+ *    false.
+ *
+ * The high level flow for these two methods are:
+ * 1. uniform split: a single __split_folio_to_order() is called to split the
+ *    @folio into @new_order, then we traverse all the resulting folios one by
+ *    one in PFN ascending order and perform stats, unfreeze, adding to list,
+ *    and file mapping index operations.
+ * 2. buddy allocator like split: in general, folio_order - @new_order calls to
+ *    __split_folio_to_order() are called in the for loop to split the @folio
+ *    to one lower order at a time. The resulting small folios are processed
+ *    like what is done during the traversal in 1, except the one containing
+ *    @page, which is split in next for loop.
+ *
+ * After splitting, the caller's folio reference will be transferred to the
+ * folio containing @page. The other folios may be freed if they are not mapped.
+ *
+ * In terms of locking, after splitting,
+ * 1. uniform split leaves @page (or the folio contains it) locked;
+ * 2. buddy allocator like split leaves @folio locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * For !uniform_split, when -ENOMEM is returned, the original folio might be
+ * split. The caller needs to check the input folio.
+ */
+static int __folio_split_without_mapping(struct folio *folio, int new_order,
+		struct page *page, struct list_head *list, pgoff_t end,
+		struct xa_state *xas, struct address_space *mapping,
+		bool uniform_split)
+{
+	struct lruvec *lruvec;
+	struct address_space *swap_cache = NULL;
+	struct folio *origin_folio = folio;
+	struct folio *next_folio = folio_next(folio);
+	struct folio *new_folio;
+	struct folio *next;
+	int order = folio_order(folio);
+	int split_order = order - 1;
+	int nr_dropped = 0;
+	int ret = 0;
+
+	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
+		if (!uniform_split)
+			return -EINVAL;
+
+		swap_cache = swap_address_space(folio->swap);
+		xa_lock(&swap_cache->i_pages);
+	}
+
+	if (folio_test_anon(folio))
+		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
+
+	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
+	lruvec = folio_lruvec_lock(folio);
+
+	/*
+	 * split to new_order one order at a time. For uniform split,
+	 * intermediate orders are skipped
+	 */
+	for (split_order = order - 1; split_order >= new_order; split_order--) {
+		int old_order = folio_order(folio);
+		struct folio *release;
+		struct folio *end_folio = folio_next(folio);
+		int status;
+		bool stop_split = false;
+
+		if (folio_test_anon(folio) && split_order == 1)
+			continue;
+		if (uniform_split && split_order != new_order)
+			continue;
+
+		if (mapping) {
+			/*
+			 * uniform split has xas_split_alloc() called before
+			 * irq is disabled, since xas_nomem() might not be
+			 * able to allocate enough memory.
+			 */
+			if (uniform_split)
+				xas_split(xas, folio, old_order);
+			else {
+				xas_set_order(xas, folio->index, split_order);
+				xas_set_err(xas, -ENOMEM);
+				if (xas_nomem(xas, 0))
+					xas_split(xas, folio, old_order);
+				else {
+					stop_split = true;
+					ret = -ENOMEM;
+					goto after_split;
+				}
+			}
+		}
+
+		split_page_memcg(&folio->page, old_order, split_order);
+		split_page_owner(&folio->page, old_order, split_order);
+		pgalloc_tag_split(folio, old_order, split_order);
+
+		status = __split_folio_to_order(folio, split_order);
+
+		if (status < 0)
+			return status;
+
+after_split:
+		/*
+		 * Iterate through after-split folios and perform related
+		 * operations. But in buddy allocator like split, the folio
+		 * containing the specified page is skipped until its order
+		 * is new_order, since the folio will be worked on in next
+		 * iteration.
+		 */
+		for_each_folio_until_end_safe(release, next, folio, end_folio) {
+			if (page_in_folio_offset(page, release) >= 0) {
+				folio = release;
+				if (split_order != new_order && !stop_split)
+					continue;
+			}
+			if (folio_test_anon(release))
+				mod_mthp_stat(folio_order(release),
+						MTHP_STAT_NR_ANON, 1);
+
+			/*
+			 * Unfreeze refcount first. Additional reference from
+			 * page cache.
+			 */
+			folio_ref_unfreeze(release,
+				1 + ((!folio_test_anon(origin_folio) ||
+				     folio_test_swapcache(origin_folio)) ?
+					     folio_nr_pages(release) : 0));
+
+			if (release != origin_folio)
+				lru_add_page_tail(origin_folio, &release->page,
+						lruvec, list);
+
+			/* Some pages can be beyond EOF: drop them from page cache */
+			if (release->index >= end) {
+				if (shmem_mapping(origin_folio->mapping))
+					nr_dropped++;
+				else if (folio_test_clear_dirty(release))
+					folio_account_cleaned(release,
+						inode_to_wb(origin_folio->mapping->host));
+				__filemap_remove_folio(release, NULL);
+				folio_put(release);
+			} else if (!folio_test_anon(release)) {
+				__xa_store(&origin_folio->mapping->i_pages,
+						release->index, &release->page, 0);
+			} else if (swap_cache) {
+				__xa_store(&swap_cache->i_pages,
+						swap_cache_index(release->swap),
+						&release->page, 0);
+			}
+		}
+		xas_destroy(xas);
+	}
+
+	unlock_page_lruvec(lruvec);
+
+	if (folio_test_anon(origin_folio)) {
+		if (folio_test_swapcache(origin_folio))
+			xa_unlock(&swap_cache->i_pages);
+	} else
+		xa_unlock(&mapping->i_pages);
+
+	/* Caller disabled irqs, so they are still disabled here */
+	local_irq_enable();
+
+	if (nr_dropped)
+		shmem_uncharge(mapping->host, nr_dropped);
+
+	remap_page(origin_folio, 1 << order,
+			folio_test_anon(origin_folio) ?
+				RMP_USE_SHARED_ZEROPAGE : 0);
+
+	/*
+	 * At this point, folio should contain the specified page, so that it
+	 * will be left to the caller to unlock it.
+	 */
+	for_each_folio_until_end_safe(new_folio, next, origin_folio, next_folio) {
+		if (uniform_split && new_folio == folio)
+			continue;
+		if (!uniform_split && new_folio == origin_folio)
+			continue;
+
+		folio_unlock(new_folio);
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		free_page_and_swap_cache(&new_folio->page);
+	}
+	return ret;
+}
+
 /*
  * This function splits a large folio into smaller folios of order @new_order.
  * @page can point to any page of the large folio to split. The split operation
-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 2/6] mm/huge_memory: move folio split common code to __folio_split()
  2024-11-01 15:03 [PATCH v2 0/6] Buddy allocator like folio split Zi Yan
  2024-11-01 15:03 ` [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split() Zi Yan
@ 2024-11-01 15:03 ` Zi Yan
  2024-11-01 15:03 ` [PATCH v2 3/6] mm/huge_memory: add buddy allocator like folio_split() Zi Yan
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-01 15:03 UTC (permalink / raw)
  To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
	Zi Yan

This is a preparation patch for folio_split().

In the upcoming patch folio_split() will share folio unmapping and
remapping code with split_huge_page_to_list_to_order(), so move the code
to a common function __folio_split() first.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 107 +++++++++++++++++++++++++----------------------
 1 file changed, 57 insertions(+), 50 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f7649043ddb7..63ca870ca3fb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3705,57 +3705,9 @@ static int __folio_split_without_mapping(struct folio *folio, int new_order,
 	return ret;
 }
 
-/*
- * This function splits a large folio into smaller folios of order @new_order.
- * @page can point to any page of the large folio to split. The split operation
- * does not change the position of @page.
- *
- * Prerequisites:
- *
- * 1) The caller must hold a reference on the @page's owning folio, also known
- *    as the large folio.
- *
- * 2) The large folio must be locked.
- *
- * 3) The folio must not be pinned. Any unexpected folio references, including
- *    GUP pins, will result in the folio not getting split; instead, the caller
- *    will receive an -EAGAIN.
- *
- * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
- *    supported for non-file-backed folios, because folio->_deferred_list, which
- *    is used by partially mapped folios, is stored in subpage 2, but an order-1
- *    folio only has subpages 0 and 1. File-backed order-1 folios are supported,
- *    since they do not use _deferred_list.
- *
- * After splitting, the caller's folio reference will be transferred to @page,
- * resulting in a raised refcount of @page after this call. The other pages may
- * be freed if they are not mapped.
- *
- * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
- *
- * Pages in @new_order will inherit the mapping, flags, and so on from the
- * huge page.
- *
- * Returns 0 if the huge page was split successfully.
- *
- * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
- * the folio was concurrently removed from the page cache.
- *
- * Returns -EBUSY when trying to split the huge zeropage, if the folio is
- * under writeback, if fs-specific folio metadata cannot currently be
- * released, or if some unexpected race happened (e.g., anon VMA disappeared,
- * truncation).
- *
- * Callers should ensure that the order respects the address space mapping
- * min-order if one is set for non-anonymous folios.
- *
- * Returns -EINVAL when trying to split to an order that is incompatible
- * with the folio. Splitting to order 0 is compatible with all folios.
- */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+static int __folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list)
 {
-	struct folio *folio = page_folio(page);
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	/* reset xarray order to new order after split */
 	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
@@ -3971,6 +3923,61 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+/*
+ * This function splits a large folio into smaller folios of order @new_order.
+ * @page can point to any page of the large folio to split. The split operation
+ * does not change the position of @page.
+ *
+ * Prerequisites:
+ *
+ * 1) The caller must hold a reference on the @page's owning folio, also known
+ *    as the large folio.
+ *
+ * 2) The large folio must be locked.
+ *
+ * 3) The folio must not be pinned. Any unexpected folio references, including
+ *    GUP pins, will result in the folio not getting split; instead, the caller
+ *    will receive an -EAGAIN.
+ *
+ * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
+ *    supported for non-file-backed folios, because folio->_deferred_list, which
+ *    is used by partially mapped folios, is stored in subpage 2, but an order-1
+ *    folio only has subpages 0 and 1. File-backed order-1 folios are supported,
+ *    since they do not use _deferred_list.
+ *
+ * After splitting, the caller's folio reference will be transferred to @page,
+ * resulting in a raised refcount of @page after this call. The other pages may
+ * be freed if they are not mapped.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Pages in @new_order will inherit the mapping, flags, and so on from the
+ * huge page.
+ *
+ * Returns 0 if the huge page was split successfully.
+ *
+ * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
+ * the folio was concurrently removed from the page cache.
+ *
+ * Returns -EBUSY when trying to split the huge zeropage, if the folio is
+ * under writeback, if fs-specific folio metadata cannot currently be
+ * released, or if some unexpected race happened (e.g., anon VMA disappeared,
+ * truncation).
+ *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
+ * Returns -EINVAL when trying to split to an order that is incompatible
+ * with the folio. Splitting to order 0 is compatible with all folios.
+ */
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order)
+{
+	struct folio *folio = page_folio(page);
+
+	return __folio_split(folio, new_order, page, list);
+}
+
 int min_order_for_split(struct folio *folio)
 {
 	if (folio_test_anon(folio))
-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 3/6] mm/huge_memory: add buddy allocator like folio_split()
  2024-11-01 15:03 [PATCH v2 0/6] Buddy allocator like folio split Zi Yan
  2024-11-01 15:03 ` [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split() Zi Yan
  2024-11-01 15:03 ` [PATCH v2 2/6] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
@ 2024-11-01 15:03 ` Zi Yan
  2024-11-01 15:03 ` [PATCH v2 4/6] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-01 15:03 UTC (permalink / raw)
  To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
	Zi Yan

folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.

It generates fewer folios than existing page split approach, which splits
the order-9 to 512 order-0 folios.

folio_split() and existing split_huge_page_to_list_to_order() share
the folio unmapping and remapping code in __folio_split() and the common
backend split code in __folio_split_without_mapping() using
uniform_split variable to distinguish their operations.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 56 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63ca870ca3fb..4f227d246ac5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3706,11 +3706,10 @@ static int __folio_split_without_mapping(struct folio *folio, int new_order,
 }
 
 static int __folio_split(struct folio *folio, unsigned int new_order,
-		struct page *page, struct list_head *list)
+		struct page *page, struct list_head *list, bool uniform_split)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
-	/* reset xarray order to new order after split */
-	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
+	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
 	bool is_anon = folio_test_anon(folio);
 	struct address_space *mapping = NULL;
 	struct anon_vma *anon_vma = NULL;
@@ -3731,9 +3730,10 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			VM_WARN_ONCE(1, "Cannot split to order-1 folio");
 			return -EINVAL;
 		}
-	} else if (new_order) {
+	} else {
 		/* Split shmem folio to non-zero order not supported */
-		if (shmem_mapping(folio->mapping)) {
+		if ((!uniform_split || new_order) &&
+		    shmem_mapping(folio->mapping)) {
 			VM_WARN_ONCE(1,
 				"Cannot split shmem folio to non-0 order");
 			return -EINVAL;
@@ -3744,7 +3744,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
 		 * does not actually support large folios properly.
 		 */
-		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+		if (new_order && IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 		    !mapping_large_folio_support(folio->mapping)) {
 			VM_WARN_ONCE(1,
 				"Cannot split file folio to non-0 order");
@@ -3753,7 +3753,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	}
 
 	/* Only swapping a whole PMD-mapped folio is supported */
-	if (folio_test_swapcache(folio) && new_order)
+	if (folio_test_swapcache(folio) && (!uniform_split || new_order))
 		return -EINVAL;
 
 	is_hzp = is_huge_zero_folio(folio);
@@ -3810,10 +3810,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			goto out;
 		}
 
-		xas_split_alloc(&xas, folio, folio_order(folio), gfp);
-		if (xas_error(&xas)) {
-			ret = xas_error(&xas);
-			goto out;
+		if (uniform_split) {
+			xas_set_order(&xas, folio->index, new_order);
+			xas_split_alloc(&xas, folio, folio_order(folio), gfp);
+			if (xas_error(&xas)) {
+				ret = xas_error(&xas);
+				goto out;
+			}
 		}
 
 		anon_vma = NULL;
@@ -3878,7 +3881,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
-			xas_split(&xas, folio, folio_order(folio));
 			if (folio_test_pmd_mappable(folio) &&
 			    new_order < HPAGE_PMD_ORDER) {
 				if (folio_test_swapbacked(folio)) {
@@ -3896,8 +3898,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
 			mod_mthp_stat(new_order, MTHP_STAT_NR_ANON, 1 << (order - new_order));
 		}
-		__split_huge_page(page, list, end, new_order);
-		ret = 0;
+		ret = __folio_split_without_mapping(page_folio(page), new_order,
+				page, list, end, &xas, mapping, uniform_split);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
@@ -3975,7 +3977,31 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, page, list);
+	return __folio_split(folio, new_order, page, list, true);
+}
+
+/*
+ * folio_split: split a folio at offset_in_new_order to a new_order folio
+ * @folio: folio to split
+ * @new_order: the order of the new folio
+ * @page: a page within the new folio
+ *
+ * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be
+ * split but not to @new_order, the caller needs to check)
+ *
+ * Split a folio at offset_in_new_order to a new_order folio, leave the
+ * remaining subpages of the original folio as large as possible. For example,
+ * split an order-9 folio at its third order-3 subpages to an order-3 folio.
+ * There are 2^6=64 order-3 subpages in an order-9 folio and the result will be
+ * a set of folios with different order and the new folio is in bracket:
+ * [order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].
+ *
+ * After split, folio is left locked for caller.
+ */
+int folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list)
+{
+	return __folio_split(folio, new_order, page, list, false);
 }
 
 int min_order_for_split(struct folio *folio)
-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 4/6] mm/huge_memory: remove the old, unused __split_huge_page()
  2024-11-01 15:03 [PATCH v2 0/6] Buddy allocator like folio split Zi Yan
                   ` (2 preceding siblings ...)
  2024-11-01 15:03 ` [PATCH v2 3/6] mm/huge_memory: add buddy allocator like folio_split() Zi Yan
@ 2024-11-01 15:03 ` Zi Yan
  2024-11-01 15:03 ` [PATCH v2 5/6] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
  2024-11-01 15:03 ` [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation Zi Yan
  5 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-01 15:03 UTC (permalink / raw)
  To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
	Zi Yan

Now split_huge_page_to_list_to_order() uses the new backend split code in
__folio_split_without_mapping(), the old __split_huge_page() and
__split_huge_page_tail() can be removed.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 207 -----------------------------------------------
 1 file changed, 207 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4f227d246ac5..f5094b677bb8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3154,213 +3154,6 @@ static void lru_add_page_tail(struct folio *folio, struct page *tail,
 	}
 }
 
-static void __split_huge_page_tail(struct folio *folio, int tail,
-		struct lruvec *lruvec, struct list_head *list,
-		unsigned int new_order)
-{
-	struct page *head = &folio->page;
-	struct page *page_tail = head + tail;
-	/*
-	 * Careful: new_folio is not a "real" folio before we cleared PageTail.
-	 * Don't pass it around before clear_compound_head().
-	 */
-	struct folio *new_folio = (struct folio *)page_tail;
-
-	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
-
-	/*
-	 * Clone page flags before unfreezing refcount.
-	 *
-	 * After successful get_page_unless_zero() might follow flags change,
-	 * for example lock_page() which set PG_waiters.
-	 *
-	 * Note that for mapped sub-pages of an anonymous THP,
-	 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
-	 * the migration entry instead from where remap_page() will restore it.
-	 * We can still have PG_anon_exclusive set on effectively unmapped and
-	 * unreferenced sub-pages of an anonymous THP: we can simply drop
-	 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
-	 */
-	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-	page_tail->flags |= (head->flags &
-			((1L << PG_referenced) |
-			 (1L << PG_swapbacked) |
-			 (1L << PG_swapcache) |
-			 (1L << PG_mlocked) |
-			 (1L << PG_uptodate) |
-			 (1L << PG_active) |
-			 (1L << PG_workingset) |
-			 (1L << PG_locked) |
-			 (1L << PG_unevictable) |
-#ifdef CONFIG_ARCH_USES_PG_ARCH_2
-			 (1L << PG_arch_2) |
-#endif
-#ifdef CONFIG_ARCH_USES_PG_ARCH_3
-			 (1L << PG_arch_3) |
-#endif
-			 (1L << PG_dirty) |
-			 LRU_GEN_MASK | LRU_REFS_MASK));
-
-	/* ->mapping in first and second tail page is replaced by other uses */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-			page_tail);
-	new_folio->mapping = folio->mapping;
-	new_folio->index = folio->index + tail;
-
-	/*
-	 * page->private should not be set in tail pages. Fix up and warn once
-	 * if private is unexpectedly set.
-	 */
-	if (unlikely(page_tail->private)) {
-		VM_WARN_ON_ONCE_PAGE(true, page_tail);
-		page_tail->private = 0;
-	}
-	if (folio_test_swapcache(folio))
-		new_folio->swap.val = folio->swap.val + tail;
-
-	/* Page flags must be visible before we make the page non-compound. */
-	smp_wmb();
-
-	/*
-	 * Clear PageTail before unfreezing page refcount.
-	 *
-	 * After successful get_page_unless_zero() might follow put_page()
-	 * which needs correct compound_head().
-	 */
-	clear_compound_head(page_tail);
-	if (new_order) {
-		prep_compound_page(page_tail, new_order);
-		folio_set_large_rmappable(new_folio);
-	}
-
-	/* Finally unfreeze refcount. Additional reference from page cache. */
-	page_ref_unfreeze(page_tail,
-		1 + ((!folio_test_anon(folio) || folio_test_swapcache(folio)) ?
-			     folio_nr_pages(new_folio) : 0));
-
-	if (folio_test_young(folio))
-		folio_set_young(new_folio);
-	if (folio_test_idle(folio))
-		folio_set_idle(new_folio);
-
-	folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
-
-	/*
-	 * always add to the tail because some iterators expect new
-	 * pages to show after the currently processed elements - e.g.
-	 * migrate_pages
-	 */
-	lru_add_page_tail(folio, page_tail, lruvec, list);
-}
-
-static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned int new_order)
-{
-	struct folio *folio = page_folio(page);
-	struct page *head = &folio->page;
-	struct lruvec *lruvec;
-	struct address_space *swap_cache = NULL;
-	unsigned long offset = 0;
-	int i, nr_dropped = 0;
-	unsigned int new_nr = 1 << new_order;
-	int order = folio_order(folio);
-	unsigned int nr = 1 << order;
-
-	/* complete memcg works before add pages to LRU */
-	split_page_memcg(head, order, new_order);
-
-	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
-		offset = swap_cache_index(folio->swap);
-		swap_cache = swap_address_space(folio->swap);
-		xa_lock(&swap_cache->i_pages);
-	}
-
-	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
-	lruvec = folio_lruvec_lock(folio);
-
-	ClearPageHasHWPoisoned(head);
-
-	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
-		struct folio *tail;
-		__split_huge_page_tail(folio, i, lruvec, list, new_order);
-		tail = page_folio(head + i);
-		/* Some pages can be beyond EOF: drop them from page cache */
-		if (tail->index >= end) {
-			if (shmem_mapping(folio->mapping))
-				nr_dropped++;
-			else if (folio_test_clear_dirty(tail))
-				folio_account_cleaned(tail,
-					inode_to_wb(folio->mapping->host));
-			__filemap_remove_folio(tail, NULL);
-			folio_put(tail);
-		} else if (!folio_test_anon(folio)) {
-			__xa_store(&folio->mapping->i_pages, tail->index,
-					tail, 0);
-		} else if (swap_cache) {
-			__xa_store(&swap_cache->i_pages, offset + i,
-					tail, 0);
-		}
-	}
-
-	if (!new_order)
-		ClearPageCompound(head);
-	else {
-		struct folio *new_folio = (struct folio *)head;
-
-		folio_set_order(new_folio, new_order);
-	}
-	unlock_page_lruvec(lruvec);
-	/* Caller disabled irqs, so they are still disabled here */
-
-	split_page_owner(head, order, new_order);
-	pgalloc_tag_split(folio, order, new_order);
-
-	/* See comment in __split_huge_page_tail() */
-	if (folio_test_anon(folio)) {
-		/* Additional pin to swap cache */
-		if (folio_test_swapcache(folio)) {
-			folio_ref_add(folio, 1 + new_nr);
-			xa_unlock(&swap_cache->i_pages);
-		} else {
-			folio_ref_inc(folio);
-		}
-	} else {
-		/* Additional pin to page cache */
-		folio_ref_add(folio, 1 + new_nr);
-		xa_unlock(&folio->mapping->i_pages);
-	}
-	local_irq_enable();
-
-	if (nr_dropped)
-		shmem_uncharge(folio->mapping->host, nr_dropped);
-	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
-
-	/*
-	 * set page to its compound_head when split to non order-0 pages, so
-	 * we can skip unlocking it below, since PG_locked is transferred to
-	 * the compound_head of the page and the caller will unlock it.
-	 */
-	if (new_order)
-		page = compound_head(page);
-
-	for (i = 0; i < nr; i += new_nr) {
-		struct page *subpage = head + i;
-		struct folio *new_folio = page_folio(subpage);
-		if (subpage == page)
-			continue;
-		folio_unlock(new_folio);
-
-		/*
-		 * Subpages may be freed if there wasn't any mapping
-		 * like if add_to_swap() is running on a lru page that
-		 * had its mapping zapped. And freeing these pages
-		 * requires taking the lru_lock so we do the put_page
-		 * of the tail pages after the split is complete.
-		 */
-		free_page_and_swap_cache(subpage);
-	}
-}
-
 /* Racy check whether the huge page can be split */
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 5/6] mm/huge_memory: add folio_split() to debugfs testing interface.
  2024-11-01 15:03 [PATCH v2 0/6] Buddy allocator like folio split Zi Yan
                   ` (3 preceding siblings ...)
  2024-11-01 15:03 ` [PATCH v2 4/6] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
@ 2024-11-01 15:03 ` Zi Yan
  2024-11-01 15:03 ` [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation Zi Yan
  5 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-01 15:03 UTC (permalink / raw)
  To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
	Zi Yan

This allows to test folio_split() by specifying an additional in folio
page offset parameter to split_huge_page debugfs interface.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f5094b677bb8..1a2619324736 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4114,7 +4114,8 @@ static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
 }
 
 static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
-				unsigned long vaddr_end, unsigned int new_order)
+				unsigned long vaddr_end, unsigned int new_order,
+				long in_folio_offset)
 {
 	int ret = 0;
 	struct task_struct *task;
@@ -4198,8 +4199,16 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_test_anon(folio) && folio->mapping != mapping)
 			goto unlock;
 
-		if (!split_folio_to_order(folio, target_order))
-			split++;
+		if (in_folio_offset < 0 ||
+		    in_folio_offset >= folio_nr_pages(folio)) {
+			if (!split_folio_to_order(folio, target_order))
+				split++;
+		} else {
+			struct page *split_at = folio_page(folio,
+							   in_folio_offset);
+			if (!folio_split(folio, target_order, split_at, NULL))
+				split++;
+		}
 
 unlock:
 
@@ -4222,7 +4231,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 }
 
 static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
-				pgoff_t off_end, unsigned int new_order)
+				pgoff_t off_end, unsigned int new_order,
+				long in_folio_offset)
 {
 	struct filename *file;
 	struct file *candidate;
@@ -4271,8 +4281,15 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (folio->mapping != mapping)
 			goto unlock;
 
-		if (!split_folio_to_order(folio, target_order))
-			split++;
+		if (in_folio_offset < 0 || in_folio_offset >= nr_pages) {
+			if (!split_folio_to_order(folio, target_order))
+				split++;
+		} else {
+			struct page *split_at = folio_page(folio,
+							   in_folio_offset);
+			if (!folio_split(folio, target_order, split_at, NULL))
+				split++;
+		}
 
 unlock:
 		folio_unlock(folio);
@@ -4305,6 +4322,7 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
 	int pid;
 	unsigned long vaddr_start, vaddr_end;
 	unsigned int new_order = 0;
+	long in_folio_offset = -1;
 
 	ret = mutex_lock_interruptible(&split_debug_mutex);
 	if (ret)
@@ -4333,29 +4351,33 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
 			goto out;
 		}
 
-		ret = sscanf(buf, "0x%lx,0x%lx,%d", &off_start, &off_end, &new_order);
-		if (ret != 2 && ret != 3) {
+		ret = sscanf(buf, "0x%lx,0x%lx,%d,%ld", &off_start, &off_end,
+				&new_order, &in_folio_offset);
+		if (ret != 2 && ret != 3 && ret != 4) {
 			ret = -EINVAL;
 			goto out;
 		}
-		ret = split_huge_pages_in_file(file_path, off_start, off_end, new_order);
+		ret = split_huge_pages_in_file(file_path, off_start, off_end,
+				new_order, in_folio_offset);
 		if (!ret)
 			ret = input_len;
 
 		goto out;
 	}
 
-	ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", &pid, &vaddr_start, &vaddr_end, &new_order);
+	ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d,%ld", &pid, &vaddr_start,
+			&vaddr_end, &new_order, &in_folio_offset);
 	if (ret == 1 && pid == 1) {
 		split_huge_pages_all();
 		ret = strlen(input_buf);
 		goto out;
-	} else if (ret != 3 && ret != 4) {
+	} else if (ret != 3 && ret != 4 && ret != 5) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order);
+	ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order,
+			in_folio_offset);
 	if (!ret)
 		ret = strlen(input_buf);
 out:
-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation.
  2024-11-01 15:03 [PATCH v2 0/6] Buddy allocator like folio split Zi Yan
                   ` (4 preceding siblings ...)
  2024-11-01 15:03 ` [PATCH v2 5/6] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
@ 2024-11-01 15:03 ` Zi Yan
  2024-11-02 15:39   ` kernel test robot
  2024-11-02 17:22   ` kernel test robot
  5 siblings, 2 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-01 15:03 UTC (permalink / raw)
  To: linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel,
	Zi Yan

Instead of splitting the large folio uniformly during truncation, use
buddy allocator like split at the start of truncation range to minimize
the number of resulting folios.

For example, to truncate a order-4 folio
[0, 1, 2, 3, 4, 5, ..., 15]
between [3, 10] (inclusive), folio_split() splits the folio to
[0,1], [2], [3], [4..7], [8..15] and [3], [4..7] can be dropped and
[8..15] is kept with zeros in [8..10].

It is possible to further do a folio_split() at 10, so more resulting
folios can be dropped. But it is left as future possible optimization
if needed.

Another possible optimization is to make folio_split() to split a folio
based on a given range, like [3..10] above. But that complicates
folio_split(), so it will investigated when necessary.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 12 ++++++++++++
 mm/truncate.c           |  5 ++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b94c2e8ee918..8048500e7bc2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -339,6 +339,18 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
+int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
+		struct list_head *list);
+static inline int split_folio_at(struct folio *folio, struct page *page,
+		struct list_head *list)
+{
+	int ret = min_order_for_split(folio);
+
+	if (ret < 0)
+		return ret;
+
+	return folio_split(folio, ret, page, list);
+}
 static inline int split_huge_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
diff --git a/mm/truncate.c b/mm/truncate.c
index e5151703ba04..dbd81c21b460 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -179,6 +179,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 {
 	loff_t pos = folio_pos(folio);
 	unsigned int offset, length;
+	long in_folio_offset;
 
 	if (pos < start)
 		offset = start - pos;
@@ -208,7 +209,9 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		folio_invalidate(folio, offset, length);
 	if (!folio_test_large(folio))
 		return true;
-	if (split_folio(folio) == 0)
+
+	in_folio_offset = PAGE_ALIGN_DOWN(offset) / PAGE_SIZE;
+	if (split_folio_at(folio, folio_page(folio, in_folio_offset), NULL) == 0)
 		return true;
 	if (folio_test_dirty(folio))
 		return false;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation.
  2024-11-01 15:03 ` [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation Zi Yan
@ 2024-11-02 15:39   ` kernel test robot
  2024-11-02 17:22   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-11-02 15:39 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: llvm, oe-kbuild-all, Ryan Roberts, Hugh Dickins,
	David Hildenbrand, Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao,
	John Hubbard, linux-kernel, Zi Yan

Hi Zi,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20241101]
[cannot apply to linus/master v6.12-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-huge_memory-add-two-new-yet-used-functions-for-folio_split/20241101-230623
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241101150357.1752726-7-ziy%40nvidia.com
patch subject: [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation.
config: arm-multi_v4t_defconfig (https://download.01.org/0day-ci/archive/20241102/202411022321.XN6rYrgx-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 639a7ac648f1e50ccd2556e17d401c04f9cce625)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241102/202411022321.XN6rYrgx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411022321.XN6rYrgx-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/truncate.c:12:
   In file included from include/linux/backing-dev.h:16:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:8:
   In file included from include/linux/cacheflush.h:5:
   In file included from arch/arm/include/asm/cacheflush.h:10:
   In file included from include/linux/mm.h:2211:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from mm/truncate.c:24:
   In file included from mm/internal.h:13:
   include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      47 |         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
         |                                    ~~~~~~~~~~~ ^ ~~~
   include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      49 |                                 NR_ZONE_LRU_BASE + lru, nr_pages);
         |                                 ~~~~~~~~~~~~~~~~ ^ ~~~
>> mm/truncate.c:214:6: error: call to undeclared function 'split_folio_at'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     214 |         if (split_folio_at(folio, folio_page(folio, in_folio_offset), NULL) == 0)
         |             ^
   3 warnings and 1 error generated.


vim +/split_folio_at +214 mm/truncate.c

   166	
   167	/*
   168	 * Handle partial folios.  The folio may be entirely within the
   169	 * range if a split has raced with us.  If not, we zero the part of the
   170	 * folio that's within the [start, end] range, and then split the folio if
   171	 * it's large.  split_page_range() will discard pages which now lie beyond
   172	 * i_size, and we rely on the caller to discard pages which lie within a
   173	 * newly created hole.
   174	 *
   175	 * Returns false if splitting failed so the caller can avoid
   176	 * discarding the entire folio which is stubbornly unsplit.
   177	 */
   178	bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
   179	{
   180		loff_t pos = folio_pos(folio);
   181		unsigned int offset, length;
   182		long in_folio_offset;
   183	
   184		if (pos < start)
   185			offset = start - pos;
   186		else
   187			offset = 0;
   188		length = folio_size(folio);
   189		if (pos + length <= (u64)end)
   190			length = length - offset;
   191		else
   192			length = end + 1 - pos - offset;
   193	
   194		folio_wait_writeback(folio);
   195		if (length == folio_size(folio)) {
   196			truncate_inode_folio(folio->mapping, folio);
   197			return true;
   198		}
   199	
   200		/*
   201		 * We may be zeroing pages we're about to discard, but it avoids
   202		 * doing a complex calculation here, and then doing the zeroing
   203		 * anyway if the page split fails.
   204		 */
   205		if (!mapping_inaccessible(folio->mapping))
   206			folio_zero_range(folio, offset, length);
   207	
   208		if (folio_needs_release(folio))
   209			folio_invalidate(folio, offset, length);
   210		if (!folio_test_large(folio))
   211			return true;
   212	
   213		in_folio_offset = PAGE_ALIGN_DOWN(offset) / PAGE_SIZE;
 > 214		if (split_folio_at(folio, folio_page(folio, in_folio_offset), NULL) == 0)
   215			return true;
   216		if (folio_test_dirty(folio))
   217			return false;
   218		truncate_inode_folio(folio->mapping, folio);
   219		return true;
   220	}
   221	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation.
  2024-11-01 15:03 ` [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation Zi Yan
  2024-11-02 15:39   ` kernel test robot
@ 2024-11-02 17:22   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-11-02 17:22 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: oe-kbuild-all, Ryan Roberts, Hugh Dickins, David Hildenbrand,
	Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard,
	linux-kernel, Zi Yan

Hi Zi,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20241101]
[cannot apply to linus/master v6.12-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-huge_memory-add-two-new-yet-used-functions-for-folio_split/20241101-230623
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241101150357.1752726-7-ziy%40nvidia.com
patch subject: [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation.
config: arc-tb10x_defconfig (https://download.01.org/0day-ci/archive/20241103/202411030124.ZWzXWxPU-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241103/202411030124.ZWzXWxPU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411030124.ZWzXWxPU-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/truncate.c: In function 'truncate_inode_partial_folio':
>> mm/truncate.c:214:13: error: implicit declaration of function 'split_folio_at'; did you mean 'split_folio'? [-Werror=implicit-function-declaration]
     214 |         if (split_folio_at(folio, folio_page(folio, in_folio_offset), NULL) == 0)
         |             ^~~~~~~~~~~~~~
         |             split_folio
   cc1: some warnings being treated as errors


vim +214 mm/truncate.c

   166	
   167	/*
   168	 * Handle partial folios.  The folio may be entirely within the
   169	 * range if a split has raced with us.  If not, we zero the part of the
   170	 * folio that's within the [start, end] range, and then split the folio if
   171	 * it's large.  split_page_range() will discard pages which now lie beyond
   172	 * i_size, and we rely on the caller to discard pages which lie within a
   173	 * newly created hole.
   174	 *
   175	 * Returns false if splitting failed so the caller can avoid
   176	 * discarding the entire folio which is stubbornly unsplit.
   177	 */
   178	bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
   179	{
   180		loff_t pos = folio_pos(folio);
   181		unsigned int offset, length;
   182		long in_folio_offset;
   183	
   184		if (pos < start)
   185			offset = start - pos;
   186		else
   187			offset = 0;
   188		length = folio_size(folio);
   189		if (pos + length <= (u64)end)
   190			length = length - offset;
   191		else
   192			length = end + 1 - pos - offset;
   193	
   194		folio_wait_writeback(folio);
   195		if (length == folio_size(folio)) {
   196			truncate_inode_folio(folio->mapping, folio);
   197			return true;
   198		}
   199	
   200		/*
   201		 * We may be zeroing pages we're about to discard, but it avoids
   202		 * doing a complex calculation here, and then doing the zeroing
   203		 * anyway if the page split fails.
   204		 */
   205		if (!mapping_inaccessible(folio->mapping))
   206			folio_zero_range(folio, offset, length);
   207	
   208		if (folio_needs_release(folio))
   209			folio_invalidate(folio, offset, length);
   210		if (!folio_test_large(folio))
   211			return true;
   212	
   213		in_folio_offset = PAGE_ALIGN_DOWN(offset) / PAGE_SIZE;
 > 214		if (split_folio_at(folio, folio_page(folio, in_folio_offset), NULL) == 0)
   215			return true;
   216		if (folio_test_dirty(folio))
   217			return false;
   218		truncate_inode_folio(folio->mapping, folio);
   219		return true;
   220	}
   221	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split()
  2024-11-01 15:03 ` [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split() Zi Yan
@ 2024-11-06 10:44   ` Kirill A . Shutemov
  2024-11-06 22:06     ` Zi Yan
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill A . Shutemov @ 2024-11-06 10:44 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel

On Fri, Nov 01, 2024 at 11:03:52AM -0400, Zi Yan wrote:
> This is a preparation patch, both added functions are not used yet.
> 

In subject: s/yet/not yet/

> The added __folio_split_without_mapping() is able to split a folio with
> its mapping removed in two manners: 1) uniform split (the existing way),
> and 2) buddy allocator like split.
> 
> The added __split_folio_to_order() can split a folio into any lower order.
> For uniform split, __folio_split_without_mapping() calls it once to split
> the given folio to the new order. For buddy allocator split,
> __folio_split_without_mapping() calls it (folio_order - new_order) times
> and each time splits the folio containing the given page to one lower
> order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 328 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 327 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f92068864469..f7649043ddb7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3135,7 +3135,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
>  static void lru_add_page_tail(struct folio *folio, struct page *tail,
>  		struct lruvec *lruvec, struct list_head *list)
>  {
> -	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>  	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
>  	lockdep_assert_held(&lruvec->lru_lock);
>  
> @@ -3379,6 +3378,333 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>  					caller_pins;
>  }
>  
> +static long page_in_folio_offset(struct page *page, struct folio *folio)
> +{
> +	long nr_pages = folio_nr_pages(folio);
> +	unsigned long pages_pfn = page_to_pfn(page);
> +	unsigned long folios_pfn = folio_pfn(folio);
> +
> +	if (pages_pfn >= folios_pfn && pages_pfn < (folios_pfn + nr_pages))
> +		return pages_pfn - folios_pfn;
> +
> +	return -EINVAL;
> +}
> +
> +/*
> + * It splits @folio into @new_order folios and copies the @folio metadata to
> + * all the resulting folios.
> + */
> +static int __split_folio_to_order(struct folio *folio, int new_order)
> +{
> +	int curr_order = folio_order(folio);
> +	long nr_pages = folio_nr_pages(folio);
> +	long new_nr_pages = 1 << new_order;
> +	long index;
> +
> +	if (curr_order <= new_order)
> +		return -EINVAL;
> +
> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {

Hm. It is not clear why you skip the first new_nr_pages range. It worth a
comment.

> +		struct page *head = &folio->page;
> +		struct page *second_head = head + index;

I am not sure about 'second_head' name. Why it is better than page_tail?

> +
> +		/*
> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
> +		 * Don't pass it around before clear_compound_head().
> +		 */
> +		struct folio *new_folio = (struct folio *)second_head;
> +
> +		VM_BUG_ON_PAGE(atomic_read(&second_head->_mapcount) != -1, second_head);
> +
> +		/*
> +		 * Clone page flags before unfreezing refcount.
> +		 *
> +		 * After successful get_page_unless_zero() might follow flags change,
> +		 * for example lock_page() which set PG_waiters.
> +		 *
> +		 * Note that for mapped sub-pages of an anonymous THP,
> +		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
> +		 * the migration entry instead from where remap_page() will restore it.
> +		 * We can still have PG_anon_exclusive set on effectively unmapped and
> +		 * unreferenced sub-pages of an anonymous THP: we can simply drop
> +		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
> +		 */
> +		second_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		second_head->flags |= (head->flags &
> +				((1L << PG_referenced) |
> +				 (1L << PG_swapbacked) |
> +				 (1L << PG_swapcache) |
> +				 (1L << PG_mlocked) |
> +				 (1L << PG_uptodate) |
> +				 (1L << PG_active) |
> +				 (1L << PG_workingset) |
> +				 (1L << PG_locked) |
> +				 (1L << PG_unevictable) |
> +#ifdef CONFIG_ARCH_USES_PG_ARCH_2
> +				 (1L << PG_arch_2) |
> +#endif
> +#ifdef CONFIG_ARCH_USES_PG_ARCH_3
> +				 (1L << PG_arch_3) |
> +#endif
> +				 (1L << PG_dirty) |
> +				 LRU_GEN_MASK | LRU_REFS_MASK));
> +
> +		/* ->mapping in first and second tail page is replaced by other uses */
> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && second_head->mapping != TAIL_MAPPING,
> +			       second_head);
> +		second_head->mapping = head->mapping;
> +		second_head->index = head->index + index;
> +
> +		/*
> +		 * page->private should not be set in tail pages. Fix up and warn once
> +		 * if private is unexpectedly set.
> +		 */
> +		if (unlikely(second_head->private)) {
> +			VM_WARN_ON_ONCE_PAGE(true, second_head);
> +			second_head->private = 0;
> +		}

New line.

> +		if (folio_test_swapcache(folio))
> +			new_folio->swap.val = folio->swap.val + index;
> +
> +		/* Page flags must be visible before we make the page non-compound. */
> +		smp_wmb();
> +
> +		/*
> +		 * Clear PageTail before unfreezing page refcount.
> +		 *
> +		 * After successful get_page_unless_zero() might follow put_page()
> +		 * which needs correct compound_head().
> +		 */
> +		clear_compound_head(second_head);
> +		if (new_order) {
> +			prep_compound_page(second_head, new_order);
> +			folio_set_large_rmappable(new_folio);
> +
> +			folio_set_order(folio, new_order);
> +		} else {
> +			if (PageHead(head))
> +				ClearPageCompound(head);

Huh? You only have to test for PageHead() because it is inside the loop.
It has to be done after loop is done.

> +		}
> +
> +		if (folio_test_young(folio))
> +			folio_set_young(new_folio);
> +		if (folio_test_idle(folio))
> +			folio_set_idle(new_folio);
> +
> +		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
> +	}
> +
> +	return 0;
> +}
> +
> +#define for_each_folio_until_end_safe(iter, iter2, start, end)	\
> +	for (iter = start, iter2 = folio_next(start);		\
> +	     iter != end;					\
> +	     iter = iter2, iter2 = folio_next(iter2))

I am not sure if hiding it inside the macro helps reading the code.

> +
> +/*
> + * It splits a @folio (without mapping) to lower order smaller folios in two
> + * ways.

What do you mean by "without mapping". I initially thought that ->mapping
is NULL, but it is obviously not true. 

Do you mean unmapped?

> + * 1. uniform split: the given @folio into multiple @new_order small folios,
> + *    where all small folios have the same order. This is done when
> + *    uniform_split is true.
> + * 2. buddy allocator like split: the given @folio is split into half and one
> + *    of the half (containing the given page) is split into half until the
> + *    given @page's order becomes @new_order. This is done when uniform_split is
> + *    false.
> + *
> + * The high level flow for these two methods are:
> + * 1. uniform split: a single __split_folio_to_order() is called to split the
> + *    @folio into @new_order, then we traverse all the resulting folios one by
> + *    one in PFN ascending order and perform stats, unfreeze, adding to list,
> + *    and file mapping index operations.
> + * 2. buddy allocator like split: in general, folio_order - @new_order calls to
> + *    __split_folio_to_order() are called in the for loop to split the @folio
> + *    to one lower order at a time. The resulting small folios are processed
> + *    like what is done during the traversal in 1, except the one containing
> + *    @page, which is split in next for loop.
> + *
> + * After splitting, the caller's folio reference will be transferred to the
> + * folio containing @page. The other folios may be freed if they are not mapped.
> + *
> + * In terms of locking, after splitting,
> + * 1. uniform split leaves @page (or the folio contains it) locked;
> + * 2. buddy allocator like split leaves @folio locked.
> + *
> + * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
> + *
> + * For !uniform_split, when -ENOMEM is returned, the original folio might be
> + * split. The caller needs to check the input folio.
> + */
> +static int __folio_split_without_mapping(struct folio *folio, int new_order,
> +		struct page *page, struct list_head *list, pgoff_t end,
> +		struct xa_state *xas, struct address_space *mapping,
> +		bool uniform_split)

It is not clear what state xas has to be on call.

> +{
> +	struct lruvec *lruvec;
> +	struct address_space *swap_cache = NULL;
> +	struct folio *origin_folio = folio;
> +	struct folio *next_folio = folio_next(folio);
> +	struct folio *new_folio;
> +	struct folio *next;
> +	int order = folio_order(folio);
> +	int split_order = order - 1;
> +	int nr_dropped = 0;
> +	int ret = 0;
> +
> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> +		if (!uniform_split)
> +			return -EINVAL;

Why this limitation?

> +		swap_cache = swap_address_space(folio->swap);
> +		xa_lock(&swap_cache->i_pages);
> +	}
> +
> +	if (folio_test_anon(folio))
> +		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> +
> +	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> +	lruvec = folio_lruvec_lock(folio);
> +
> +	/*
> +	 * split to new_order one order at a time. For uniform split,
> +	 * intermediate orders are skipped
> +	 */
> +	for (split_order = order - 1; split_order >= new_order; split_order--) {
> +		int old_order = folio_order(folio);
> +		struct folio *release;
> +		struct folio *end_folio = folio_next(folio);
> +		int status;
> +		bool stop_split = false;
> +
> +		if (folio_test_anon(folio) && split_order == 1)

Comment is missing.

> +			continue;
> +		if (uniform_split && split_order != new_order)
> +			continue;

What the point in the loop for uniform_split?

> +
> +		if (mapping) {
> +			/*
> +			 * uniform split has xas_split_alloc() called before
> +			 * irq is disabled, since xas_nomem() might not be
> +			 * able to allocate enough memory.
> +			 */
> +			if (uniform_split)
> +				xas_split(xas, folio, old_order);
> +			else {
> +				xas_set_order(xas, folio->index, split_order);
> +				xas_set_err(xas, -ENOMEM);
> +				if (xas_nomem(xas, 0))

0 gfp?

> +					xas_split(xas, folio, old_order);
> +				else {
> +					stop_split = true;
> +					ret = -ENOMEM;
> +					goto after_split;
> +				}
> +			}
> +		}
> +
> +		split_page_memcg(&folio->page, old_order, split_order);

__split_huge_page() has a comment for split_page_memcg(). Do we want to
keep it? Is it safe to call it under lruvec lock?

> +		split_page_owner(&folio->page, old_order, split_order);
> +		pgalloc_tag_split(folio, old_order, split_order);
> +
> +		status = __split_folio_to_order(folio, split_order);
> +
> +		if (status < 0)
> +			return status;
> +
> +after_split:
> +		/*
> +		 * Iterate through after-split folios and perform related
> +		 * operations. But in buddy allocator like split, the folio
> +		 * containing the specified page is skipped until its order
> +		 * is new_order, since the folio will be worked on in next
> +		 * iteration.
> +		 */
> +		for_each_folio_until_end_safe(release, next, folio, end_folio) {
> +			if (page_in_folio_offset(page, release) >= 0) {
> +				folio = release;
> +				if (split_order != new_order && !stop_split)
> +					continue;

I don't understand this condition.

> +			}
> +			if (folio_test_anon(release))
> +				mod_mthp_stat(folio_order(release),
> +						MTHP_STAT_NR_ANON, 1);

Add { } around the block.

> +
> +			/*
> +			 * Unfreeze refcount first. Additional reference from
> +			 * page cache.
> +			 */
> +			folio_ref_unfreeze(release,
> +				1 + ((!folio_test_anon(origin_folio) ||
> +				     folio_test_swapcache(origin_folio)) ?
> +					     folio_nr_pages(release) : 0));
> +
> +			if (release != origin_folio)
> +				lru_add_page_tail(origin_folio, &release->page,
> +						lruvec, list);
> +
> +			/* Some pages can be beyond EOF: drop them from page cache */
> +			if (release->index >= end) {
> +				if (shmem_mapping(origin_folio->mapping))
> +					nr_dropped++;
> +				else if (folio_test_clear_dirty(release))
> +					folio_account_cleaned(release,
> +						inode_to_wb(origin_folio->mapping->host));
> +				__filemap_remove_folio(release, NULL);
> +				folio_put(release);
> +			} else if (!folio_test_anon(release)) {
> +				__xa_store(&origin_folio->mapping->i_pages,
> +						release->index, &release->page, 0);
> +			} else if (swap_cache) {
> +				__xa_store(&swap_cache->i_pages,
> +						swap_cache_index(release->swap),
> +						&release->page, 0);
> +			}
> +		}
> +		xas_destroy(xas);
> +	}
> +
> +	unlock_page_lruvec(lruvec);
> +
> +	if (folio_test_anon(origin_folio)) {
> +		if (folio_test_swapcache(origin_folio))
> +			xa_unlock(&swap_cache->i_pages);
> +	} else
> +		xa_unlock(&mapping->i_pages);
> +
> +	/* Caller disabled irqs, so they are still disabled here */
> +	local_irq_enable();
> +
> +	if (nr_dropped)
> +		shmem_uncharge(mapping->host, nr_dropped);
> +
> +	remap_page(origin_folio, 1 << order,
> +			folio_test_anon(origin_folio) ?
> +				RMP_USE_SHARED_ZEROPAGE : 0);
> +
> +	/*
> +	 * At this point, folio should contain the specified page, so that it
> +	 * will be left to the caller to unlock it.
> +	 */
> +	for_each_folio_until_end_safe(new_folio, next, origin_folio, next_folio) {
> +		if (uniform_split && new_folio == folio)
> +			continue;
> +		if (!uniform_split && new_folio == origin_folio)
> +			continue;
> +
> +		folio_unlock(new_folio);
> +		/*
> +		 * Subpages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		free_page_and_swap_cache(&new_folio->page);
> +	}
> +	return ret;
> +}
> +
>  /*
>   * This function splits a large folio into smaller folios of order @new_order.
>   * @page can point to any page of the large folio to split. The split operation
> -- 
> 2.45.2
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split()
  2024-11-06 10:44   ` Kirill A . Shutemov
@ 2024-11-06 22:06     ` Zi Yan
  2024-11-07 14:01       ` Kirill A . Shutemov
  2024-11-07 15:01       ` Zi Yan
  0 siblings, 2 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-06 22:06 UTC (permalink / raw)
  To: Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: linux-mm, Ryan Roberts, Hugh Dickins, David Hildenbrand,
	Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard,
	linux-kernel

On 6 Nov 2024, at 5:44, Kirill A . Shutemov wrote:

> On Fri, Nov 01, 2024 at 11:03:52AM -0400, Zi Yan wrote:
>> This is a preparation patch, both added functions are not used yet.
>>
>
> In subject: s/yet/not yet/

Ack.

>
>> The added __folio_split_without_mapping() is able to split a folio with
>> its mapping removed in two manners: 1) uniform split (the existing way),
>> and 2) buddy allocator like split.
>>
>> The added __split_folio_to_order() can split a folio into any lower order.
>> For uniform split, __folio_split_without_mapping() calls it once to split
>> the given folio to the new order. For buddy allocator split,
>> __folio_split_without_mapping() calls it (folio_order - new_order) times
>> and each time splits the folio containing the given page to one lower
>> order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/huge_memory.c | 328 ++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 327 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index f92068864469..f7649043ddb7 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3135,7 +3135,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
>>  static void lru_add_page_tail(struct folio *folio, struct page *tail,
>>  		struct lruvec *lruvec, struct list_head *list)
>>  {
>> -	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>  	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
>>  	lockdep_assert_held(&lruvec->lru_lock);
>>
>> @@ -3379,6 +3378,333 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>  					caller_pins;
>>  }
>>
>> +static long page_in_folio_offset(struct page *page, struct folio *folio)
>> +{
>> +	long nr_pages = folio_nr_pages(folio);
>> +	unsigned long pages_pfn = page_to_pfn(page);
>> +	unsigned long folios_pfn = folio_pfn(folio);
>> +
>> +	if (pages_pfn >= folios_pfn && pages_pfn < (folios_pfn + nr_pages))
>> +		return pages_pfn - folios_pfn;
>> +
>> +	return -EINVAL;
>> +}
>> +
>> +/*
>> + * It splits @folio into @new_order folios and copies the @folio metadata to
>> + * all the resulting folios.
>> + */
>> +static int __split_folio_to_order(struct folio *folio, int new_order)
>> +{
>> +	int curr_order = folio_order(folio);
>> +	long nr_pages = folio_nr_pages(folio);
>> +	long new_nr_pages = 1 << new_order;
>> +	long index;
>> +
>> +	if (curr_order <= new_order)
>> +		return -EINVAL;
>> +
>> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
>
> Hm. It is not clear why you skip the first new_nr_pages range. It worth a
> comment.

The first new_nr_pages range belongs to the original folio, so no copies
are needed. Will add this comment.

>
>> +		struct page *head = &folio->page;
>> +		struct page *second_head = head + index;
>
> I am not sure about 'second_head' name. Why it is better than page_tail?

new_head might be better, as it means the head of a new folio that we are
working on.  ’second_head’ was legacy code since in my unpublished version
I was always splitting the folio into half.

>
>> +
>> +		/*
>> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
>> +		 * Don't pass it around before clear_compound_head().
>> +		 */
>> +		struct folio *new_folio = (struct folio *)second_head;
>> +
>> +		VM_BUG_ON_PAGE(atomic_read(&second_head->_mapcount) != -1, second_head);
>> +
>> +		/*
>> +		 * Clone page flags before unfreezing refcount.
>> +		 *
>> +		 * After successful get_page_unless_zero() might follow flags change,
>> +		 * for example lock_page() which set PG_waiters.
>> +		 *
>> +		 * Note that for mapped sub-pages of an anonymous THP,
>> +		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
>> +		 * the migration entry instead from where remap_page() will restore it.
>> +		 * We can still have PG_anon_exclusive set on effectively unmapped and
>> +		 * unreferenced sub-pages of an anonymous THP: we can simply drop
>> +		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
>> +		 */
>> +		second_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>> +		second_head->flags |= (head->flags &
>> +				((1L << PG_referenced) |
>> +				 (1L << PG_swapbacked) |
>> +				 (1L << PG_swapcache) |
>> +				 (1L << PG_mlocked) |
>> +				 (1L << PG_uptodate) |
>> +				 (1L << PG_active) |
>> +				 (1L << PG_workingset) |
>> +				 (1L << PG_locked) |
>> +				 (1L << PG_unevictable) |
>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_2
>> +				 (1L << PG_arch_2) |
>> +#endif
>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_3
>> +				 (1L << PG_arch_3) |
>> +#endif
>> +				 (1L << PG_dirty) |
>> +				 LRU_GEN_MASK | LRU_REFS_MASK));
>> +
>> +		/* ->mapping in first and second tail page is replaced by other uses */
>> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && second_head->mapping != TAIL_MAPPING,
>> +			       second_head);
>> +		second_head->mapping = head->mapping;
>> +		second_head->index = head->index + index;
>> +
>> +		/*
>> +		 * page->private should not be set in tail pages. Fix up and warn once
>> +		 * if private is unexpectedly set.
>> +		 */
>> +		if (unlikely(second_head->private)) {
>> +			VM_WARN_ON_ONCE_PAGE(true, second_head);
>> +			second_head->private = 0;
>> +		}
>
> New line.
Ack.

>
>> +		if (folio_test_swapcache(folio))
>> +			new_folio->swap.val = folio->swap.val + index;
>> +
>> +		/* Page flags must be visible before we make the page non-compound. */
>> +		smp_wmb();
>> +
>> +		/*
>> +		 * Clear PageTail before unfreezing page refcount.
>> +		 *
>> +		 * After successful get_page_unless_zero() might follow put_page()
>> +		 * which needs correct compound_head().
>> +		 */
>> +		clear_compound_head(second_head);
>> +		if (new_order) {
>> +			prep_compound_page(second_head, new_order);
>> +			folio_set_large_rmappable(new_folio);
>> +
>> +			folio_set_order(folio, new_order);
>> +		} else {
>> +			if (PageHead(head))
>> +				ClearPageCompound(head);
>
> Huh? You only have to test for PageHead() because it is inside the loop.
> It has to be done after loop is done.

You are right, will remove this and add the code below after the loop.

if (!new_order && PageHead(&folio->page))
	ClearPageCompound(&folio->page);

>
>> +		}
>> +
>> +		if (folio_test_young(folio))
>> +			folio_set_young(new_folio);
>> +		if (folio_test_idle(folio))
>> +			folio_set_idle(new_folio);
>> +
>> +		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +#define for_each_folio_until_end_safe(iter, iter2, start, end)	\
>> +	for (iter = start, iter2 = folio_next(start);		\
>> +	     iter != end;					\
>> +	     iter = iter2, iter2 = folio_next(iter2))
>
> I am not sure if hiding it inside the macro helps reading the code.
>

OK, I will remove the macro, since it is only used in the function below.

>> +
>> +/*
>> + * It splits a @folio (without mapping) to lower order smaller folios in two
>> + * ways.
>
> What do you mean by "without mapping". I initially thought that ->mapping
> is NULL, but it is obviously not true.
>
> Do you mean unmapped?

Yes. I will rename it to __split_unmapped_folio() and fix the comment too.

>
>> + * 1. uniform split: the given @folio into multiple @new_order small folios,
>> + *    where all small folios have the same order. This is done when
>> + *    uniform_split is true.
>> + * 2. buddy allocator like split: the given @folio is split into half and one
>> + *    of the half (containing the given page) is split into half until the
>> + *    given @page's order becomes @new_order. This is done when uniform_split is
>> + *    false.
>> + *
>> + * The high level flow for these two methods are:
>> + * 1. uniform split: a single __split_folio_to_order() is called to split the
>> + *    @folio into @new_order, then we traverse all the resulting folios one by
>> + *    one in PFN ascending order and perform stats, unfreeze, adding to list,
>> + *    and file mapping index operations.
>> + * 2. buddy allocator like split: in general, folio_order - @new_order calls to
>> + *    __split_folio_to_order() are called in the for loop to split the @folio
>> + *    to one lower order at a time. The resulting small folios are processed
>> + *    like what is done during the traversal in 1, except the one containing
>> + *    @page, which is split in next for loop.
>> + *
>> + * After splitting, the caller's folio reference will be transferred to the
>> + * folio containing @page. The other folios may be freed if they are not mapped.
>> + *
>> + * In terms of locking, after splitting,
>> + * 1. uniform split leaves @page (or the folio contains it) locked;
>> + * 2. buddy allocator like split leaves @folio locked.
>> + *
>> + * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
>> + *
>> + * For !uniform_split, when -ENOMEM is returned, the original folio might be
>> + * split. The caller needs to check the input folio.
>> + */
>> +static int __folio_split_without_mapping(struct folio *folio, int new_order,
>> +		struct page *page, struct list_head *list, pgoff_t end,
>> +		struct xa_state *xas, struct address_space *mapping,
>> +		bool uniform_split)
>
> It is not clear what state xas has to be on call.

xas needs to point to folio->mapping->i_pages and locked. Will add this to
the comment above.

>
>> +{
>> +	struct lruvec *lruvec;
>> +	struct address_space *swap_cache = NULL;
>> +	struct folio *origin_folio = folio;
>> +	struct folio *next_folio = folio_next(folio);
>> +	struct folio *new_folio;
>> +	struct folio *next;
>> +	int order = folio_order(folio);
>> +	int split_order = order - 1;
>> +	int nr_dropped = 0;
>> +	int ret = 0;
>> +
>> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
>> +		if (!uniform_split)
>> +			return -EINVAL;
>
> Why this limitation?

I am not closely following the status of mTHP support in swap. If it
is supported, this can be removed. Right now, split_huge_page_to_list_to_order()
only allows to split a swapcache folio to order 0[1].

[1] https://elixir.bootlin.com/linux/v6.12-rc6/source/mm/huge_memory.c#L3397

>
>> +		swap_cache = swap_address_space(folio->swap);
>> +		xa_lock(&swap_cache->i_pages);
>> +	}
>> +
>> +	if (folio_test_anon(folio))
>> +		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>> +
>> +	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
>> +	lruvec = folio_lruvec_lock(folio);
>> +
>> +	/*
>> +	 * split to new_order one order at a time. For uniform split,
>> +	 * intermediate orders are skipped
>> +	 */
>> +	for (split_order = order - 1; split_order >= new_order; split_order--) {
>> +		int old_order = folio_order(folio);
>> +		struct folio *release;
>> +		struct folio *end_folio = folio_next(folio);
>> +		int status;
>> +		bool stop_split = false;
>> +
>> +		if (folio_test_anon(folio) && split_order == 1)
>
> Comment is missing.

Will add “order-1 anonymous folio is not supported”.
>
>> +			continue;
>> +		if (uniform_split && split_order != new_order)
>> +			continue;
>
> What the point in the loop for uniform_split?

Will just start the loop with new_order for uniform_split.

>
>> +
>> +		if (mapping) {
>> +			/*
>> +			 * uniform split has xas_split_alloc() called before
>> +			 * irq is disabled, since xas_nomem() might not be
>> +			 * able to allocate enough memory.
>> +			 */
>> +			if (uniform_split)
>> +				xas_split(xas, folio, old_order);
>> +			else {
>> +				xas_set_order(xas, folio->index, split_order);
>> +				xas_set_err(xas, -ENOMEM);
>> +				if (xas_nomem(xas, 0))
>
> 0 gfp?

This is inside lru_lock and allocation cannot sleep, so I am not sure
current_gfp_context(mapping_gfp_mask(mapping) &	GFP_RECLAIM_MASK); can
be used.

I need Matthew to help me out about this.


>
>> +					xas_split(xas, folio, old_order);
>> +				else {
>> +					stop_split = true;
>> +					ret = -ENOMEM;
>> +					goto after_split;
>> +				}
>> +			}
>> +		}
>> +
>> +		split_page_memcg(&folio->page, old_order, split_order);
>
> __split_huge_page() has a comment for split_page_memcg(). Do we want to
> keep it? Is it safe to call it under lruvec lock?

Will add the comment back.

split_page_memcg() assigns memcg_data to new folios and bump memcg ref counts,
so I assume it should be fine.

>
>> +		split_page_owner(&folio->page, old_order, split_order);
>> +		pgalloc_tag_split(folio, old_order, split_order);
>> +
>> +		status = __split_folio_to_order(folio, split_order);
>> +
>> +		if (status < 0)
>> +			return status;
>> +
>> +after_split:
>> +		/*
>> +		 * Iterate through after-split folios and perform related
>> +		 * operations. But in buddy allocator like split, the folio
>> +		 * containing the specified page is skipped until its order
>> +		 * is new_order, since the folio will be worked on in next
>> +		 * iteration.
>> +		 */
>> +		for_each_folio_until_end_safe(release, next, folio, end_folio) {
>> +			if (page_in_folio_offset(page, release) >= 0) {
>> +				folio = release;
>> +				if (split_order != new_order && !stop_split)
>> +					continue;
>
> I don't understand this condition.

This is for buddy allocator like split. If split_order != new_order,
we are going to further split “folio”, which contains the provided page,
so we do not update related stats nor put the folio back to list.
If stop_split is true, the folio failed to be split in the code above,
so we stop split and put it back to list and return.

OK, I think I need to add code to bail out the outer loop when stop_split
is true.

>
>> +			}
>> +			if (folio_test_anon(release))
>> +				mod_mthp_stat(folio_order(release),
>> +						MTHP_STAT_NR_ANON, 1);
>
> Add { } around the block.

Sure.

>
>> +
>> +			/*
>> +			 * Unfreeze refcount first. Additional reference from
>> +			 * page cache.
>> +			 */
>> +			folio_ref_unfreeze(release,
>> +				1 + ((!folio_test_anon(origin_folio) ||
>> +				     folio_test_swapcache(origin_folio)) ?
>> +					     folio_nr_pages(release) : 0));
>> +
>> +			if (release != origin_folio)
>> +				lru_add_page_tail(origin_folio, &release->page,
>> +						lruvec, list);
>> +
>> +			/* Some pages can be beyond EOF: drop them from page cache */
>> +			if (release->index >= end) {
>> +				if (shmem_mapping(origin_folio->mapping))
>> +					nr_dropped++;
>> +				else if (folio_test_clear_dirty(release))
>> +					folio_account_cleaned(release,
>> +						inode_to_wb(origin_folio->mapping->host));
>> +				__filemap_remove_folio(release, NULL);
>> +				folio_put(release);
>> +			} else if (!folio_test_anon(release)) {
>> +				__xa_store(&origin_folio->mapping->i_pages,
>> +						release->index, &release->page, 0);
>> +			} else if (swap_cache) {
>> +				__xa_store(&swap_cache->i_pages,
>> +						swap_cache_index(release->swap),
>> +						&release->page, 0);
>> +			}
>> +		}
>> +		xas_destroy(xas);
>> +	}
>> +
>> +	unlock_page_lruvec(lruvec);
>> +
>> +	if (folio_test_anon(origin_folio)) {
>> +		if (folio_test_swapcache(origin_folio))
>> +			xa_unlock(&swap_cache->i_pages);
>> +	} else
>> +		xa_unlock(&mapping->i_pages);
>> +
>> +	/* Caller disabled irqs, so they are still disabled here */
>> +	local_irq_enable();
>> +
>> +	if (nr_dropped)
>> +		shmem_uncharge(mapping->host, nr_dropped);
>> +
>> +	remap_page(origin_folio, 1 << order,
>> +			folio_test_anon(origin_folio) ?
>> +				RMP_USE_SHARED_ZEROPAGE : 0);
>> +
>> +	/*
>> +	 * At this point, folio should contain the specified page, so that it
>> +	 * will be left to the caller to unlock it.
>> +	 */
>> +	for_each_folio_until_end_safe(new_folio, next, origin_folio, next_folio) {
>> +		if (uniform_split && new_folio == folio)
>> +			continue;
>> +		if (!uniform_split && new_folio == origin_folio)
>> +			continue;
>> +
>> +		folio_unlock(new_folio);
>> +		/*
>> +		 * Subpages may be freed if there wasn't any mapping
>> +		 * like if add_to_swap() is running on a lru page that
>> +		 * had its mapping zapped. And freeing these pages
>> +		 * requires taking the lru_lock so we do the put_page
>> +		 * of the tail pages after the split is complete.
>> +		 */
>> +		free_page_and_swap_cache(&new_folio->page);
>> +	}
>> +	return ret;
>> +}
>> +
>>  /*
>>   * This function splits a large folio into smaller folios of order @new_order.
>>   * @page can point to any page of the large folio to split. The split operation
>> -- 
>> 2.45.2

Thank you for the review. I will address all the concerns in the next version.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split()
  2024-11-06 22:06     ` Zi Yan
@ 2024-11-07 14:01       ` Kirill A . Shutemov
  2024-11-07 14:42         ` Zi Yan
  2024-11-07 15:01       ` Zi Yan
  1 sibling, 1 reply; 14+ messages in thread
From: Kirill A . Shutemov @ 2024-11-07 14:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle),
	linux-mm, Ryan Roberts, Hugh Dickins, David Hildenbrand,
	Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard,
	linux-kernel

On Wed, Nov 06, 2024 at 05:06:32PM -0500, Zi Yan wrote:
> >> +		} else {
> >> +			if (PageHead(head))
> >> +				ClearPageCompound(head);
> >
> > Huh? You only have to test for PageHead() because it is inside the loop.
> > It has to be done after loop is done.
> 
> You are right, will remove this and add the code below after the loop.
> 
> if (!new_order && PageHead(&folio->page))
> 	ClearPageCompound(&folio->page);

PageHead(&forlio->page) is always true, isn't it?

> >> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> >> +		if (!uniform_split)
> >> +			return -EINVAL;
> >
> > Why this limitation?
> 
> I am not closely following the status of mTHP support in swap. If it
> is supported, this can be removed. Right now, split_huge_page_to_list_to_order()
> only allows to split a swapcache folio to order 0[1].
> 
> [1] https://elixir.bootlin.com/linux/v6.12-rc6/source/mm/huge_memory.c#L3397

It would be nice to clarify this or at least add a comment.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split()
  2024-11-07 14:01       ` Kirill A . Shutemov
@ 2024-11-07 14:42         ` Zi Yan
  0 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-07 14:42 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Matthew Wilcox (Oracle),
	linux-mm, Ryan Roberts, Hugh Dickins, David Hildenbrand,
	Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1173 bytes --]

On 7 Nov 2024, at 9:01, Kirill A . Shutemov wrote:

> On Wed, Nov 06, 2024 at 05:06:32PM -0500, Zi Yan wrote:
>>>> +		} else {
>>>> +			if (PageHead(head))
>>>> +				ClearPageCompound(head);
>>>
>>> Huh? You only have to test for PageHead() because it is inside the loop.
>>> It has to be done after loop is done.
>>
>> You are right, will remove this and add the code below after the loop.
>>
>> if (!new_order && PageHead(&folio->page))
>> 	ClearPageCompound(&folio->page);
>
> PageHead(&forlio->page) is always true, isn't it?

Yes. Will remove that if part. Thank you for pointing this out.

>
>>>> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
>>>> +		if (!uniform_split)
>>>> +			return -EINVAL;
>>>
>>> Why this limitation?
>>
>> I am not closely following the status of mTHP support in swap. If it
>> is supported, this can be removed. Right now, split_huge_page_to_list_to_order()
>> only allows to split a swapcache folio to order 0[1].
>>
>> [1] https://elixir.bootlin.com/linux/v6.12-rc6/source/mm/huge_memory.c#L3397
>
> It would be nice to clarify this or at least add a comment.

Sure. Will add a comment about it.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split()
  2024-11-06 22:06     ` Zi Yan
  2024-11-07 14:01       ` Kirill A . Shutemov
@ 2024-11-07 15:01       ` Zi Yan
  1 sibling, 0 replies; 14+ messages in thread
From: Zi Yan @ 2024-11-07 15:01 UTC (permalink / raw)
  To: Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: linux-mm, Ryan Roberts, Hugh Dickins, David Hildenbrand,
	Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard,
	linux-kernel

>>
>>> +
>>> +		if (mapping) {
>>> +			/*
>>> +			 * uniform split has xas_split_alloc() called before
>>> +			 * irq is disabled, since xas_nomem() might not be
>>> +			 * able to allocate enough memory.
>>> +			 */
>>> +			if (uniform_split)
>>> +				xas_split(xas, folio, old_order);
>>> +			else {
>>> +				xas_set_order(xas, folio->index, split_order);
>>> +				xas_set_err(xas, -ENOMEM);
>>> +				if (xas_nomem(xas, 0))
>>
>> 0 gfp?
>
> This is inside lru_lock and allocation cannot sleep, so I am not sure
> current_gfp_context(mapping_gfp_mask(mapping) &	GFP_RECLAIM_MASK); can
> be used.
>
> I need Matthew to help me out about this.

Talked to Matthew about this, will use GFP_NOWAIT here, since we can fail
here and probably should not get into atomic reserves.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-07 15:01 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-01 15:03 [PATCH v2 0/6] Buddy allocator like folio split Zi Yan
2024-11-01 15:03 ` [PATCH v2 1/6] mm/huge_memory: add two new (yet used) functions for folio_split() Zi Yan
2024-11-06 10:44   ` Kirill A . Shutemov
2024-11-06 22:06     ` Zi Yan
2024-11-07 14:01       ` Kirill A . Shutemov
2024-11-07 14:42         ` Zi Yan
2024-11-07 15:01       ` Zi Yan
2024-11-01 15:03 ` [PATCH v2 2/6] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
2024-11-01 15:03 ` [PATCH v2 3/6] mm/huge_memory: add buddy allocator like folio_split() Zi Yan
2024-11-01 15:03 ` [PATCH v2 4/6] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
2024-11-01 15:03 ` [PATCH v2 5/6] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
2024-11-01 15:03 ` [PATCH v2 6/6] mm/truncate: use folio_split() for truncate operation Zi Yan
2024-11-02 15:39   ` kernel test robot
2024-11-02 17:22   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox