[PATCH v1 0/3] Buddy allocator like folio split

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/3] Buddy allocator like folio split
@ 2024-10-28 18:09 Zi Yan
  2024-10-28 18:09 ` [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split() Zi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Zi Yan @ 2024-10-28 18:09 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, Kirill A . Shutemov,
	David Hildenbrand, Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao,
	John Hubbard, linux-kernel, Zi Yan

Hi all

This patchset adds a new buddy allocator like large folio split to the total
number of resulting folios, the amount of memory needed for multi-index xarray
split, and keep more large folios after a split. It is on top of
mm-everything-2024-10-27-03-16.

Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like split. All
existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().

THP tests in selftesting passed for split_huge_page*() runs and I also
tested folio_split() for anon large folio, pagecache folio, and
truncate. I will run more extensive tests.

Changelog
===
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
   folio_split(). The same code is used for both uniform split and buddy
   allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
   instead of the one containing the given page, since
   the caller of truncate_inode_partial_folio() locks and unlocks the
   first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().

Design
===

folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.

The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
   the whole folio into multiple smaller ones with the same order in a
   shot, this approach splits the folio iteratively. Taking the example
   above, this approach first splits the original order-9 into two order-8,
   then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
   adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.

__folio_split_without_mapping() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__folio_split_without_mapping() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
   uniformly split the given folio. All resulting folios are put back to
   the list after split. The folio containing the given page is left to
   caller to unlock and others are unlocked.

2. buddy allocator like split: old_order - new_order calls to
   __split_folio_to_order() are used to split the given folio at order N to
   order N-1. After each call, the target folio is changed to the one
   containing the page, which is given via folio_split() parameters.
   After each call, folios not containing the page are put back to the list.
   The folio containing the page is put back to the list when its order
   is new_order. All folios are unlocked except the first folio, which
   is left to caller to unlock.

TODOs
===
1. xas_nomem() is used in buddy allocator like split like Matthew
   suggested, but I see kmemleak during its use. I would like to get some
   code review from Matthew on it.

2. A proper error handling is needed when xas_nomem() fails to allocate memory
   for xas_split(). The target folio should be put back to the list and
   all after-split folios should be unlocked. (I realize this when I am
   writing the cover letter)

Any comments and/or suggestions are welcome. Thanks.

[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/

Zi Yan (3):
  mm/huge_memory: buddy allocator like folio_split()
  mm/huge_memory: add folio_split() to debugfs testing interface.
  mm/truncate: use folio_split() for truncate operation.

 include/linux/huge_mm.h |  12 +
 mm/huge_memory.c        | 650 +++++++++++++++++++++++++---------------
 mm/truncate.c           |   5 +-
 3 files changed, 422 insertions(+), 245 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split()
  2024-10-28 18:09 [PATCH v1 0/3] Buddy allocator like folio split Zi Yan
@ 2024-10-28 18:09 ` Zi Yan
  2024-10-29 10:24   ` kernel test robot
  2024-10-31 10:14   ` Kirill A . Shutemov
  2024-10-28 18:09 ` [PATCH v1 2/3] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
  2024-10-28 18:09 ` [PATCH v1 3/3] mm/truncate: use folio_split() for truncate operation Zi Yan
  2 siblings, 2 replies; 7+ messages in thread
From: Zi Yan @ 2024-10-28 18:09 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, Kirill A . Shutemov,
	David Hildenbrand, Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao,
	John Hubbard, linux-kernel, Zi Yan

folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.

It generates fewer folios than existing page split approach, which splits
the order-9 to 512 order-0 folios.

To minimize code duplication, __split_huge_page() and
__split_huge_page_tail() are replaced by __folio_split_without_mapping()
and __split_folio_to_order() respectively.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 604 +++++++++++++++++++++++++++++------------------
 1 file changed, 372 insertions(+), 232 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 832ca761b4c3..0224925e4c3c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3135,7 +3135,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
 static void lru_add_page_tail(struct folio *folio, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
 	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
@@ -3155,202 +3154,325 @@ static void lru_add_page_tail(struct folio *folio, struct page *tail,
 	}
 }
 
-static void __split_huge_page_tail(struct folio *folio, int tail,
-		struct lruvec *lruvec, struct list_head *list,
-		unsigned int new_order)
+/* Racy check whether the huge page can be split */
+bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 {
-	struct page *head = &folio->page;
-	struct page *page_tail = head + tail;
-	/*
-	 * Careful: new_folio is not a "real" folio before we cleared PageTail.
-	 * Don't pass it around before clear_compound_head().
-	 */
-	struct folio *new_folio = (struct folio *)page_tail;
+	int extra_pins;
 
-	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
+	/* Additional pins from page cache */
+	if (folio_test_anon(folio))
+		extra_pins = folio_test_swapcache(folio) ?
+				folio_nr_pages(folio) : 0;
+	else
+		extra_pins = folio_nr_pages(folio);
+	if (pextra_pins)
+		*pextra_pins = extra_pins;
+	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
+					caller_pins;
+}
 
-	/*
-	 * Clone page flags before unfreezing refcount.
-	 *
-	 * After successful get_page_unless_zero() might follow flags change,
-	 * for example lock_page() which set PG_waiters.
-	 *
-	 * Note that for mapped sub-pages of an anonymous THP,
-	 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
-	 * the migration entry instead from where remap_page() will restore it.
-	 * We can still have PG_anon_exclusive set on effectively unmapped and
-	 * unreferenced sub-pages of an anonymous THP: we can simply drop
-	 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
-	 */
-	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-	page_tail->flags |= (head->flags &
-			((1L << PG_referenced) |
-			 (1L << PG_swapbacked) |
-			 (1L << PG_swapcache) |
-			 (1L << PG_mlocked) |
-			 (1L << PG_uptodate) |
-			 (1L << PG_active) |
-			 (1L << PG_workingset) |
-			 (1L << PG_locked) |
-			 (1L << PG_unevictable) |
+static long page_in_folio_offset(struct page *page, struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	unsigned long pages_pfn = page_to_pfn(page);
+	unsigned long folios_pfn = folio_pfn(folio);
+
+	if (pages_pfn >= folios_pfn && pages_pfn < (folios_pfn + nr_pages))
+		return pages_pfn - folios_pfn;
+
+	return -EINVAL;
+}
+
+/*
+ * It splits @folio into @new_order folios and copies the @folio metadata to
+ * all the resulting folios.
+ */
+static int __split_folio_to_order(struct folio *folio, int new_order)
+{
+	int curr_order = folio_order(folio);
+	long nr_pages = folio_nr_pages(folio);
+	long new_nr_pages = 1 << new_order;
+	long index;
+
+	if (curr_order <= new_order)
+		return -EINVAL;
+
+	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
+		struct page *head = &folio->page;
+		struct page *second_head = head + index;
+
+		/*
+		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
+		 * Don't pass it around before clear_compound_head().
+		 */
+		struct folio *new_folio = (struct folio *)second_head;
+
+		VM_BUG_ON_PAGE(atomic_read(&second_head->_mapcount) != -1, second_head);
+
+		/*
+		 * Clone page flags before unfreezing refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow flags change,
+		 * for example lock_page() which set PG_waiters.
+		 *
+		 * Note that for mapped sub-pages of an anonymous THP,
+		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
+		 * the migration entry instead from where remap_page() will restore it.
+		 * We can still have PG_anon_exclusive set on effectively unmapped and
+		 * unreferenced sub-pages of an anonymous THP: we can simply drop
+		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
+		 */
+		second_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		second_head->flags |= (head->flags &
+				((1L << PG_referenced) |
+				 (1L << PG_swapbacked) |
+				 (1L << PG_swapcache) |
+				 (1L << PG_mlocked) |
+				 (1L << PG_uptodate) |
+				 (1L << PG_active) |
+				 (1L << PG_workingset) |
+				 (1L << PG_locked) |
+				 (1L << PG_unevictable) |
 #ifdef CONFIG_ARCH_USES_PG_ARCH_2
-			 (1L << PG_arch_2) |
+				 (1L << PG_arch_2) |
 #endif
 #ifdef CONFIG_ARCH_USES_PG_ARCH_3
-			 (1L << PG_arch_3) |
+				 (1L << PG_arch_3) |
 #endif
-			 (1L << PG_dirty) |
-			 LRU_GEN_MASK | LRU_REFS_MASK));
+				 (1L << PG_dirty) |
+				 LRU_GEN_MASK | LRU_REFS_MASK));
 
-	/* ->mapping in first and second tail page is replaced by other uses */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-			page_tail);
-	new_folio->mapping = folio->mapping;
-	new_folio->index = folio->index + tail;
+		/* ->mapping in first and second tail page is replaced by other uses */
+		VM_BUG_ON_PAGE(new_nr_pages > 2 && second_head->mapping != TAIL_MAPPING,
+			       second_head);
+		second_head->mapping = head->mapping;
+		second_head->index = head->index + index;
 
-	/*
-	 * page->private should not be set in tail pages. Fix up and warn once
-	 * if private is unexpectedly set.
-	 */
-	if (unlikely(page_tail->private)) {
-		VM_WARN_ON_ONCE_PAGE(true, page_tail);
-		page_tail->private = 0;
-	}
-	if (folio_test_swapcache(folio))
-		new_folio->swap.val = folio->swap.val + tail;
+		/*
+		 * page->private should not be set in tail pages. Fix up and warn once
+		 * if private is unexpectedly set.
+		 */
+		if (unlikely(second_head->private)) {
+			VM_WARN_ON_ONCE_PAGE(true, second_head);
+			second_head->private = 0;
+		}
+		if (folio_test_swapcache(folio))
+			new_folio->swap.val = folio->swap.val + index;
 
-	/* Page flags must be visible before we make the page non-compound. */
-	smp_wmb();
+		/* Page flags must be visible before we make the page non-compound. */
+		smp_wmb();
 
-	/*
-	 * Clear PageTail before unfreezing page refcount.
-	 *
-	 * After successful get_page_unless_zero() might follow put_page()
-	 * which needs correct compound_head().
-	 */
-	clear_compound_head(page_tail);
-	if (new_order) {
-		prep_compound_page(page_tail, new_order);
-		folio_set_large_rmappable(new_folio);
-	}
+		/*
+		 * Clear PageTail before unfreezing page refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow put_page()
+		 * which needs correct compound_head().
+		 */
+		clear_compound_head(second_head);
+		if (new_order) {
+			prep_compound_page(second_head, new_order);
+			folio_set_large_rmappable(new_folio);
 
-	/* Finally unfreeze refcount. Additional reference from page cache. */
-	page_ref_unfreeze(page_tail,
-		1 + ((!folio_test_anon(folio) || folio_test_swapcache(folio)) ?
-			     folio_nr_pages(new_folio) : 0));
+			folio_set_order(folio, new_order);
+		} else {
+			if (PageHead(head))
+				ClearPageCompound(head);
+		}
 
-	if (folio_test_young(folio))
-		folio_set_young(new_folio);
-	if (folio_test_idle(folio))
-		folio_set_idle(new_folio);
+		if (folio_test_young(folio))
+			folio_set_young(new_folio);
+		if (folio_test_idle(folio))
+			folio_set_idle(new_folio);
 
-	folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
+		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
+	}
 
-	/*
-	 * always add to the tail because some iterators expect new
-	 * pages to show after the currently processed elements - e.g.
-	 * migrate_pages
-	 */
-	lru_add_page_tail(folio, page_tail, lruvec, list);
+	return 0;
 }
 
-static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned int new_order)
+#define for_each_folio_until_end_safe(iter, iter2, start, end)	\
+	for (iter = start, iter2 = folio_next(start);		\
+	     iter != end;					\
+	     iter = iter2, iter2 = folio_next(iter2))
+
+/*
+ * It splits a @folio (without mapping) to lower order smaller folios in two
+ * ways.
+ * 1. uniform split: the given @folio into multiple @new_order small folios,
+ *    where all small folios have the same order. This is done when
+ *    uniform_split is true.
+ * 2. buddy allocator like split: the given @folio is split into half and one
+ *    of the half (containing the given page) is split into half until the
+ *    given @page's order becomes @new_order. This is done when uniform_split is
+ *    false.
+ *
+ * The high level flow for these two methods are:
+ * 1. uniform split: a single __split_folio_to_order() is called to split the
+ *    @folio into @new_order, then we traverse all the resulting folios one by
+ *    one in PFN ascending order and perform stats, unfreeze, adding to list,
+ *    and file mapping index operations.
+ * 2. buddy allocator like split: in general, folio_order - @new_order calls to
+ *    __split_folio_to_order() are called in the for loop to split the @folio
+ *    to one lower order at a time. The resulting small folios are processed
+ *    like what is done during the traversal in 1, except the one containing
+ *    @page, which is split in next for loop.
+ *
+ * After splitting, the caller's folio reference will be transferred to the
+ * folio containing @page. The other folios may be freed if they are not mapped.
+ *
+ * In terms of locking, after splitting,
+ * 1. uniform split leaves @page (or the folio contains it) locked;
+ * 2. buddy allocator like split leaves @folio locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ */
+static int __folio_split_without_mapping(struct folio *folio, int new_order,
+		struct page *page, struct list_head *list, pgoff_t end,
+		struct xa_state *xas, struct address_space *mapping,
+		bool uniform_split)
 {
-	struct folio *folio = page_folio(page);
-	struct page *head = &folio->page;
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
-	unsigned long offset = 0;
-	int i, nr_dropped = 0;
-	unsigned int new_nr = 1 << new_order;
+	struct folio *origin_folio = folio;
+	struct folio *next_folio = folio_next(folio);
+	struct folio *new_folio;
+	struct folio *next;
 	int order = folio_order(folio);
-	unsigned int nr = 1 << order;
-
-	/* complete memcg works before add pages to LRU */
-	split_page_memcg(head, order, new_order);
+	int split_order = order - 1;
+	int nr_dropped = 0;
 
 	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
-		offset = swap_cache_index(folio->swap);
+		if (!uniform_split)
+			return -EINVAL;
+
 		swap_cache = swap_address_space(folio->swap);
 		xa_lock(&swap_cache->i_pages);
 	}
 
+	if (folio_test_anon(folio))
+		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
+
 	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
 	lruvec = folio_lruvec_lock(folio);
 
-	ClearPageHasHWPoisoned(head);
-
-	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
-		struct folio *tail;
-		__split_huge_page_tail(folio, i, lruvec, list, new_order);
-		tail = page_folio(head + i);
-		/* Some pages can be beyond EOF: drop them from page cache */
-		if (tail->index >= end) {
-			if (shmem_mapping(folio->mapping))
-				nr_dropped++;
-			else if (folio_test_clear_dirty(tail))
-				folio_account_cleaned(tail,
-					inode_to_wb(folio->mapping->host));
-			__filemap_remove_folio(tail, NULL);
-			folio_put(tail);
-		} else if (!folio_test_anon(folio)) {
-			__xa_store(&folio->mapping->i_pages, tail->index,
-					tail, 0);
-		} else if (swap_cache) {
-			__xa_store(&swap_cache->i_pages, offset + i,
-					tail, 0);
+	/*
+	 * split to new_order one order at a time. For uniform split,
+	 * intermediate orders are skipped
+	 */
+	for (split_order = order - 1; split_order >= new_order; split_order--) {
+		int old_order = folio_order(folio);
+		struct folio *release;
+		struct folio *end_folio = folio_next(folio);
+		int status;
+
+		if (folio_test_anon(folio) && split_order == 1)
+			continue;
+		if (uniform_split && split_order != new_order)
+			continue;
+
+		if (mapping) {
+			/*
+			 * uniform split has xas_split_alloc() called before
+			 * irq is disabled, since xas_nomem() might not be
+			 * able to allocate enough memory.
+			 */
+			if (uniform_split)
+				xas_split(xas, folio, old_order);
+			else {
+				xas_set_order(xas, folio->index, split_order);
+				xas_set_err(xas, -ENOMEM);
+				if (xas_nomem(xas, 0))
+					xas_split(xas, folio, old_order);
+				else
+					return -ENOMEM;
+			}
 		}
-	}
 
-	if (!new_order)
-		ClearPageCompound(head);
-	else {
-		struct folio *new_folio = (struct folio *)head;
+		split_page_memcg(&folio->page, old_order, split_order);
+		split_page_owner(&folio->page, old_order, split_order);
+		pgalloc_tag_split(folio, old_order, split_order);
 
-		folio_set_order(new_folio, new_order);
-	}
-	unlock_page_lruvec(lruvec);
-	/* Caller disabled irqs, so they are still disabled here */
+		status = __split_folio_to_order(folio, split_order);
 
-	split_page_owner(head, order, new_order);
-	pgalloc_tag_split(folio, order, new_order);
+		if (status < 0)
+			return status;
 
-	/* See comment in __split_huge_page_tail() */
-	if (folio_test_anon(folio)) {
-		/* Additional pin to swap cache */
-		if (folio_test_swapcache(folio)) {
-			folio_ref_add(folio, 1 + new_nr);
-			xa_unlock(&swap_cache->i_pages);
-		} else {
-			folio_ref_inc(folio);
+		/*
+		 * Iterate through after-split folios and perform related
+		 * operations. But in buddy allocator like split, the folio
+		 * containing the specified page is skipped until its order
+		 * is new_order, since the folio will be worked on in next
+		 * iteration.
+		 */
+		for_each_folio_until_end_safe(release, next, folio, end_folio) {
+			if (page_in_folio_offset(page, release) >= 0) {
+				folio = release;
+				if (split_order != new_order)
+					continue;
+			}
+			if (folio_test_anon(release))
+				mod_mthp_stat(folio_order(release),
+						MTHP_STAT_NR_ANON, 1);
+
+			/*
+			 * Unfreeze refcount first. Additional reference from
+			 * page cache.
+			 */
+			folio_ref_unfreeze(release,
+				1 + ((!folio_test_anon(origin_folio) ||
+				     folio_test_swapcache(origin_folio)) ?
+					     folio_nr_pages(release) : 0));
+
+			if (release != origin_folio)
+				lru_add_page_tail(origin_folio, &release->page,
+						lruvec, list);
+
+			/* Some pages can be beyond EOF: drop them from page cache */
+			if (release->index >= end) {
+				if (shmem_mapping(origin_folio->mapping))
+					nr_dropped++;
+				else if (folio_test_clear_dirty(release))
+					folio_account_cleaned(release,
+						inode_to_wb(origin_folio->mapping->host));
+				__filemap_remove_folio(release, NULL);
+				folio_put(release);
+			} else if (!folio_test_anon(release)) {
+				__xa_store(&origin_folio->mapping->i_pages,
+						release->index, &release->page, 0);
+			} else if (swap_cache) {
+				__xa_store(&swap_cache->i_pages,
+						swap_cache_index(release->swap),
+						&release->page, 0);
+			}
 		}
-	} else {
-		/* Additional pin to page cache */
-		folio_ref_add(folio, 1 + new_nr);
-		xa_unlock(&folio->mapping->i_pages);
 	}
+
+	unlock_page_lruvec(lruvec);
+
+	if (folio_test_anon(origin_folio)) {
+		if (folio_test_swapcache(origin_folio))
+			xa_unlock(&swap_cache->i_pages);
+	} else
+		xa_unlock(&mapping->i_pages);
+
+	/* Caller disabled irqs, so they are still disabled here */
 	local_irq_enable();
 
-	if (nr_dropped)
-		shmem_uncharge(folio->mapping->host, nr_dropped);
-	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
+	remap_page(origin_folio, 1 << order,
+			folio_test_anon(origin_folio) ?
+				RMP_USE_SHARED_ZEROPAGE : 0);
 
 	/*
-	 * set page to its compound_head when split to non order-0 pages, so
-	 * we can skip unlocking it below, since PG_locked is transferred to
-	 * the compound_head of the page and the caller will unlock it.
+	 * At this point, folio should contain the specified page, so that it
+	 * will be left to the caller to unlock it.
 	 */
-	if (new_order)
-		page = compound_head(page);
-
-	for (i = 0; i < nr; i += new_nr) {
-		struct page *subpage = head + i;
-		struct folio *new_folio = page_folio(subpage);
-		if (subpage == page)
+	for_each_folio_until_end_safe(new_folio, next, origin_folio, next_folio) {
+		if (uniform_split && new_folio == folio)
+			continue;
+		if (!uniform_split && new_folio == origin_folio)
 			continue;
-		folio_unlock(new_folio);
 
+		folio_unlock(new_folio);
 		/*
 		 * Subpages may be freed if there wasn't any mapping
 		 * like if add_to_swap() is running on a lru page that
@@ -3358,81 +3480,18 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 * requires taking the lru_lock so we do the put_page
 		 * of the tail pages after the split is complete.
 		 */
-		free_page_and_swap_cache(subpage);
+		free_page_and_swap_cache(&new_folio->page);
 	}
+	return 0;
 }
 
-/* Racy check whether the huge page can be split */
-bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
-{
-	int extra_pins;
 
-	/* Additional pins from page cache */
-	if (folio_test_anon(folio))
-		extra_pins = folio_test_swapcache(folio) ?
-				folio_nr_pages(folio) : 0;
-	else
-		extra_pins = folio_nr_pages(folio);
-	if (pextra_pins)
-		*pextra_pins = extra_pins;
-	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
-					caller_pins;
-}
 
-/*
- * This function splits a large folio into smaller folios of order @new_order.
- * @page can point to any page of the large folio to split. The split operation
- * does not change the position of @page.
- *
- * Prerequisites:
- *
- * 1) The caller must hold a reference on the @page's owning folio, also known
- *    as the large folio.
- *
- * 2) The large folio must be locked.
- *
- * 3) The folio must not be pinned. Any unexpected folio references, including
- *    GUP pins, will result in the folio not getting split; instead, the caller
- *    will receive an -EAGAIN.
- *
- * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
- *    supported for non-file-backed folios, because folio->_deferred_list, which
- *    is used by partially mapped folios, is stored in subpage 2, but an order-1
- *    folio only has subpages 0 and 1. File-backed order-1 folios are supported,
- *    since they do not use _deferred_list.
- *
- * After splitting, the caller's folio reference will be transferred to @page,
- * resulting in a raised refcount of @page after this call. The other pages may
- * be freed if they are not mapped.
- *
- * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
- *
- * Pages in @new_order will inherit the mapping, flags, and so on from the
- * huge page.
- *
- * Returns 0 if the huge page was split successfully.
- *
- * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
- * the folio was concurrently removed from the page cache.
- *
- * Returns -EBUSY when trying to split the huge zeropage, if the folio is
- * under writeback, if fs-specific folio metadata cannot currently be
- * released, or if some unexpected race happened (e.g., anon VMA disappeared,
- * truncation).
- *
- * Callers should ensure that the order respects the address space mapping
- * min-order if one is set for non-anonymous folios.
- *
- * Returns -EINVAL when trying to split to an order that is incompatible
- * with the folio. Splitting to order 0 is compatible with all folios.
- */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+static int __folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list, bool uniform_split)
 {
-	struct folio *folio = page_folio(page);
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
-	/* reset xarray order to new order after split */
-	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
+	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
 	bool is_anon = folio_test_anon(folio);
 	struct address_space *mapping = NULL;
 	struct anon_vma *anon_vma = NULL;
@@ -3453,9 +3512,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			VM_WARN_ONCE(1, "Cannot split to order-1 folio");
 			return -EINVAL;
 		}
-	} else if (new_order) {
+	} else {
 		/* Split shmem folio to non-zero order not supported */
-		if (shmem_mapping(folio->mapping)) {
+		if ((!uniform_split || new_order) &&
+		    shmem_mapping(folio->mapping)) {
 			VM_WARN_ONCE(1,
 				"Cannot split shmem folio to non-0 order");
 			return -EINVAL;
@@ -3466,7 +3526,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
 		 * does not actually support large folios properly.
 		 */
-		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+		if (new_order && IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 		    !mapping_large_folio_support(folio->mapping)) {
 			VM_WARN_ONCE(1,
 				"Cannot split file folio to non-0 order");
@@ -3475,7 +3535,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	}
 
 	/* Only swapping a whole PMD-mapped folio is supported */
-	if (folio_test_swapcache(folio) && new_order)
+	if (folio_test_swapcache(folio) && (!uniform_split || new_order))
 		return -EINVAL;
 
 	is_hzp = is_huge_zero_folio(folio);
@@ -3532,10 +3592,13 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
-		xas_split_alloc(&xas, folio, folio_order(folio), gfp);
-		if (xas_error(&xas)) {
-			ret = xas_error(&xas);
-			goto out;
+		if (uniform_split) {
+			xas_set_order(&xas, folio->index, new_order);
+			xas_split_alloc(&xas, folio, folio_order(folio), gfp);
+			if (xas_error(&xas)) {
+				ret = xas_error(&xas);
+				goto out;
+			}
 		}
 
 		anon_vma = NULL;
@@ -3600,7 +3663,6 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
-			xas_split(&xas, folio, folio_order(folio));
 			if (folio_test_pmd_mappable(folio) &&
 			    new_order < HPAGE_PMD_ORDER) {
 				if (folio_test_swapbacked(folio)) {
@@ -3618,8 +3680,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
 			mod_mthp_stat(new_order, MTHP_STAT_NR_ANON, 1 << (order - new_order));
 		}
-		__split_huge_page(page, list, end, new_order);
-		ret = 0;
+		ret = __folio_split_without_mapping(page_folio(page), new_order,
+				page, list, end, &xas, mapping, uniform_split);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
@@ -3645,6 +3707,61 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+/*
+ * This function splits a large folio into smaller folios of order @new_order.
+ * @page can point to any page of the large folio to split. The split operation
+ * does not change the position of @page.
+ *
+ * Prerequisites:
+ *
+ * 1) The caller must hold a reference on the @page's owning folio, also known
+ *    as the large folio.
+ *
+ * 2) The large folio must be locked.
+ *
+ * 3) The folio must not be pinned. Any unexpected folio references, including
+ *    GUP pins, will result in the folio not getting split; instead, the caller
+ *    will receive an -EAGAIN.
+ *
+ * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
+ *    supported for non-file-backed folios, because folio->_deferred_list, which
+ *    is used by partially mapped folios, is stored in subpage 2, but an order-1
+ *    folio only has subpages 0 and 1. File-backed order-1 folios are supported,
+ *    since they do not use _deferred_list.
+ *
+ * After splitting, the caller's folio reference will be transferred to @page,
+ * resulting in a raised refcount of @page after this call. The other pages may
+ * be freed if they are not mapped.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Pages in @new_order will inherit the mapping, flags, and so on from the
+ * huge page.
+ *
+ * Returns 0 if the huge page was split successfully.
+ *
+ * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
+ * the folio was concurrently removed from the page cache.
+ *
+ * Returns -EBUSY when trying to split the huge zeropage, if the folio is
+ * under writeback, if fs-specific folio metadata cannot currently be
+ * released, or if some unexpected race happened (e.g., anon VMA disappeared,
+ * truncation).
+ *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
+ * Returns -EINVAL when trying to split to an order that is incompatible
+ * with the folio. Splitting to order 0 is compatible with all folios.
+ */
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order)
+{
+	struct folio *folio = page_folio(page);
+
+	return __folio_split(folio, new_order, page, list, true);
+}
+
 int min_order_for_split(struct folio *folio)
 {
 	if (folio_test_anon(folio))
@@ -3669,6 +3786,29 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
 	return split_huge_page_to_list_to_order(&folio->page, list, ret);
 }
 
+/*
+ * folio_split: split a folio at offset_in_new_order to a new_order folio
+ * @folio: folio to split
+ * @new_order: the order of the new folio
+ * @page: a page within the new folio
+ *
+ * return: 0: successful, <0 failed
+ *
+ * Split a folio at offset_in_new_order to a new_order folio, leave the
+ * remaining subpages of the original folio as large as possible. For example,
+ * split an order-9 folio at its third order-3 subpages to an order-3 folio.
+ * There are 2^6=64 order-3 subpages in an order-9 folio and the result will be
+ * a set of folios with different order and the new folio is in bracket:
+ * [order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].
+ *
+ * After split, folio is left locked for caller.
+ */
+static int folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list)
+{
+	return __folio_split(folio, new_order, page, list, false);
+}
+
 void __folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 2/3] mm/huge_memory: add folio_split() to debugfs testing interface.
  2024-10-28 18:09 [PATCH v1 0/3] Buddy allocator like folio split Zi Yan
  2024-10-28 18:09 ` [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split() Zi Yan
@ 2024-10-28 18:09 ` Zi Yan
  2024-10-28 18:09 ` [PATCH v1 3/3] mm/truncate: use folio_split() for truncate operation Zi Yan
  2 siblings, 0 replies; 7+ messages in thread
From: Zi Yan @ 2024-10-28 18:09 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, Kirill A . Shutemov,
	David Hildenbrand, Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao,
	John Hubbard, linux-kernel, Zi Yan

This allows to test folio_split() by specifying an additional in folio
page offset parameter to split_huge_page debugfs interface.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0224925e4c3c..4ccd23473e2b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4072,7 +4072,8 @@ static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
 }
 
 static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
-				unsigned long vaddr_end, unsigned int new_order)
+				unsigned long vaddr_end, unsigned int new_order,
+				long in_folio_offset)
 {
 	int ret = 0;
 	struct task_struct *task;
@@ -4156,8 +4157,16 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_test_anon(folio) && folio->mapping != mapping)
 			goto unlock;
 
-		if (!split_folio_to_order(folio, target_order))
-			split++;
+		if (in_folio_offset < 0 ||
+		    in_folio_offset >= folio_nr_pages(folio)) {
+			if (!split_folio_to_order(folio, target_order))
+				split++;
+		} else {
+			struct page *split_at = folio_page(folio,
+							   in_folio_offset);
+			if (!folio_split(folio, target_order, split_at, NULL))
+				split++;
+		}
 
 unlock:
 
@@ -4180,7 +4189,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 }
 
 static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
-				pgoff_t off_end, unsigned int new_order)
+				pgoff_t off_end, unsigned int new_order,
+				long in_folio_offset)
 {
 	struct filename *file;
 	struct file *candidate;
@@ -4229,8 +4239,15 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (folio->mapping != mapping)
 			goto unlock;
 
-		if (!split_folio_to_order(folio, target_order))
-			split++;
+		if (in_folio_offset < 0 || in_folio_offset >= nr_pages) {
+			if (!split_folio_to_order(folio, target_order))
+				split++;
+		} else {
+			struct page *split_at = folio_page(folio,
+							   in_folio_offset);
+			if (!folio_split(folio, target_order, split_at, NULL))
+				split++;
+		}
 
 unlock:
 		folio_unlock(folio);
@@ -4263,6 +4280,7 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
 	int pid;
 	unsigned long vaddr_start, vaddr_end;
 	unsigned int new_order = 0;
+	long in_folio_offset = -1;
 
 	ret = mutex_lock_interruptible(&split_debug_mutex);
 	if (ret)
@@ -4291,29 +4309,33 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
 			goto out;
 		}
 
-		ret = sscanf(buf, "0x%lx,0x%lx,%d", &off_start, &off_end, &new_order);
-		if (ret != 2 && ret != 3) {
+		ret = sscanf(buf, "0x%lx,0x%lx,%d,%ld", &off_start, &off_end,
+				&new_order, &in_folio_offset);
+		if (ret != 2 && ret != 3 && ret != 4) {
 			ret = -EINVAL;
 			goto out;
 		}
-		ret = split_huge_pages_in_file(file_path, off_start, off_end, new_order);
+		ret = split_huge_pages_in_file(file_path, off_start, off_end,
+				new_order, in_folio_offset);
 		if (!ret)
 			ret = input_len;
 
 		goto out;
 	}
 
-	ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", &pid, &vaddr_start, &vaddr_end, &new_order);
+	ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d,%ld", &pid, &vaddr_start,
+			&vaddr_end, &new_order, &in_folio_offset);
 	if (ret == 1 && pid == 1) {
 		split_huge_pages_all();
 		ret = strlen(input_buf);
 		goto out;
-	} else if (ret != 3 && ret != 4) {
+	} else if (ret != 3 && ret != 4 && ret != 5) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order);
+	ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order,
+			in_folio_offset);
 	if (!ret)
 		ret = strlen(input_buf);
 out:
-- 
2.45.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 3/3] mm/truncate: use folio_split() for truncate operation.
  2024-10-28 18:09 [PATCH v1 0/3] Buddy allocator like folio split Zi Yan
  2024-10-28 18:09 ` [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split() Zi Yan
  2024-10-28 18:09 ` [PATCH v1 2/3] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
@ 2024-10-28 18:09 ` Zi Yan
  2 siblings, 0 replies; 7+ messages in thread
From: Zi Yan @ 2024-10-28 18:09 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, Kirill A . Shutemov,
	David Hildenbrand, Yang Shi, Miaohe Lin, Kefeng Wang, Yu Zhao,
	John Hubbard, linux-kernel, Zi Yan

Instead of splitting the large folio uniformly during truncation, use
buddy allocator like split at the start of truncation range to minimize
the number of resulting folios.

For example, to truncate a order-4 folio
[0, 1, 2, 3, 4, 5, ..., 15]
between [3, 10] (inclusive), folio_split() splits the folio to
[0,1], [2], [3], [4..7], [8..15] and [3], [4..7] can be dropped and
[8..15] is kept with zeros in [8..10].

It is possible to further do a folio_split() at 10, so more resulting
folios can be dropped. But it is left as future possible optimization
if needed.

Another possible optimization is to make folio_split() to split a folio
based on a given range, like [3..10] above. But that complicates
folio_split(), so it will investigated when necessary.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 12 ++++++++++++
 mm/huge_memory.c        |  2 +-
 mm/truncate.c           |  5 ++++-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b94c2e8ee918..8048500e7bc2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -339,6 +339,18 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
+int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
+		struct list_head *list);
+static inline int split_folio_at(struct folio *folio, struct page *page,
+		struct list_head *list)
+{
+	int ret = min_order_for_split(folio);
+
+	if (ret < 0)
+		return ret;
+
+	return folio_split(folio, ret, page, list);
+}
 static inline int split_huge_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ccd23473e2b..a688b73fa793 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3803,7 +3803,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
  *
  * After split, folio is left locked for caller.
  */
-static int folio_split(struct folio *folio, unsigned int new_order,
+int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *page, struct list_head *list)
 {
 	return __folio_split(folio, new_order, page, list, false);
diff --git a/mm/truncate.c b/mm/truncate.c
index e5151703ba04..dbd81c21b460 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -179,6 +179,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 {
 	loff_t pos = folio_pos(folio);
 	unsigned int offset, length;
+	long in_folio_offset;
 
 	if (pos < start)
 		offset = start - pos;
@@ -208,7 +209,9 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		folio_invalidate(folio, offset, length);
 	if (!folio_test_large(folio))
 		return true;
-	if (split_folio(folio) == 0)
+
+	in_folio_offset = PAGE_ALIGN_DOWN(offset) / PAGE_SIZE;
+	if (split_folio_at(folio, folio_page(folio, in_folio_offset), NULL) == 0)
 		return true;
 	if (folio_test_dirty(folio))
 		return false;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split()
  2024-10-28 18:09 ` [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split() Zi Yan
@ 2024-10-29 10:24   ` kernel test robot
  2024-10-31 10:14   ` Kirill A . Shutemov
  1 sibling, 0 replies; 7+ messages in thread
From: kernel test robot @ 2024-10-29 10:24 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Matthew Wilcox (Oracle)
  Cc: llvm, oe-kbuild-all, Ryan Roberts, Hugh Dickins,
	Kirill A . Shutemov, David Hildenbrand, Yang Shi, Miaohe Lin,
	Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel, Zi Yan

Hi Zi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20241029]
[cannot apply to linus/master v6.12-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-huge_memory-buddy-allocator-like-folio_split/20241029-021200
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241028180932.1319265-2-ziy%40nvidia.com
patch subject: [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split()
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20241029/202410291853.lBOeTPTK-lkp@intel.com/config)
compiler: clang version 19.1.2 (https://github.com/llvm/llvm-project 7ba7d8e2f7b6445b60679da826210cdde29eaf8b)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241029/202410291853.lBOeTPTK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410291853.lBOeTPTK-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from mm/huge_memory.c:8:
   In file included from include/linux/mm.h:2213:
   include/linux/vmstat.h:504:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     504 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     505 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:511:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     511 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     512 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:524:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     524 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     525 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   In file included from mm/huge_memory.c:18:
   include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      47 |         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
         |                                    ~~~~~~~~~~~ ^ ~~~
   include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      49 |                                 NR_ZONE_LRU_BASE + lru, nr_pages);
         |                                 ~~~~~~~~~~~~~~~~ ^ ~~~
>> mm/huge_memory.c:3342:6: warning: variable 'nr_dropped' set but not used [-Wunused-but-set-variable]
    3342 |         int nr_dropped = 0;
         |             ^
   mm/huge_memory.c:3806:12: warning: unused function 'folio_split' [-Wunused-function]
    3806 | static int folio_split(struct folio *folio, unsigned int new_order,
         |            ^~~~~~~~~~~
   8 warnings generated.


vim +/nr_dropped +3342 mm/huge_memory.c

  3292	
  3293	#define for_each_folio_until_end_safe(iter, iter2, start, end)	\
  3294		for (iter = start, iter2 = folio_next(start);		\
  3295		     iter != end;					\
  3296		     iter = iter2, iter2 = folio_next(iter2))
  3297	
  3298	/*
  3299	 * It splits a @folio (without mapping) to lower order smaller folios in two
  3300	 * ways.
  3301	 * 1. uniform split: the given @folio into multiple @new_order small folios,
  3302	 *    where all small folios have the same order. This is done when
  3303	 *    uniform_split is true.
  3304	 * 2. buddy allocator like split: the given @folio is split into half and one
  3305	 *    of the half (containing the given page) is split into half until the
  3306	 *    given @page's order becomes @new_order. This is done when uniform_split is
  3307	 *    false.
  3308	 *
  3309	 * The high level flow for these two methods are:
  3310	 * 1. uniform split: a single __split_folio_to_order() is called to split the
  3311	 *    @folio into @new_order, then we traverse all the resulting folios one by
  3312	 *    one in PFN ascending order and perform stats, unfreeze, adding to list,
  3313	 *    and file mapping index operations.
  3314	 * 2. buddy allocator like split: in general, folio_order - @new_order calls to
  3315	 *    __split_folio_to_order() are called in the for loop to split the @folio
  3316	 *    to one lower order at a time. The resulting small folios are processed
  3317	 *    like what is done during the traversal in 1, except the one containing
  3318	 *    @page, which is split in next for loop.
  3319	 *
  3320	 * After splitting, the caller's folio reference will be transferred to the
  3321	 * folio containing @page. The other folios may be freed if they are not mapped.
  3322	 *
  3323	 * In terms of locking, after splitting,
  3324	 * 1. uniform split leaves @page (or the folio contains it) locked;
  3325	 * 2. buddy allocator like split leaves @folio locked.
  3326	 *
  3327	 * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
  3328	 */
  3329	static int __folio_split_without_mapping(struct folio *folio, int new_order,
  3330			struct page *page, struct list_head *list, pgoff_t end,
  3331			struct xa_state *xas, struct address_space *mapping,
  3332			bool uniform_split)
  3333	{
  3334		struct lruvec *lruvec;
  3335		struct address_space *swap_cache = NULL;
  3336		struct folio *origin_folio = folio;
  3337		struct folio *next_folio = folio_next(folio);
  3338		struct folio *new_folio;
  3339		struct folio *next;
  3340		int order = folio_order(folio);
  3341		int split_order = order - 1;
> 3342		int nr_dropped = 0;
  3343	
  3344		if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
  3345			if (!uniform_split)
  3346				return -EINVAL;
  3347	
  3348			swap_cache = swap_address_space(folio->swap);
  3349			xa_lock(&swap_cache->i_pages);
  3350		}
  3351	
  3352		if (folio_test_anon(folio))
  3353			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
  3354	
  3355		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
  3356		lruvec = folio_lruvec_lock(folio);
  3357	
  3358		/*
  3359		 * split to new_order one order at a time. For uniform split,
  3360		 * intermediate orders are skipped
  3361		 */
  3362		for (split_order = order - 1; split_order >= new_order; split_order--) {
  3363			int old_order = folio_order(folio);
  3364			struct folio *release;
  3365			struct folio *end_folio = folio_next(folio);
  3366			int status;
  3367	
  3368			if (folio_test_anon(folio) && split_order == 1)
  3369				continue;
  3370			if (uniform_split && split_order != new_order)
  3371				continue;
  3372	
  3373			if (mapping) {
  3374				/*
  3375				 * uniform split has xas_split_alloc() called before
  3376				 * irq is disabled, since xas_nomem() might not be
  3377				 * able to allocate enough memory.
  3378				 */
  3379				if (uniform_split)
  3380					xas_split(xas, folio, old_order);
  3381				else {
  3382					xas_set_order(xas, folio->index, split_order);
  3383					xas_set_err(xas, -ENOMEM);
  3384					if (xas_nomem(xas, 0))
  3385						xas_split(xas, folio, old_order);
  3386					else
  3387						return -ENOMEM;
  3388				}
  3389			}
  3390	
  3391			split_page_memcg(&folio->page, old_order, split_order);
  3392			split_page_owner(&folio->page, old_order, split_order);
  3393			pgalloc_tag_split(folio, old_order, split_order);
  3394	
  3395			status = __split_folio_to_order(folio, split_order);
  3396	
  3397			if (status < 0)
  3398				return status;
  3399	
  3400			/*
  3401			 * Iterate through after-split folios and perform related
  3402			 * operations. But in buddy allocator like split, the folio
  3403			 * containing the specified page is skipped until its order
  3404			 * is new_order, since the folio will be worked on in next
  3405			 * iteration.
  3406			 */
  3407			for_each_folio_until_end_safe(release, next, folio, end_folio) {
  3408				if (page_in_folio_offset(page, release) >= 0) {
  3409					folio = release;
  3410					if (split_order != new_order)
  3411						continue;
  3412				}
  3413				if (folio_test_anon(release))
  3414					mod_mthp_stat(folio_order(release),
  3415							MTHP_STAT_NR_ANON, 1);
  3416	
  3417				/*
  3418				 * Unfreeze refcount first. Additional reference from
  3419				 * page cache.
  3420				 */
  3421				folio_ref_unfreeze(release,
  3422					1 + ((!folio_test_anon(origin_folio) ||
  3423					     folio_test_swapcache(origin_folio)) ?
  3424						     folio_nr_pages(release) : 0));
  3425	
  3426				if (release != origin_folio)
  3427					lru_add_page_tail(origin_folio, &release->page,
  3428							lruvec, list);
  3429	
  3430				/* Some pages can be beyond EOF: drop them from page cache */
  3431				if (release->index >= end) {
  3432					if (shmem_mapping(origin_folio->mapping))
  3433						nr_dropped++;
  3434					else if (folio_test_clear_dirty(release))
  3435						folio_account_cleaned(release,
  3436							inode_to_wb(origin_folio->mapping->host));
  3437					__filemap_remove_folio(release, NULL);
  3438					folio_put(release);
  3439				} else if (!folio_test_anon(release)) {
  3440					__xa_store(&origin_folio->mapping->i_pages,
  3441							release->index, &release->page, 0);
  3442				} else if (swap_cache) {
  3443					__xa_store(&swap_cache->i_pages,
  3444							swap_cache_index(release->swap),
  3445							&release->page, 0);
  3446				}
  3447			}
  3448		}
  3449	
  3450		unlock_page_lruvec(lruvec);
  3451	
  3452		if (folio_test_anon(origin_folio)) {
  3453			if (folio_test_swapcache(origin_folio))
  3454				xa_unlock(&swap_cache->i_pages);
  3455		} else
  3456			xa_unlock(&mapping->i_pages);
  3457	
  3458		/* Caller disabled irqs, so they are still disabled here */
  3459		local_irq_enable();
  3460	
  3461		remap_page(origin_folio, 1 << order,
  3462				folio_test_anon(origin_folio) ?
  3463					RMP_USE_SHARED_ZEROPAGE : 0);
  3464	
  3465		/*
  3466		 * At this point, folio should contain the specified page, so that it
  3467		 * will be left to the caller to unlock it.
  3468		 */
  3469		for_each_folio_until_end_safe(new_folio, next, origin_folio, next_folio) {
  3470			if (uniform_split && new_folio == folio)
  3471				continue;
  3472			if (!uniform_split && new_folio == origin_folio)
  3473				continue;
  3474	
  3475			folio_unlock(new_folio);
  3476			/*
  3477			 * Subpages may be freed if there wasn't any mapping
  3478			 * like if add_to_swap() is running on a lru page that
  3479			 * had its mapping zapped. And freeing these pages
  3480			 * requires taking the lru_lock so we do the put_page
  3481			 * of the tail pages after the split is complete.
  3482			 */
  3483			free_page_and_swap_cache(&new_folio->page);
  3484		}
  3485		return 0;
  3486	}
  3487	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split()
  2024-10-28 18:09 ` [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split() Zi Yan
  2024-10-29 10:24   ` kernel test robot
@ 2024-10-31 10:14   ` Kirill A . Shutemov
  2024-10-31 14:04     ` Zi Yan
  1 sibling, 1 reply; 7+ messages in thread
From: Kirill A . Shutemov @ 2024-10-31 10:14 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel

On Mon, Oct 28, 2024 at 02:09:30PM -0400, Zi Yan wrote:
>  mm/huge_memory.c | 604 +++++++++++++++++++++++++++++------------------
>  1 file changed, 372 insertions(+), 232 deletions(-)

The patch is really hard to follow. Could you split it into multiple
smaller patches?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split()
  2024-10-31 10:14   ` Kirill A . Shutemov
@ 2024-10-31 14:04     ` Zi Yan
  0 siblings, 0 replies; 7+ messages in thread
From: Zi Yan @ 2024-10-31 14:04 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: linux-mm, Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, linux-kernel

On 31 Oct 2024, at 6:14, Kirill A . Shutemov wrote:

> On Mon, Oct 28, 2024 at 02:09:30PM -0400, Zi Yan wrote:
>>  mm/huge_memory.c | 604 +++++++++++++++++++++++++++++------------------
>>  1 file changed, 372 insertions(+), 232 deletions(-)
>
> The patch is really hard to follow. Could you split it into multiple
> smaller patches?

How about these patches instead of this one?

1. add folio_split() backend code, including __folio_split_without_mapping() and __split_folio_to_order();
2. change split_huge_page_to_list_to_order() to use the backend code in 1;
3. add folio_split();
4. remove __split_huge_page() and __split_huge_page_tail().

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-10-31 14:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-28 18:09 [PATCH v1 0/3] Buddy allocator like folio split Zi Yan
2024-10-28 18:09 ` [PATCH v1 1/3] mm/huge_memory: buddy allocator like folio_split() Zi Yan
2024-10-29 10:24   ` kernel test robot
2024-10-31 10:14   ` Kirill A . Shutemov
2024-10-31 14:04     ` Zi Yan
2024-10-28 18:09 ` [PATCH v1 2/3] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
2024-10-28 18:09 ` [PATCH v1 3/3] mm/truncate: use folio_split() for truncate operation Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox