[PATCH 0/5] Clean up __folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Clean up __folio_put()
@ 2024-04-05 15:32 Matthew Wilcox (Oracle)
  2024-04-05 15:32 ` [PATCH 1/5] mm: Free non-hugetlb large folios in a batch Matthew Wilcox (Oracle)
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-04-05 15:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

With all the changes over the last few years, __folio_put_small and
__folio_put_large have become almost identical to each other ... except
you can't tell because they're spread over two files.  Rearrange it all
so that you can tell, and then inline them both into __folio_put().

Matthew Wilcox (Oracle) (5):
  mm: Free non-hugetlb large folios in a batch
  mm: Combine free_the_page() and free_unref_page()
  mm: Inline destroy_large_folio() into __folio_put_large()
  mm: Combine __folio_put_small, __folio_put_large and __folio_put
  mm: Convert free_zone_device_page to free_zone_device_folio

 include/linux/mm.h |  2 --
 mm/internal.h      |  2 +-
 mm/memremap.c      | 30 ++++++++++++++++--------------
 mm/page_alloc.c    | 37 ++++++++++---------------------------
 mm/swap.c          | 44 ++++++++++++++++----------------------------
 5 files changed, 43 insertions(+), 72 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] mm: Free non-hugetlb large folios in a batch
  2024-04-05 15:32 [PATCH 0/5] Clean up __folio_put() Matthew Wilcox (Oracle)
@ 2024-04-05 15:32 ` Matthew Wilcox (Oracle)
  2024-04-24 15:20   ` Peter Xu
  2024-04-05 15:32 ` [PATCH 2/5] mm: Combine free_the_page() and free_unref_page() Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-04-05 15:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

free_unref_folios() can now handle non-hugetlb large folios, so keep
normal large folios in the batch.  hugetlb folios still need to be
handled specially.  I believe that folios freed using put_pages_list()
cannot be accounted to a memcg (or the small folios would trip the "page
still charged to cgroup" warning), but put an assertion in to check that.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index f72364e92d5f..4643e0d53124 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -158,10 +158,11 @@ void put_pages_list(struct list_head *pages)
 	list_for_each_entry_safe(folio, next, pages, lru) {
 		if (!folio_put_testzero(folio))
 			continue;
-		if (folio_test_large(folio)) {
-			__folio_put_large(folio);
+		if (folio_test_hugetlb(folio)) {
+			free_huge_folio(folio);
 			continue;
 		}
+		VM_BUG_ON_FOLIO(folio_memcg(folio), folio);
 		/* LRU flag must be clear because it's passed using the lru */
 		if (folio_batch_add(&fbatch, folio) > 0)
 			continue;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/5] mm: Combine free_the_page() and free_unref_page()
  2024-04-05 15:32 [PATCH 0/5] Clean up __folio_put() Matthew Wilcox (Oracle)
  2024-04-05 15:32 ` [PATCH 1/5] mm: Free non-hugetlb large folios in a batch Matthew Wilcox (Oracle)
@ 2024-04-05 15:32 ` Matthew Wilcox (Oracle)
  2024-04-05 15:32 ` [PATCH 3/5] mm: Inline destroy_large_folio() into __folio_put_large() Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-04-05 15:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

The pcp_allowed_order() check in free_the_page() was only being skipped by
__folio_put_small() which is about to be rearranged.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page_alloc.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 808b4b720834..8d255f18f6db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -541,14 +541,6 @@ static inline bool pcp_allowed_order(unsigned int order)
 	return false;
 }
 
-static inline void free_the_page(struct page *page, unsigned int order)
-{
-	if (pcp_allowed_order(order))		/* Via pcp? */
-		free_unref_page(page, order);
-	else
-		__free_pages_ok(page, order, FPI_NONE);
-}
-
 /*
  * Higher-order pages are called "compound pages".  They are structured thusly:
  *
@@ -584,7 +576,7 @@ void destroy_large_folio(struct folio *folio)
 		folio_undo_large_rmappable(folio);
 
 	mem_cgroup_uncharge(folio);
-	free_the_page(&folio->page, folio_order(folio));
+	free_unref_page(&folio->page, folio_order(folio));
 }
 
 static inline void set_buddy_order(struct page *page, unsigned int order)
@@ -2614,6 +2606,11 @@ void free_unref_page(struct page *page, unsigned int order)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	if (!pcp_allowed_order(order)) {
+		__free_pages_ok(page, order, FPI_NONE);
+		return;
+	}
+
 	if (!free_pages_prepare(page, order))
 		return;
 
@@ -4800,11 +4797,11 @@ void __free_pages(struct page *page, unsigned int order)
 	struct alloc_tag *tag = pgalloc_tag_get(page);
 
 	if (put_page_testzero(page))
-		free_the_page(page, order);
+		free_unref_page(page, order);
 	else if (!head) {
 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
 		while (order-- > 0)
-			free_the_page(page + (1 << order), order);
+			free_unref_page(page + (1 << order), order);
 	}
 }
 EXPORT_SYMBOL(__free_pages);
@@ -4866,7 +4863,7 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
 	if (page_ref_sub_and_test(page, count))
-		free_the_page(page, compound_order(page));
+		free_unref_page(page, compound_order(page));
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -4907,7 +4904,7 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 			goto refill;
 
 		if (unlikely(nc->pfmemalloc)) {
-			free_the_page(page, compound_order(page));
+			free_unref_page(page, compound_order(page));
 			goto refill;
 		}
 
@@ -4951,7 +4948,7 @@ void page_frag_free(void *addr)
 	struct page *page = virt_to_head_page(addr);
 
 	if (unlikely(put_page_testzero(page)))
-		free_the_page(page, compound_order(page));
+		free_unref_page(page, compound_order(page));
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/5] mm: Inline destroy_large_folio() into __folio_put_large()
  2024-04-05 15:32 [PATCH 0/5] Clean up __folio_put() Matthew Wilcox (Oracle)
  2024-04-05 15:32 ` [PATCH 1/5] mm: Free non-hugetlb large folios in a batch Matthew Wilcox (Oracle)
  2024-04-05 15:32 ` [PATCH 2/5] mm: Combine free_the_page() and free_unref_page() Matthew Wilcox (Oracle)
@ 2024-04-05 15:32 ` Matthew Wilcox (Oracle)
  2024-04-05 15:32 ` [PATCH 4/5] mm: Combine __folio_put_small, __folio_put_large and __folio_put Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-04-05 15:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

destroy_large_folio() has only one caller, move its contents there.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h |  2 --
 mm/page_alloc.c    | 14 --------------
 mm/swap.c          | 13 ++++++++++---
 3 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d7d311d13712..b9173e230804 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1318,8 +1318,6 @@ void folio_copy(struct folio *dst, struct folio *src);
 
 unsigned long nr_free_buffer_pages(void);
 
-void destroy_large_folio(struct folio *folio);
-
 /* Returns the number of bytes in this potentially compound page. */
 static inline unsigned long page_size(struct page *page)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d255f18f6db..171b7863868e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -565,20 +565,6 @@ void prep_compound_page(struct page *page, unsigned int order)
 	prep_compound_head(page, order);
 }
 
-void destroy_large_folio(struct folio *folio)
-{
-	if (folio_test_hugetlb(folio)) {
-		free_huge_folio(folio);
-		return;
-	}
-
-	if (folio_test_large_rmappable(folio))
-		folio_undo_large_rmappable(folio);
-
-	mem_cgroup_uncharge(folio);
-	free_unref_page(&folio->page, folio_order(folio));
-}
-
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
diff --git a/mm/swap.c b/mm/swap.c
index 4643e0d53124..4cfb98304742 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -127,9 +127,16 @@ static void __folio_put_large(struct folio *folio)
 	 * (it's never listed to any LRU lists) and no memcg routines should
 	 * be called for hugetlb (it has a separate hugetlb_cgroup.)
 	 */
-	if (!folio_test_hugetlb(folio))
-		page_cache_release(folio);
-	destroy_large_folio(folio);
+	if (folio_test_hugetlb(folio)) {
+		free_huge_folio(folio);
+		return;
+	}
+
+	page_cache_release(folio);
+	if (folio_test_large_rmappable(folio))
+		folio_undo_large_rmappable(folio);
+	mem_cgroup_uncharge(folio);
+	free_unref_page(&folio->page, folio_order(folio));
 }
 
 void __folio_put(struct folio *folio)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/5] mm: Combine __folio_put_small, __folio_put_large and __folio_put
  2024-04-05 15:32 [PATCH 0/5] Clean up __folio_put() Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2024-04-05 15:32 ` [PATCH 3/5] mm: Inline destroy_large_folio() into __folio_put_large() Matthew Wilcox (Oracle)
@ 2024-04-05 15:32 ` Matthew Wilcox (Oracle)
  2024-04-05 15:32 ` [PATCH 5/5] mm: Convert free_zone_device_page to free_zone_device_folio Matthew Wilcox (Oracle)
  2024-04-05 16:34 ` [PATCH 0/5] Clean up __folio_put() Zi Yan
  5 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-04-05 15:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

It's now obvious that __folio_put_small() and __folio_put_large() do
almost exactly the same thing.  Inline them both into __folio_put().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap.c | 32 ++++++--------------------------
 1 file changed, 6 insertions(+), 26 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 4cfb98304742..15a7da725576 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -112,42 +112,22 @@ static void page_cache_release(struct folio *folio)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 }
 
-static void __folio_put_small(struct folio *folio)
-{
-	page_cache_release(folio);
-	mem_cgroup_uncharge(folio);
-	free_unref_page(&folio->page, 0);
-}
-
-static void __folio_put_large(struct folio *folio)
+void __folio_put(struct folio *folio)
 {
-	/*
-	 * __page_cache_release() is supposed to be called for thp, not for
-	 * hugetlb. This is because hugetlb page does never have PageLRU set
-	 * (it's never listed to any LRU lists) and no memcg routines should
-	 * be called for hugetlb (it has a separate hugetlb_cgroup.)
-	 */
-	if (folio_test_hugetlb(folio)) {
+	if (unlikely(folio_is_zone_device(folio))) {
+		free_zone_device_page(&folio->page);
+		return;
+	} else if (folio_test_hugetlb(folio)) {
 		free_huge_folio(folio);
 		return;
 	}
 
 	page_cache_release(folio);
-	if (folio_test_large_rmappable(folio))
+	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
 		folio_undo_large_rmappable(folio);
 	mem_cgroup_uncharge(folio);
 	free_unref_page(&folio->page, folio_order(folio));
 }
-
-void __folio_put(struct folio *folio)
-{
-	if (unlikely(folio_is_zone_device(folio)))
-		free_zone_device_page(&folio->page);
-	else if (unlikely(folio_test_large(folio)))
-		__folio_put_large(folio);
-	else
-		__folio_put_small(folio);
-}
 EXPORT_SYMBOL(__folio_put);
 
 /**
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 5/5] mm: Convert free_zone_device_page to free_zone_device_folio
  2024-04-05 15:32 [PATCH 0/5] Clean up __folio_put() Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2024-04-05 15:32 ` [PATCH 4/5] mm: Combine __folio_put_small, __folio_put_large and __folio_put Matthew Wilcox (Oracle)
@ 2024-04-05 15:32 ` Matthew Wilcox (Oracle)
  2024-04-05 16:34 ` [PATCH 0/5] Clean up __folio_put() Zi Yan
  5 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-04-05 15:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Both callers already have a folio; pass it in and save a few calls to
compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h |  2 +-
 mm/memremap.c | 30 ++++++++++++++++--------------
 mm/swap.c     |  4 ++--
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 56eed0b6eecb..57c1055d5568 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1149,7 +1149,7 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
 		      unsigned long addr, int page_nid, int *flags);
 
-void free_zone_device_page(struct page *page);
+void free_zone_device_folio(struct folio *folio);
 int migrate_device_coherent_page(struct page *page);
 
 /*
diff --git a/mm/memremap.c b/mm/memremap.c
index 9e9fb1972fff..e1776693e2ea 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -456,21 +456,23 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 }
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
-void free_zone_device_page(struct page *page)
+void free_zone_device_folio(struct folio *folio)
 {
-	if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
+	if (WARN_ON_ONCE(!folio->page.pgmap->ops ||
+			!folio->page.pgmap->ops->page_free))
 		return;
 
-	mem_cgroup_uncharge(page_folio(page));
+	mem_cgroup_uncharge(folio);
 
 	/*
 	 * Note: we don't expect anonymous compound pages yet. Once supported
 	 * and we could PTE-map them similar to THP, we'd have to clear
 	 * PG_anon_exclusive on all tail pages.
 	 */
-	VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page);
-	if (PageAnon(page))
-		__ClearPageAnonExclusive(page);
+	if (folio_test_anon(folio)) {
+		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
+		__ClearPageAnonExclusive(folio_page(folio, 0));
+	}
 
 	/*
 	 * When a device managed page is freed, the folio->mapping field
@@ -481,20 +483,20 @@ void free_zone_device_page(struct page *page)
 	 *
 	 * For other types of ZONE_DEVICE pages, migration is either
 	 * handled differently or not done at all, so there is no need
-	 * to clear page->mapping.
+	 * to clear folio->mapping.
 	 */
-	page->mapping = NULL;
-	page->pgmap->ops->page_free(page);
+	folio->mapping = NULL;
+	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
 
-	if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
-	    page->pgmap->type != MEMORY_DEVICE_COHERENT)
+	if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
+	    folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
 		/*
-		 * Reset the page count to 1 to prepare for handing out the page
+		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
-		set_page_count(page, 1);
+		folio_set_count(folio, 1);
 	else
-		put_dev_pagemap(page->pgmap);
+		put_dev_pagemap(folio->page.pgmap);
 }
 
 void zone_device_page_init(struct page *page)
diff --git a/mm/swap.c b/mm/swap.c
index 15a7da725576..f0d478eee292 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -115,7 +115,7 @@ static void page_cache_release(struct folio *folio)
 void __folio_put(struct folio *folio)
 {
 	if (unlikely(folio_is_zone_device(folio))) {
-		free_zone_device_page(&folio->page);
+		free_zone_device_folio(folio);
 		return;
 	} else if (folio_test_hugetlb(folio)) {
 		free_huge_folio(folio);
@@ -984,7 +984,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			if (put_devmap_managed_page_refs(&folio->page, nr_refs))
 				continue;
 			if (folio_ref_sub_and_test(folio, nr_refs))
-				free_zone_device_page(&folio->page);
+				free_zone_device_folio(folio);
 			continue;
 		}
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] Clean up __folio_put()
  2024-04-05 15:32 [PATCH 0/5] Clean up __folio_put() Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2024-04-05 15:32 ` [PATCH 5/5] mm: Convert free_zone_device_page to free_zone_device_folio Matthew Wilcox (Oracle)
@ 2024-04-05 16:34 ` Zi Yan
  5 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2024-04-05 16:34 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: Andrew Morton, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

On 5 Apr 2024, at 11:32, Matthew Wilcox (Oracle) wrote:

> With all the changes over the last few years, __folio_put_small and
> __folio_put_large have become almost identical to each other ... except
> you can't tell because they're spread over two files.  Rearrange it all
> so that you can tell, and then inline them both into __folio_put().
>
> Matthew Wilcox (Oracle) (5):
>   mm: Free non-hugetlb large folios in a batch
>   mm: Combine free_the_page() and free_unref_page()
>   mm: Inline destroy_large_folio() into __folio_put_large()
>   mm: Combine __folio_put_small, __folio_put_large and __folio_put
>   mm: Convert free_zone_device_page to free_zone_device_folio
>
>  include/linux/mm.h |  2 --
>  mm/internal.h      |  2 +-
>  mm/memremap.c      | 30 ++++++++++++++++--------------
>  mm/page_alloc.c    | 37 ++++++++++---------------------------
>  mm/swap.c          | 44 ++++++++++++++++----------------------------
>  5 files changed, 43 insertions(+), 72 deletions(-)

The whole series looks good to me. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/5] mm: Free non-hugetlb large folios in a batch
  2024-04-05 15:32 ` [PATCH 1/5] mm: Free non-hugetlb large folios in a batch Matthew Wilcox (Oracle)
@ 2024-04-24 15:20   ` Peter Xu
  2024-04-25  3:39     ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2024-04-24 15:20 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: Andrew Morton, linux-mm, Alex Williamson

On Fri, Apr 05, 2024 at 04:32:23PM +0100, Matthew Wilcox (Oracle) wrote:
> free_unref_folios() can now handle non-hugetlb large folios, so keep
> normal large folios in the batch.  hugetlb folios still need to be
> handled specially.  I believe that folios freed using put_pages_list()
> cannot be accounted to a memcg (or the small folios would trip the "page
> still charged to cgroup" warning), but put an assertion in to check that.

There's such user, iommu uses put_pages_list() to free IOMMU pgtables, and
they can be memcg accounted; since 2023 iommu_map switched to use
GFP_KERNEL_ACCOUNT.

I hit below panic when testing my local branch over mm-everthing when
running some VFIO workloads.

For this specific vfio use case, see 160912fc3d4a ("vfio/type1: account
iommu allocations").

I think we should remove the VM_BUG_ON_FOLIO() line, as the memcg will then
be properly taken care of later in free_pages_prepare().  Fixup attached at
the end that will fix this crash for me.

Thanks,

[   10.092411] kernel BUG at mm/swap.c:152!
[   10.092686] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   10.093034] CPU: 3 PID: 634 Comm: vfio-pci-mmap-t Tainted: G        W          6.9.0-rc4-peterx+ #2
[   10.093628] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   10.094361] RIP: 0010:put_pages_list+0x12b/0x150
[   10.094675] Code: 6d 08 48 81 c4 00 01 00 00 5b 5d c3 cc cc cc cc 48 c7 c6 f0 fd 9f 82 e8 63 e8 03 00 0f 0b 48 c7 c6 48 00 a0 82 e8 55 e8 03 00 <0f> 0b 48 c7 c6 28 fe 9f 82 e8 47f
[   10.095896] RSP: 0018:ffffc9000221bc50 EFLAGS: 00010282
[   10.096242] RAX: 0000000000000038 RBX: ffffea00042695c0 RCX: 0000000000000000
[   10.096707] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[   10.097177] RBP: ffffc9000221bd68 R08: 0000000000000000 R09: 0000000000000003
[   10.097642] R10: ffffc9000221bb08 R11: ffffffff8335db48 R12: ffff8881070172c0
[   10.098113] R13: ffff888102fd0000 R14: ffff888107017210 R15: ffff888110a6c7c0
[   10.098586] FS:  0000000000000000(0000) GS:ffff888276a00000(0000) knlGS:0000000000000000
[   10.099117] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.099494] CR2: 00007f1910000000 CR3: 000000000323c006 CR4: 0000000000770ef0
[   10.099972] PKRU: 55555554
[   10.100154] Call Trace:
[   10.100321]  <TASK>
[   10.100466]  ? die+0x32/0x80
[   10.100666]  ? do_trap+0xd9/0x100
[   10.100897]  ? put_pages_list+0x12b/0x150
[   10.101168]  ? put_pages_list+0x12b/0x150
[   10.101434]  ? do_error_trap+0x81/0x110
[   10.101688]  ? put_pages_list+0x12b/0x150
[   10.101957]  ? exc_invalid_op+0x4c/0x60
[   10.102216]  ? put_pages_list+0x12b/0x150
[   10.102484]  ? asm_exc_invalid_op+0x16/0x20
[   10.102771]  ? put_pages_list+0x12b/0x150
[   10.103026]  ? 0xffffffff81000000
[   10.103246]  ? dma_pte_list_pagetables.isra.0+0x38/0xa0
[   10.103592]  ? dma_pte_list_pagetables.isra.0+0x9b/0xa0
[   10.103933]  ? dma_pte_clear_level+0x18c/0x1a0
[   10.104228]  ? domain_unmap+0x65/0x130
[   10.104481]  ? domain_unmap+0xe6/0x130
[   10.104735]  domain_exit+0x47/0x80
[   10.104968]  vfio_iommu_type1_detach_group+0x3f1/0x5f0
[   10.105308]  ? vfio_group_detach_container+0x3c/0x1a0
[   10.105644]  vfio_group_detach_container+0x60/0x1a0
[   10.105977]  vfio_group_fops_release+0x46/0x80
[   10.106274]  __fput+0x9a/0x2d0
[   10.106479]  task_work_run+0x55/0x90
[   10.106717]  do_exit+0x32f/0xb70
[   10.106945]  ? _raw_spin_unlock_irq+0x24/0x50
[   10.107237]  do_group_exit+0x32/0xa0
[   10.107481]  __x64_sys_exit_group+0x14/0x20
[   10.107760]  do_syscall_64+0x75/0x190
[   10.108007]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

==================================

diff --git a/mm/swap.c b/mm/swap.c
index f0d478eee292..8ae5cd4ed180 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -149,7 +149,6 @@ void put_pages_list(struct list_head *pages)
                        free_huge_folio(folio);
                        continue;
                }
-               VM_BUG_ON_FOLIO(folio_memcg(folio), folio);
                /* LRU flag must be clear because it's passed using the lru */
                if (folio_batch_add(&fbatch, folio) > 0)
                        continue;

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/5] mm: Free non-hugetlb large folios in a batch
  2024-04-24 15:20   ` Peter Xu
@ 2024-04-25  3:39     ` Matthew Wilcox
  2024-04-25 15:00       ` Peter Xu
  2024-04-25 20:54       ` Andrew Morton
  0 siblings, 2 replies; 11+ messages in thread
From: Matthew Wilcox @ 2024-04-25  3:39 UTC (permalink / raw)
  To: Peter Xu; +Cc: Andrew Morton, linux-mm, Alex Williamson

On Wed, Apr 24, 2024 at 11:20:28AM -0400, Peter Xu wrote:
> On Fri, Apr 05, 2024 at 04:32:23PM +0100, Matthew Wilcox (Oracle) wrote:
> > free_unref_folios() can now handle non-hugetlb large folios, so keep
> > normal large folios in the batch.  hugetlb folios still need to be
> > handled specially.  I believe that folios freed using put_pages_list()
> > cannot be accounted to a memcg (or the small folios would trip the "page
> > still charged to cgroup" warning), but put an assertion in to check that.
> 
> There's such user, iommu uses put_pages_list() to free IOMMU pgtables, and
> they can be memcg accounted; since 2023 iommu_map switched to use
> GFP_KERNEL_ACCOUNT.
> 
> I hit below panic when testing my local branch over mm-everthing when
> running some VFIO workloads.
> 
> For this specific vfio use case, see 160912fc3d4a ("vfio/type1: account
> iommu allocations").
> 
> I think we should remove the VM_BUG_ON_FOLIO() line, as the memcg will then
> be properly taken care of later in free_pages_prepare().  Fixup attached at
> the end that will fix this crash for me.

Yes, I think you're right.

I was concerned about the deferred split list / memcg charge problem,
but (a) page table pages can't ever be on the deferred split list, (b)
just passing them through to free_unref_folios() works fine.  The problem
was that folios_put_refs() was uncharging a batch before passing them
to free_unref_folios().

That does bring up the question though ... should we be uncharging
these folios as a batch for better performance?  Do you have a workload
which frees a lot of page tables?  Presumably an exit would do that.
If so, does adding a call to mem_cgroup_uncharge_folios() before calling
free_unref_folios() improve performance in any noticable way?

In the meantime, this patch:

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

although I think Andrew will just fold it into
"mm: free non-hugetlb large folios in a batch"

Andrew, if you do do that, please also edit out the last couple of
sentences from the commit message:

    free_unref_folios() can now handle non-hugetlb large folios, so keep
    normal large folios in the batch.  hugetlb folios still need to be handled
-   specially.  I believe that folios freed using put_pages_list() cannot be
-   accounted to a memcg (or the small folios would trip the "page still
-   charged to cgroup" warning), but put an assertion in to check that.
+   specially.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/5] mm: Free non-hugetlb large folios in a batch
  2024-04-25  3:39     ` Matthew Wilcox
@ 2024-04-25 15:00       ` Peter Xu
  2024-04-25 20:54       ` Andrew Morton
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Xu @ 2024-04-25 15:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm, Alex Williamson

On Thu, Apr 25, 2024 at 04:39:14AM +0100, Matthew Wilcox wrote:
> On Wed, Apr 24, 2024 at 11:20:28AM -0400, Peter Xu wrote:
> > On Fri, Apr 05, 2024 at 04:32:23PM +0100, Matthew Wilcox (Oracle) wrote:
> > > free_unref_folios() can now handle non-hugetlb large folios, so keep
> > > normal large folios in the batch.  hugetlb folios still need to be
> > > handled specially.  I believe that folios freed using put_pages_list()
> > > cannot be accounted to a memcg (or the small folios would trip the "page
> > > still charged to cgroup" warning), but put an assertion in to check that.
> > 
> > There's such user, iommu uses put_pages_list() to free IOMMU pgtables, and
> > they can be memcg accounted; since 2023 iommu_map switched to use
> > GFP_KERNEL_ACCOUNT.
> > 
> > I hit below panic when testing my local branch over mm-everthing when
> > running some VFIO workloads.
> > 
> > For this specific vfio use case, see 160912fc3d4a ("vfio/type1: account
> > iommu allocations").
> > 
> > I think we should remove the VM_BUG_ON_FOLIO() line, as the memcg will then
> > be properly taken care of later in free_pages_prepare().  Fixup attached at
> > the end that will fix this crash for me.
> 
> Yes, I think you're right.
> 
> I was concerned about the deferred split list / memcg charge problem,
> but (a) page table pages can't ever be on the deferred split list, (b)
> just passing them through to free_unref_folios() works fine.  The problem
> was that folios_put_refs() was uncharging a batch before passing them
> to free_unref_folios().
> 
> That does bring up the question though ... should we be uncharging
> these folios as a batch for better performance?  Do you have a workload
> which frees a lot of page tables?  Presumably an exit would do that.
> If so, does adding a call to mem_cgroup_uncharge_folios() before calling
> free_unref_folios() improve performance in any noticable way?

Looks like something worth trying indeed.

The trace I hit was an exit path, but we can double check whether it can
even happen in some iommu hot paths too like unmap, so maybe such change
would justify better in that case?

AFAIU based on my reading to the current iommu pgtable mgmt it's more
aggresive than cpu pgtable on freeing pgtable pages, so it looks like such
batched release can happen during an iommu unmap too rather than exit only:

intel_iommu_unmap
  domain_unmap
    dma_pte_clear_level
      dma_pte_list_pagetables

But still worth checking in a test, perhaps the easiest way is to use
ioctl(VFIO_IOMMU_[UN]MAP_DMA).

> 
> In the meantime, this patch:
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> although I think Andrew will just fold it into
> "mm: free non-hugetlb large folios in a batch"
> 
> Andrew, if you do do that, please also edit out the last couple of
> sentences from the commit message:
> 
>     free_unref_folios() can now handle non-hugetlb large folios, so keep
>     normal large folios in the batch.  hugetlb folios still need to be handled
> -   specially.  I believe that folios freed using put_pages_list() cannot be
> -   accounted to a memcg (or the small folios would trip the "page still
> -   charged to cgroup" warning), but put an assertion in to check that.
> +   specially.

Yes, we should drop these lines too.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/5] mm: Free non-hugetlb large folios in a batch
  2024-04-25  3:39     ` Matthew Wilcox
  2024-04-25 15:00       ` Peter Xu
@ 2024-04-25 20:54       ` Andrew Morton
  1 sibling, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2024-04-25 20:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Peter Xu, linux-mm, Alex Williamson

On Thu, 25 Apr 2024 04:39:14 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> Andrew, if you do do that, please also edit out the last couple of
> sentences from the commit message:
> 
>     free_unref_folios() can now handle non-hugetlb large folios, so keep
>     normal large folios in the batch.  hugetlb folios still need to be handled
> -   specially.  I believe that folios freed using put_pages_list() cannot be
> -   accounted to a memcg (or the small folios would trip the "page still
> -   charged to cgroup" warning), but put an assertion in to check that.
> +   specially.

Done, thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-04-25 20:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-05 15:32 [PATCH 0/5] Clean up __folio_put() Matthew Wilcox (Oracle)
2024-04-05 15:32 ` [PATCH 1/5] mm: Free non-hugetlb large folios in a batch Matthew Wilcox (Oracle)
2024-04-24 15:20   ` Peter Xu
2024-04-25  3:39     ` Matthew Wilcox
2024-04-25 15:00       ` Peter Xu
2024-04-25 20:54       ` Andrew Morton
2024-04-05 15:32 ` [PATCH 2/5] mm: Combine free_the_page() and free_unref_page() Matthew Wilcox (Oracle)
2024-04-05 15:32 ` [PATCH 3/5] mm: Inline destroy_large_folio() into __folio_put_large() Matthew Wilcox (Oracle)
2024-04-05 15:32 ` [PATCH 4/5] mm: Combine __folio_put_small, __folio_put_large and __folio_put Matthew Wilcox (Oracle)
2024-04-05 15:32 ` [PATCH 5/5] mm: Convert free_zone_device_page to free_zone_device_folio Matthew Wilcox (Oracle)
2024-04-05 16:34 ` [PATCH 0/5] Clean up __folio_put() Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox