[RFC PATCH v1 00/10] mm: Introduce and use folio_owner

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
@ 2024-11-08 16:20 Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 01/10] mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb() Fuad Tabba
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

Some folios, such as hugetlb folios and zone device folios,
require special handling when the folio's reference count reaches
0, before being freed. Moreover, guest_memfd folios will likely
require special handling to notify it once a folio's reference
count reaches 0, to facilitate shared to private folio conversion
[*]. Currently, each usecase has a dedicated callback when the
folio refcount reaches 0 to that effect. Adding yet more
callbacks is not ideal.

This patch series introduces struct folio_owner_ops and uses it
as a generic way to handle callbacks on freeing a folio. It also
applies the callbacks to hugetlb and zone device folios.

A pointer to struct folio_owner_ops is overlaid on struct page
compound_page, struct page/folio lru. To indicate that the folio
uses the callback, this patch series sets bit 1 of the new field,
similar to how bit 0 indicates a compound page.

Patches 1 to 6 rework the hugetlb code to allow us to reuse
folio->lru for the owner ops as long as they are not isolated.

Patches 7 to 10 introduce struct folio_owner_ops, and apply the
callbacks to zone device and hugetlb folios.

Cheers,
/fuad

[*] https://lore.kernel.org/all/CAGtprH_JP2w-4rq02h_Ugvq5KuHX7TUvegOS7xUs_iy5hriE7g@mail.gmail.com/

David Hildenbrand (6):
  mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()
  mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb
    folio
  mm/hugetlb: rename "folio_putback_active_hugetlb()" to
    "folio_putback_hugetlb()"
  mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on
    folios
  mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()
  mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals

Fuad Tabba (4):
  mm: Introduce struct folio_owner_ops
  mm: Use getters and setters to access page pgmap
  mm: Use owner_ops on folio_put for zone device pages
  mm: hugetlb: Use owner_ops on folio_put for hugetlb

 drivers/gpu/drm/nouveau/nouveau_dmem.c |   4 +-
 drivers/pci/p2pdma.c                   |   8 +-
 include/linux/hugetlb.h                |  10 +-
 include/linux/memremap.h               |  14 +-
 include/linux/mm_types.h               | 107 ++++++++++++++-
 lib/test_hmm.c                         |   2 +-
 mm/gup.c                               |   2 +-
 mm/hmm.c                               |   2 +-
 mm/hugetlb.c                           | 179 ++++++++++++++++++-------
 mm/hugetlb_cgroup.c                    |  19 ++-
 mm/hugetlb_vmemmap.c                   |   8 +-
 mm/internal.h                          |   1 -
 mm/memory.c                            |   2 +-
 mm/mempolicy.c                         |   2 +-
 mm/memremap.c                          |  49 +------
 mm/migrate.c                           |  20 +--
 mm/migrate_device.c                    |   4 +-
 mm/mm_init.c                           |  48 ++++++-
 mm/swap.c                              |  25 ++--
 19 files changed, 342 insertions(+), 164 deletions(-)

base-commit: beb2622b970047000fa3cae64c23585669b01fca
-- 
2.47.0.277.g8800431eea-goog

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 01/10] mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 02/10] mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio Fuad Tabba
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

From: David Hildenbrand <david@redhat.com>

Let's make the function name match "folio_isolate_lru()", and add some
kernel doc.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/hugetlb.h |  4 ++--
 mm/gup.c                |  2 +-
 mm/hugetlb.c            | 23 ++++++++++++++++++++---
 mm/mempolicy.c          |  2 +-
 mm/migrate.c            |  6 +++---
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ae4fe8615bb6..b0cf8dbfeb6a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -153,7 +153,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						vm_flags_t vm_flags);
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
-bool isolate_hugetlb(struct folio *folio, struct list_head *list);
+bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list);
 int get_hwpoison_hugetlb_folio(struct folio *folio, bool *hugetlb, bool unpoison);
 int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 				bool *migratable_cleared);
@@ -414,7 +414,7 @@ static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
 	return NULL;
 }
 
-static inline bool isolate_hugetlb(struct folio *folio, struct list_head *list)
+static inline bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list)
 {
 	return false;
 }
diff --git a/mm/gup.c b/mm/gup.c
index 28ae330ec4dd..40bbcffca865 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2301,7 +2301,7 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 
 		if (folio_test_hugetlb(folio)) {
-			isolate_hugetlb(folio, movable_folio_list);
+			folio_isolate_hugetlb(folio, movable_folio_list);
 			continue;
 		}
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cec4b121193f..e17bb2847572 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2868,7 +2868,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
 		 * Fail with -EBUSY if not possible.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
-		isolated = isolate_hugetlb(old_folio, list);
+		isolated = folio_isolate_hugetlb(old_folio, list);
 		ret = isolated ? 0 : -EBUSY;
 		spin_lock_irq(&hugetlb_lock);
 		goto free_new;
@@ -2953,7 +2953,7 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	if (hstate_is_gigantic(h))
 		return -ENOMEM;
 
-	if (folio_ref_count(folio) && isolate_hugetlb(folio, list))
+	if (folio_ref_count(folio) && folio_isolate_hugetlb(folio, list))
 		ret = 0;
 	else if (!folio_ref_count(folio))
 		ret = alloc_and_dissolve_hugetlb_folio(h, folio, list);
@@ -7396,7 +7396,24 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
 
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 
-bool isolate_hugetlb(struct folio *folio, struct list_head *list)
+/**
+ * folio_isolate_hugetlb: try to isolate an allocated hugetlb folio
+ * @folio: the folio to isolate
+ * @list: the list to add the folio to on success
+ *
+ * Isolate an allocated (refcount > 0) hugetlb folio, marking it as
+ * isolated/non-migratable, and moving it from the active list to the
+ * given list.
+ *
+ * Isolation will fail if @folio is not an allocated hugetlb folio, or if
+ * it is already isolated/non-migratable.
+ *
+ * On success, an additional folio reference is taken that must be dropped
+ * using folio_putback_active_hugetlb() to undo the isolation.
+ *
+ * Return: True if isolation worked, otherwise False.
+ */
+bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list)
 {
 	bool ret = true;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bb37cd1a51d8..41bdff67757c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -647,7 +647,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
 	 */
 	if ((flags & MPOL_MF_MOVE_ALL) ||
 	    (!folio_likely_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
-		if (!isolate_hugetlb(folio, qp->pagelist))
+		if (!folio_isolate_hugetlb(folio, qp->pagelist))
 			qp->nr_failed++;
 unlock:
 	spin_unlock(ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index dfb5eba3c522..55585b5f57ec 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -136,7 +136,7 @@ static void putback_movable_folio(struct folio *folio)
  *
  * This function shall be used whenever the isolated pageset has been
  * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
- * and isolate_hugetlb().
+ * and folio_isolate_hugetlb().
  */
 void putback_movable_pages(struct list_head *l)
 {
@@ -177,7 +177,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list)
 	bool isolated, lru;
 
 	if (folio_test_hugetlb(folio))
-		return isolate_hugetlb(folio, list);
+		return folio_isolate_hugetlb(folio, list);
 
 	lru = !__folio_test_movable(folio);
 	if (lru)
@@ -2208,7 +2208,7 @@ static int __add_folio_for_migration(struct folio *folio, int node,
 		return -EACCES;
 
 	if (folio_test_hugetlb(folio)) {
-		if (isolate_hugetlb(folio, pagelist))
+		if (folio_isolate_hugetlb(folio, pagelist))
 			return 1;
 	} else if (folio_isolate_lru(folio)) {
 		list_add_tail(&folio->lru, pagelist);
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 02/10] mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 01/10] mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb() Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 03/10] mm/hugetlb: rename "folio_putback_active_hugetlb()" to "folio_putback_hugetlb()" Fuad Tabba
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

From: David Hildenbrand <david@redhat.com>

We replaced a simple put_page() by a putback_active_hugepage() call in
commit 3aaa76e125c1 (" mm: migrate: hugetlb: putback destination hugepage
to active list"), to set the "active" flag on the dst hugetlb folio.

Nowadays, we decoupled the "active" list from the flag, by calling the
flag "migratable".

Calling "putback" on something that wasn't allocated is weird and not
future proof, especially if we might reach that path when migration failed
and we just want to free the freshly allocated hugetlb folio.

Let's simply set the "migratable" flag in move_hugetlb_state(), where we
know that allocation succeeded, and use simple folio_put() to return
our reference.

Do we need the hugetlb_lock for setting that flag? Staring at other
users of folio_set_hugetlb_migratable(), it does not look like it. After
all, the dst folio should already be on the active list, and we are not
modifying that list.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 mm/hugetlb.c | 5 +++++
 mm/migrate.c | 8 ++++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e17bb2847572..da3fe1840ab8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7508,6 +7508,11 @@ void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int re
 		}
 		spin_unlock_irq(&hugetlb_lock);
 	}
+	/*
+	 * Our old folio is isolated and has "migratable" cleared until it
+	 * is putback. As migration succeeded, set the new folio "migratable".
+	 */
+	folio_set_hugetlb_migratable(new_folio);
 }
 
 static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 55585b5f57ec..b129dc41c140 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1547,14 +1547,14 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 		list_move_tail(&src->lru, ret);
 
 	/*
-	 * If migration was not successful and there's a freeing callback, use
-	 * it.  Otherwise, put_page() will drop the reference grabbed during
-	 * isolation.
+	 * If migration was not successful and there's a freeing callback,
+	 * return the folio to that special allocator. Otherwise, simply drop
+	 * our additional reference.
 	 */
 	if (put_new_folio)
 		put_new_folio(dst, private);
 	else
-		folio_putback_active_hugetlb(dst);
+		folio_put(dst);
 
 	return rc;
 }
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 03/10] mm/hugetlb: rename "folio_putback_active_hugetlb()" to "folio_putback_hugetlb()"
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 01/10] mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb() Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 02/10] mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 04/10] mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios Fuad Tabba
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

From: David Hildenbrand <david@redhat.com>

Now that folio_putback_hugetlb() is only called on folios that were
previously isolated through folio_isolate_hugetlb(), let's rename it to
match folio_putback_lru().

Add some kernel doc to clarify how this function is supposed to be used.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/hugetlb.h |  4 ++--
 mm/hugetlb.c            | 15 +++++++++++++--
 mm/migrate.c            |  6 +++---
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b0cf8dbfeb6a..e846d7dac77c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -157,7 +157,7 @@ bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list);
 int get_hwpoison_hugetlb_folio(struct folio *folio, bool *hugetlb, bool unpoison);
 int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 				bool *migratable_cleared);
-void folio_putback_active_hugetlb(struct folio *folio);
+void folio_putback_hugetlb(struct folio *folio);
 void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int reason);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
@@ -430,7 +430,7 @@ static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return 0;
 }
 
-static inline void folio_putback_active_hugetlb(struct folio *folio)
+static inline void folio_putback_hugetlb(struct folio *folio)
 {
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index da3fe1840ab8..d58bd815fdf2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7409,7 +7409,7 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
  * it is already isolated/non-migratable.
  *
  * On success, an additional folio reference is taken that must be dropped
- * using folio_putback_active_hugetlb() to undo the isolation.
+ * using folio_putback_hugetlb() to undo the isolation.
  *
  * Return: True if isolation worked, otherwise False.
  */
@@ -7461,7 +7461,18 @@ int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return ret;
 }
 
-void folio_putback_active_hugetlb(struct folio *folio)
+/**
+ * folio_putback_hugetlb: unisolate a hugetlb folio
+ * @folio: the isolated hugetlb folio
+ *
+ * Putback/un-isolate the hugetlb folio that was previous isolated using
+ * folio_isolate_hugetlb(): marking it non-isolated/migratable and putting it
+ * back onto the active list.
+ *
+ * Will drop the additional folio reference obtained through
+ * folio_isolate_hugetlb().
+ */
+void folio_putback_hugetlb(struct folio *folio)
 {
 	spin_lock_irq(&hugetlb_lock);
 	folio_set_hugetlb_migratable(folio);
diff --git a/mm/migrate.c b/mm/migrate.c
index b129dc41c140..89292d131148 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -145,7 +145,7 @@ void putback_movable_pages(struct list_head *l)
 
 	list_for_each_entry_safe(folio, folio2, l, lru) {
 		if (unlikely(folio_test_hugetlb(folio))) {
-			folio_putback_active_hugetlb(folio);
+			folio_putback_hugetlb(folio);
 			continue;
 		}
 		list_del(&folio->lru);
@@ -1459,7 +1459,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 
 	if (folio_ref_count(src) == 1) {
 		/* page was freed from under us. So we are done. */
-		folio_putback_active_hugetlb(src);
+		folio_putback_hugetlb(src);
 		return MIGRATEPAGE_SUCCESS;
 	}
 
@@ -1542,7 +1542,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 	folio_unlock(src);
 out:
 	if (rc == MIGRATEPAGE_SUCCESS)
-		folio_putback_active_hugetlb(src);
+		folio_putback_hugetlb(src);
 	else if (rc != -EAGAIN)
 		list_move_tail(&src->lru, ret);
 
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 04/10] mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (2 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 03/10] mm/hugetlb: rename "folio_putback_active_hugetlb()" to "folio_putback_hugetlb()" Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 05/10] mm/hugetlb: use folio->lru int demote_free_hugetlb_folios() Fuad Tabba
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

From: David Hildenbrand <david@redhat.com>

Let's convert hugetlb_cgroup_css_offline() and
hugetlb_cgroup_move_parent() to work on folios. hugepage_activelist
contains folios, not pages.

While at it, rename page_hcg simply to hcg, removing most of the "page"
terminology.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 mm/hugetlb_cgroup.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index d8d0e665caed..1bdeaf25f640 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -195,24 +195,23 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
  * cannot fail.
  */
 static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
-				       struct page *page)
+				       struct folio *folio)
 {
 	unsigned int nr_pages;
 	struct page_counter *counter;
-	struct hugetlb_cgroup *page_hcg;
+	struct hugetlb_cgroup *hcg;
 	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
-	struct folio *folio = page_folio(page);
 
-	page_hcg = hugetlb_cgroup_from_folio(folio);
+	hcg = hugetlb_cgroup_from_folio(folio);
 	/*
 	 * We can have pages in active list without any cgroup
 	 * ie, hugepage with less than 3 pages. We can safely
 	 * ignore those pages.
 	 */
-	if (!page_hcg || page_hcg != h_cg)
+	if (!hcg || hcg != h_cg)
 		goto out;
 
-	nr_pages = compound_nr(page);
+	nr_pages = folio_nr_pages(folio);
 	if (!parent) {
 		parent = root_h_cgroup;
 		/* root has no limit */
@@ -235,13 +234,13 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
 	struct hstate *h;
-	struct page *page;
+	struct folio *folio;
 
 	do {
 		for_each_hstate(h) {
 			spin_lock_irq(&hugetlb_lock);
-			list_for_each_entry(page, &h->hugepage_activelist, lru)
-				hugetlb_cgroup_move_parent(hstate_index(h), h_cg, page);
+			list_for_each_entry(folio, &h->hugepage_activelist, lru)
+				hugetlb_cgroup_move_parent(hstate_index(h), h_cg, folio);
 
 			spin_unlock_irq(&hugetlb_lock);
 		}
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 05/10] mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (3 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 04/10] mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals Fuad Tabba
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

From: David Hildenbrand <david@redhat.com>

Let's avoid messing with pages.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 mm/hugetlb.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d58bd815fdf2..a64852280213 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3806,13 +3806,15 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst,
 
 		for (i = 0; i < pages_per_huge_page(src); i += pages_per_huge_page(dst)) {
 			struct page *page = folio_page(folio, i);
+			struct folio *new_folio;
 
 			page->mapping = NULL;
 			clear_compound_head(page);
 			prep_compound_page(page, dst->order);
+			new_folio = page_folio(page);
 
-			init_new_hugetlb_folio(dst, page_folio(page));
-			list_add(&page->lru, &dst_list);
+			init_new_hugetlb_folio(dst, new_folio);
+			list_add(&new_folio->lru, &dst_list);
 		}
 	}
 
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (4 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 05/10] mm/hugetlb: use folio->lru int demote_free_hugetlb_folios() Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-12 15:28   ` wang wei
  2024-11-08 16:20 ` [RFC PATCH v1 07/10] mm: Introduce struct folio_owner_ops Fuad Tabba
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

From: David Hildenbrand <david@redhat.com>

Let's use a separate list head in the folio, as long as hugetlb folios are
not isolated. This way, we can reuse folio->lru for different purpose
(e.g., owner_ops) as long as they are not isolated.

Consequently, folio->lru will only be used while there is an additional
folio reference that cannot be dropped until putback/un-isolated.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/mm_types.h | 18 +++++++++
 mm/hugetlb.c             | 81 +++++++++++++++++++++-------------------
 mm/hugetlb_cgroup.c      |  4 +-
 mm/hugetlb_vmemmap.c     |  8 ++--
 4 files changed, 66 insertions(+), 45 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 80fef38d9d64..365c73be0bb4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,6 +310,7 @@ typedef struct {
  * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
+ * @_hugetlb_list: To be used in hugetlb core code only.
  * @_deferred_list: Folios to be split under memory pressure.
  * @_unused_slab_obj_exts: Placeholder to match obj_exts in struct slab.
  *
@@ -397,6 +398,17 @@ struct folio {
 		};
 		struct page __page_2;
 	};
+	union {
+		struct {
+			unsigned long _flags_3;
+			unsigned long _head_3;
+	/* public: */
+			struct list_head _hugetlb_list;
+	/* private: the union with struct page is transitional */
+		};
+		struct page __page_3;
+	};
+
 };
 
 #define FOLIO_MATCH(pg, fl)						\
@@ -433,6 +445,12 @@ FOLIO_MATCH(compound_head, _head_2);
 FOLIO_MATCH(flags, _flags_2a);
 FOLIO_MATCH(compound_head, _head_2a);
 #undef FOLIO_MATCH
+#define FOLIO_MATCH(pg, fl)						\
+	static_assert(offsetof(struct folio, fl) ==			\
+			offsetof(struct page, pg) + 3 * sizeof(struct page))
+FOLIO_MATCH(flags, _flags_3);
+FOLIO_MATCH(compound_head, _head_3);
+#undef FOLIO_MATCH
 
 /**
  * struct ptdesc -    Memory descriptor for page tables.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a64852280213..2308e94d8615 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1316,7 +1316,7 @@ static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 	lockdep_assert_held(&hugetlb_lock);
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
 
-	list_move(&folio->lru, &h->hugepage_freelists[nid]);
+	list_move(&folio->_hugetlb_list, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
 	folio_set_hugetlb_freed(folio);
@@ -1329,14 +1329,14 @@ static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
 	lockdep_assert_held(&hugetlb_lock);
-	list_for_each_entry(folio, &h->hugepage_freelists[nid], lru) {
+	list_for_each_entry(folio, &h->hugepage_freelists[nid], _hugetlb_list) {
 		if (pin && !folio_is_longterm_pinnable(folio))
 			continue;
 
 		if (folio_test_hwpoison(folio))
 			continue;
 
-		list_move(&folio->lru, &h->hugepage_activelist);
+		list_move(&folio->_hugetlb_list, &h->hugepage_activelist);
 		folio_ref_unfreeze(folio, 1);
 		folio_clear_hugetlb_freed(folio);
 		h->free_huge_pages--;
@@ -1599,7 +1599,7 @@ static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
-	list_del(&folio->lru);
+	list_del(&folio->_hugetlb_list);
 
 	if (folio_test_hugetlb_freed(folio)) {
 		folio_clear_hugetlb_freed(folio);
@@ -1616,8 +1616,9 @@ static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
 	 * pages.  Otherwise, someone (memory error handling) may try to write
 	 * to tail struct pages.
 	 */
-	if (!folio_test_hugetlb_vmemmap_optimized(folio))
+	if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
 		__folio_clear_hugetlb(folio);
+	}
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[nid]--;
@@ -1632,7 +1633,7 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 
 	lockdep_assert_held(&hugetlb_lock);
 
-	INIT_LIST_HEAD(&folio->lru);
+	INIT_LIST_HEAD(&folio->_hugetlb_list);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 
@@ -1640,8 +1641,8 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 		h->surplus_huge_pages++;
 		h->surplus_huge_pages_node[nid]++;
 	}
-
 	__folio_set_hugetlb(folio);
+
 	folio_change_private(folio, NULL);
 	/*
 	 * We have to set hugetlb_vmemmap_optimized again as above
@@ -1789,8 +1790,8 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 		 * hugetlb pages with vmemmap we will free up memory so that we
 		 * can allocate vmemmap for more hugetlb pages.
 		 */
-		list_for_each_entry_safe(folio, t_folio, non_hvo_folios, lru) {
-			list_del(&folio->lru);
+		list_for_each_entry_safe(folio, t_folio, non_hvo_folios, _hugetlb_list) {
+			list_del(&folio->_hugetlb_list);
 			spin_lock_irq(&hugetlb_lock);
 			__folio_clear_hugetlb(folio);
 			spin_unlock_irq(&hugetlb_lock);
@@ -1808,14 +1809,14 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 		 * If are able to restore vmemmap and free one hugetlb page, we
 		 * quit processing the list to retry the bulk operation.
 		 */
-		list_for_each_entry_safe(folio, t_folio, folio_list, lru)
+		list_for_each_entry_safe(folio, t_folio, folio_list, _hugetlb_list)
 			if (hugetlb_vmemmap_restore_folio(h, folio)) {
-				list_del(&folio->lru);
+				list_del(&folio->_hugetlb_list);
 				spin_lock_irq(&hugetlb_lock);
 				add_hugetlb_folio(h, folio, true);
 				spin_unlock_irq(&hugetlb_lock);
 			} else {
-				list_del(&folio->lru);
+				list_del(&folio->_hugetlb_list);
 				spin_lock_irq(&hugetlb_lock);
 				__folio_clear_hugetlb(folio);
 				spin_unlock_irq(&hugetlb_lock);
@@ -1856,12 +1857,12 @@ static void update_and_free_pages_bulk(struct hstate *h,
 	VM_WARN_ON(ret < 0);
 	if (!list_empty(&non_hvo_folios) && ret) {
 		spin_lock_irq(&hugetlb_lock);
-		list_for_each_entry(folio, &non_hvo_folios, lru)
+		list_for_each_entry(folio, &non_hvo_folios, _hugetlb_list)
 			__folio_clear_hugetlb(folio);
 		spin_unlock_irq(&hugetlb_lock);
 	}
 
-	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, lru) {
+	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, _hugetlb_list) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
 	}
@@ -1959,7 +1960,7 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
 static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	__folio_set_hugetlb(folio);
-	INIT_LIST_HEAD(&folio->lru);
+	INIT_LIST_HEAD(&folio->_hugetlb_list);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
 	set_hugetlb_cgroup_rsvd(folio, NULL);
@@ -2112,7 +2113,7 @@ static void prep_and_add_allocated_folios(struct hstate *h,
 
 	/* Add all new pool pages to free lists in one lock cycle */
 	spin_lock_irqsave(&hugetlb_lock, flags);
-	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+	list_for_each_entry_safe(folio, tmp_f, folio_list, _hugetlb_list) {
 		__prep_account_new_huge_page(h, folio_nid(folio));
 		enqueue_hugetlb_folio(h, folio);
 	}
@@ -2165,7 +2166,7 @@ static struct folio *remove_pool_hugetlb_folio(struct hstate *h,
 		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
 		    !list_empty(&h->hugepage_freelists[node])) {
 			folio = list_entry(h->hugepage_freelists[node].next,
-					  struct folio, lru);
+					  struct folio, _hugetlb_list);
 			remove_hugetlb_folio(h, folio, acct_surplus);
 			break;
 		}
@@ -2491,7 +2492,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 			alloc_ok = false;
 			break;
 		}
-		list_add(&folio->lru, &surplus_list);
+		list_add(&folio->_hugetlb_list, &surplus_list);
 		cond_resched();
 	}
 	allocated += i;
@@ -2526,7 +2527,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	ret = 0;
 
 	/* Free the needed pages to the hugetlb pool */
-	list_for_each_entry_safe(folio, tmp, &surplus_list, lru) {
+	list_for_each_entry_safe(folio, tmp, &surplus_list, _hugetlb_list) {
 		if ((--needed) < 0)
 			break;
 		/* Add the page to the hugetlb allocator */
@@ -2539,7 +2540,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * Free unnecessary surplus pages to the buddy allocator.
 	 * Pages have no ref count, call free_huge_folio directly.
 	 */
-	list_for_each_entry_safe(folio, tmp, &surplus_list, lru)
+	list_for_each_entry_safe(folio, tmp, &surplus_list, _hugetlb_list)
 		free_huge_folio(folio);
 	spin_lock_irq(&hugetlb_lock);
 
@@ -2588,7 +2589,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 		if (!folio)
 			goto out;
 
-		list_add(&folio->lru, &page_list);
+		list_add(&folio->_hugetlb_list, &page_list);
 	}
 
 out:
@@ -3051,7 +3052,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
-		list_add(&folio->lru, &h->hugepage_activelist);
+		list_add(&folio->_hugetlb_list, &h->hugepage_activelist);
 		folio_ref_unfreeze(folio, 1);
 		/* Fall through */
 	}
@@ -3211,7 +3212,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 	/* Send list for bulk vmemmap optimization processing */
 	hugetlb_vmemmap_optimize_folios(h, folio_list);
 
-	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+	list_for_each_entry_safe(folio, tmp_f, folio_list, _hugetlb_list) {
 		if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
 			/*
 			 * If HVO fails, initialize all tail struct pages
@@ -3260,7 +3261,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
 		hugetlb_folio_init_vmemmap(folio, h,
 					   HUGETLB_VMEMMAP_RESERVE_PAGES);
 		init_new_hugetlb_folio(h, folio);
-		list_add(&folio->lru, &folio_list);
+		list_add(&folio->_hugetlb_list, &folio_list);
 
 		/*
 		 * We need to restore the 'stolen' pages to totalram_pages
@@ -3317,7 +3318,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 					&node_states[N_MEMORY], NULL);
 			if (!folio)
 				break;
-			list_add(&folio->lru, &folio_list);
+			list_add(&folio->_hugetlb_list, &folio_list);
 		}
 		cond_resched();
 	}
@@ -3379,7 +3380,7 @@ static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned l
 		if (!folio)
 			break;
 
-		list_move(&folio->lru, &folio_list);
+		list_move(&folio->_hugetlb_list, &folio_list);
 		cond_resched();
 	}
 
@@ -3544,13 +3545,13 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	for_each_node_mask(i, *nodes_allowed) {
 		struct folio *folio, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
-		list_for_each_entry_safe(folio, next, freel, lru) {
+		list_for_each_entry_safe(folio, next, freel, _hugetlb_list) {
 			if (count >= h->nr_huge_pages)
 				goto out;
 			if (folio_test_highmem(folio))
 				continue;
 			remove_hugetlb_folio(h, folio, false);
-			list_add(&folio->lru, &page_list);
+			list_add(&folio->_hugetlb_list, &page_list);
 		}
 	}
 
@@ -3703,7 +3704,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			goto out;
 		}
 
-		list_add(&folio->lru, &page_list);
+		list_add(&folio->_hugetlb_list, &page_list);
 		allocated++;
 
 		/* Bail for signals. Probably ctrl-c from user */
@@ -3750,7 +3751,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		if (!folio)
 			break;
 
-		list_add(&folio->lru, &page_list);
+		list_add(&folio->_hugetlb_list, &page_list);
 	}
 	/* free the pages after dropping lock */
 	spin_unlock_irq(&hugetlb_lock);
@@ -3793,13 +3794,13 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst,
 	 */
 	mutex_lock(&dst->resize_lock);
 
-	list_for_each_entry_safe(folio, next, src_list, lru) {
+	list_for_each_entry_safe(folio, next, src_list, _hugetlb_list) {
 		int i;
 
 		if (folio_test_hugetlb_vmemmap_optimized(folio))
 			continue;
 
-		list_del(&folio->lru);
+		list_del(&folio->_hugetlb_list);
 
 		split_page_owner(&folio->page, huge_page_order(src), huge_page_order(dst));
 		pgalloc_tag_split(folio, huge_page_order(src), huge_page_order(dst));
@@ -3814,7 +3815,7 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst,
 			new_folio = page_folio(page);
 
 			init_new_hugetlb_folio(dst, new_folio);
-			list_add(&new_folio->lru, &dst_list);
+			list_add(&new_folio->_hugetlb_list, &dst_list);
 		}
 	}
 
@@ -3847,12 +3848,12 @@ static long demote_pool_huge_page(struct hstate *src, nodemask_t *nodes_allowed,
 		LIST_HEAD(list);
 		struct folio *folio, *next;
 
-		list_for_each_entry_safe(folio, next, &src->hugepage_freelists[node], lru) {
+		list_for_each_entry_safe(folio, next, &src->hugepage_freelists[node], _hugetlb_list) {
 			if (folio_test_hwpoison(folio))
 				continue;
 
 			remove_hugetlb_folio(src, folio, false);
-			list_add(&folio->lru, &list);
+			list_add(&folio->_hugetlb_list, &list);
 
 			if (++nr_demoted == nr_to_demote)
 				break;
@@ -3864,8 +3865,8 @@ static long demote_pool_huge_page(struct hstate *src, nodemask_t *nodes_allowed,
 
 		spin_lock_irq(&hugetlb_lock);
 
-		list_for_each_entry_safe(folio, next, &list, lru) {
-			list_del(&folio->lru);
+		list_for_each_entry_safe(folio, next, &list, _hugetlb_list) {
+			list_del(&folio->_hugetlb_list);
 			add_hugetlb_folio(src, folio, false);
 
 			nr_demoted--;
@@ -7427,7 +7428,8 @@ bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list)
 		goto unlock;
 	}
 	folio_clear_hugetlb_migratable(folio);
-	list_move_tail(&folio->lru, list);
+	list_del_init(&folio->_hugetlb_list);
+	list_add_tail(&folio->lru, list);
 unlock:
 	spin_unlock_irq(&hugetlb_lock);
 	return ret;
@@ -7478,7 +7480,8 @@ void folio_putback_hugetlb(struct folio *folio)
 {
 	spin_lock_irq(&hugetlb_lock);
 	folio_set_hugetlb_migratable(folio);
-	list_move_tail(&folio->lru, &(folio_hstate(folio))->hugepage_activelist);
+	list_del_init(&folio->lru);
+	list_add_tail(&folio->_hugetlb_list, &(folio_hstate(folio))->hugepage_activelist);
 	spin_unlock_irq(&hugetlb_lock);
 	folio_put(folio);
 }
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 1bdeaf25f640..ee720eeaf6b1 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -239,7 +239,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		for_each_hstate(h) {
 			spin_lock_irq(&hugetlb_lock);
-			list_for_each_entry(folio, &h->hugepage_activelist, lru)
+			list_for_each_entry(folio, &h->hugepage_activelist, _hugetlb_list)
 				hugetlb_cgroup_move_parent(hstate_index(h), h_cg, folio);
 
 			spin_unlock_irq(&hugetlb_lock);
@@ -933,7 +933,7 @@ void hugetlb_cgroup_migrate(struct folio *old_folio, struct folio *new_folio)
 	/* move the h_cg details to new cgroup */
 	set_hugetlb_cgroup(new_folio, h_cg);
 	set_hugetlb_cgroup_rsvd(new_folio, h_cg_rsvd);
-	list_move(&new_folio->lru, &h->hugepage_activelist);
+	list_move(&new_folio->_hugetlb_list, &h->hugepage_activelist);
 	spin_unlock_irq(&hugetlb_lock);
 	return;
 }
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 57b7f591eee8..b2cb8d328aac 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -519,7 +519,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	long ret = 0;
 	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
 
-	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
+	list_for_each_entry_safe(folio, t_folio, folio_list, _hugetlb_list) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
 			ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
 			/* only need to synchronize_rcu() once for each batch */
@@ -531,7 +531,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 		}
 
 		/* Add non-optimized folios to output list */
-		list_move(&folio->lru, non_hvo_folios);
+		list_move(&folio->_hugetlb_list, non_hvo_folios);
 	}
 
 	if (restored)
@@ -651,7 +651,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 	LIST_HEAD(vmemmap_pages);
 	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
 
-	list_for_each_entry(folio, folio_list, lru) {
+	list_for_each_entry(folio, folio_list, _hugetlb_list) {
 		int ret = hugetlb_vmemmap_split_folio(h, folio);
 
 		/*
@@ -666,7 +666,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 
 	flush_tlb_all();
 
-	list_for_each_entry(folio, folio_list, lru) {
+	list_for_each_entry(folio, folio_list, _hugetlb_list) {
 		int ret;
 
 		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re:[RFC PATCH v1 06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals
  2024-11-08 16:20 ` [RFC PATCH v1 06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals Fuad Tabba
@ 2024-11-12 15:28   ` wang wei
  2024-11-12 15:48     ` [RFC " David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: wang wei @ 2024-11-12 15:28 UTC (permalink / raw)
  To: tabba
  Cc: ackerleytng, airlied, akpm, david, dri-devel, jgg, jglisse,
	jhubbard, keirf, kirill.shutemov, linux-mm, mail, maz,
	muchun.song, nouveau, pbonzini, qperret, quic_eberman, roypat,
	rppt, seanjc, simona, vannapurve, will, willy, wang wei

Signed-off-by: wang wei <a929244872@163.com>
---
>diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>index 80fef38d9d64..365c73be0bb4 100644
>--- a/include/linux/mm_types.h
>+++ b/include/linux/mm_types.h
>@@ -310,6 +310,7 @@ typedef struct {
>  * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
>  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
>  * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
>+ * @_hugetlb_list: To be used in hugetlb core code only.
>  * @_deferred_list: Folios to be split under memory pressure.
>  * @_unused_slab_obj_exts: Placeholder to match obj_exts in struct slab.
>  *
>@@ -397,6 +398,17 @@ struct folio {
> 		};
> 		struct page __page_2;
> 	};
>+	union {
>+		struct {
>+			unsigned long _flags_3;
>+			unsigned long _head_3;
>+	/* public: */
>+			struct list_head _hugetlb_list;
>+	/* private: the union with struct page is transitional */
>+		};
>+		struct page __page_3;
>+	};
>+
> };
> 
As far as I know, increasing the size of folio maybe decrease 
the revenue of HVO, do you measure it?
> #define FOLIO_MATCH(pg, fl)						\
>@@ -433,6 +445,12 @@ FOLIO_MATCH(compound_head, _head_2);
> FOLIO_MATCH(flags, _flags_2a);
> FOLIO_MATCH(compound_head, _head_2a);
> #undef FOLIO_MATCH
>+#define FOLIO_MATCH(pg, fl)						\
>+	static_assert(offsetof(struct folio, fl) ==			\
>+			offsetof(struct page, pg) + 3 * sizeof(struct page))
>+FOLIO_MATCH(flags, _flags_3);
>+FOLIO_MATCH(compound_head, _head_3);
>+#undef FOLIO_MATCH
> 
> /**
>  * struct ptdesc -    Memory descriptor for page tables.
-- 
2.25.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals
  2024-11-12 15:28   ` wang wei
@ 2024-11-12 15:48     ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-12 15:48 UTC (permalink / raw)
  To: wang wei, tabba
  Cc: ackerleytng, airlied, akpm, dri-devel, jgg, jglisse, jhubbard,
	keirf, kirill.shutemov, linux-mm, mail, maz, muchun.song,
	nouveau, pbonzini, qperret, quic_eberman, roypat, rppt, seanjc,
	simona, vannapurve, will, willy

On 12.11.24 16:28, wang wei wrote:
> Signed-off-by: wang wei <a929244872@163.com>
> ---
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 80fef38d9d64..365c73be0bb4 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -310,6 +310,7 @@ typedef struct {
>>   * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
>>   * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
>>   * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
>> + * @_hugetlb_list: To be used in hugetlb core code only.
>>   * @_deferred_list: Folios to be split under memory pressure.
>>   * @_unused_slab_obj_exts: Placeholder to match obj_exts in struct slab.
>>   *
>> @@ -397,6 +398,17 @@ struct folio {
>> 		};
>> 		struct page __page_2;
>> 	};
>> +	union {
>> +		struct {
>> +			unsigned long _flags_3;
>> +			unsigned long _head_3;
>> +	/* public: */
>> +			struct list_head _hugetlb_list;
>> +	/* private: the union with struct page is transitional */
>> +		};
>> +		struct page __page_3;
>> +	};
>> +
>> };
>>
> As far as I know, increasing the size of folio maybe decrease
> the revenue of HVO, do you measure it?

We should always have a writable page2, even with HVO. So nothing should 
change at this point. Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 07/10] mm: Introduce struct folio_owner_ops
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (5 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 08/10] mm: Use getters and setters to access page pgmap Fuad Tabba
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

Introduce struct folio_owner_ops, a method table that contains
callbacks to owners of folios that need special handling for
certain operations. For now, it only contains a callback for
folio free(), which is called immediately after the folio
refcount drops to 0.

Add a pointer to this struct overlaid on struct page
compound_head, pgmap, and struct page/folio lru. The users of
this struct either will not use lru (e.g., zone device), or would
be able to easily isolate when lru is being used (e.g., hugetlb)
and handle it accordingly. While folios are isolated, they cannot
get freed and the owner_ops are unstable. This is sufficient for
the current use case of returning these folios to a custom
allocator.

To identify that a folio has owner_ops, we set bit 1 of the
field, in a similar way to that bit 0 of compound_head is used to
identify compound pages.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/mm_types.h | 64 +++++++++++++++++++++++++++++++++++++---
 mm/swap.c                | 19 ++++++++++++
 2 files changed, 79 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 365c73be0bb4..6e06286f44f1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -41,10 +41,12 @@ struct mem_cgroup;
  *
  * If you allocate the page using alloc_pages(), you can use some of the
  * space in struct page for your own purposes.  The five words in the main
- * union are available, except for bit 0 of the first word which must be
- * kept clear.  Many users use this word to store a pointer to an object
- * which is guaranteed to be aligned.  If you use the same storage as
- * page->mapping, you must restore it to NULL before freeing the page.
+ * union are available, except for bit 0 (used for compound_head pages)
+ * and bit 1 (used for owner_ops) of the first word, which must be kept
+ * clear and used with care.  Many users use this word to store a pointer
+ * to an object which is guaranteed to be aligned.  If you use the same
+ * storage as page->mapping, you must restore it to NULL before freeing
+ * the page.
  *
  * The mapcount field must not be used for own purposes.
  *
@@ -283,10 +285,16 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+struct folio_owner_ops;
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
  * @lru: Least Recently Used list; tracks how recently this folio was used.
+ * @owner_ops: Pointer to callback operations of the folio owner. Valid if bit 1
+ *    is set.
+ *    NOTE: Cannot be used with lru, since it is overlaid with it. To use lru,
+ *          owner_ops must be cleared first, and restored once done with lru.
  * @mlock_count: Number of times this folio has been pinned by mlock().
  * @mapping: The file this page belongs to, or refers to the anon_vma for
  *    anonymous memory.
@@ -330,6 +338,7 @@ struct folio {
 			unsigned long flags;
 			union {
 				struct list_head lru;
+				const struct folio_owner_ops *owner_ops; /* Bit 1 is set */
 	/* private: avoid cluttering the output */
 				struct {
 					void *__filler;
@@ -417,6 +426,7 @@ FOLIO_MATCH(flags, flags);
 FOLIO_MATCH(lru, lru);
 FOLIO_MATCH(mapping, mapping);
 FOLIO_MATCH(compound_head, lru);
+FOLIO_MATCH(compound_head, owner_ops);
 FOLIO_MATCH(index, index);
 FOLIO_MATCH(private, private);
 FOLIO_MATCH(_mapcount, _mapcount);
@@ -452,6 +462,13 @@ FOLIO_MATCH(flags, _flags_3);
 FOLIO_MATCH(compound_head, _head_3);
 #undef FOLIO_MATCH
 
+struct folio_owner_ops {
+	/*
+	 * Called once the folio refcount reaches 0.
+	 */
+	void (*free)(struct folio *folio);
+};
+
 /**
  * struct ptdesc -    Memory descriptor for page tables.
  * @__page_flags:     Same as page flags. Powerpc only.
@@ -560,6 +577,45 @@ static inline void *folio_get_private(struct folio *folio)
 	return folio->private;
 }
 
+/*
+ * Use bit 1, since bit 0 is used to indicate a compound page in compound_head,
+ * which owner_ops is overlaid with.
+ */
+#define FOLIO_OWNER_OPS_BIT    1UL
+#define FOLIO_OWNER_OPS        (1UL << FOLIO_OWNER_OPS_BIT)
+
+/*
+ * Set the folio owner_ops as well as bit 1 of the pointer to indicate that the
+ * folio has owner_ops.
+ */
+static inline void folio_set_owner_ops(struct folio *folio, const struct folio_owner_ops *owner_ops)
+{
+	owner_ops = (const struct folio_owner_ops *)((unsigned long)owner_ops | FOLIO_OWNER_OPS);
+	folio->owner_ops = owner_ops;
+}
+
+/*
+ * Clear the folio owner_ops including bit 1 of the pointer.
+ */
+static inline void folio_clear_owner_ops(struct folio *folio)
+{
+	folio->owner_ops = NULL;
+}
+
+/*
+ * Return the folio's owner_ops if it has them, otherwise, return NULL.
+ */
+static inline const struct folio_owner_ops *folio_get_owner_ops(struct folio *folio)
+{
+	const struct folio_owner_ops *owner_ops = folio->owner_ops;
+
+	if (!((unsigned long)owner_ops & FOLIO_OWNER_OPS))
+		return NULL;
+
+	owner_ops = (const struct folio_owner_ops *)((unsigned long)owner_ops & ~FOLIO_OWNER_OPS);
+	return owner_ops;
+}
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/mm/swap.c b/mm/swap.c
index 638a3f001676..767ff6d8f47b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -110,6 +110,13 @@ static void page_cache_release(struct folio *folio)
 
 void __folio_put(struct folio *folio)
 {
+	const struct folio_owner_ops *owner_ops = folio_get_owner_ops(folio);
+
+	if (unlikely(owner_ops)) {
+		owner_ops->free(folio);
+		return;
+	}
+
 	if (unlikely(folio_is_zone_device(folio))) {
 		free_zone_device_folio(folio);
 		return;
@@ -929,10 +936,22 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 	for (i = 0, j = 0; i < folios->nr; i++) {
 		struct folio *folio = folios->folios[i];
 		unsigned int nr_refs = refs ? refs[i] : 1;
+		const struct folio_owner_ops *owner_ops;
 
 		if (is_huge_zero_folio(folio))
 			continue;
 
+		owner_ops = folio_get_owner_ops(folio);
+		if (unlikely(owner_ops)) {
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
+			}
+			if (folio_ref_sub_and_test(folio, nr_refs))
+				owner_ops->free(folio);
+			continue;
+		}
+
 		if (folio_is_zone_device(folio)) {
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 08/10] mm: Use getters and setters to access page pgmap
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (6 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 07/10] mm: Introduce struct folio_owner_ops Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 09/10] mm: Use owner_ops on folio_put for zone device pages Fuad Tabba
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

The pointer to pgmap in struct page is overlaid with folio
owner_ops. To indicate that a page/folio has owner ops, bit 1 is
set. Therefore, before we can start to using owner_ops, we need
to ensure that all accesses to page pgmap sanitize the pointer
value.

This patch introduces the accessors, which will be modified in
the following patch to sanitize the pointer values.

No functional change intended.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  4 +++-
 drivers/pci/p2pdma.c                   |  8 +++++---
 include/linux/memremap.h               |  6 +++---
 include/linux/mm_types.h               | 13 +++++++++++++
 lib/test_hmm.c                         |  2 +-
 mm/hmm.c                               |  2 +-
 mm/memory.c                            |  2 +-
 mm/memremap.c                          | 19 +++++++++++--------
 mm/migrate_device.c                    |  4 ++--
 mm/mm_init.c                           |  2 +-
 10 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1a072568cef6..d7d9d9476bb0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,9 @@ struct nouveau_dmem {
 
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
-	return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+	struct dev_pagemap *pgmap = page_get_pgmap(page);
+
+	return container_of(pgmap, struct nouveau_dmem_chunk, pagemap);
 }
 
 static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4f47a13cb500..19519bb4ba56 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -193,7 +193,7 @@ static const struct attribute_group p2pmem_group = {
 
 static void p2pdma_page_free(struct page *page)
 {
-	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_get_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
 	struct pci_p2pdma *p2pdma =
 		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
@@ -1016,8 +1016,10 @@ enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
 		       struct scatterlist *sg)
 {
-	if (state->pgmap != sg_page(sg)->pgmap) {
-		state->pgmap = sg_page(sg)->pgmap;
+	struct dev_pagemap *pgmap = page_get_pgmap(sg_page(sg));
+
+	if (state->pgmap != pgmap) {
+		state->pgmap = pgmap;
 		state->map = pci_p2pdma_map_type(state->pgmap, dev);
 		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
 	}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143ade32c..060e27b6aee0 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
 		is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+		page_get_pgmap(page)->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool folio_is_device_private(const struct folio *folio)
@@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
 		is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+		page_get_pgmap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
 static inline bool is_device_coherent_page(const struct page *page)
 {
 	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_COHERENT;
+		page_get_pgmap(page)->type == MEMORY_DEVICE_COHERENT;
 }
 
 static inline bool folio_is_device_coherent(const struct folio *folio)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e06286f44f1..27075ea24e67 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -616,6 +616,19 @@ static inline const struct folio_owner_ops *folio_get_owner_ops(struct folio *fo
 	return owner_ops;
 }
 
+/*
+ * Get the page dev_pagemap pgmap pointer.
+ */
+#define page_get_pgmap(page)	((page)->pgmap)
+
+/*
+ * Set the page dev_pagemap pgmap pointer.
+ */
+static inline void page_set_pgmap(struct page *page, struct dev_pagemap *pgmap)
+{
+	page->pgmap = pgmap;
+}
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 056f2e411d7b..d3e3843f57dd 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -195,7 +195,7 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp)
 
 static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
 {
-	return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+	return container_of(page_get_pgmap(page), struct dmirror_chunk, pagemap);
 }
 
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..b5f5ac218fda 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * just report the PFN.
 		 */
 		if (is_device_private_entry(entry) &&
-		    pfn_swap_entry_to_page(entry)->pgmap->owner ==
+		    page_get_pgmap(pfn_swap_entry_to_page(entry))->owner ==
 		    range->dev_private_owner) {
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
diff --git a/mm/memory.c b/mm/memory.c
index 80850cad0e6f..5853fa5767c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4276,7 +4276,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 */
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
-			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+			ret = page_get_pgmap(vmf->page)->ops->migrate_to_ram(vmf);
 			put_page(vmf->page);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547ce514..931bc85da1df 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,8 +458,9 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 void free_zone_device_folio(struct folio *folio)
 {
-	if (WARN_ON_ONCE(!folio->page.pgmap->ops ||
-			!folio->page.pgmap->ops->page_free))
+	struct dev_pagemap *pgmap = page_get_pgmap(&folio->page);
+
+	if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
 		return;
 
 	mem_cgroup_uncharge(folio);
@@ -486,17 +487,17 @@ void free_zone_device_folio(struct folio *folio)
 	 * to clear folio->mapping.
 	 */
 	folio->mapping = NULL;
-	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
+	pgmap->ops->page_free(folio_page(folio, 0));
 
-	if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
-	    folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
+	if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+	    pgmap->type != MEMORY_DEVICE_COHERENT)
 		/*
 		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
 		folio_set_count(folio, 1);
 	else
-		put_dev_pagemap(folio->page.pgmap);
+		put_dev_pagemap(pgmap);
 }
 
 void zone_device_page_init(struct page *page)
@@ -505,7 +506,7 @@ void zone_device_page_init(struct page *page)
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref));
+	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_get_pgmap(page)->ref));
 	set_page_count(page, 1);
 	lock_page(page);
 }
@@ -514,7 +515,9 @@ EXPORT_SYMBOL_GPL(zone_device_page_init);
 #ifdef CONFIG_FS_DAX
 bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
 {
-	if (folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX)
+	struct dev_pagemap *pgmap = page_get_pgmap(&folio->page);
+
+	if (pgmap->type != MEMORY_DEVICE_FS_DAX)
 		return false;
 
 	/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 9cf26592ac93..368def358d02 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -135,7 +135,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			page = pfn_swap_entry_to_page(entry);
 			if (!(migrate->flags &
 				MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
-			    page->pgmap->owner != migrate->pgmap_owner)
+			    page_get_pgmap(page)->owner != migrate->pgmap_owner)
 				goto next;
 
 			mpfn = migrate_pfn(page_to_pfn(page)) |
@@ -156,7 +156,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 			else if (page && is_device_coherent_page(page) &&
 			    (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
-			     page->pgmap->owner != migrate->pgmap_owner))
+			     page_get_pgmap(page)->owner != migrate->pgmap_owner))
 				goto next;
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 1c205b0a86ed..279cdaebfd2b 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -995,7 +995,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
 	 * ever freed or placed on a driver-private list.
 	 */
-	page->pgmap = pgmap;
+	page_set_pgmap(page, pgmap);
 	page->zone_device_data = NULL;
 
 	/*
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 09/10] mm: Use owner_ops on folio_put for zone device pages
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (7 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 08/10] mm: Use getters and setters to access page pgmap Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 16:20 ` [RFC PATCH v1 10/10] mm: hugetlb: Use owner_ops on folio_put for hugetlb Fuad Tabba
  2024-11-08 17:05 ` [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Jason Gunthorpe
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

Now that we have the folio_owner_ops callback, use it for zone
device pages instead of using a dedicated callback.

Note that struct dev_pagemap (pgmap) in struct page is overlaid
with struct folio owner_ops. Therefore, make struct dev_pagemap
contain an instance of struct folio_owner_ops, to handle it the
same way as struct folio_owner_ops.

Also note that, although struct dev_pagemap_ops has a page_free()
function, it does not have the same intention as the
folio_owner_ops free() callback nor does it have the same
behavior. The page_free() function is used as an optional
callback to drivers that use zone device to inform them of the
freeing of the page.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/memremap.h |  8 +++++++
 include/linux/mm_types.h | 16 ++++++++++++--
 mm/internal.h            |  1 -
 mm/memremap.c            | 44 --------------------------------------
 mm/mm_init.c             | 46 ++++++++++++++++++++++++++++++++++++++++
 mm/swap.c                | 18 ++--------------
 6 files changed, 70 insertions(+), 63 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 060e27b6aee0..5b68bbc588a3 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -106,6 +106,7 @@ struct dev_pagemap_ops {
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @folio_ops: method table for folio operations.
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @done: completion for @ref
@@ -125,6 +126,7 @@ struct dev_pagemap_ops {
  * @ranges: array of ranges to be mapped when nr_range > 1
  */
 struct dev_pagemap {
+	struct folio_owner_ops folio_ops;
 	struct vmem_altmap altmap;
 	struct percpu_ref ref;
 	struct completion done;
@@ -140,6 +142,12 @@ struct dev_pagemap {
 	};
 };
 
+/*
+ * The folio_owner_ops structure needs to be first since pgmap in struct page is
+ * overlaid with owner_ops in struct folio.
+ */
+static_assert(offsetof(struct dev_pagemap, folio_ops) == 0);
+
 static inline bool pgmap_has_memory_failure(struct dev_pagemap *pgmap)
 {
 	return pgmap->ops && pgmap->ops->memory_failure;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 27075ea24e67..a72fda20d5e9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -427,6 +427,7 @@ FOLIO_MATCH(lru, lru);
 FOLIO_MATCH(mapping, mapping);
 FOLIO_MATCH(compound_head, lru);
 FOLIO_MATCH(compound_head, owner_ops);
+FOLIO_MATCH(pgmap, owner_ops);
 FOLIO_MATCH(index, index);
 FOLIO_MATCH(private, private);
 FOLIO_MATCH(_mapcount, _mapcount);
@@ -618,15 +619,26 @@ static inline const struct folio_owner_ops *folio_get_owner_ops(struct folio *fo
 
 /*
  * Get the page dev_pagemap pgmap pointer.
+ *
+ * The page pgmap is overlaid with the folio owner_ops, where bit 1 is used to
+ * indicate that the page/folio has owner ops. The dev_pagemap contains
+ * owner_ops and is handled the same way. The getter returns a sanitized
+ * pointer.
  */
-#define page_get_pgmap(page)	((page)->pgmap)
+#define page_get_pgmap(page)							\
+	((struct dev_pagemap *)((unsigned long)(page)->pgmap & ~FOLIO_OWNER_OPS))
 
 /*
  * Set the page dev_pagemap pgmap pointer.
+ *
+ * The page pgmap is overlaid with the folio owner_ops, where bit 1 is used to
+ * indicate that the page/folio has owner ops. The dev_pagemap contains
+ * owner_ops and is handled the same way. The setter sets bit 1 to indicate
+ * that the page owner_ops.
  */
 static inline void page_set_pgmap(struct page *page, struct dev_pagemap *pgmap)
 {
-	page->pgmap = pgmap;
+	page->pgmap = (struct dev_pagemap *)((unsigned long)pgmap | FOLIO_OWNER_OPS);
 }
 
 struct page_frag_cache {
diff --git a/mm/internal.h b/mm/internal.h
index 5a7302baeed7..a041247bed10 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1262,7 +1262,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 		      unsigned long addr, int *flags, bool writable,
 		      int *last_cpupid);
 
-void free_zone_device_folio(struct folio *folio);
 int migrate_device_coherent_folio(struct folio *folio);
 
 struct vm_struct *__get_vm_area_node(unsigned long size,
diff --git a/mm/memremap.c b/mm/memremap.c
index 931bc85da1df..9fd5f57219eb 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -456,50 +456,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 }
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
-void free_zone_device_folio(struct folio *folio)
-{
-	struct dev_pagemap *pgmap = page_get_pgmap(&folio->page);
-
-	if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
-		return;
-
-	mem_cgroup_uncharge(folio);
-
-	/*
-	 * Note: we don't expect anonymous compound pages yet. Once supported
-	 * and we could PTE-map them similar to THP, we'd have to clear
-	 * PG_anon_exclusive on all tail pages.
-	 */
-	if (folio_test_anon(folio)) {
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-		__ClearPageAnonExclusive(folio_page(folio, 0));
-	}
-
-	/*
-	 * When a device managed page is freed, the folio->mapping field
-	 * may still contain a (stale) mapping value. For example, the
-	 * lower bits of folio->mapping may still identify the folio as an
-	 * anonymous folio. Ultimately, this entire field is just stale
-	 * and wrong, and it will cause errors if not cleared.
-	 *
-	 * For other types of ZONE_DEVICE pages, migration is either
-	 * handled differently or not done at all, so there is no need
-	 * to clear folio->mapping.
-	 */
-	folio->mapping = NULL;
-	pgmap->ops->page_free(folio_page(folio, 0));
-
-	if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
-	    pgmap->type != MEMORY_DEVICE_COHERENT)
-		/*
-		 * Reset the refcount to 1 to prepare for handing out the page
-		 * again.
-		 */
-		folio_set_count(folio, 1);
-	else
-		put_dev_pagemap(pgmap);
-}
-
 void zone_device_page_init(struct page *page)
 {
 	/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 279cdaebfd2b..47c1f8fd4914 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -974,6 +974,51 @@ static void __init memmap_init(void)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+
+static void free_zone_device_folio(struct folio *folio)
+{
+	struct dev_pagemap *pgmap = page_get_pgmap(&folio->page);
+
+	if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
+		return;
+
+	mem_cgroup_uncharge(folio);
+
+	/*
+	 * Note: we don't expect anonymous compound pages yet. Once supported
+	 * and we could PTE-map them similar to THP, we'd have to clear
+	 * PG_anon_exclusive on all tail pages.
+	 */
+	if (folio_test_anon(folio)) {
+		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
+		__ClearPageAnonExclusive(folio_page(folio, 0));
+	}
+
+	/*
+	 * When a device managed page is freed, the folio->mapping field
+	 * may still contain a (stale) mapping value. For example, the
+	 * lower bits of folio->mapping may still identify the folio as an
+	 * anonymous folio. Ultimately, this entire field is just stale
+	 * and wrong, and it will cause errors if not cleared.
+	 *
+	 * For other types of ZONE_DEVICE pages, migration is either
+	 * handled differently or not done at all, so there is no need
+	 * to clear folio->mapping.
+	 */
+	folio->mapping = NULL;
+	pgmap->ops->page_free(folio_page(folio, 0));
+
+	if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+	    pgmap->type != MEMORY_DEVICE_COHERENT)
+		/*
+		 * Reset the refcount to 1 to prepare for handing out the page
+		 * again.
+		 */
+		folio_set_count(folio, 1);
+	else
+		put_dev_pagemap(pgmap);
+}
+
 static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 					  unsigned long zone_idx, int nid,
 					  struct dev_pagemap *pgmap)
@@ -995,6 +1040,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
 	 * ever freed or placed on a driver-private list.
 	 */
+	pgmap->folio_ops.free = free_zone_device_folio;
 	page_set_pgmap(page, pgmap);
 	page->zone_device_data = NULL;
 
diff --git a/mm/swap.c b/mm/swap.c
index 767ff6d8f47b..d2578465e270 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -117,11 +117,6 @@ void __folio_put(struct folio *folio)
 		return;
 	}
 
-	if (unlikely(folio_is_zone_device(folio))) {
-		free_zone_device_folio(folio);
-		return;
-	}
-
 	if (folio_test_hugetlb(folio)) {
 		free_huge_folio(folio);
 		return;
@@ -947,20 +942,11 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			if (folio_ref_sub_and_test(folio, nr_refs))
-				owner_ops->free(folio);
-			continue;
-		}
-
-		if (folio_is_zone_device(folio)) {
-			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-				lruvec = NULL;
-			}
+			/* fenced by folio_is_zone_device() */
 			if (put_devmap_managed_folio_refs(folio, nr_refs))
 				continue;
 			if (folio_ref_sub_and_test(folio, nr_refs))
-				free_zone_device_folio(folio);
+				owner_ops->free(folio);
 			continue;
 		}
 
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 10/10] mm: hugetlb: Use owner_ops on folio_put for hugetlb
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (8 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 09/10] mm: Use owner_ops on folio_put for zone device pages Fuad Tabba
@ 2024-11-08 16:20 ` Fuad Tabba
  2024-11-08 17:05 ` [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Jason Gunthorpe
  10 siblings, 0 replies; 23+ messages in thread
From: Fuad Tabba @ 2024-11-08 16:20 UTC (permalink / raw)
  To: linux-mm
  Cc: kvm, nouveau, dri-devel, david, rppt, jglisse, akpm, muchun.song,
	simona, airlied, pbonzini, seanjc, willy, jgg, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat, tabba

Now that we have the folio_owner_ops callback, use it for hugetlb
pages instead of using a dedicated callback.

Since owner_ops is overlaid with lru, we need to unset owner_ops
to allow the use of lru when its isolated. At that point we know
that the reference count is elevated, will not reach 0, and thus
not trigger a callback. Therefore, it is safe to do so provided
we restore it before we put the folio back.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/hugetlb.h |  2 --
 mm/hugetlb.c            | 57 +++++++++++++++++++++++++++++++++--------
 mm/swap.c               | 14 ----------
 3 files changed, 47 insertions(+), 26 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e846d7dac77c..500848862702 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -20,8 +20,6 @@ struct user_struct;
 struct mmu_gather;
 struct node;
 
-void free_huge_folio(struct folio *folio);
-
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include <linux/pagemap.h>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2308e94d8615..4e1c87e37968 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -89,6 +89,33 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
 static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end);
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma);
+static void free_huge_folio(struct folio *folio);
+
+static const struct folio_owner_ops hugetlb_owner_ops = {
+	.free = free_huge_folio,
+};
+
+/*
+ * Mark this folio as a hugetlb-owned folio.
+ *
+ * Set the folio hugetlb flag and owner operations.
+ */
+static void folio_set_hugetlb_owner(struct folio *folio)
+{
+	__folio_set_hugetlb(folio);
+	folio_set_owner_ops(folio, &hugetlb_owner_ops);
+}
+
+/*
+ * Unmark this folio from being a hugetlb-owned folio.
+ *
+ * Clear the folio hugetlb flag and owner operations.
+ */
+static void folio_clear_hugetlb_owner(struct folio *folio)
+{
+	folio_clear_owner_ops(folio);
+	__folio_clear_hugetlb(folio);
+}
 
 static void hugetlb_free_folio(struct folio *folio)
 {
@@ -1617,7 +1644,7 @@ static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
 	 * to tail struct pages.
 	 */
 	if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
-		__folio_clear_hugetlb(folio);
+		folio_clear_hugetlb_owner(folio);
 	}
 
 	h->nr_huge_pages--;
@@ -1641,7 +1668,7 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 		h->surplus_huge_pages++;
 		h->surplus_huge_pages_node[nid]++;
 	}
-	__folio_set_hugetlb(folio);
+	folio_set_hugetlb_owner(folio);
 
 	folio_change_private(folio, NULL);
 	/*
@@ -1692,7 +1719,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 	 */
 	if (folio_test_hugetlb(folio)) {
 		spin_lock_irq(&hugetlb_lock);
-		__folio_clear_hugetlb(folio);
+		folio_clear_hugetlb_owner(folio);
 		spin_unlock_irq(&hugetlb_lock);
 	}
 
@@ -1793,7 +1820,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 		list_for_each_entry_safe(folio, t_folio, non_hvo_folios, _hugetlb_list) {
 			list_del(&folio->_hugetlb_list);
 			spin_lock_irq(&hugetlb_lock);
-			__folio_clear_hugetlb(folio);
+			folio_clear_hugetlb_owner(folio);
 			spin_unlock_irq(&hugetlb_lock);
 			update_and_free_hugetlb_folio(h, folio, false);
 			cond_resched();
@@ -1818,7 +1845,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 			} else {
 				list_del(&folio->_hugetlb_list);
 				spin_lock_irq(&hugetlb_lock);
-				__folio_clear_hugetlb(folio);
+				folio_clear_hugetlb_owner(folio);
 				spin_unlock_irq(&hugetlb_lock);
 				update_and_free_hugetlb_folio(h, folio, false);
 				cond_resched();
@@ -1851,14 +1878,14 @@ static void update_and_free_pages_bulk(struct hstate *h,
 	 * should only be pages on the non_hvo_folios list.
 	 * Do note that the non_hvo_folios list could be empty.
 	 * Without HVO enabled, ret will be 0 and there is no need to call
-	 * __folio_clear_hugetlb as this was done previously.
+	 * folio_clear_hugetlb_owner as this was done previously.
 	 */
 	VM_WARN_ON(!list_empty(folio_list));
 	VM_WARN_ON(ret < 0);
 	if (!list_empty(&non_hvo_folios) && ret) {
 		spin_lock_irq(&hugetlb_lock);
 		list_for_each_entry(folio, &non_hvo_folios, _hugetlb_list)
-			__folio_clear_hugetlb(folio);
+			folio_clear_hugetlb_owner(folio);
 		spin_unlock_irq(&hugetlb_lock);
 	}
 
@@ -1879,7 +1906,7 @@ struct hstate *size_to_hstate(unsigned long size)
 	return NULL;
 }
 
-void free_huge_folio(struct folio *folio)
+static void free_huge_folio(struct folio *folio)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1959,7 +1986,7 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
 
 static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
-	__folio_set_hugetlb(folio);
+	folio_set_hugetlb_owner(folio);
 	INIT_LIST_HEAD(&folio->_hugetlb_list);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
@@ -7428,6 +7455,14 @@ bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list)
 		goto unlock;
 	}
 	folio_clear_hugetlb_migratable(folio);
+	/*
+	 * Clear folio->owner_ops; now we can use folio->lru.
+	 * Note that the folio cannot get freed because we are holding a
+	 * reference. The reference will be put in folio_putback_hugetlb(),
+	 * after restoring folio->owner_ops.
+	 */
+	folio_clear_owner_ops(folio);
+	INIT_LIST_HEAD(&folio->lru);
 	list_del_init(&folio->_hugetlb_list);
 	list_add_tail(&folio->lru, list);
 unlock:
@@ -7480,7 +7515,9 @@ void folio_putback_hugetlb(struct folio *folio)
 {
 	spin_lock_irq(&hugetlb_lock);
 	folio_set_hugetlb_migratable(folio);
-	list_del_init(&folio->lru);
+	list_del(&folio->lru);
+	/* Restore folio->owner_ops since we can no longer use folio->lru. */
+	folio_set_owner_ops(folio, &hugetlb_owner_ops);
 	list_add_tail(&folio->_hugetlb_list, &(folio_hstate(folio))->hugepage_activelist);
 	spin_unlock_irq(&hugetlb_lock);
 	folio_put(folio);
diff --git a/mm/swap.c b/mm/swap.c
index d2578465e270..9798ca47f26a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -117,11 +117,6 @@ void __folio_put(struct folio *folio)
 		return;
 	}
 
-	if (folio_test_hugetlb(folio)) {
-		free_huge_folio(folio);
-		return;
-	}
-
 	page_cache_release(folio);
 	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
@@ -953,15 +948,6 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		if (!folio_ref_sub_and_test(folio, nr_refs))
 			continue;
 
-		/* hugetlb has its own memcg */
-		if (folio_test_hugetlb(folio)) {
-			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-				lruvec = NULL;
-			}
-			free_huge_folio(folio);
-			continue;
-		}
 		folio_unqueue_deferred_split(folio);
 		__page_cache_release(folio, &lruvec, &flags);
 
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
                   ` (9 preceding siblings ...)
  2024-11-08 16:20 ` [RFC PATCH v1 10/10] mm: hugetlb: Use owner_ops on folio_put for hugetlb Fuad Tabba
@ 2024-11-08 17:05 ` Jason Gunthorpe
  2024-11-08 19:33   ` David Hildenbrand
  10 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2024-11-08 17:05 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: linux-mm, kvm, nouveau, dri-devel, david, rppt, jglisse, akpm,
	muchun.song, simona, airlied, pbonzini, seanjc, willy, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat

On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote:
> Some folios, such as hugetlb folios and zone device folios,
> require special handling when the folio's reference count reaches
> 0, before being freed. Moreover, guest_memfd folios will likely
> require special handling to notify it once a folio's reference
> count reaches 0, to facilitate shared to private folio conversion
> [*]. Currently, each usecase has a dedicated callback when the
> folio refcount reaches 0 to that effect. Adding yet more
> callbacks is not ideal.

Honestly, I question this thesis. How complex would it be to have 'yet
more callbacks'? Is the challenge really that the mm can't detect when
guestmemfd is the owner of the page because the page will be
ZONE_NORMAL?

So the point of this is really to allow ZONE_NORMAL pages to have a
per-allocator callback?

But this is also why I suggested to shift them to ZONE_DEVICE for
guestmemfd, because then you get these things for free from the pgmap.

(this is not a disagreement this is a valid solution, but a request
you explain much more about what it is you actually need and compare
it with the other existing options)

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-08 17:05 ` [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Jason Gunthorpe
@ 2024-11-08 19:33   ` David Hildenbrand
  2024-11-11  8:26     ` Fuad Tabba
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-11-08 19:33 UTC (permalink / raw)
  To: Jason Gunthorpe, Fuad Tabba
  Cc: linux-mm, kvm, nouveau, dri-devel, rppt, jglisse, akpm,
	muchun.song, simona, airlied, pbonzini, seanjc, willy, jhubbard,
	ackerleytng, vannapurve, mail, kirill.shutemov, quic_eberman,
	maz, will, qperret, keirf, roypat

On 08.11.24 18:05, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote:
>> Some folios, such as hugetlb folios and zone device folios,
>> require special handling when the folio's reference count reaches
>> 0, before being freed. Moreover, guest_memfd folios will likely
>> require special handling to notify it once a folio's reference
>> count reaches 0, to facilitate shared to private folio conversion
>> [*]. Currently, each usecase has a dedicated callback when the
>> folio refcount reaches 0 to that effect. Adding yet more
>> callbacks is not ideal.
> 

Thanks for having a look!

Replying to clarify some things. Fuad, feel free to add additional 
information.

> Honestly, I question this thesis. How complex would it be to have 'yet
> more callbacks'? Is the challenge really that the mm can't detect when
> guestmemfd is the owner of the page because the page will be
> ZONE_NORMAL?

Fuad might have been a bit imprecise here: We don't want an ever growing 
list of checks+callbacks on the page freeing fast path.

This series replaces the two cases we have by a single generic one, 
which is nice independent of guest_memfd I think.

> 
> So the point of this is really to allow ZONE_NORMAL pages to have a
> per-allocator callback?

To intercept the refcount going to zero independent of any zones or 
magic page types, without as little overhead in the common page freeing 
path.

It can be used to implement custom allocators, like factored out for 
hugetlb in this series. It's not necessarily limited to that, though. It 
can be used as a form of "asynchronous page ref freezing", where you get 
notified once all references are gone.

(I might have another use case with PageOffline, where we want to 
prevent virtio-mem ones of them from getting accidentally leaked into 
the buddy during memory offlining with speculative references -- 
virtio_mem_fake_offline_going_offline() contains the interesting bits. 
But I did not look into the dirty details yet, just some thought where 
we'd want to intercept the refcount going to 0.)

> 
> But this is also why I suggested to shift them to ZONE_DEVICE for
> guestmemfd, because then you get these things for free from the pgmap.

With this series even hugetlb gets it for "free", and hugetlb is not 
quite the nail for the ZONE_DEVICE hammer IMHO :)

For things we can statically set aside early during boot and never 
really want to return to the buddy/another allocator, I would agree that 
static ZONE_DEVICE would have possible.

Whenever the buddy or other allocators are involved, and we might have 
granularity as a handful of pages (e.g., taken from the buddy), getting 
ZONE_DEVICE involved is not a good (or even feasible) approach.

After all, all we want is intercept the refcount going to 0.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-08 19:33   ` David Hildenbrand
@ 2024-11-11  8:26     ` Fuad Tabba
  2024-11-12  5:26       ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Fuad Tabba @ 2024-11-11  8:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, linux-mm, kvm, nouveau, dri-devel, rppt,
	jglisse, akpm, muchun.song, simona, airlied, pbonzini, seanjc,
	willy, jhubbard, ackerleytng, vannapurve, mail, kirill.shutemov,
	quic_eberman, maz, will, qperret, keirf, roypat

Hi Jason and David,

On Fri, 8 Nov 2024 at 19:33, David Hildenbrand <david@redhat.com> wrote:
>
> On 08.11.24 18:05, Jason Gunthorpe wrote:
> > On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote:
> >> Some folios, such as hugetlb folios and zone device folios,
> >> require special handling when the folio's reference count reaches
> >> 0, before being freed. Moreover, guest_memfd folios will likely
> >> require special handling to notify it once a folio's reference
> >> count reaches 0, to facilitate shared to private folio conversion
> >> [*]. Currently, each usecase has a dedicated callback when the
> >> folio refcount reaches 0 to that effect. Adding yet more
> >> callbacks is not ideal.
> >
>
> Thanks for having a look!
>
> Replying to clarify some things. Fuad, feel free to add additional
> information.

Thanks for your comments Jason, and for clarifying my cover letter
David. I think David has covered everything, and I'll make sure to
clarify this in the cover letter when I respin.

Cheers,
/fuad

>
> > Honestly, I question this thesis. How complex would it be to have 'yet
> > more callbacks'? Is the challenge really that the mm can't detect when
> > guestmemfd is the owner of the page because the page will be
> > ZONE_NORMAL?
>
> Fuad might have been a bit imprecise here: We don't want an ever growing
> list of checks+callbacks on the page freeing fast path.
>
> This series replaces the two cases we have by a single generic one,
> which is nice independent of guest_memfd I think.
>
> >
> > So the point of this is really to allow ZONE_NORMAL pages to have a
> > per-allocator callback?
>
> To intercept the refcount going to zero independent of any zones or
> magic page types, without as little overhead in the common page freeing
> path.
>
> It can be used to implement custom allocators, like factored out for
> hugetlb in this series. It's not necessarily limited to that, though. It
> can be used as a form of "asynchronous page ref freezing", where you get
> notified once all references are gone.
>
> (I might have another use case with PageOffline, where we want to
> prevent virtio-mem ones of them from getting accidentally leaked into
> the buddy during memory offlining with speculative references --
> virtio_mem_fake_offline_going_offline() contains the interesting bits.
> But I did not look into the dirty details yet, just some thought where
> we'd want to intercept the refcount going to 0.)
>
> >
> > But this is also why I suggested to shift them to ZONE_DEVICE for
> > guestmemfd, because then you get these things for free from the pgmap.
>
> With this series even hugetlb gets it for "free", and hugetlb is not
> quite the nail for the ZONE_DEVICE hammer IMHO :)
>
> For things we can statically set aside early during boot and never
> really want to return to the buddy/another allocator, I would agree that
> static ZONE_DEVICE would have possible.
>
> Whenever the buddy or other allocators are involved, and we might have
> granularity as a handful of pages (e.g., taken from the buddy), getting
> ZONE_DEVICE involved is not a good (or even feasible) approach.
>
> After all, all we want is intercept the refcount going to 0.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-11  8:26     ` Fuad Tabba
@ 2024-11-12  5:26       ` Matthew Wilcox
  2024-11-12  9:10         ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2024-11-12  5:26 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: David Hildenbrand, Jason Gunthorpe, linux-mm, kvm, nouveau,
	dri-devel, rppt, jglisse, akpm, muchun.song, simona, airlied,
	pbonzini, seanjc, jhubbard, ackerleytng, vannapurve, mail,
	kirill.shutemov, quic_eberman, maz, will, qperret, keirf, roypat

On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
> Thanks for your comments Jason, and for clarifying my cover letter
> David. I think David has covered everything, and I'll make sure to
> clarify this in the cover letter when I respin.

I don't want you to respin.  I think this is a bad idea.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-12  5:26       ` Matthew Wilcox
@ 2024-11-12  9:10         ` David Hildenbrand
  2024-11-12 13:53           ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-11-12  9:10 UTC (permalink / raw)
  To: Matthew Wilcox, Fuad Tabba
  Cc: Jason Gunthorpe, linux-mm, kvm, nouveau, dri-devel, rppt,
	jglisse, akpm, muchun.song, simona, airlied, pbonzini, seanjc,
	jhubbard, ackerleytng, vannapurve, mail, kirill.shutemov,
	quic_eberman, maz, will, qperret, keirf, roypat

On 12.11.24 06:26, Matthew Wilcox wrote:
> On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
>> Thanks for your comments Jason, and for clarifying my cover letter
>> David. I think David has covered everything, and I'll make sure to
>> clarify this in the cover letter when I respin.
> 
> I don't want you to respin.  I think this is a bad idea.

I'm hoping you'll find some more time to explain what exactly you don't 
like, because this series only refactors what we already have.

I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.

I don't particularly enjoy overlaying folio->lru, primarily because we 
have to temporarily "evacuate" it when someone wants to make use of 
folio->lru (e.g., hugetlb isolation). So it's not completely "sticky", 
at least for hugetlb.

Overlaying folio->mapping, similar to how "struct movable_operations" 
overlay page->mapping is not an option, because folio->mapping will be 
used for other purposes.

We'd need some sticky and reliable way to tell folio freeing code that 
someone wants to intercept when the refcount of that folio goes to 0, 
and identify who to notify.

Maybe folio->private/page->private could be overlayed? hugetlb only uses 
folio->private for flags, which we could move to some other tail page 
(e.g., simply putting them into flags1).

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-12  9:10         ` David Hildenbrand
@ 2024-11-12 13:53           ` Jason Gunthorpe
  2024-11-12 14:22             ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2024-11-12 13:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Fuad Tabba, linux-mm, kvm, nouveau, dri-devel,
	rppt, jglisse, akpm, muchun.song, simona, airlied, pbonzini,
	seanjc, jhubbard, ackerleytng, vannapurve, mail, kirill.shutemov,
	quic_eberman, maz, will, qperret, keirf, roypat

On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
> On 12.11.24 06:26, Matthew Wilcox wrote:
> > On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
> > > Thanks for your comments Jason, and for clarifying my cover letter
> > > David. I think David has covered everything, and I'll make sure to
> > > clarify this in the cover letter when I respin.
> > 
> > I don't want you to respin.  I think this is a bad idea.
> 
> I'm hoping you'll find some more time to explain what exactly you don't
> like, because this series only refactors what we already have.
> 
> I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.
> 
> I don't particularly enjoy overlaying folio->lru, primarily because we have
> to temporarily "evacuate" it when someone wants to make use of folio->lru
> (e.g., hugetlb isolation). So it's not completely "sticky", at least for
> hugetlb.

This is really the worst part of it though

And, IMHO, seems like overkill. We have only a handful of cases -
maybe we shouldn't be trying to get to full generality but just handle
a couple of cases directly? I don't really think it is such a bad
thing to have an if ladder on the free path if we have only a couple
things. Certainly it looks good instead of doing overlaying tricks.

Also how does this translate to Matthew's memdesc world? 

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-12 13:53           ` Jason Gunthorpe
@ 2024-11-12 14:22             ` David Hildenbrand
  2024-11-13  4:57               ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-11-12 14:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Fuad Tabba, linux-mm, kvm, nouveau, dri-devel,
	rppt, jglisse, akpm, muchun.song, simona, airlied, pbonzini,
	seanjc, jhubbard, ackerleytng, vannapurve, mail, kirill.shutemov,
	quic_eberman, maz, will, qperret, keirf, roypat

On 12.11.24 14:53, Jason Gunthorpe wrote:
> On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
>> On 12.11.24 06:26, Matthew Wilcox wrote:
>>> On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
>>>> Thanks for your comments Jason, and for clarifying my cover letter
>>>> David. I think David has covered everything, and I'll make sure to
>>>> clarify this in the cover letter when I respin.
>>>
>>> I don't want you to respin.  I think this is a bad idea.
>>
>> I'm hoping you'll find some more time to explain what exactly you don't
>> like, because this series only refactors what we already have.
>>
>> I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.
>>
>> I don't particularly enjoy overlaying folio->lru, primarily because we have
>> to temporarily "evacuate" it when someone wants to make use of folio->lru
>> (e.g., hugetlb isolation). So it's not completely "sticky", at least for
>> hugetlb.
> 
> This is really the worst part of it though

Yes.

> 
> And, IMHO, seems like overkill. We have only a handful of cases -
> maybe we shouldn't be trying to get to full generality but just handle
> a couple of cases directly? I don't really think it is such a bad
> thing to have an if ladder on the free path if we have only a couple
> things. Certainly it looks good instead of doing overlaying tricks.

I'd really like to abstract hugetlb handling if possible. The way it 
stands it's just very odd.

We'll need some reliable way to identify these folios that need care. 
guest_memfd will be using folio->mapcount for now, so for now we 
couldn't set a page type like hugetlb does.

> Also how does this translate to Matthew's memdesc world?

guest_memfd and hugetlb would be operating on folios (at least for now), 
which contain the refcount,lru,private, ... so nothing special there.

Once we actually decoupled "struct folio" from "struct page", we *might* 
have to play less tricks, because we could just have a callback pointer 
there. But well, maybe we also want to save some space in there.

Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios 
in the future? I don't know, maybe.

I'm currently wondering if we can use folio->private for the time being. 
Either

(a) If folio->private is still set once the refcount drops to 0, it 
indicates that there is a freeing callback/owner_ops. We'll have to make 
hugetlb not use folio->private and convert others to clear 
folio->private before freeing.

(b) Use bitX of folio->private to indicate that this has "owner_ops" 
meaning. We'll have to make hugetlb not use folio->private and make 
others not use bitX. Might be harder and overkill, because right now we 
only really need the callback when refcount==0.

(c) Use some other indication that folio->private contains folio_ops.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-12 14:22             ` David Hildenbrand
@ 2024-11-13  4:57               ` Matthew Wilcox
  2024-11-13 11:27                 ` David Hildenbrand
  2024-11-14  4:02                 ` John Hubbard
  0 siblings, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2024-11-13  4:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Fuad Tabba, linux-mm, kvm, nouveau, dri-devel,
	rppt, jglisse, akpm, muchun.song, simona, airlied, pbonzini,
	seanjc, jhubbard, ackerleytng, vannapurve, mail, kirill.shutemov,
	quic_eberman, maz, will, qperret, keirf, roypat

On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote:
> On 12.11.24 14:53, Jason Gunthorpe wrote:
> > On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
> > > On 12.11.24 06:26, Matthew Wilcox wrote:
> > > > I don't want you to respin.  I think this is a bad idea.
> > > 
> > > I'm hoping you'll find some more time to explain what exactly you don't
> > > like, because this series only refactors what we already have.
> > > 
> > > I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.

I don't.  The list of 'if's is better than the indirect function call.
That's terribly expensive, and the way we reuse the lru.next field
is fragile.  Not to mention that it introduces a new thing for the
hardening people to fret over.

> > And, IMHO, seems like overkill. We have only a handful of cases -
> > maybe we shouldn't be trying to get to full generality but just handle
> > a couple of cases directly? I don't really think it is such a bad
> > thing to have an if ladder on the free path if we have only a couple
> > things. Certainly it looks good instead of doing overlaying tricks.
> 
> I'd really like to abstract hugetlb handling if possible. The way it stands
> it's just very odd.

There might be ways to make that better.  I haven't really been looking
too hard at making that special handling go away.

> We'll need some reliable way to identify these folios that need care.
> guest_memfd will be using folio->mapcount for now, so for now we couldn't
> set a page type like hugetlb does.

If hugetlb can set lru.next at a certain point, then guestmemfd could
set a page type at a similar point, no?

> > Also how does this translate to Matthew's memdesc world?

In a memdesc world, pages no longer have a refcount.  We might still
have put_page() which will now be a very complicated (and out-of-line)
function that looks up what kind of memdesc it is and operates on the
memdesc's refcount ... if it has one.  I don't know if it'll be exported
to modules; I can see uses in the mm code, but I'm not sure if modules
will have a need.

Each memdesc type will have its own function to call to free the memdesc.
So we'll still have folio_put().  But slab does not have, need nor want
a refcount, so it'll just slab_free().  I expect us to keep around a
list of recently-freed memdescs of a particular type with their pages
still attached so that we can allocate them again quickly (or reclaim
them under memory pressure).  Once that freelist overflows, we'll free
a batch of them to the buddy allocator (for the pages) and the slab
allocator (for the memdesc itself).

> guest_memfd and hugetlb would be operating on folios (at least for now),
> which contain the refcount,lru,private, ... so nothing special there.
> 
> Once we actually decoupled "struct folio" from "struct page", we *might*
> have to play less tricks, because we could just have a callback pointer
> there. But well, maybe we also want to save some space in there.
> 
> Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios in
> the future? I don't know, maybe.

I've certainly considered going so far as a per-fs folio.  So we'd
have an ext4_folio, an btrfs_folio, an iomap_folio, etc.  That'd let us
get rid of folio->private, but I'm not sure that C's type system can
really handle this nicely.  Maybe in a Rust world ;-)

What I'm thinking about is that I'd really like to be able to declare
that all the functions in ext4_aops only accept pointers to ext4_folio,
so ext4_dirty_folio() can't be called with pointers to _any_ folio,
but specifically folios which were previously allocated for ext4.

I don't know if Rust lets you do something like that.

> I'm currently wondering if we can use folio->private for the time being.
> Either
> 
> (a) If folio->private is still set once the refcount drops to 0, it
> indicates that there is a freeing callback/owner_ops. We'll have to make
> hugetlb not use folio->private and convert others to clear folio->private
> before freeing.
> 
> (b) Use bitX of folio->private to indicate that this has "owner_ops"
> meaning. We'll have to make hugetlb not use folio->private and make others
> not use bitX. Might be harder and overkill, because right now we only really
> need the callback when refcount==0.
> 
> (c) Use some other indication that folio->private contains folio_ops.

I really don't want to use folio_ops / folio_owner_ops.  I read
https://lore.kernel.org/all/CAGtprH_JP2w-4rq02h_Ugvq5KuHX7TUvegOS7xUs_iy5hriE7g@mail.gmail.com/
and I still don't understand what you're trying to do.

Would it work to use aops->free_folio() to notify you when the folio is
being removed from the address space?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-13  4:57               ` Matthew Wilcox
@ 2024-11-13 11:27                 ` David Hildenbrand
  2024-11-14  4:02                 ` John Hubbard
  1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-13 11:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Fuad Tabba, linux-mm, kvm, nouveau, dri-devel,
	rppt, jglisse, akpm, muchun.song, simona, airlied, pbonzini,
	seanjc, jhubbard, ackerleytng, vannapurve, mail, kirill.shutemov,
	quic_eberman, maz, will, qperret, keirf, roypat

On 13.11.24 05:57, Matthew Wilcox wrote:
> On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote:
>> On 12.11.24 14:53, Jason Gunthorpe wrote:
>>> On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
>>>> On 12.11.24 06:26, Matthew Wilcox wrote:
>>>>> I don't want you to respin.  I think this is a bad idea.
>>>>
>>>> I'm hoping you'll find some more time to explain what exactly you don't
>>>> like, because this series only refactors what we already have.
>>>>
>>>> I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.
> 
> I don't.  The list of 'if's is better than the indirect function call.
> That's terribly expensive, and the way we reuse the lru.next field
> is fragile.  Not to mention that it introduces a new thing for the
> hardening people to fret over.

Right, indirect calls are nasty and probably more fragile/insecure, but this is
really what ZONE_DEVICE is already using internally ... but I agree that is
less desirable to abstract that.

[...]

>> We'll need some reliable way to identify these folios that need care.
>> guest_memfd will be using folio->mapcount for now, so for now we couldn't
>> set a page type like hugetlb does.
> 
> If hugetlb can set lru.next at a certain point, then guestmemfd could
> set a page type at a similar point, no?

The main problem is that we will have to set it on small folios that can be
mapped to user space. As long as mapcount overlays page_type, that's
... not going to work.

We won't be truncating+freeing the folio as long as it is mapped to user space,
though. So one workaround would be to set the page type only as long as it
isn't mapped to user space.

Won't win a beauty price, but could be one workaround until we decoupled
the type from the mapcount.

[...]

> 
>> guest_memfd and hugetlb would be operating on folios (at least for now),
>> which contain the refcount,lru,private, ... so nothing special there.
>>
>> Once we actually decoupled "struct folio" from "struct page", we *might*
>> have to play less tricks, because we could just have a callback pointer
>> there. But well, maybe we also want to save some space in there.
>>
>> Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios in
>> the future? I don't know, maybe.
> 
> I've certainly considered going so far as a per-fs folio.  So we'd
> have an ext4_folio, an btrfs_folio, an iomap_folio, etc.  That'd let us
> get rid of folio->private, but I'm not sure that C's type system can
> really handle this nicely.  Maybe in a Rust world ;-)

:)

> 
> What I'm thinking about is that I'd really like to be able to declare
> that all the functions in ext4_aops only accept pointers to ext4_folio,
> so ext4_dirty_folio() can't be called with pointers to _any_ folio,
> but specifically folios which were previously allocated for ext4.

Yes, that sounds reasonable. hugetlb definetly might be another such candidate.

> 
> I don't know if Rust lets you do something like that.
> 
>> I'm currently wondering if we can use folio->private for the time being.
>> Either
>>
>> (a) If folio->private is still set once the refcount drops to 0, it
>> indicates that there is a freeing callback/owner_ops. We'll have to make
>> hugetlb not use folio->private and convert others to clear folio->private
>> before freeing.
>>
>> (b) Use bitX of folio->private to indicate that this has "owner_ops"
>> meaning. We'll have to make hugetlb not use folio->private and make others
>> not use bitX. Might be harder and overkill, because right now we only really
>> need the callback when refcount==0.
>>
>> (c) Use some other indication that folio->private contains folio_ops.
> 
> I really don't want to use folio_ops / folio_owner_ops.

Yes, and I understand your reasoning. It was one approach to work around
the page_type vs. mapcount issue and avoiding more checks on the freeing path.

If we manage to make the page type fly, the following could work and leave
the ordinary folio freeing path as fast as before (and allow me to add the
PGTY_offline handling :) ):

 From 5a55e4bcf4d6cfa64d3383a7cf6649042cedcecb Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 13 Nov 2024 12:20:58 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  include/linux/page-flags.h | 11 +++++++++++
  mm/swap.c                  | 27 ++++++++++++++++++++++-----
  2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e80665bc51fac..ebf89075eeb5f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -950,6 +950,7 @@ enum pagetype {
  	PGTY_slab	= 0xf5,
  	PGTY_zsmalloc	= 0xf6,
  	PGTY_unaccepted	= 0xf7,
+	PGTY_guestmem	= 0xf8,
  
  	PGTY_mapcount_underflow = 0xff
  };
@@ -970,6 +971,16 @@ static inline bool page_has_type(const struct page *page)
  	return page_mapcount_is_type(data_race(page->page_type));
  }
  
+static inline bool folio_has_type(const struct folio *folio)
+{
+	return page_has_type(&folio->page);
+}
+
+static inline int folio_get_type(const struct folio *folio)
+{
+	return folio->page.page_type >> 24;
+}
+
  #define FOLIO_TYPE_OPS(lname, fname)					\
  static __always_inline bool folio_test_##fname(const struct folio *folio) \
  {									\
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa17..bf4efc7bba18a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,6 +94,22 @@ static void page_cache_release(struct folio *folio)
  		unlock_page_lruvec_irqrestore(lruvec, flags);
  }
  
+static void free_typed_folio(struct folio *folio)
+{
+	switch (folio_get_type(folio)) {
+	case PGTY_hugetlb:
+		free_huge_folio(folio);
+		return;
+	case PGTY_offline:
+		/* Nothing to do, it's offline. */
+		return;
+	case PGTY_guestmem:
+		// free_guestmem_folio(folio);
+	default:
+		WARN_ON_ONCE(1);
+	}
+}
+
  void __folio_put(struct folio *folio)
  {
  	if (unlikely(folio_is_zone_device(folio))) {
@@ -101,8 +117,8 @@ void __folio_put(struct folio *folio)
  		return;
  	}
  
-	if (folio_test_hugetlb(folio)) {
-		free_huge_folio(folio);
+	if (unlikely(folio_has_type(folio))) {
+		free_typed_folio(folio);
  		return;
  	}
  
@@ -934,15 +950,16 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
  		if (!folio_ref_sub_and_test(folio, nr_refs))
  			continue;
  
-		/* hugetlb has its own memcg */
-		if (folio_test_hugetlb(folio)) {
+		if (unlikely(folio_has_type(folio))) {
+			/* typed folios have their own memcg, if any */
  			if (lruvec) {
  				unlock_page_lruvec_irqrestore(lruvec, flags);
  				lruvec = NULL;
  			}
-			free_huge_folio(folio);
+			free_typed_folio(folio);
  			continue;
  		}
+
  		folio_unqueue_deferred_split(folio);
  		__page_cache_release(folio, &lruvec, &flags);
  
-- 
2.47.0



I read
> https://lore.kernel.org/all/CAGtprH_JP2w-4rq02h_Ugvq5KuHX7TUvegOS7xUs_iy5hriE7g@mail.gmail.com/
> and I still don't understand what you're trying to do.

It's a bunch of related issues (e.g., one idea is also having another "global" pool
of large pages that are not owned by hugetlb), but the biggest thing we want to sort
out is the following:

We allocated a large folio from somewhere (hugetlb, another global pool). We might
have to split this folio internally in some cases (for example, because we need
precise per-page refcounts/mapcounts) and we:

(a) Really have to return the whole re-collapsed folio to the original allocator.

(b) Might have to clean up some stuff (e.g., restore original page type) before handing
the folio back to the original allocator.

(c) Want to re-collapse (parts of?) the folio when we don't need the split anymore.


Re-collapsing is one problem because of possible GUP pins (for parts that were
exposed to user space) and speculative references.

As an example during (a): we might have truncated the whole split thing, but some
reference on a split folio prevents us from re-collapsing the whole thing
synchronously during truncation. As soon as that last reference is put,
we can collapse the thing and return it to the original allocator.

Similar during (c), just that we want to "freeze" all refcounts as the last
references are put, so we can just collapse it once we are notified that
it is now possible -- and that no other speculative references can show up as
the refcount is 0.

> 
> Would it work to use aops->free_folio() to notify you when the folio is
> being removed from the address space?

We really have to know when the last reference is gone, and intercept that
to cleanup and prepare the large folio to go back to its original allocator.

As always, thanks for taking your time and the detailed response Willy!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops
  2024-11-13  4:57               ` Matthew Wilcox
  2024-11-13 11:27                 ` David Hildenbrand
@ 2024-11-14  4:02                 ` John Hubbard
  1 sibling, 0 replies; 23+ messages in thread
From: John Hubbard @ 2024-11-14  4:02 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: Jason Gunthorpe, Fuad Tabba, linux-mm, kvm, nouveau, dri-devel,
	rppt, jglisse, akpm, muchun.song, simona, airlied, pbonzini,
	seanjc, ackerleytng, vannapurve, mail, kirill.shutemov,
	quic_eberman, maz, will, qperret, keirf, roypat

On 11/12/24 8:57 PM, Matthew Wilcox wrote:
> On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote:
>> On 12.11.24 14:53, Jason Gunthorpe wrote:
>>> On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
>>>> On 12.11.24 06:26, Matthew Wilcox wrote:
...
> I've certainly considered going so far as a per-fs folio.  So we'd
> have an ext4_folio, an btrfs_folio, an iomap_folio, etc.  That'd let us
> get rid of folio->private, but I'm not sure that C's type system can
> really handle this nicely.  Maybe in a Rust world ;-)
> 
> What I'm thinking about is that I'd really like to be able to declare
> that all the functions in ext4_aops only accept pointers to ext4_folio,
> so ext4_dirty_folio() can't be called with pointers to _any_ folio,
> but specifically folios which were previously allocated for ext4.
> 
> I don't know if Rust lets you do something like that.
> 

As Rust-for-Linux student, I can answer that one: "yes".

Some combination of "newtypes" and Traits will provide exactly what you
need here. newtypes provide a zero-overhead type safe way of specifying
a type, and Traits can be used to require that only types that support
specific operations are accepted in foo().

(Rust at the language level looks a lot more like a replacement for C++,
than a replacement for C, imho. By which I mean, it has lots of goodies
for expressing things, built right into the language.)


thanks,
-- 
John Hubbard



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-12-05 15:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-08 16:20 [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 01/10] mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb() Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 02/10] mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 03/10] mm/hugetlb: rename "folio_putback_active_hugetlb()" to "folio_putback_hugetlb()" Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 04/10] mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 05/10] mm/hugetlb: use folio->lru int demote_free_hugetlb_folios() Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals Fuad Tabba
2024-11-12 15:28   ` wang wei
2024-11-12 15:48     ` [RFC " David Hildenbrand
2024-11-08 16:20 ` [RFC PATCH v1 07/10] mm: Introduce struct folio_owner_ops Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 08/10] mm: Use getters and setters to access page pgmap Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 09/10] mm: Use owner_ops on folio_put for zone device pages Fuad Tabba
2024-11-08 16:20 ` [RFC PATCH v1 10/10] mm: hugetlb: Use owner_ops on folio_put for hugetlb Fuad Tabba
2024-11-08 17:05 ` [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops Jason Gunthorpe
2024-11-08 19:33   ` David Hildenbrand
2024-11-11  8:26     ` Fuad Tabba
2024-11-12  5:26       ` Matthew Wilcox
2024-11-12  9:10         ` David Hildenbrand
2024-11-12 13:53           ` Jason Gunthorpe
2024-11-12 14:22             ` David Hildenbrand
2024-11-13  4:57               ` Matthew Wilcox
2024-11-13 11:27                 ` David Hildenbrand
2024-11-14  4:02                 ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox