[PATCH v3 0/6] mm: split underutilized THPs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] mm: split underutilized THPs
@ 2024-08-13 12:02 Usama Arif
  2024-08-13 12:02 ` [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
                   ` (7 more replies)
  0 siblings, 8 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-13 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

The current upstream default policy for THP is always. However, Meta
uses madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing.
Using madvise + relying on khugepaged has certain drawbacks over
THP=always. Using madvise hints mean THPs aren't "transparent" and
require userspace changes. Waiting for khugepaged to scan memory and
collapse pages into THP can be slow and unpredictable in terms of performance
(i.e. you dont know when the collapse will happen), while production
environments require predictable performance. If there is enough memory
available, its better for both performance and predictability to have
a THP from fault time, i.e. THP=always rather than wait for khugepaged
to collapse it, and deal with sparsely populated THPs when the system is
running out of memory.

This patch-series is an attempt to mitigate the issue of running out of
memory when THP is always enabled. During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list.
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underutilized,
i.e. how many of the base 4K pages of the entire THP were zero-filled.
If this number goes above a certain threshold, the shrinker will attempt
to split that THP. Then at remap time, the pages that were zero-filled are
mapped to the shared zeropage, hence saving memory. This method avoids the
downside of wasting memory in areas where THP is sparsely filled when THP
is always enabled, while still providing the upside THPs like reduced TLB
misses without having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker. The results after 2 hours are as follows:

                            | THP=madvise |  THP=always   | THP=always
                            |             |               | + shrinker series
                            |             |               | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement     |      -      |    +1.8%      |     +1.7%
(over THP=madvise)          |             |               |
-----------------------------------------------------------------------------
Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with
the shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kills stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K

v2 -> v3:
- Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
- Use flags argument instead of bools in remove_migration_ptes (Johannes)
- Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
  (David Hildenbrand).
- Split out the last patch of v2 into 3, one for introducing the flag,
  one for splitting underutilized THPs on _deferred_list and one for adding
  sysfs entry to disable splitting (David Hildenbrand).

v1 -> v2:
- Turn page checks and operations to folio versions in __split_huge_page.
  This means patches 1 and 2 from v1 are no longer needed.
  (David Hildenbrand)
- Map to shared zeropage in all cases if the base page is zero-filled.
  The uffd selftest was removed.
  (David Hildenbrand).
- rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
  (Rik van Riel).
- Use unsigned long instead of uint64_t (kernel test robot).

Alexander Zhu (1):
  mm: selftest to verify zero-filled pages are mapped to zeropage

Usama Arif (3):
  mm: Introduce a pageflag for partially mapped folios
  mm: split underutilized THPs
  mm: add sysfs entry to disable splitting underutilized THPs

Yu Zhao (2):
  mm: free zapped tail pages when splitting isolated thp
  mm: remap unused subpages to shared zeropage when splitting isolated
    thp

 Documentation/admin-guide/mm/transhuge.rst    |   6 +
 include/linux/huge_mm.h                       |   4 +-
 include/linux/khugepaged.h                    |   1 +
 include/linux/page-flags.h                    |   3 +
 include/linux/rmap.h                          |   7 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/huge_memory.c                              | 156 ++++++++++++++++--
 mm/hugetlb.c                                  |   1 +
 mm/internal.h                                 |   4 +-
 mm/khugepaged.c                               |   3 +-
 mm/memcontrol.c                               |   3 +-
 mm/migrate.c                                  |  74 +++++++--
 mm/migrate_device.c                           |   4 +-
 mm/page_alloc.c                               |   5 +-
 mm/rmap.c                                     |   3 +-
 mm/vmscan.c                                   |   3 +-
 mm/vmstat.c                                   |   1 +
 .../selftests/mm/split_huge_page_test.c       |  71 ++++++++
 tools/testing/selftests/mm/vm_util.c          |  22 +++
 tools/testing/selftests/mm/vm_util.h          |   1 +
 20 files changed, 333 insertions(+), 40 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
@ 2024-08-13 12:02 ` Usama Arif
  2024-08-15 18:47   ` Kairui Song
  2024-08-13 12:02 ` [PATCH v3 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-13 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai, Usama Arif

From: Yu Zhao <yuzhao@google.com>

If a tail page has only two references left, one inherited from the
isolation of its head and the other from lru_add_page_tail() which we
are about to drop, it means this tail page was concurrently zapped.
Then we can safely free it and save page reclaim or migration the
trouble of trying it.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/huge_memory.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 04ee8abd6475..85a424e954be 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3059,7 +3059,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned int new_nr = 1 << new_order;
 	int order = folio_order(folio);
 	unsigned int nr = 1 << order;
+	struct folio_batch free_folios;
 
+	folio_batch_init(&free_folios);
 	/* complete memcg works before add pages to LRU */
 	split_page_memcg(head, order, new_order);
 
@@ -3143,6 +3145,26 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		if (subpage == page)
 			continue;
 		folio_unlock(new_folio);
+		/*
+		 * If a folio has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * folio was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && folio_ref_freeze(new_folio, 2)) {
+			VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
+
+			folio_clear_active(new_folio);
+			folio_clear_unevictable(new_folio);
+			if (!folio_batch_add(&free_folios, folio)) {
+				mem_cgroup_uncharge_folios(&free_folios);
+				free_unref_folios(&free_folios);
+			}
+			continue;
+		}
 
 		/*
 		 * Subpages may be freed if there wasn't any mapping
@@ -3153,6 +3175,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (free_folios.nr) {
+		mem_cgroup_uncharge_folios(&free_folios);
+		free_unref_folios(&free_folios);
+	}
 }
 
 /* Racy check whether the huge page can be split */
-- 
2.43.5



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
  2024-08-13 12:02 ` [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-08-13 12:02 ` Usama Arif
  2024-08-13 12:02 ` [PATCH v3 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-13 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai, Usama Arif

From: Yu Zhao <yuzhao@google.com>

Here being unused means containing only zeros and inaccessible to
userspace. When splitting an isolated thp under reclaim or migration,
the unused subpages can be mapped to the shared zeropage, hence saving
memory. This is particularly helpful when the internal
fragmentation of a thp is high, i.e. it has many untouched subpages.

This is also a prerequisite for THP low utilization shrinker which will
be introduced in later patches, where underutilized THPs are split, and
the zero-filled pages are freed saving memory.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/rmap.h |  7 ++++-
 mm/huge_memory.c     |  8 ++---
 mm/migrate.c         | 71 ++++++++++++++++++++++++++++++++++++++------
 mm/migrate_device.c  |  4 +--
 4 files changed, 74 insertions(+), 16 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0978c64f49d8..07854d1f9ad6 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma);
 
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
+enum rmp_flags {
+	RMP_LOCKED		= 1 << 0,
+	RMP_USE_SHARED_ZEROPAGE	= 1 << 1,
+};
+
+void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
 
 /*
  * rmap_walk_control: To control rmap traversing for specific needs
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 85a424e954be..6df0e9f4f56c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2911,7 +2911,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 	return false;
 }
 
-static void remap_page(struct folio *folio, unsigned long nr)
+static void remap_page(struct folio *folio, unsigned long nr, int flags)
 {
 	int i = 0;
 
@@ -2919,7 +2919,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	if (!folio_test_anon(folio))
 		return;
 	for (;;) {
-		remove_migration_ptes(folio, folio, true);
+		remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
 		i += folio_nr_pages(folio);
 		if (i >= nr)
 			break;
@@ -3129,7 +3129,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	if (nr_dropped)
 		shmem_uncharge(folio->mapping->host, nr_dropped);
-	remap_page(folio, nr);
+	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
 
 	/*
 	 * set page to its compound_head when split to non order-0 pages, so
@@ -3424,7 +3424,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio));
+		remap_page(folio, folio_nr_pages(folio), 0);
 		ret = -EAGAIN;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 66a5f73ebfdf..3288ac041d03 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -178,13 +178,56 @@ void putback_movable_pages(struct list_head *l)
 	}
 }
 
+static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
+					  struct folio *folio,
+					  unsigned long idx)
+{
+	struct page *page = folio_page(folio, idx);
+	bool contains_data;
+	pte_t newpte;
+	void *addr;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
+
+	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
+		return false;
+
+	/*
+	 * The pmd entry mapping the old thp was flushed and the pte mapping
+	 * this subpage has been non present. If the subpage is only zero-filled
+	 * then map it to the shared zeropage.
+	 */
+	addr = kmap_local_page(page);
+	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
+	kunmap_local(addr);
+
+	if (contains_data || mm_forbids_zeropage(pvmw->vma->vm_mm))
+		return false;
+
+	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
+					pvmw->vma->vm_page_prot));
+	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
+
+	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
+	return true;
+}
+
+struct rmap_walk_arg {
+	struct folio *folio;
+	bool map_unused_to_zeropage;
+};
+
 /*
  * Restore a potential migration pte to a working pte entry
  */
 static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct rmap_walk_arg *rmap_walk_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -208,6 +251,9 @@ static bool remove_migration_pte(struct folio *folio,
 			continue;
 		}
 #endif
+		if (rmap_walk_arg->map_unused_to_zeropage &&
+		    try_to_map_unused_to_zeropage(&pvmw, folio, idx))
+			continue;
 
 		folio_get(folio);
 		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -286,14 +332,21 @@ static bool remove_migration_pte(struct folio *folio,
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
+void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
 {
+	struct rmap_walk_arg rmap_walk_arg = {
+		.folio = src,
+		.map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
-		.arg = src,
+		.arg = &rmap_walk_arg,
 	};
 
-	if (locked)
+	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src);
+
+	if (flags & RMP_LOCKED)
 		rmap_walk_locked(dst, &rwc);
 	else
 		rmap_walk(dst, &rwc);
@@ -903,7 +956,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(folio, folio, false);
+	remove_migration_ptes(folio, folio, 0);
 
 	rc = mapping->a_ops->writepage(&folio->page, &wbc);
 
@@ -1067,7 +1120,7 @@ static void migrate_folio_undo_src(struct folio *src,
 				   struct list_head *ret)
 {
 	if (page_was_mapped)
-		remove_migration_ptes(src, src, false);
+		remove_migration_ptes(src, src, 0);
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
 		put_anon_vma(anon_vma);
@@ -1305,7 +1358,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 		lru_add_drain();
 
 	if (old_page_state & PAGE_WAS_MAPPED)
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, 0);
 
 out_unlock_both:
 	folio_unlock(dst);
@@ -1443,7 +1496,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
 
 unlock_put_anon:
 	folio_unlock(dst);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa..8f875636b35b 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			continue;
 
 		folio = page_folio(page);
-		remove_migration_ptes(folio, folio, false);
+		remove_migration_ptes(folio, folio, 0);
 
 		src_pfns[i] = 0;
 		folio_unlock(folio);
@@ -837,7 +837,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
 
 		src = page_folio(page);
 		dst = page_folio(newpage);
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, 0);
 		folio_unlock(src);
 
 		if (is_zone_device_page(page))
-- 
2.43.5



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
  2024-08-13 12:02 ` [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
  2024-08-13 12:02 ` [PATCH v3 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
@ 2024-08-13 12:02 ` Usama Arif
  2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-13 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Alexander Zhu, Usama Arif

From: Alexander Zhu <alexlzhu@fb.com>

When a THP is split, any subpage that is zero-filled will be mapped
to the shared zeropage, hence saving memory. Add selftest to verify
this by allocating zero-filled THP and comparing RssAnon before and
after split.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 .../selftests/mm/split_huge_page_test.c       | 71 +++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c          | 22 ++++++
 tools/testing/selftests/mm/vm_util.h          |  1 +
 3 files changed, 94 insertions(+)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index e5e8dafc9d94..eb6d1b9fc362 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -84,6 +84,76 @@ static void write_debugfs(const char *fmt, ...)
 	write_file(SPLIT_DEBUGFS, input, ret + 1);
 }
 
+static char *allocate_zero_filled_hugepage(size_t len)
+{
+	char *result;
+	size_t i;
+
+	result = memalign(pmd_pagesize, len);
+	if (!result) {
+		printf("Fail to allocate memory\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(result, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		result[i] = (char)0;
+
+	return result;
+}
+
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
+{
+	unsigned long rss_anon_before, rss_anon_after;
+	size_t i;
+
+	if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
+		printf("No THP is allocated\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_before = rss_anon();
+	if (!rss_anon_before) {
+		printf("No RssAnon is allocated before split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all THPs */
+	write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+		      (uint64_t)one_page + len, 0);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)0) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+	if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_after = rss_anon();
+	if (rss_anon_after >= rss_anon_before) {
+		printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+		       rss_anon_before, rss_anon_after);
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_pmd_zero_pages(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+
+	one_page = allocate_zero_filled_hugepage(len);
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages successful\n");
+	free(one_page);
+}
+
 void split_pmd_thp(void)
 {
 	char *one_page;
@@ -431,6 +501,7 @@ int main(int argc, char **argv)
 
 	fd_size = 2 * pmd_pagesize;
 
+	split_pmd_zero_pages();
 	split_pmd_thp();
 	split_pte_mapped_thp();
 	split_file_backed_thp();
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 5a62530da3b5..d8d0cf04bb57 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -12,6 +12,7 @@
 
 #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SMAP_FILE_PATH "/proc/self/smaps"
+#define STATUS_FILE_PATH "/proc/self/status"
 #define MAX_LINE_LENGTH 500
 
 unsigned int __page_size;
@@ -171,6 +172,27 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
+unsigned long rss_anon(void)
+{
+	unsigned long rss_anon = 0;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+
+	fp = fopen(STATUS_FILE_PATH, "r");
+	if (!fp)
+		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
+
+	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
+		goto err_out;
+
+	if (sscanf(buffer, "RssAnon:%10lu kB", &rss_anon) != 1)
+		ksft_exit_fail_msg("Reading status error\n");
+
+err_out:
+	fclose(fp);
+	return rss_anon;
+}
+
 bool __check_huge(void *addr, char *pattern, int nr_hpages,
 		  uint64_t hpage_size)
 {
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 9007c420d52c..71b75429f4a5 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -39,6 +39,7 @@ unsigned long pagemap_get_pfn(int fd, char *start);
 void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
+uint64_t rss_anon(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
                   ` (2 preceding siblings ...)
  2024-08-13 12:02 ` [PATCH v3 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
@ 2024-08-13 12:02 ` Usama Arif
  2024-08-14  3:30   ` Yu Zhao
                     ` (4 more replies)
  2024-08-13 12:02 ` [PATCH v3 5/6] mm: split underutilized THPs Usama Arif
                   ` (3 subsequent siblings)
  7 siblings, 5 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-13 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

Currently folio->_deferred_list is used to keep track of
partially_mapped folios that are going to be split under memory
pressure. In the next patch, all THPs that are faulted in and collapsed
by khugepaged are also going to be tracked using _deferred_list.

This patch introduces a pageflag to be able to distinguish between
partially mapped folios and others in the deferred_list at split time in
deferred_split_scan. Its needed as __folio_remove_rmap decrements
_mapcount, _large_mapcount and _entire_mapcount, hence it won't be
possible to distinguish between partially mapped folios and others in
deferred_split_scan.

Eventhough it introduces an extra flag to track if the folio is
partially mapped, there is no functional change intended with this
patch and the flag is not useful in this patch itself, it will
become useful in the next patch when _deferred_list has non partially
mapped folios.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h    |  4 ++--
 include/linux/page-flags.h |  3 +++
 mm/huge_memory.c           | 21 +++++++++++++--------
 mm/hugetlb.c               |  1 +
 mm/internal.h              |  4 +++-
 mm/memcontrol.c            |  3 ++-
 mm/migrate.c               |  3 ++-
 mm/page_alloc.c            |  5 +++--
 mm/rmap.c                  |  3 ++-
 mm/vmscan.c                |  3 ++-
 10 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c32058cacfe..969f11f360d2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list_to_order(page, NULL, 0);
 }
-void deferred_split_folio(struct folio *folio);
+void deferred_split_folio(struct folio *folio, bool partially_mapped);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct folio *folio);
@@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-static inline void deferred_split_folio(struct folio *folio) {}
+static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a0a29bd092f8..cecc1bad7910 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -182,6 +182,7 @@ enum pageflags {
 	/* At least one page in this folio has the hwpoison flag set */
 	PG_has_hwpoisoned = PG_active,
 	PG_large_rmappable = PG_workingset, /* anon or file-backed */
+	PG_partially_mapped, /* was identified to be partially mapped */
 };
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
@@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
 	ClearPageHead(page);
 }
 FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
+FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
 #else
 FOLIO_FLAG_FALSE(large_rmappable)
+FOLIO_FLAG_FALSE(partially_mapped)
 #endif
 
 #define PG_head_mask ((1UL << PG_head))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6df0e9f4f56c..c024ab0f745c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			 * page_deferred_list.
 			 */
 			list_del_init(&folio->_deferred_list);
+			folio_clear_partially_mapped(folio);
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
@@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
 		list_del_init(&folio->_deferred_list);
+		folio_clear_partially_mapped(folio);
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
-void deferred_split_folio(struct folio *folio)
+void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 #ifdef CONFIG_MEMCG
@@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
 	if (folio_test_swapcache(folio))
 		return;
 
-	if (!list_empty(&folio->_deferred_list))
-		return;
-
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	if (partially_mapped)
+		folio_set_partially_mapped(folio);
+	else
+		folio_clear_partially_mapped(folio);
 	if (list_empty(&folio->_deferred_list)) {
-		if (folio_test_pmd_mappable(folio))
-			count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-		count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		if (partially_mapped) {
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		}
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
@@ -3541,6 +3546,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		} else {
 			/* We lost race with folio_put() */
 			list_del_init(&folio->_deferred_list);
+			folio_clear_partially_mapped(folio);
 			ds_queue->split_queue_len--;
 		}
 		if (!--sc->nr_to_scan)
@@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 next:
 		folio_put(folio);
 	}
-
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	list_splice_tail(&list, &ds_queue->split_queue);
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1fdd9eab240c..2ae2d9a18e40 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 		free_gigantic_folio(folio, huge_page_order(h));
 	} else {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+		folio_clear_partially_mapped(folio);
 		folio_put(folio);
 	}
 }
diff --git a/mm/internal.h b/mm/internal.h
index 52f7fc4e8ac3..d64546b8d377 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
 	atomic_set(&folio->_pincount, 0);
-	if (order > 1)
+	if (order > 1) {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+		folio_clear_partially_mapped(folio);
+	}
 }
 
 static inline void prep_compound_tail(struct page *head, int tail_idx)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e1ffd2950393..0fd95daecf9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4669,7 +4669,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
 			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list), folio);
+			!list_empty(&folio->_deferred_list) &&
+			folio_test_partially_mapped(folio), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
diff --git a/mm/migrate.c b/mm/migrate.c
index 3288ac041d03..6e32098ac2dc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1734,7 +1734,8 @@ static int migrate_pages_batch(struct list_head *from,
 			 * use _deferred_list.
 			 */
 			if (nr_pages > 2 &&
-			   !list_empty(&folio->_deferred_list)) {
+			   !list_empty(&folio->_deferred_list) &&
+			   folio_test_partially_mapped(folio)) {
 				if (!try_split_folio(folio, split_folios, mode)) {
 					nr_failed++;
 					stats->nr_thp_failed += is_thp;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 408ef3d25cf5..a145c550dd2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -957,8 +957,9 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		break;
 	case 2:
 		/* the second tail page: deferred_list overlaps ->mapping */
-		if (unlikely(!list_empty(&folio->_deferred_list))) {
-			bad_page(page, "on deferred list");
+		if (unlikely(!list_empty(&folio->_deferred_list) &&
+		    folio_test_partially_mapped(folio))) {
+			bad_page(page, "partially mapped folio on deferred list");
 			goto out;
 		}
 		break;
diff --git a/mm/rmap.c b/mm/rmap.c
index a6b9cd0b2b18..9ad558c2bad0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1579,7 +1579,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 	 */
 	if (partially_mapped && folio_test_anon(folio) &&
 	    list_empty(&folio->_deferred_list))
-		deferred_split_folio(folio);
+		deferred_split_folio(folio, true);
+
 	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
 
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25e43bb3b574..25f4e8403f41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1233,7 +1233,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * Split partially mapped folios right away.
 					 * We can free the unmapped pages without IO.
 					 */
-					if (data_race(!list_empty(&folio->_deferred_list)) &&
+					if (data_race(!list_empty(&folio->_deferred_list) &&
+					    folio_test_partially_mapped(folio)) &&
 					    split_folio_to_list(folio, folio_list))
 						goto activate_locked;
 				}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 5/6] mm: split underutilized THPs
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
                   ` (3 preceding siblings ...)
  2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
@ 2024-08-13 12:02 ` Usama Arif
  2024-08-13 12:02 ` [PATCH v3 6/6] mm: add sysfs entry to disable splitting " Usama Arif
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-13 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

This is an attempt to mitigate the issue of running out of memory when
THP is always enabled. During runtime whenever a THP is being faulted in
(__do_huge_pmd_anonymous_page) or collapsed by khugepaged
(collapse_huge_page), the THP is added to  _deferred_list. Whenever
memory reclaim happens in linux, the kernel runs the deferred_split
shrinker which goes through the _deferred_list.

If the folio was partially mapped, the shrinker attempts to split it.
If the folio is not partially mapped, the shrinker checks if the THP
was underutilized, i.e. how many of the base 4K pages of the entire THP
were zero-filled. If this number goes above a certain threshold (decided by
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the
shrinker will attempt to split that THP. Then at remap time, the pages
that were zero-filled are mapped to the shared zeropage, hence saving
memory.

Suggested-by: Rik van Riel <riel@surriel.com>
Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  6 ++
 include/linux/khugepaged.h                 |  1 +
 include/linux/vm_event_item.h              |  1 +
 mm/huge_memory.c                           | 76 ++++++++++++++++++++--
 mm/khugepaged.c                            |  3 +-
 mm/vmstat.c                                |  1 +
 6 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 058485daf186..60522f49178b 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -447,6 +447,12 @@ thp_deferred_split_page
 	splitting it would free up some memory. Pages on split queue are
 	going to be split under memory pressure.
 
+thp_underutilized_split_page
+	is incremented when a huge page on the split queue was split
+	because it was underutilized. A THP is underutilized if the
+	number of zero pages in the THP is above a certain threshold
+	(/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
+
 thp_split_pmd
 	is incremented every time a PMD split into table of PTEs.
 	This can happen, for instance, when application calls mprotect() or
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index f68865e19b0b..30baae91b225 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
 
+extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index aae5c7c5cfb4..bf1470a7a737 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SPLIT_PAGE,
 		THP_SPLIT_PAGE_FAILED,
 		THP_DEFERRED_SPLIT_PAGE,
+		THP_UNDERUTILIZED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c024ab0f745c..6b32b2d4ab1e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1087,6 +1087,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(vma->vm_mm);
+		deferred_split_folio(folio, false);
 		spin_unlock(vmf->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
 		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
@@ -3522,6 +3523,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(ds_queue->split_queue_len);
 }
 
+static bool thp_underutilized(struct folio *folio)
+{
+	int num_zero_pages = 0, num_filled_pages = 0;
+	void *kaddr;
+	int i;
+
+	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
+		return false;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
+		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
+			num_zero_pages++;
+			if (num_zero_pages > khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return true;
+			}
+		} else {
+			/*
+			 * Another path for early exit once the number
+			 * of non-zero filled pages exceeds threshold.
+			 */
+			num_filled_pages++;
+			if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return false;
+			}
+		}
+		kunmap_local(kaddr);
+	}
+	return false;
+}
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -3555,17 +3589,45 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+		bool did_split = false;
+		bool underutilized = false;
+
+		if (folio_test_partially_mapped(folio))
+			goto split;
+		underutilized = thp_underutilized(folio);
+		if (underutilized)
+			goto split;
+		continue;
+split:
 		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
-		if (!split_folio(folio))
-			split++;
+			continue;
+		did_split = !split_folio(folio);
 		folio_unlock(folio);
-next:
-		folio_put(folio);
+		if (did_split) {
+			/* Splitting removed folio from the list, drop reference here */
+			folio_put(folio);
+			if (underutilized)
+				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
+			split++;
+		}
 	}
+
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
+	/*
+	 * Only add back to the queue if folio is partially mapped.
+	 * If thp_underutilized returns false, or if split_folio fails in
+	 * the case it was underutilized, then consider it used and don't
+	 * add it back to split_queue.
+	 */
+	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+		if (folio_test_partially_mapped(folio))
+			list_move(&folio->_deferred_list, &ds_queue->split_queue);
+		else {
+			list_del_init(&folio->_deferred_list);
+			ds_queue->split_queue_len--;
+		}
+		folio_put(folio);
+	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cdd1d8655a76..02e1463e1a79 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  *
  * Note that these are only respected if collapse was initiated by khugepaged.
  */
-static unsigned int khugepaged_max_ptes_none __read_mostly;
+unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly;
 static unsigned int khugepaged_max_ptes_shared __read_mostly;
 
@@ -1235,6 +1235,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
+	deferred_split_folio(folio, false);
 	spin_unlock(pmd_ptl);
 
 	folio = NULL;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c3a402ea91f0..91cd7d4d482b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1384,6 +1384,7 @@ const char * const vmstat_text[] = {
 	"thp_split_page",
 	"thp_split_page_failed",
 	"thp_deferred_split_page",
+	"thp_underutilized_split_page",
 	"thp_split_pmd",
 	"thp_scan_exceed_none_pte",
 	"thp_scan_exceed_swap_pte",
-- 
2.43.5



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 6/6] mm: add sysfs entry to disable splitting underutilized THPs
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
                   ` (4 preceding siblings ...)
  2024-08-13 12:02 ` [PATCH v3 5/6] mm: split underutilized THPs Usama Arif
@ 2024-08-13 12:02 ` Usama Arif
  2024-08-13 17:22 ` [PATCH v3 0/6] mm: split " Andi Kleen
  2024-08-18  5:13 ` Hugh Dickins
  7 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-13 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

If disabled, THPs faulted in or collapsed will not be added to
_deferred_list, and therefore won't be considered for splitting under
memory pressure if underutilized.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6b32b2d4ab1e..b4d72479330d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -74,6 +74,7 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
+static bool split_underutilized_thp = true;
 
 static atomic_t huge_zero_refcount;
 struct folio *huge_zero_folio __read_mostly;
@@ -439,6 +440,27 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+static ssize_t split_underutilized_thp_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", split_underutilized_thp);
+}
+
+static ssize_t split_underutilized_thp_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int err = kstrtobool(buf, &split_underutilized_thp);
+
+	if (err < 0)
+		return err;
+
+	return count;
+}
+
+static struct kobj_attribute split_underutilized_thp_attr = __ATTR(
+	thp_low_util_shrinker, 0644, split_underutilized_thp_show, split_underutilized_thp_store);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
@@ -447,6 +469,7 @@ static struct attribute *hugepage_attr[] = {
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
+	&split_underutilized_thp_attr.attr,
 	NULL,
 };
 
@@ -3475,6 +3498,9 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (folio_order(folio) <= 1)
 		return;
 
+	if (!partially_mapped && !split_underutilized_thp)
+		return;
+
 	/*
 	 * The try_to_unmap() in page reclaim path might reach here too,
 	 * this may cause a race condition to corrupt deferred split queue.
-- 
2.43.5



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/6] mm: split underutilized THPs
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
                   ` (5 preceding siblings ...)
  2024-08-13 12:02 ` [PATCH v3 6/6] mm: add sysfs entry to disable splitting " Usama Arif
@ 2024-08-13 17:22 ` Andi Kleen
  2024-08-14 10:13   ` Usama Arif
  2024-08-18  5:13 ` Hugh Dickins
  7 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2024-08-13 17:22 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

Usama Arif <usamaarif642@gmail.com> writes:
>
> This patch-series is an attempt to mitigate the issue of running out of
> memory when THP is always enabled. During runtime whenever a THP is being
> faulted in or collapsed by khugepaged, the THP is added to a list.
> Whenever memory reclaim happens, the kernel runs the deferred_split
> shrinker which goes through the list and checks if the THP was underutilized,
> i.e. how many of the base 4K pages of the entire THP were zero-filled.

Sometimes when writing a benchmark I fill things with zero explictly
to avoid faults later. For example if you want to measure memory
read bandwidth you need to fault the pages first, but that fault
pattern may well be zero.

With your patch if there is memory pressure there are two effects:

- If things are remapped to the zero page the benchmark
reading memory may give unrealistically good results because
what is thinks is a big memory area is actually only backed
by a single page.

- If I expect to write I may end up with an unexpected zeropage->real
memory fault if the pages got remapped. 

I expect such patterns can happen without benchmarking too.
I could see it being a problem for latency sensitive applications.

Now you could argue that this all should only happen under memory
pressure and when that happens things may be slow anyways and your
patch will still be an improvement.

Maybe that's true but there might be still corner cases
which are negatively impacted by this. I don't have a good solution
other than a tunable, but I expect it will cause problems for someone.

The other problem I have with your patch is that it may cause the kernel
to pollute CPU caches in the background, which again will cause noise in
the system. Instead of plain memchr_inv, you should probably use some
primitive to bypass caches or use a NTA prefetch hint at least.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
@ 2024-08-14  3:30   ` Yu Zhao
  2024-08-14 10:20     ` Usama Arif
  2024-08-14 10:44   ` Barry Song
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 42+ messages in thread
From: Yu Zhao @ 2024-08-14  3:30 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	david, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Tue, Aug 13, 2024 at 6:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Currently folio->_deferred_list is used to keep track of
> partially_mapped folios that are going to be split under memory
> pressure. In the next patch, all THPs that are faulted in and collapsed
> by khugepaged are also going to be tracked using _deferred_list.
>
> This patch introduces a pageflag to be able to distinguish between
> partially mapped folios and others in the deferred_list at split time in
> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> possible to distinguish between partially mapped folios and others in
> deferred_split_scan.
>
> Eventhough it introduces an extra flag to track if the folio is
> partially mapped, there is no functional change intended with this
> patch and the flag is not useful in this patch itself, it will
> become useful in the next patch when _deferred_list has non partially
> mapped folios.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h    |  4 ++--
>  include/linux/page-flags.h |  3 +++
>  mm/huge_memory.c           | 21 +++++++++++++--------
>  mm/hugetlb.c               |  1 +
>  mm/internal.h              |  4 +++-
>  mm/memcontrol.c            |  3 ++-
>  mm/migrate.c               |  3 ++-
>  mm/page_alloc.c            |  5 +++--
>  mm/rmap.c                  |  3 ++-
>  mm/vmscan.c                |  3 ++-
>  10 files changed, 33 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4c32058cacfe..969f11f360d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>  {
>         return split_huge_page_to_list_to_order(page, NULL, 0);
>  }
> -void deferred_split_folio(struct folio *folio);
> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>                 unsigned long address, bool freeze, struct folio *folio);
> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>  {
>         return 0;
>  }
> -static inline void deferred_split_folio(struct folio *folio) {}
> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>  #define split_huge_pmd(__vma, __pmd, __address)        \
>         do { } while (0)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a0a29bd092f8..cecc1bad7910 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -182,6 +182,7 @@ enum pageflags {
>         /* At least one page in this folio has the hwpoison flag set */
>         PG_has_hwpoisoned = PG_active,
>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> +       PG_partially_mapped, /* was identified to be partially mapped */
>  };
>
>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>         ClearPageHead(page);
>  }
>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>  #else
>  FOLIO_FLAG_FALSE(large_rmappable)
> +FOLIO_FLAG_FALSE(partially_mapped)
>  #endif
>
>  #define PG_head_mask ((1UL << PG_head))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6df0e9f4f56c..c024ab0f745c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                          * page_deferred_list.
>                          */
>                         list_del_init(&folio->_deferred_list);
> +                       folio_clear_partially_mapped(folio);
>                 }
>                 spin_unlock(&ds_queue->split_queue_lock);
>                 if (mapping) {
> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>         if (!list_empty(&folio->_deferred_list)) {
>                 ds_queue->split_queue_len--;
>                 list_del_init(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
>
> -void deferred_split_folio(struct folio *folio)
> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>  {
>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  #ifdef CONFIG_MEMCG
> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>         if (folio_test_swapcache(folio))
>                 return;
>
> -       if (!list_empty(&folio->_deferred_list))
> -               return;
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +       if (partially_mapped)
> +               folio_set_partially_mapped(folio);
> +       else
> +               folio_clear_partially_mapped(folio);
>         if (list_empty(&folio->_deferred_list)) {
> -               if (folio_test_pmd_mappable(folio))
> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +               if (partially_mapped) {
> +                       if (folio_test_pmd_mappable(folio))
> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +               }
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
> @@ -3541,6 +3546,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>                 } else {
>                         /* We lost race with folio_put() */
>                         list_del_init(&folio->_deferred_list);
> +                       folio_clear_partially_mapped(folio);
>                         ds_queue->split_queue_len--;
>                 }
>                 if (!--sc->nr_to_scan)
> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  next:
>                 folio_put(folio);
>         }
> -

Unintentional change above?

>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         list_splice_tail(&list, &ds_queue->split_queue);
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1fdd9eab240c..2ae2d9a18e40 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>                 free_gigantic_folio(folio, huge_page_order(h));
>         } else {
>                 INIT_LIST_HEAD(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);

Why does it need to clear a flag that should never be set on hugeTLB folios?

HugeTLB does really use _deferred_list -- it clears it only to avoid
bad_page() because of the overlapping fields:
                        void *_hugetlb_subpool;
                        void *_hugetlb_cgroup;


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/6] mm: split underutilized THPs
  2024-08-13 17:22 ` [PATCH v3 0/6] mm: split " Andi Kleen
@ 2024-08-14 10:13   ` Usama Arif
  0 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-14 10:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

On 13/08/2024 18:22, Andi Kleen wrote:
> Usama Arif <usamaarif642@gmail.com> writes:
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> 
> Sometimes when writing a benchmark I fill things with zero explictly
> to avoid faults later. For example if you want to measure memory
> read bandwidth you need to fault the pages first, but that fault
> pattern may well be zero.
> 
> With your patch if there is memory pressure there are two effects:
> 
> - If things are remapped to the zero page the benchmark
> reading memory may give unrealistically good results because
> what is thinks is a big memory area is actually only backed
> by a single page.
> 
> - If I expect to write I may end up with an unexpected zeropage->real
> memory fault if the pages got remapped. 
> 
> I expect such patterns can happen without benchmarking too.
> I could see it being a problem for latency sensitive applications.
> 
> Now you could argue that this all should only happen under memory
> pressure and when that happens things may be slow anyways and your
> patch will still be an improvement.
> 
> Maybe that's true but there might be still corner cases
> which are negatively impacted by this. I don't have a good solution
> other than a tunable, but I expect it will cause problems for someone.
> 

There are currently 2 knobs to control behaviour of THP low utilization shrinker introduced in this series.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none:
The current default value for this is HPAGE_PMD_NR - 1 (511 for x86). If set to 511, the shrinker will immediately remove the folio from the deferred_list (Please see first if statement in thp_underutilized in Patch 5) and split is not attempted. Not a single page is checked at this point and there is no memory accesses done to impact performance.
If someone sets its to 510, it will exit as soon as a single page containing non-zero data is encountered (the else part in thp_underutilized).

/sys/kernel/mm/transparent_hugepage/thp_low_util_shrinker:
Introduced in patch 6, if someone really doesn't want to enable the shrinker, then they can set this to false. The folio will not be added to the _deferred_list at fault or collapse time, and it will be as if these patches didn't exist. Personally, I don't think its absolutely necessary to have this, but I added it incase someone comes up with some corner case.

For the first effect you mentioned, with the default behaviour of the patches with max_ptes_none set to 511, there will be no splitting of THPs, so you will get the same performance as without the series. 
If there is some benchmark that allocates all of the system memory with zeropages, causing shrinker to run and if someone has changed max_ptes_none and if they have kept thp_low_util_shrinker enabled and if all the benchmark does is read those pages, thus giving good memory results, then that benchmark is not really useful and the good results it gives is not unrealistic but a result of these patches. The stress example I have in the cover letter is an example. With these patches you can run stress or any other benchmark that behaves like this and still run other applications at the same time that consume memory, so the improvement is not unrealistic.

For the second effect of memory faults affecting latency sensitive applications, if THP is always enabled, and such applications are running out of memory resulting in shrinker to run, then a higher priority should be to have memory to run (which the shrinker will provide) rather than stalling for memory creating memory pressure which will result in latency spikes and possibly OOM killer being invoked killing the application.

I think we should focus on real world applications for which I have posted numbers in the cover letter and not tailor this for some benchmarks. If there is some real world low latency application where you could show these patches causing an issue, I would be happy to look into it. But again, with the default max_ptes_none of 511, it wouldn't.

> The other problem I have with your patch is that it may cause the kernel
> to pollute CPU caches in the background, which again will cause noise in
> the system. Instead of plain memchr_inv, you should probably use some
> primitive to bypass caches or use a NTA prefetch hint at least.
> 

A few points on this:
- the page is checked in 2 places, at shrink time and at split time, so having the page in cache is useful and needed.
- there is stuff like this already done in the kernel when there is memory pressure, for e.g. at swap time [1]. Its not memchr_inv, but doing the exact same thing as memchr_inv.
- At the time the shrinker runs, one of the highest priority of the kernel/system is to get free memory. We should not try to make this slower by messing around with caches.

I think the current behaviour in the patches is good because of the above points. But also I don't think there is a standard way of doing NTA prefetch across all architectures, x86 prefetch does it [1], but arm prefetch [2] does pld1keep, i.e. keep the data in L1 cache which is the opposite of what NTA prefetch is intended doing. 

[1] https://elixir.bootlin.com/linux/v6.10.4/source/mm/zswap.c#L1390
[2] https://elixir.bootlin.com/linux/v6.10.4/source/arch/x86/include/asm/processor.h#L614
[3] https://elixir.bootlin.com/linux/v6.10.4/source/arch/arm64/include/asm/processor.h#L360

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14  3:30   ` Yu Zhao
@ 2024-08-14 10:20     ` Usama Arif
  0 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-14 10:20 UTC (permalink / raw)
  To: Yu Zhao
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	david, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team



On 14/08/2024 04:30, Yu Zhao wrote:
>> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>  next:
>>                 folio_put(folio);
>>         }
>> -
> 
> Unintentional change above?

Yes, unintended new line, will fix it.
> 
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>         list_splice_tail(&list, &ds_queue->split_queue);
>>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 1fdd9eab240c..2ae2d9a18e40 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>>                 free_gigantic_folio(folio, huge_page_order(h));
>>         } else {
>>                 INIT_LIST_HEAD(&folio->_deferred_list);
>> +               folio_clear_partially_mapped(folio);
> 
> Why does it need to clear a flag that should never be set on hugeTLB folios?
> 
> HugeTLB does really use _deferred_list -- it clears it only to avoid
> bad_page() because of the overlapping fields:
>                         void *_hugetlb_subpool;
>                         void *_hugetlb_cgroup;

Yes, thats right, will remove it. Thanks!


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
  2024-08-14  3:30   ` Yu Zhao
@ 2024-08-14 10:44   ` Barry Song
  2024-08-14 10:52     ` Barry Song
  2024-08-14 11:11     ` Usama Arif
  2024-08-14 11:10   ` Barry Song
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 42+ messages in thread
From: Barry Song @ 2024-08-14 10:44 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Currently folio->_deferred_list is used to keep track of
> partially_mapped folios that are going to be split under memory
> pressure. In the next patch, all THPs that are faulted in and collapsed
> by khugepaged are also going to be tracked using _deferred_list.
>
> This patch introduces a pageflag to be able to distinguish between
> partially mapped folios and others in the deferred_list at split time in
> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> possible to distinguish between partially mapped folios and others in
> deferred_split_scan.
>
> Eventhough it introduces an extra flag to track if the folio is
> partially mapped, there is no functional change intended with this
> patch and the flag is not useful in this patch itself, it will
> become useful in the next patch when _deferred_list has non partially
> mapped folios.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h    |  4 ++--
>  include/linux/page-flags.h |  3 +++
>  mm/huge_memory.c           | 21 +++++++++++++--------
>  mm/hugetlb.c               |  1 +
>  mm/internal.h              |  4 +++-
>  mm/memcontrol.c            |  3 ++-
>  mm/migrate.c               |  3 ++-
>  mm/page_alloc.c            |  5 +++--
>  mm/rmap.c                  |  3 ++-
>  mm/vmscan.c                |  3 ++-
>  10 files changed, 33 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4c32058cacfe..969f11f360d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>  {
>         return split_huge_page_to_list_to_order(page, NULL, 0);
>  }
> -void deferred_split_folio(struct folio *folio);
> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>                 unsigned long address, bool freeze, struct folio *folio);
> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>  {
>         return 0;
>  }
> -static inline void deferred_split_folio(struct folio *folio) {}
> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>  #define split_huge_pmd(__vma, __pmd, __address)        \
>         do { } while (0)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a0a29bd092f8..cecc1bad7910 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -182,6 +182,7 @@ enum pageflags {
>         /* At least one page in this folio has the hwpoison flag set */
>         PG_has_hwpoisoned = PG_active,
>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> +       PG_partially_mapped, /* was identified to be partially mapped */
>  };
>
>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>         ClearPageHead(page);
>  }
>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>  #else
>  FOLIO_FLAG_FALSE(large_rmappable)
> +FOLIO_FLAG_FALSE(partially_mapped)
>  #endif
>
>  #define PG_head_mask ((1UL << PG_head))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6df0e9f4f56c..c024ab0f745c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                          * page_deferred_list.
>                          */
>                         list_del_init(&folio->_deferred_list);
> +                       folio_clear_partially_mapped(folio);
>                 }
>                 spin_unlock(&ds_queue->split_queue_lock);
>                 if (mapping) {
> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>         if (!list_empty(&folio->_deferred_list)) {
>                 ds_queue->split_queue_len--;
>                 list_del_init(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
>
> -void deferred_split_folio(struct folio *folio)
> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>  {
>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  #ifdef CONFIG_MEMCG
> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>         if (folio_test_swapcache(folio))
>                 return;
>
> -       if (!list_empty(&folio->_deferred_list))
> -               return;
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +       if (partially_mapped)
> +               folio_set_partially_mapped(folio);
> +       else
> +               folio_clear_partially_mapped(folio);

Hi Usama,

Do we need this? When can a partially_mapped folio on deferred_list become
non-partially-mapped and need a clear? I understand transferring from
entirely_map
to partially_mapped is a one way process? partially_mapped folios can't go back
to entirely_mapped?

I am trying to rebase my NR_SPLIT_DEFERRED counter on top of your
work, but this "clear" makes that job quite tricky. as I am not sure
if this patch
is going to clear the partially_mapped flag for folios on deferred_list.

Otherwise, I can simply do the below whenever folio is leaving deferred_list

        ds_queue->split_queue_len--;
        list_del_init(&folio->_deferred_list);
        if (folio_test_clear_partially_mapped(folio))
                mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_SPLIT_DEFERRED, -1);

>         if (list_empty(&folio->_deferred_list)) {
> -               if (folio_test_pmd_mappable(folio))
> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +               if (partially_mapped) {
> +                       if (folio_test_pmd_mappable(folio))
> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +               }
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
> @@ -3541,6 +3546,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>                 } else {
>                         /* We lost race with folio_put() */
>                         list_del_init(&folio->_deferred_list);
> +                       folio_clear_partially_mapped(folio);
>                         ds_queue->split_queue_len--;
>                 }
>                 if (!--sc->nr_to_scan)
> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  next:
>                 folio_put(folio);
>         }
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         list_splice_tail(&list, &ds_queue->split_queue);
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1fdd9eab240c..2ae2d9a18e40 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>                 free_gigantic_folio(folio, huge_page_order(h));
>         } else {
>                 INIT_LIST_HEAD(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);
>                 folio_put(folio);
>         }
>  }
> diff --git a/mm/internal.h b/mm/internal.h
> index 52f7fc4e8ac3..d64546b8d377 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>         atomic_set(&folio->_entire_mapcount, -1);
>         atomic_set(&folio->_nr_pages_mapped, 0);
>         atomic_set(&folio->_pincount, 0);
> -       if (order > 1)
> +       if (order > 1) {
>                 INIT_LIST_HEAD(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);
> +       }
>  }
>
>  static inline void prep_compound_tail(struct page *head, int tail_idx)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e1ffd2950393..0fd95daecf9a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4669,7 +4669,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>         VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
>                         !folio_test_hugetlb(folio) &&
> -                       !list_empty(&folio->_deferred_list), folio);
> +                       !list_empty(&folio->_deferred_list) &&
> +                       folio_test_partially_mapped(folio), folio);
>
>         /*
>          * Nobody should be changing or seriously looking at
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3288ac041d03..6e32098ac2dc 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1734,7 +1734,8 @@ static int migrate_pages_batch(struct list_head *from,
>                          * use _deferred_list.
>                          */
>                         if (nr_pages > 2 &&
> -                          !list_empty(&folio->_deferred_list)) {
> +                          !list_empty(&folio->_deferred_list) &&
> +                          folio_test_partially_mapped(folio)) {
>                                 if (!try_split_folio(folio, split_folios, mode)) {
>                                         nr_failed++;
>                                         stats->nr_thp_failed += is_thp;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 408ef3d25cf5..a145c550dd2a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -957,8 +957,9 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>                 break;
>         case 2:
>                 /* the second tail page: deferred_list overlaps ->mapping */
> -               if (unlikely(!list_empty(&folio->_deferred_list))) {
> -                       bad_page(page, "on deferred list");
> +               if (unlikely(!list_empty(&folio->_deferred_list) &&
> +                   folio_test_partially_mapped(folio))) {
> +                       bad_page(page, "partially mapped folio on deferred list");
>                         goto out;
>                 }
>                 break;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a6b9cd0b2b18..9ad558c2bad0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1579,7 +1579,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>          */
>         if (partially_mapped && folio_test_anon(folio) &&
>             list_empty(&folio->_deferred_list))
> -               deferred_split_folio(folio);
> +               deferred_split_folio(folio, true);
> +
>         __folio_mod_stat(folio, -nr, -nr_pmdmapped);
>
>         /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 25e43bb3b574..25f4e8403f41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1233,7 +1233,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                          * Split partially mapped folios right away.
>                                          * We can free the unmapped pages without IO.
>                                          */
> -                                       if (data_race(!list_empty(&folio->_deferred_list)) &&
> +                                       if (data_race(!list_empty(&folio->_deferred_list) &&
> +                                           folio_test_partially_mapped(folio)) &&
>                                             split_folio_to_list(folio, folio_list))
>                                                 goto activate_locked;
>                                 }
> --
> 2.43.5
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 10:44   ` Barry Song
@ 2024-08-14 10:52     ` Barry Song
  2024-08-14 11:11     ` Usama Arif
  1 sibling, 0 replies; 42+ messages in thread
From: Barry Song @ 2024-08-14 10:52 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Wed, Aug 14, 2024 at 10:44 PM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> > Currently folio->_deferred_list is used to keep track of
> > partially_mapped folios that are going to be split under memory
> > pressure. In the next patch, all THPs that are faulted in and collapsed
> > by khugepaged are also going to be tracked using _deferred_list.
> >
> > This patch introduces a pageflag to be able to distinguish between
> > partially mapped folios and others in the deferred_list at split time in
> > deferred_split_scan. Its needed as __folio_remove_rmap decrements
> > _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> > possible to distinguish between partially mapped folios and others in
> > deferred_split_scan.
> >
> > Eventhough it introduces an extra flag to track if the folio is
> > partially mapped, there is no functional change intended with this
> > patch and the flag is not useful in this patch itself, it will
> > become useful in the next patch when _deferred_list has non partially
> > mapped folios.
> >
> > Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > ---
> >  include/linux/huge_mm.h    |  4 ++--
> >  include/linux/page-flags.h |  3 +++
> >  mm/huge_memory.c           | 21 +++++++++++++--------
> >  mm/hugetlb.c               |  1 +
> >  mm/internal.h              |  4 +++-
> >  mm/memcontrol.c            |  3 ++-
> >  mm/migrate.c               |  3 ++-
> >  mm/page_alloc.c            |  5 +++--
> >  mm/rmap.c                  |  3 ++-
> >  mm/vmscan.c                |  3 ++-
> >  10 files changed, 33 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 4c32058cacfe..969f11f360d2 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
> >  {
> >         return split_huge_page_to_list_to_order(page, NULL, 0);
> >  }
> > -void deferred_split_folio(struct folio *folio);
> > +void deferred_split_folio(struct folio *folio, bool partially_mapped);
> >
> >  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >                 unsigned long address, bool freeze, struct folio *folio);
> > @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
> >  {
> >         return 0;
> >  }
> > -static inline void deferred_split_folio(struct folio *folio) {}
> > +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> >  #define split_huge_pmd(__vma, __pmd, __address)        \
> >         do { } while (0)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index a0a29bd092f8..cecc1bad7910 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -182,6 +182,7 @@ enum pageflags {
> >         /* At least one page in this folio has the hwpoison flag set */
> >         PG_has_hwpoisoned = PG_active,
> >         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> > +       PG_partially_mapped, /* was identified to be partially mapped */
> >  };
> >
> >  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> > @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
> >         ClearPageHead(page);
> >  }
> >  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> > +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
> >  #else
> >  FOLIO_FLAG_FALSE(large_rmappable)
> > +FOLIO_FLAG_FALSE(partially_mapped)
> >  #endif
> >
> >  #define PG_head_mask ((1UL << PG_head))
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 6df0e9f4f56c..c024ab0f745c 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >                          * page_deferred_list.
> >                          */
> >                         list_del_init(&folio->_deferred_list);
> > +                       folio_clear_partially_mapped(folio);
> >                 }
> >                 spin_unlock(&ds_queue->split_queue_lock);
> >                 if (mapping) {
> > @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >         if (!list_empty(&folio->_deferred_list)) {
> >                 ds_queue->split_queue_len--;
> >                 list_del_init(&folio->_deferred_list);
> > +               folio_clear_partially_mapped(folio);
> >         }
> >         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >  }
> >
> > -void deferred_split_folio(struct folio *folio)
> > +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >  {
> >         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >  #ifdef CONFIG_MEMCG
> > @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
> >         if (folio_test_swapcache(folio))
> >                 return;
> >
> > -       if (!list_empty(&folio->_deferred_list))
> > -               return;
> > -
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > +       if (partially_mapped)
> > +               folio_set_partially_mapped(folio);
> > +       else
> > +               folio_clear_partially_mapped(folio);
>
> Hi Usama,
>
> Do we need this? When can a partially_mapped folio on deferred_list become
> non-partially-mapped and need a clear? I understand transferring from
> entirely_map
> to partially_mapped is a one way process? partially_mapped folios can't go back
> to entirely_mapped?
>
> I am trying to rebase my NR_SPLIT_DEFERRED counter on top of your
> work, but this "clear" makes that job quite tricky. as I am not sure
> if this patch
> is going to clear the partially_mapped flag for folios on deferred_list.
>
> Otherwise, I can simply do the below whenever folio is leaving deferred_list
>
>         ds_queue->split_queue_len--;
>         list_del_init(&folio->_deferred_list);
>         if (folio_test_clear_partially_mapped(folio))
>                 mod_mthp_stat(folio_order(folio),
> MTHP_STAT_NR_SPLIT_DEFERRED, -1);

On the other hand, I can still make things correct by the below code,
but it looks much more tricky. i believe we at least need the first
one folio_test_set_partially_mapped()  because folios on
deferred_list can become partially_mapped from
entirely_mapped.

Not quite sure if we need the second folio_test_clear_partially_mapped(folio)
in deferred_split_folio(). My understand is that it is impossible and this
folio_clear_partially_mapped() is probably redundant.

@@ -3515,10 +3522,13 @@ void deferred_split_folio(struct folio *folio,
bool partially_mapped)
                return;

        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       if (partially_mapped)
-               folio_set_partially_mapped(folio);
-       else
-               folio_clear_partially_mapped(folio);
+       if (partially_mapped) {
+               if (!folio_test_set_partially_mapped(folio))
+                       mod_mthp_stat(folio_order(folio),
+                               MTHP_STAT_NR_SPLIT_DEFERRED, 1);
+       } else if (folio_test_clear_partially_mapped(folio)) {
+                mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_SPLIT_DEFERRED, -1);
+       }
        if (list_empty(&folio->_deferred_list)) {
                if (partially_mapped) {
                        if (folio_test_pmd_mappable(folio))

>
> >         if (list_empty(&folio->_deferred_list)) {
> > -               if (folio_test_pmd_mappable(folio))
> > -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> > -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> > +               if (partially_mapped) {
> > +                       if (folio_test_pmd_mappable(folio))
> > +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> > +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> > +               }
> >                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> >                 ds_queue->split_queue_len++;
> >  #ifdef CONFIG_MEMCG
> > @@ -3541,6 +3546,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> >                 } else {
> >                         /* We lost race with folio_put() */
> >                         list_del_init(&folio->_deferred_list);
> > +                       folio_clear_partially_mapped(folio);
> >                         ds_queue->split_queue_len--;
> >                 }
> >                 if (!--sc->nr_to_scan)
> > @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> >  next:
> >                 folio_put(folio);
> >         }
> > -
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >         list_splice_tail(&list, &ds_queue->split_queue);
> >         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1fdd9eab240c..2ae2d9a18e40 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >                 free_gigantic_folio(folio, huge_page_order(h));
> >         } else {
> >                 INIT_LIST_HEAD(&folio->_deferred_list);
> > +               folio_clear_partially_mapped(folio);
> >                 folio_put(folio);
> >         }
> >  }
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 52f7fc4e8ac3..d64546b8d377 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
> >         atomic_set(&folio->_entire_mapcount, -1);
> >         atomic_set(&folio->_nr_pages_mapped, 0);
> >         atomic_set(&folio->_pincount, 0);
> > -       if (order > 1)
> > +       if (order > 1) {
> >                 INIT_LIST_HEAD(&folio->_deferred_list);
> > +               folio_clear_partially_mapped(folio);
> > +       }
> >  }
> >
> >  static inline void prep_compound_tail(struct page *head, int tail_idx)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e1ffd2950393..0fd95daecf9a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4669,7 +4669,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> >         VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> >                         !folio_test_hugetlb(folio) &&
> > -                       !list_empty(&folio->_deferred_list), folio);
> > +                       !list_empty(&folio->_deferred_list) &&
> > +                       folio_test_partially_mapped(folio), folio);
> >
> >         /*
> >          * Nobody should be changing or seriously looking at
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 3288ac041d03..6e32098ac2dc 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1734,7 +1734,8 @@ static int migrate_pages_batch(struct list_head *from,
> >                          * use _deferred_list.
> >                          */
> >                         if (nr_pages > 2 &&
> > -                          !list_empty(&folio->_deferred_list)) {
> > +                          !list_empty(&folio->_deferred_list) &&
> > +                          folio_test_partially_mapped(folio)) {
> >                                 if (!try_split_folio(folio, split_folios, mode)) {
> >                                         nr_failed++;
> >                                         stats->nr_thp_failed += is_thp;
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 408ef3d25cf5..a145c550dd2a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -957,8 +957,9 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
> >                 break;
> >         case 2:
> >                 /* the second tail page: deferred_list overlaps ->mapping */
> > -               if (unlikely(!list_empty(&folio->_deferred_list))) {
> > -                       bad_page(page, "on deferred list");
> > +               if (unlikely(!list_empty(&folio->_deferred_list) &&
> > +                   folio_test_partially_mapped(folio))) {
> > +                       bad_page(page, "partially mapped folio on deferred list");
> >                         goto out;
> >                 }
> >                 break;
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index a6b9cd0b2b18..9ad558c2bad0 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1579,7 +1579,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >          */
> >         if (partially_mapped && folio_test_anon(folio) &&
> >             list_empty(&folio->_deferred_list))
> > -               deferred_split_folio(folio);
> > +               deferred_split_folio(folio, true);
> > +
> >         __folio_mod_stat(folio, -nr, -nr_pmdmapped);
> >
> >         /*
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 25e43bb3b574..25f4e8403f41 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1233,7 +1233,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                                          * Split partially mapped folios right away.
> >                                          * We can free the unmapped pages without IO.
> >                                          */
> > -                                       if (data_race(!list_empty(&folio->_deferred_list)) &&
> > +                                       if (data_race(!list_empty(&folio->_deferred_list) &&
> > +                                           folio_test_partially_mapped(folio)) &&
> >                                             split_folio_to_list(folio, folio_list))
> >                                                 goto activate_locked;
> >                                 }
> > --
> > 2.43.5
> >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
  2024-08-14  3:30   ` Yu Zhao
  2024-08-14 10:44   ` Barry Song
@ 2024-08-14 11:10   ` Barry Song
  2024-08-14 11:20     ` Usama Arif
  2024-08-15 16:33   ` David Hildenbrand
  2024-08-16 15:44   ` Matthew Wilcox
  4 siblings, 1 reply; 42+ messages in thread
From: Barry Song @ 2024-08-14 11:10 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Currently folio->_deferred_list is used to keep track of
> partially_mapped folios that are going to be split under memory
> pressure. In the next patch, all THPs that are faulted in and collapsed
> by khugepaged are also going to be tracked using _deferred_list.
>
> This patch introduces a pageflag to be able to distinguish between
> partially mapped folios and others in the deferred_list at split time in
> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> possible to distinguish between partially mapped folios and others in
> deferred_split_scan.
>
> Eventhough it introduces an extra flag to track if the folio is
> partially mapped, there is no functional change intended with this
> patch and the flag is not useful in this patch itself, it will
> become useful in the next patch when _deferred_list has non partially
> mapped folios.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h    |  4 ++--
>  include/linux/page-flags.h |  3 +++
>  mm/huge_memory.c           | 21 +++++++++++++--------
>  mm/hugetlb.c               |  1 +
>  mm/internal.h              |  4 +++-
>  mm/memcontrol.c            |  3 ++-
>  mm/migrate.c               |  3 ++-
>  mm/page_alloc.c            |  5 +++--
>  mm/rmap.c                  |  3 ++-
>  mm/vmscan.c                |  3 ++-
>  10 files changed, 33 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4c32058cacfe..969f11f360d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>  {
>         return split_huge_page_to_list_to_order(page, NULL, 0);
>  }
> -void deferred_split_folio(struct folio *folio);
> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>                 unsigned long address, bool freeze, struct folio *folio);
> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>  {
>         return 0;
>  }
> -static inline void deferred_split_folio(struct folio *folio) {}
> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>  #define split_huge_pmd(__vma, __pmd, __address)        \
>         do { } while (0)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a0a29bd092f8..cecc1bad7910 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -182,6 +182,7 @@ enum pageflags {
>         /* At least one page in this folio has the hwpoison flag set */
>         PG_has_hwpoisoned = PG_active,
>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> +       PG_partially_mapped, /* was identified to be partially mapped */
>  };
>
>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>         ClearPageHead(page);
>  }
>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>  #else
>  FOLIO_FLAG_FALSE(large_rmappable)
> +FOLIO_FLAG_FALSE(partially_mapped)
>  #endif
>
>  #define PG_head_mask ((1UL << PG_head))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6df0e9f4f56c..c024ab0f745c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                          * page_deferred_list.
>                          */
>                         list_del_init(&folio->_deferred_list);
> +                       folio_clear_partially_mapped(folio);
>                 }
>                 spin_unlock(&ds_queue->split_queue_lock);
>                 if (mapping) {
> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>         if (!list_empty(&folio->_deferred_list)) {
>                 ds_queue->split_queue_len--;
>                 list_del_init(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
>
> -void deferred_split_folio(struct folio *folio)
> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>  {
>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  #ifdef CONFIG_MEMCG
> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>         if (folio_test_swapcache(folio))
>                 return;
>
> -       if (!list_empty(&folio->_deferred_list))
> -               return;
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +       if (partially_mapped)
> +               folio_set_partially_mapped(folio);
> +       else
> +               folio_clear_partially_mapped(folio);
>         if (list_empty(&folio->_deferred_list)) {
> -               if (folio_test_pmd_mappable(folio))
> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +               if (partially_mapped) {
> +                       if (folio_test_pmd_mappable(folio))
> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);

This code completely broke MTHP_STAT_SPLIT_DEFERRED for PMD_ORDER. It
added the folio to the deferred_list as entirely_mapped
(partially_mapped == false).
However, when partially_mapped becomes true, there's no opportunity to
add it again
as it has been there on the list. Are you consistently seeing the counter for
PMD_ORDER as 0?

> +               }
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
> @@ -3541,6 +3546,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>                 } else {
>                         /* We lost race with folio_put() */
>                         list_del_init(&folio->_deferred_list);
> +                       folio_clear_partially_mapped(folio);
>                         ds_queue->split_queue_len--;
>                 }
>                 if (!--sc->nr_to_scan)
> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  next:
>                 folio_put(folio);
>         }
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         list_splice_tail(&list, &ds_queue->split_queue);
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1fdd9eab240c..2ae2d9a18e40 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>                 free_gigantic_folio(folio, huge_page_order(h));
>         } else {
>                 INIT_LIST_HEAD(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);
>                 folio_put(folio);
>         }
>  }
> diff --git a/mm/internal.h b/mm/internal.h
> index 52f7fc4e8ac3..d64546b8d377 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>         atomic_set(&folio->_entire_mapcount, -1);
>         atomic_set(&folio->_nr_pages_mapped, 0);
>         atomic_set(&folio->_pincount, 0);
> -       if (order > 1)
> +       if (order > 1) {
>                 INIT_LIST_HEAD(&folio->_deferred_list);
> +               folio_clear_partially_mapped(folio);
> +       }
>  }
>
>  static inline void prep_compound_tail(struct page *head, int tail_idx)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e1ffd2950393..0fd95daecf9a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4669,7 +4669,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>         VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
>                         !folio_test_hugetlb(folio) &&
> -                       !list_empty(&folio->_deferred_list), folio);
> +                       !list_empty(&folio->_deferred_list) &&
> +                       folio_test_partially_mapped(folio), folio);
>
>         /*
>          * Nobody should be changing or seriously looking at
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3288ac041d03..6e32098ac2dc 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1734,7 +1734,8 @@ static int migrate_pages_batch(struct list_head *from,
>                          * use _deferred_list.
>                          */
>                         if (nr_pages > 2 &&
> -                          !list_empty(&folio->_deferred_list)) {
> +                          !list_empty(&folio->_deferred_list) &&
> +                          folio_test_partially_mapped(folio)) {
>                                 if (!try_split_folio(folio, split_folios, mode)) {
>                                         nr_failed++;
>                                         stats->nr_thp_failed += is_thp;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 408ef3d25cf5..a145c550dd2a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -957,8 +957,9 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>                 break;
>         case 2:
>                 /* the second tail page: deferred_list overlaps ->mapping */
> -               if (unlikely(!list_empty(&folio->_deferred_list))) {
> -                       bad_page(page, "on deferred list");
> +               if (unlikely(!list_empty(&folio->_deferred_list) &&
> +                   folio_test_partially_mapped(folio))) {
> +                       bad_page(page, "partially mapped folio on deferred list");
>                         goto out;
>                 }
>                 break;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a6b9cd0b2b18..9ad558c2bad0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1579,7 +1579,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>          */
>         if (partially_mapped && folio_test_anon(folio) &&
>             list_empty(&folio->_deferred_list))
> -               deferred_split_folio(folio);
> +               deferred_split_folio(folio, true);
> +
>         __folio_mod_stat(folio, -nr, -nr_pmdmapped);
>
>         /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 25e43bb3b574..25f4e8403f41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1233,7 +1233,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                          * Split partially mapped folios right away.
>                                          * We can free the unmapped pages without IO.
>                                          */
> -                                       if (data_race(!list_empty(&folio->_deferred_list)) &&
> +                                       if (data_race(!list_empty(&folio->_deferred_list) &&
> +                                           folio_test_partially_mapped(folio)) &&
>                                             split_folio_to_list(folio, folio_list))
>                                                 goto activate_locked;
>                                 }
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 10:44   ` Barry Song
  2024-08-14 10:52     ` Barry Song
@ 2024-08-14 11:11     ` Usama Arif
  2024-08-14 11:20       ` Barry Song
  1 sibling, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-14 11:11 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team



On 14/08/2024 11:44, Barry Song wrote:
> On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> Currently folio->_deferred_list is used to keep track of
>> partially_mapped folios that are going to be split under memory
>> pressure. In the next patch, all THPs that are faulted in and collapsed
>> by khugepaged are also going to be tracked using _deferred_list.
>>
>> This patch introduces a pageflag to be able to distinguish between
>> partially mapped folios and others in the deferred_list at split time in
>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
>> possible to distinguish between partially mapped folios and others in
>> deferred_split_scan.
>>
>> Eventhough it introduces an extra flag to track if the folio is
>> partially mapped, there is no functional change intended with this
>> patch and the flag is not useful in this patch itself, it will
>> become useful in the next patch when _deferred_list has non partially
>> mapped folios.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/huge_mm.h    |  4 ++--
>>  include/linux/page-flags.h |  3 +++
>>  mm/huge_memory.c           | 21 +++++++++++++--------
>>  mm/hugetlb.c               |  1 +
>>  mm/internal.h              |  4 +++-
>>  mm/memcontrol.c            |  3 ++-
>>  mm/migrate.c               |  3 ++-
>>  mm/page_alloc.c            |  5 +++--
>>  mm/rmap.c                  |  3 ++-
>>  mm/vmscan.c                |  3 ++-
>>  10 files changed, 33 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 4c32058cacfe..969f11f360d2 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>>  {
>>         return split_huge_page_to_list_to_order(page, NULL, 0);
>>  }
>> -void deferred_split_folio(struct folio *folio);
>> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>
>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>                 unsigned long address, bool freeze, struct folio *folio);
>> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>>  {
>>         return 0;
>>  }
>> -static inline void deferred_split_folio(struct folio *folio) {}
>> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>  #define split_huge_pmd(__vma, __pmd, __address)        \
>>         do { } while (0)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index a0a29bd092f8..cecc1bad7910 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -182,6 +182,7 @@ enum pageflags {
>>         /* At least one page in this folio has the hwpoison flag set */
>>         PG_has_hwpoisoned = PG_active,
>>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
>> +       PG_partially_mapped, /* was identified to be partially mapped */
>>  };
>>
>>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
>> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>>         ClearPageHead(page);
>>  }
>>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
>> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>>  #else
>>  FOLIO_FLAG_FALSE(large_rmappable)
>> +FOLIO_FLAG_FALSE(partially_mapped)
>>  #endif
>>
>>  #define PG_head_mask ((1UL << PG_head))
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 6df0e9f4f56c..c024ab0f745c 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>                          * page_deferred_list.
>>                          */
>>                         list_del_init(&folio->_deferred_list);
>> +                       folio_clear_partially_mapped(folio);
>>                 }
>>                 spin_unlock(&ds_queue->split_queue_lock);
>>                 if (mapping) {
>> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>>         if (!list_empty(&folio->_deferred_list)) {
>>                 ds_queue->split_queue_len--;
>>                 list_del_init(&folio->_deferred_list);
>> +               folio_clear_partially_mapped(folio);
>>         }
>>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  }
>>
>> -void deferred_split_folio(struct folio *folio)
>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>  {
>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  #ifdef CONFIG_MEMCG
>> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>>         if (folio_test_swapcache(folio))
>>                 return;
>>
>> -       if (!list_empty(&folio->_deferred_list))
>> -               return;
>> -
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> +       if (partially_mapped)
>> +               folio_set_partially_mapped(folio);
>> +       else
>> +               folio_clear_partially_mapped(folio);
> 
> Hi Usama,
> 
> Do we need this? When can a partially_mapped folio on deferred_list become
> non-partially-mapped and need a clear? I understand transferring from
> entirely_map
> to partially_mapped is a one way process? partially_mapped folios can't go back
> to entirely_mapped?
> 
Hi Barry,

deferred_split_folio function is called in 3 places after this series, at fault, collapse and partial mapping. partial mapping can only happen after fault/collapse, and we have FOLIO_FLAG_FALSE(partially_mapped), i.e. flag initialized to false, so technically its not needed. A partially_mapped folio on deferred list wont become non-partially mapped. 

I just did it as a precaution if someone ever changes the kernel and calls deferred_split_folio with partially_mapped set to false after it had been true. The function arguments of deferred_split_folio make it seem that passing partially_mapped=false as an argument would clear it, which is why I cleared it as well. I could change the patch to something like below if it makes things better? i.e. add a comment at the top of the function:


-void deferred_split_folio(struct folio *folio)
+/* partially_mapped=false won't clear PG_partially_mapped folio flag */
+void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 #ifdef CONFIG_MEMCG
@@ -3485,14 +3488,15 @@ void deferred_split_folio(struct folio *folio)
        if (folio_test_swapcache(folio))
                return;
 
-       if (!list_empty(&folio->_deferred_list))
-               return;
-
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+       if (partially_mapped)
+               folio_set_partially_mapped(folio);
        if (list_empty(&folio->_deferred_list)) {
-               if (folio_test_pmd_mappable(folio))
-                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+               if (partially_mapped) {
+                       if (folio_test_pmd_mappable(folio))
+                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+               }
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG


> I am trying to rebase my NR_SPLIT_DEFERRED counter on top of your
> work, but this "clear" makes that job quite tricky. as I am not sure
> if this patch
> is going to clear the partially_mapped flag for folios on deferred_list.
> 
> Otherwise, I can simply do the below whenever folio is leaving deferred_list
> 
>         ds_queue->split_queue_len--;
>         list_del_init(&folio->_deferred_list);
>         if (folio_test_clear_partially_mapped(folio))
>                 mod_mthp_stat(folio_order(folio),
> MTHP_STAT_NR_SPLIT_DEFERRED, -1);
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 11:11     ` Usama Arif
@ 2024-08-14 11:20       ` Barry Song
  2024-08-14 11:26         ` Barry Song
  2024-08-14 11:30         ` Usama Arif
  0 siblings, 2 replies; 42+ messages in thread
From: Barry Song @ 2024-08-14 11:20 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Wed, Aug 14, 2024 at 11:11 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 14/08/2024 11:44, Barry Song wrote:
> > On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >> Currently folio->_deferred_list is used to keep track of
> >> partially_mapped folios that are going to be split under memory
> >> pressure. In the next patch, all THPs that are faulted in and collapsed
> >> by khugepaged are also going to be tracked using _deferred_list.
> >>
> >> This patch introduces a pageflag to be able to distinguish between
> >> partially mapped folios and others in the deferred_list at split time in
> >> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> >> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> >> possible to distinguish between partially mapped folios and others in
> >> deferred_split_scan.
> >>
> >> Eventhough it introduces an extra flag to track if the folio is
> >> partially mapped, there is no functional change intended with this
> >> patch and the flag is not useful in this patch itself, it will
> >> become useful in the next patch when _deferred_list has non partially
> >> mapped folios.
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >>  include/linux/huge_mm.h    |  4 ++--
> >>  include/linux/page-flags.h |  3 +++
> >>  mm/huge_memory.c           | 21 +++++++++++++--------
> >>  mm/hugetlb.c               |  1 +
> >>  mm/internal.h              |  4 +++-
> >>  mm/memcontrol.c            |  3 ++-
> >>  mm/migrate.c               |  3 ++-
> >>  mm/page_alloc.c            |  5 +++--
> >>  mm/rmap.c                  |  3 ++-
> >>  mm/vmscan.c                |  3 ++-
> >>  10 files changed, 33 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 4c32058cacfe..969f11f360d2 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
> >>  {
> >>         return split_huge_page_to_list_to_order(page, NULL, 0);
> >>  }
> >> -void deferred_split_folio(struct folio *folio);
> >> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
> >>
> >>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >>                 unsigned long address, bool freeze, struct folio *folio);
> >> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
> >>  {
> >>         return 0;
> >>  }
> >> -static inline void deferred_split_folio(struct folio *folio) {}
> >> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> >>  #define split_huge_pmd(__vma, __pmd, __address)        \
> >>         do { } while (0)
> >>
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index a0a29bd092f8..cecc1bad7910 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -182,6 +182,7 @@ enum pageflags {
> >>         /* At least one page in this folio has the hwpoison flag set */
> >>         PG_has_hwpoisoned = PG_active,
> >>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> >> +       PG_partially_mapped, /* was identified to be partially mapped */
> >>  };
> >>
> >>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> >> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
> >>         ClearPageHead(page);
> >>  }
> >>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> >> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
> >>  #else
> >>  FOLIO_FLAG_FALSE(large_rmappable)
> >> +FOLIO_FLAG_FALSE(partially_mapped)
> >>  #endif
> >>
> >>  #define PG_head_mask ((1UL << PG_head))
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 6df0e9f4f56c..c024ab0f745c 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>                          * page_deferred_list.
> >>                          */
> >>                         list_del_init(&folio->_deferred_list);
> >> +                       folio_clear_partially_mapped(folio);
> >>                 }
> >>                 spin_unlock(&ds_queue->split_queue_lock);
> >>                 if (mapping) {
> >> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >>         if (!list_empty(&folio->_deferred_list)) {
> >>                 ds_queue->split_queue_len--;
> >>                 list_del_init(&folio->_deferred_list);
> >> +               folio_clear_partially_mapped(folio);
> >>         }
> >>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >>  }
> >>
> >> -void deferred_split_folio(struct folio *folio)
> >> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >>  {
> >>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>  #ifdef CONFIG_MEMCG
> >> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
> >>         if (folio_test_swapcache(folio))
> >>                 return;
> >>
> >> -       if (!list_empty(&folio->_deferred_list))
> >> -               return;
> >> -
> >>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >> +       if (partially_mapped)
> >> +               folio_set_partially_mapped(folio);
> >> +       else
> >> +               folio_clear_partially_mapped(folio);
> >
> > Hi Usama,
> >
> > Do we need this? When can a partially_mapped folio on deferred_list become
> > non-partially-mapped and need a clear? I understand transferring from
> > entirely_map
> > to partially_mapped is a one way process? partially_mapped folios can't go back
> > to entirely_mapped?
> >
> Hi Barry,
>
> deferred_split_folio function is called in 3 places after this series, at fault, collapse and partial mapping. partial mapping can only happen after fault/collapse, and we have FOLIO_FLAG_FALSE(partially_mapped), i.e. flag initialized to false, so technically its not needed. A partially_mapped folio on deferred list wont become non-partially mapped.
>
> I just did it as a precaution if someone ever changes the kernel and calls deferred_split_folio with partially_mapped set to false after it had been true. The function arguments of deferred_split_folio make it seem that passing partially_mapped=false as an argument would clear it, which is why I cleared it as well. I could change the patch to something like below if it makes things better? i.e. add a comment at the top of the function:
>

to me, it seems a BUG to call with false after a folio has been
partially_mapped. So I'd rather
VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);

The below should also fix the MTHP_STAT_SPLIT_DEFERRED
counter this patch is breaking?

@@ -3515,16 +3522,18 @@ void deferred_split_folio(struct folio *folio,
bool partially_mapped)
                return;

        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       if (partially_mapped)
-               folio_set_partially_mapped(folio);
-       else
-               folio_clear_partially_mapped(folio);
-       if (list_empty(&folio->_deferred_list)) {
-               if (partially_mapped) {
+       if (partially_mapped) {
+               if (!folio_test_set_partially_mapped(folio)) {
+                       mod_mthp_stat(folio_order(folio),
+                               MTHP_STAT_NR_SPLIT_DEFERRED, 1);
                        if (folio_test_pmd_mappable(folio))
                                count_vm_event(THP_DEFERRED_SPLIT_PAGE);
                        count_mthp_stat(folio_order(folio),
MTHP_STAT_SPLIT_DEFERRED);
                }
+       }
+       VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
+
+       if (list_empty(&folio->_deferred_list)) {
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG


>
> -void deferred_split_folio(struct folio *folio)
> +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>  {
>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  #ifdef CONFIG_MEMCG
> @@ -3485,14 +3488,15 @@ void deferred_split_folio(struct folio *folio)
>         if (folio_test_swapcache(folio))
>                 return;
>
> -       if (!list_empty(&folio->_deferred_list))
> -               return;
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +       if (partially_mapped)
> +               folio_set_partially_mapped(folio);
>         if (list_empty(&folio->_deferred_list)) {
> -               if (folio_test_pmd_mappable(folio))
> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +               if (partially_mapped) {
> +                       if (folio_test_pmd_mappable(folio))
> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +               }
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
>
>
> > I am trying to rebase my NR_SPLIT_DEFERRED counter on top of your
> > work, but this "clear" makes that job quite tricky. as I am not sure
> > if this patch
> > is going to clear the partially_mapped flag for folios on deferred_list.
> >
> > Otherwise, I can simply do the below whenever folio is leaving deferred_list
> >
> >         ds_queue->split_queue_len--;
> >         list_del_init(&folio->_deferred_list);
> >         if (folio_test_clear_partially_mapped(folio))
> >                 mod_mthp_stat(folio_order(folio),
> > MTHP_STAT_NR_SPLIT_DEFERRED, -1);
> >
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 11:10   ` Barry Song
@ 2024-08-14 11:20     ` Usama Arif
  2024-08-14 11:23       ` Barry Song
  0 siblings, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-14 11:20 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team



On 14/08/2024 12:10, Barry Song wrote:
> On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> Currently folio->_deferred_list is used to keep track of
>> partially_mapped folios that are going to be split under memory
>> pressure. In the next patch, all THPs that are faulted in and collapsed
>> by khugepaged are also going to be tracked using _deferred_list.
>>
>> This patch introduces a pageflag to be able to distinguish between
>> partially mapped folios and others in the deferred_list at split time in
>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
>> possible to distinguish between partially mapped folios and others in
>> deferred_split_scan.
>>
>> Eventhough it introduces an extra flag to track if the folio is
>> partially mapped, there is no functional change intended with this
>> patch and the flag is not useful in this patch itself, it will
>> become useful in the next patch when _deferred_list has non partially
>> mapped folios.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/huge_mm.h    |  4 ++--
>>  include/linux/page-flags.h |  3 +++
>>  mm/huge_memory.c           | 21 +++++++++++++--------
>>  mm/hugetlb.c               |  1 +
>>  mm/internal.h              |  4 +++-
>>  mm/memcontrol.c            |  3 ++-
>>  mm/migrate.c               |  3 ++-
>>  mm/page_alloc.c            |  5 +++--
>>  mm/rmap.c                  |  3 ++-
>>  mm/vmscan.c                |  3 ++-
>>  10 files changed, 33 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 4c32058cacfe..969f11f360d2 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>>  {
>>         return split_huge_page_to_list_to_order(page, NULL, 0);
>>  }
>> -void deferred_split_folio(struct folio *folio);
>> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>
>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>                 unsigned long address, bool freeze, struct folio *folio);
>> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>>  {
>>         return 0;
>>  }
>> -static inline void deferred_split_folio(struct folio *folio) {}
>> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>  #define split_huge_pmd(__vma, __pmd, __address)        \
>>         do { } while (0)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index a0a29bd092f8..cecc1bad7910 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -182,6 +182,7 @@ enum pageflags {
>>         /* At least one page in this folio has the hwpoison flag set */
>>         PG_has_hwpoisoned = PG_active,
>>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
>> +       PG_partially_mapped, /* was identified to be partially mapped */
>>  };
>>
>>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
>> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>>         ClearPageHead(page);
>>  }
>>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
>> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>>  #else
>>  FOLIO_FLAG_FALSE(large_rmappable)
>> +FOLIO_FLAG_FALSE(partially_mapped)
>>  #endif
>>
>>  #define PG_head_mask ((1UL << PG_head))
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 6df0e9f4f56c..c024ab0f745c 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>                          * page_deferred_list.
>>                          */
>>                         list_del_init(&folio->_deferred_list);
>> +                       folio_clear_partially_mapped(folio);
>>                 }
>>                 spin_unlock(&ds_queue->split_queue_lock);
>>                 if (mapping) {
>> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>>         if (!list_empty(&folio->_deferred_list)) {
>>                 ds_queue->split_queue_len--;
>>                 list_del_init(&folio->_deferred_list);
>> +               folio_clear_partially_mapped(folio);
>>         }
>>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  }
>>
>> -void deferred_split_folio(struct folio *folio)
>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>  {
>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  #ifdef CONFIG_MEMCG
>> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>>         if (folio_test_swapcache(folio))
>>                 return;
>>
>> -       if (!list_empty(&folio->_deferred_list))
>> -               return;
>> -
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> +       if (partially_mapped)
>> +               folio_set_partially_mapped(folio);
>> +       else
>> +               folio_clear_partially_mapped(folio);
>>         if (list_empty(&folio->_deferred_list)) {
>> -               if (folio_test_pmd_mappable(folio))
>> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>> +               if (partially_mapped) {
>> +                       if (folio_test_pmd_mappable(folio))
>> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> 
> This code completely broke MTHP_STAT_SPLIT_DEFERRED for PMD_ORDER. It
> added the folio to the deferred_list as entirely_mapped
> (partially_mapped == false).
> However, when partially_mapped becomes true, there's no opportunity to
> add it again
> as it has been there on the list. Are you consistently seeing the counter for
> PMD_ORDER as 0?
> 

Ah I see it, this should fix it?

-void deferred_split_folio(struct folio *folio)
+/* partially_mapped=false won't clear PG_partially_mapped folio flag */
+void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 #ifdef CONFIG_MEMCG
@@ -3485,14 +3488,14 @@ void deferred_split_folio(struct folio *folio)
        if (folio_test_swapcache(folio))
                return;
 
-       if (!list_empty(&folio->_deferred_list))
-               return;
-
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       if (list_empty(&folio->_deferred_list)) {
+       if (partially_mapped) {
+               folio_set_partially_mapped(folio);
                if (folio_test_pmd_mappable(folio))
                        count_vm_event(THP_DEFERRED_SPLIT_PAGE);
                count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+       }
+       if (list_empty(&folio->_deferred_list)) {
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 11:20     ` Usama Arif
@ 2024-08-14 11:23       ` Barry Song
  2024-08-14 12:36         ` Usama Arif
  0 siblings, 1 reply; 42+ messages in thread
From: Barry Song @ 2024-08-14 11:23 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Wed, Aug 14, 2024 at 11:20 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 14/08/2024 12:10, Barry Song wrote:
> > On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >> Currently folio->_deferred_list is used to keep track of
> >> partially_mapped folios that are going to be split under memory
> >> pressure. In the next patch, all THPs that are faulted in and collapsed
> >> by khugepaged are also going to be tracked using _deferred_list.
> >>
> >> This patch introduces a pageflag to be able to distinguish between
> >> partially mapped folios and others in the deferred_list at split time in
> >> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> >> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> >> possible to distinguish between partially mapped folios and others in
> >> deferred_split_scan.
> >>
> >> Eventhough it introduces an extra flag to track if the folio is
> >> partially mapped, there is no functional change intended with this
> >> patch and the flag is not useful in this patch itself, it will
> >> become useful in the next patch when _deferred_list has non partially
> >> mapped folios.
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >>  include/linux/huge_mm.h    |  4 ++--
> >>  include/linux/page-flags.h |  3 +++
> >>  mm/huge_memory.c           | 21 +++++++++++++--------
> >>  mm/hugetlb.c               |  1 +
> >>  mm/internal.h              |  4 +++-
> >>  mm/memcontrol.c            |  3 ++-
> >>  mm/migrate.c               |  3 ++-
> >>  mm/page_alloc.c            |  5 +++--
> >>  mm/rmap.c                  |  3 ++-
> >>  mm/vmscan.c                |  3 ++-
> >>  10 files changed, 33 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 4c32058cacfe..969f11f360d2 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
> >>  {
> >>         return split_huge_page_to_list_to_order(page, NULL, 0);
> >>  }
> >> -void deferred_split_folio(struct folio *folio);
> >> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
> >>
> >>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >>                 unsigned long address, bool freeze, struct folio *folio);
> >> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
> >>  {
> >>         return 0;
> >>  }
> >> -static inline void deferred_split_folio(struct folio *folio) {}
> >> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> >>  #define split_huge_pmd(__vma, __pmd, __address)        \
> >>         do { } while (0)
> >>
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index a0a29bd092f8..cecc1bad7910 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -182,6 +182,7 @@ enum pageflags {
> >>         /* At least one page in this folio has the hwpoison flag set */
> >>         PG_has_hwpoisoned = PG_active,
> >>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> >> +       PG_partially_mapped, /* was identified to be partially mapped */
> >>  };
> >>
> >>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> >> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
> >>         ClearPageHead(page);
> >>  }
> >>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> >> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
> >>  #else
> >>  FOLIO_FLAG_FALSE(large_rmappable)
> >> +FOLIO_FLAG_FALSE(partially_mapped)
> >>  #endif
> >>
> >>  #define PG_head_mask ((1UL << PG_head))
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 6df0e9f4f56c..c024ab0f745c 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>                          * page_deferred_list.
> >>                          */
> >>                         list_del_init(&folio->_deferred_list);
> >> +                       folio_clear_partially_mapped(folio);
> >>                 }
> >>                 spin_unlock(&ds_queue->split_queue_lock);
> >>                 if (mapping) {
> >> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >>         if (!list_empty(&folio->_deferred_list)) {
> >>                 ds_queue->split_queue_len--;
> >>                 list_del_init(&folio->_deferred_list);
> >> +               folio_clear_partially_mapped(folio);
> >>         }
> >>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >>  }
> >>
> >> -void deferred_split_folio(struct folio *folio)
> >> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >>  {
> >>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>  #ifdef CONFIG_MEMCG
> >> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
> >>         if (folio_test_swapcache(folio))
> >>                 return;
> >>
> >> -       if (!list_empty(&folio->_deferred_list))
> >> -               return;
> >> -
> >>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >> +       if (partially_mapped)
> >> +               folio_set_partially_mapped(folio);
> >> +       else
> >> +               folio_clear_partially_mapped(folio);
> >>         if (list_empty(&folio->_deferred_list)) {
> >> -               if (folio_test_pmd_mappable(folio))
> >> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> >> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> >> +               if (partially_mapped) {
> >> +                       if (folio_test_pmd_mappable(folio))
> >> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> >> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> >
> > This code completely broke MTHP_STAT_SPLIT_DEFERRED for PMD_ORDER. It
> > added the folio to the deferred_list as entirely_mapped
> > (partially_mapped == false).
> > However, when partially_mapped becomes true, there's no opportunity to
> > add it again
> > as it has been there on the list. Are you consistently seeing the counter for
> > PMD_ORDER as 0?
> >
>
> Ah I see it, this should fix it?
>
> -void deferred_split_folio(struct folio *folio)
> +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>  {
>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  #ifdef CONFIG_MEMCG
> @@ -3485,14 +3488,14 @@ void deferred_split_folio(struct folio *folio)
>         if (folio_test_swapcache(folio))
>                 return;
>
> -       if (!list_empty(&folio->_deferred_list))
> -               return;
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -       if (list_empty(&folio->_deferred_list)) {
> +       if (partially_mapped) {
> +               folio_set_partially_mapped(folio);
>                 if (folio_test_pmd_mappable(folio))
>                         count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>                 count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +       }
> +       if (list_empty(&folio->_deferred_list)) {
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
>

not enough. as deferred_split_folio(true) won't be called if folio has been
deferred_list in __folio_remove_rmap():

        if (partially_mapped && folio_test_anon(folio) &&
            list_empty(&folio->_deferred_list))
                deferred_split_folio(folio, true);

so you will still see 0.

Thanks
Barry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 11:20       ` Barry Song
@ 2024-08-14 11:26         ` Barry Song
  2024-08-14 11:30         ` Usama Arif
  1 sibling, 0 replies; 42+ messages in thread
From: Barry Song @ 2024-08-14 11:26 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Wed, Aug 14, 2024 at 11:20 PM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Aug 14, 2024 at 11:11 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> >
> >
> > On 14/08/2024 11:44, Barry Song wrote:
> > > On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>
> > >> Currently folio->_deferred_list is used to keep track of
> > >> partially_mapped folios that are going to be split under memory
> > >> pressure. In the next patch, all THPs that are faulted in and collapsed
> > >> by khugepaged are also going to be tracked using _deferred_list.
> > >>
> > >> This patch introduces a pageflag to be able to distinguish between
> > >> partially mapped folios and others in the deferred_list at split time in
> > >> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> > >> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> > >> possible to distinguish between partially mapped folios and others in
> > >> deferred_split_scan.
> > >>
> > >> Eventhough it introduces an extra flag to track if the folio is
> > >> partially mapped, there is no functional change intended with this
> > >> patch and the flag is not useful in this patch itself, it will
> > >> become useful in the next patch when _deferred_list has non partially
> > >> mapped folios.
> > >>
> > >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > >> ---
> > >>  include/linux/huge_mm.h    |  4 ++--
> > >>  include/linux/page-flags.h |  3 +++
> > >>  mm/huge_memory.c           | 21 +++++++++++++--------
> > >>  mm/hugetlb.c               |  1 +
> > >>  mm/internal.h              |  4 +++-
> > >>  mm/memcontrol.c            |  3 ++-
> > >>  mm/migrate.c               |  3 ++-
> > >>  mm/page_alloc.c            |  5 +++--
> > >>  mm/rmap.c                  |  3 ++-
> > >>  mm/vmscan.c                |  3 ++-
> > >>  10 files changed, 33 insertions(+), 17 deletions(-)
> > >>
> > >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > >> index 4c32058cacfe..969f11f360d2 100644
> > >> --- a/include/linux/huge_mm.h
> > >> +++ b/include/linux/huge_mm.h
> > >> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
> > >>  {
> > >>         return split_huge_page_to_list_to_order(page, NULL, 0);
> > >>  }
> > >> -void deferred_split_folio(struct folio *folio);
> > >> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
> > >>
> > >>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > >>                 unsigned long address, bool freeze, struct folio *folio);
> > >> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
> > >>  {
> > >>         return 0;
> > >>  }
> > >> -static inline void deferred_split_folio(struct folio *folio) {}
> > >> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> > >>  #define split_huge_pmd(__vma, __pmd, __address)        \
> > >>         do { } while (0)
> > >>
> > >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > >> index a0a29bd092f8..cecc1bad7910 100644
> > >> --- a/include/linux/page-flags.h
> > >> +++ b/include/linux/page-flags.h
> > >> @@ -182,6 +182,7 @@ enum pageflags {
> > >>         /* At least one page in this folio has the hwpoison flag set */
> > >>         PG_has_hwpoisoned = PG_active,
> > >>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> > >> +       PG_partially_mapped, /* was identified to be partially mapped */
> > >>  };
> > >>
> > >>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> > >> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
> > >>         ClearPageHead(page);
> > >>  }
> > >>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> > >> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
> > >>  #else
> > >>  FOLIO_FLAG_FALSE(large_rmappable)
> > >> +FOLIO_FLAG_FALSE(partially_mapped)
> > >>  #endif
> > >>
> > >>  #define PG_head_mask ((1UL << PG_head))
> > >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > >> index 6df0e9f4f56c..c024ab0f745c 100644
> > >> --- a/mm/huge_memory.c
> > >> +++ b/mm/huge_memory.c
> > >> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> > >>                          * page_deferred_list.
> > >>                          */
> > >>                         list_del_init(&folio->_deferred_list);
> > >> +                       folio_clear_partially_mapped(folio);
> > >>                 }
> > >>                 spin_unlock(&ds_queue->split_queue_lock);
> > >>                 if (mapping) {
> > >> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
> > >>         if (!list_empty(&folio->_deferred_list)) {
> > >>                 ds_queue->split_queue_len--;
> > >>                 list_del_init(&folio->_deferred_list);
> > >> +               folio_clear_partially_mapped(folio);
> > >>         }
> > >>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> > >>  }
> > >>
> > >> -void deferred_split_folio(struct folio *folio)
> > >> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> > >>  {
> > >>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> > >>  #ifdef CONFIG_MEMCG
> > >> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
> > >>         if (folio_test_swapcache(folio))
> > >>                 return;
> > >>
> > >> -       if (!list_empty(&folio->_deferred_list))
> > >> -               return;
> > >> -
> > >>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > >> +       if (partially_mapped)
> > >> +               folio_set_partially_mapped(folio);
> > >> +       else
> > >> +               folio_clear_partially_mapped(folio);
> > >
> > > Hi Usama,
> > >
> > > Do we need this? When can a partially_mapped folio on deferred_list become
> > > non-partially-mapped and need a clear? I understand transferring from
> > > entirely_map
> > > to partially_mapped is a one way process? partially_mapped folios can't go back
> > > to entirely_mapped?
> > >
> > Hi Barry,
> >
> > deferred_split_folio function is called in 3 places after this series, at fault, collapse and partial mapping. partial mapping can only happen after fault/collapse, and we have FOLIO_FLAG_FALSE(partially_mapped), i.e. flag initialized to false, so technically its not needed. A partially_mapped folio on deferred list wont become non-partially mapped.
> >
> > I just did it as a precaution if someone ever changes the kernel and calls deferred_split_folio with partially_mapped set to false after it had been true. The function arguments of deferred_split_folio make it seem that passing partially_mapped=false as an argument would clear it, which is why I cleared it as well. I could change the patch to something like below if it makes things better? i.e. add a comment at the top of the function:
> >
>
> to me, it seems a BUG to call with false after a folio has been
> partially_mapped. So I'd rather
> VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
>
> The below should also fix the MTHP_STAT_SPLIT_DEFERRED
> counter this patch is breaking?
>
> @@ -3515,16 +3522,18 @@ void deferred_split_folio(struct folio *folio,
> bool partially_mapped)
>                 return;
>
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -       if (partially_mapped)
> -               folio_set_partially_mapped(folio);
> -       else
> -               folio_clear_partially_mapped(folio);
> -       if (list_empty(&folio->_deferred_list)) {
> -               if (partially_mapped) {
> +       if (partially_mapped) {
> +               if (!folio_test_set_partially_mapped(folio)) {
> +                       mod_mthp_stat(folio_order(folio),
> +                               MTHP_STAT_NR_SPLIT_DEFERRED, 1);
>                         if (folio_test_pmd_mappable(folio))
>                                 count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>                         count_mthp_stat(folio_order(folio),
> MTHP_STAT_SPLIT_DEFERRED);
>                 }
> +       }
> +       VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);

sorry, I mean:
VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio) &&
!partially_mapped, folio);

> +
> +       if (list_empty(&folio->_deferred_list)) {
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
>
>
> >
> > -void deferred_split_folio(struct folio *folio)
> > +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
> > +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >  {
> >         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >  #ifdef CONFIG_MEMCG
> > @@ -3485,14 +3488,15 @@ void deferred_split_folio(struct folio *folio)
> >         if (folio_test_swapcache(folio))
> >                 return;
> >
> > -       if (!list_empty(&folio->_deferred_list))
> > -               return;
> > -
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > +       if (partially_mapped)
> > +               folio_set_partially_mapped(folio);
> >         if (list_empty(&folio->_deferred_list)) {
> > -               if (folio_test_pmd_mappable(folio))
> > -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> > -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> > +               if (partially_mapped) {
> > +                       if (folio_test_pmd_mappable(folio))
> > +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> > +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> > +               }
> >                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> >                 ds_queue->split_queue_len++;
> >  #ifdef CONFIG_MEMCG
> >
> >
> > > I am trying to rebase my NR_SPLIT_DEFERRED counter on top of your
> > > work, but this "clear" makes that job quite tricky. as I am not sure
> > > if this patch
> > > is going to clear the partially_mapped flag for folios on deferred_list.
> > >
> > > Otherwise, I can simply do the below whenever folio is leaving deferred_list
> > >
> > >         ds_queue->split_queue_len--;
> > >         list_del_init(&folio->_deferred_list);
> > >         if (folio_test_clear_partially_mapped(folio))
> > >                 mod_mthp_stat(folio_order(folio),
> > > MTHP_STAT_NR_SPLIT_DEFERRED, -1);
> > >
> >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 11:20       ` Barry Song
  2024-08-14 11:26         ` Barry Song
@ 2024-08-14 11:30         ` Usama Arif
  1 sibling, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-14 11:30 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team



On 14/08/2024 12:20, Barry Song wrote:
> On Wed, Aug 14, 2024 at 11:11 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 14/08/2024 11:44, Barry Song wrote:
>>> On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> Currently folio->_deferred_list is used to keep track of
>>>> partially_mapped folios that are going to be split under memory
>>>> pressure. In the next patch, all THPs that are faulted in and collapsed
>>>> by khugepaged are also going to be tracked using _deferred_list.
>>>>
>>>> This patch introduces a pageflag to be able to distinguish between
>>>> partially mapped folios and others in the deferred_list at split time in
>>>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>>>> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
>>>> possible to distinguish between partially mapped folios and others in
>>>> deferred_split_scan.
>>>>
>>>> Eventhough it introduces an extra flag to track if the folio is
>>>> partially mapped, there is no functional change intended with this
>>>> patch and the flag is not useful in this patch itself, it will
>>>> become useful in the next patch when _deferred_list has non partially
>>>> mapped folios.
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> ---
>>>>  include/linux/huge_mm.h    |  4 ++--
>>>>  include/linux/page-flags.h |  3 +++
>>>>  mm/huge_memory.c           | 21 +++++++++++++--------
>>>>  mm/hugetlb.c               |  1 +
>>>>  mm/internal.h              |  4 +++-
>>>>  mm/memcontrol.c            |  3 ++-
>>>>  mm/migrate.c               |  3 ++-
>>>>  mm/page_alloc.c            |  5 +++--
>>>>  mm/rmap.c                  |  3 ++-
>>>>  mm/vmscan.c                |  3 ++-
>>>>  10 files changed, 33 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 4c32058cacfe..969f11f360d2 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>>>>  {
>>>>         return split_huge_page_to_list_to_order(page, NULL, 0);
>>>>  }
>>>> -void deferred_split_folio(struct folio *folio);
>>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>>>
>>>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>>>                 unsigned long address, bool freeze, struct folio *folio);
>>>> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>>>>  {
>>>>         return 0;
>>>>  }
>>>> -static inline void deferred_split_folio(struct folio *folio) {}
>>>> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>>>  #define split_huge_pmd(__vma, __pmd, __address)        \
>>>>         do { } while (0)
>>>>
>>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>>> index a0a29bd092f8..cecc1bad7910 100644
>>>> --- a/include/linux/page-flags.h
>>>> +++ b/include/linux/page-flags.h
>>>> @@ -182,6 +182,7 @@ enum pageflags {
>>>>         /* At least one page in this folio has the hwpoison flag set */
>>>>         PG_has_hwpoisoned = PG_active,
>>>>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
>>>> +       PG_partially_mapped, /* was identified to be partially mapped */
>>>>  };
>>>>
>>>>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
>>>> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>>>>         ClearPageHead(page);
>>>>  }
>>>>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
>>>> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>>>>  #else
>>>>  FOLIO_FLAG_FALSE(large_rmappable)
>>>> +FOLIO_FLAG_FALSE(partially_mapped)
>>>>  #endif
>>>>
>>>>  #define PG_head_mask ((1UL << PG_head))
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 6df0e9f4f56c..c024ab0f745c 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>                          * page_deferred_list.
>>>>                          */
>>>>                         list_del_init(&folio->_deferred_list);
>>>> +                       folio_clear_partially_mapped(folio);
>>>>                 }
>>>>                 spin_unlock(&ds_queue->split_queue_lock);
>>>>                 if (mapping) {
>>>> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>>>>         if (!list_empty(&folio->_deferred_list)) {
>>>>                 ds_queue->split_queue_len--;
>>>>                 list_del_init(&folio->_deferred_list);
>>>> +               folio_clear_partially_mapped(folio);
>>>>         }
>>>>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>>  }
>>>>
>>>> -void deferred_split_folio(struct folio *folio)
>>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>  {
>>>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>  #ifdef CONFIG_MEMCG
>>>> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>>>>         if (folio_test_swapcache(folio))
>>>>                 return;
>>>>
>>>> -       if (!list_empty(&folio->_deferred_list))
>>>> -               return;
>>>> -
>>>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>> +       if (partially_mapped)
>>>> +               folio_set_partially_mapped(folio);
>>>> +       else
>>>> +               folio_clear_partially_mapped(folio);
>>>
>>> Hi Usama,
>>>
>>> Do we need this? When can a partially_mapped folio on deferred_list become
>>> non-partially-mapped and need a clear? I understand transferring from
>>> entirely_map
>>> to partially_mapped is a one way process? partially_mapped folios can't go back
>>> to entirely_mapped?
>>>
>> Hi Barry,
>>
>> deferred_split_folio function is called in 3 places after this series, at fault, collapse and partial mapping. partial mapping can only happen after fault/collapse, and we have FOLIO_FLAG_FALSE(partially_mapped), i.e. flag initialized to false, so technically its not needed. A partially_mapped folio on deferred list wont become non-partially mapped.
>>
>> I just did it as a precaution if someone ever changes the kernel and calls deferred_split_folio with partially_mapped set to false after it had been true. The function arguments of deferred_split_folio make it seem that passing partially_mapped=false as an argument would clear it, which is why I cleared it as well. I could change the patch to something like below if it makes things better? i.e. add a comment at the top of the function:
>>
> 
> to me, it seems a BUG to call with false after a folio has been
> partially_mapped. So I'd rather
> VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
> 
> The below should also fix the MTHP_STAT_SPLIT_DEFERRED
> counter this patch is breaking?
> 
> @@ -3515,16 +3522,18 @@ void deferred_split_folio(struct folio *folio,
> bool partially_mapped)
>                 return;
> 
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -       if (partially_mapped)
> -               folio_set_partially_mapped(folio);
> -       else
> -               folio_clear_partially_mapped(folio);
> -       if (list_empty(&folio->_deferred_list)) {
> -               if (partially_mapped) {
> +       if (partially_mapped) {
> +               if (!folio_test_set_partially_mapped(folio)) {
> +                       mod_mthp_stat(folio_order(folio),
> +                               MTHP_STAT_NR_SPLIT_DEFERRED, 1);
>                         if (folio_test_pmd_mappable(folio))
>                                 count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>                         count_mthp_stat(folio_order(folio),
> MTHP_STAT_SPLIT_DEFERRED);
>                 }
> +       }
> +       VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
> +
> +       if (list_empty(&folio->_deferred_list)) {
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
> 
> 
So I had sent the below without the VM_WARN_ON_FOLIO as a reply to the other email, below is with VM_WARN.


-void deferred_split_folio(struct folio *folio)
+/* partially_mapped=false won't clear PG_partially_mapped folio flag */
+void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 #ifdef CONFIG_MEMCG
@@ -3485,14 +3488,17 @@ void deferred_split_folio(struct folio *folio)
        if (folio_test_swapcache(folio))
                return;
 
-       if (!list_empty(&folio->_deferred_list))
-               return;
-
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       if (list_empty(&folio->_deferred_list)) {
+       if (partially_mapped) {
+               folio_set_partially_mapped(folio);
                if (folio_test_pmd_mappable(folio))
                        count_vm_event(THP_DEFERRED_SPLIT_PAGE);
                count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+       } else {
+               /* partially mapped folios cannont become partially unmapped */
+               VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
+       }
+       if (list_empty(&folio->_deferred_list)) {
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG

 Thanks


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 11:23       ` Barry Song
@ 2024-08-14 12:36         ` Usama Arif
  2024-08-14 23:05           ` Barry Song
  0 siblings, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-14 12:36 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team



On 14/08/2024 12:23, Barry Song wrote:
> On Wed, Aug 14, 2024 at 11:20 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 14/08/2024 12:10, Barry Song wrote:
>>> On Wed, Aug 14, 2024 at 12:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> Currently folio->_deferred_list is used to keep track of
>>>> partially_mapped folios that are going to be split under memory
>>>> pressure. In the next patch, all THPs that are faulted in and collapsed
>>>> by khugepaged are also going to be tracked using _deferred_list.
>>>>
>>>> This patch introduces a pageflag to be able to distinguish between
>>>> partially mapped folios and others in the deferred_list at split time in
>>>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>>>> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
>>>> possible to distinguish between partially mapped folios and others in
>>>> deferred_split_scan.
>>>>
>>>> Eventhough it introduces an extra flag to track if the folio is
>>>> partially mapped, there is no functional change intended with this
>>>> patch and the flag is not useful in this patch itself, it will
>>>> become useful in the next patch when _deferred_list has non partially
>>>> mapped folios.
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> ---
>>>>  include/linux/huge_mm.h    |  4 ++--
>>>>  include/linux/page-flags.h |  3 +++
>>>>  mm/huge_memory.c           | 21 +++++++++++++--------
>>>>  mm/hugetlb.c               |  1 +
>>>>  mm/internal.h              |  4 +++-
>>>>  mm/memcontrol.c            |  3 ++-
>>>>  mm/migrate.c               |  3 ++-
>>>>  mm/page_alloc.c            |  5 +++--
>>>>  mm/rmap.c                  |  3 ++-
>>>>  mm/vmscan.c                |  3 ++-
>>>>  10 files changed, 33 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 4c32058cacfe..969f11f360d2 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>>>>  {
>>>>         return split_huge_page_to_list_to_order(page, NULL, 0);
>>>>  }
>>>> -void deferred_split_folio(struct folio *folio);
>>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>>>
>>>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>>>                 unsigned long address, bool freeze, struct folio *folio);
>>>> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>>>>  {
>>>>         return 0;
>>>>  }
>>>> -static inline void deferred_split_folio(struct folio *folio) {}
>>>> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>>>  #define split_huge_pmd(__vma, __pmd, __address)        \
>>>>         do { } while (0)
>>>>
>>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>>> index a0a29bd092f8..cecc1bad7910 100644
>>>> --- a/include/linux/page-flags.h
>>>> +++ b/include/linux/page-flags.h
>>>> @@ -182,6 +182,7 @@ enum pageflags {
>>>>         /* At least one page in this folio has the hwpoison flag set */
>>>>         PG_has_hwpoisoned = PG_active,
>>>>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
>>>> +       PG_partially_mapped, /* was identified to be partially mapped */
>>>>  };
>>>>
>>>>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
>>>> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>>>>         ClearPageHead(page);
>>>>  }
>>>>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
>>>> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>>>>  #else
>>>>  FOLIO_FLAG_FALSE(large_rmappable)
>>>> +FOLIO_FLAG_FALSE(partially_mapped)
>>>>  #endif
>>>>
>>>>  #define PG_head_mask ((1UL << PG_head))
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 6df0e9f4f56c..c024ab0f745c 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>                          * page_deferred_list.
>>>>                          */
>>>>                         list_del_init(&folio->_deferred_list);
>>>> +                       folio_clear_partially_mapped(folio);
>>>>                 }
>>>>                 spin_unlock(&ds_queue->split_queue_lock);
>>>>                 if (mapping) {
>>>> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>>>>         if (!list_empty(&folio->_deferred_list)) {
>>>>                 ds_queue->split_queue_len--;
>>>>                 list_del_init(&folio->_deferred_list);
>>>> +               folio_clear_partially_mapped(folio);
>>>>         }
>>>>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>>  }
>>>>
>>>> -void deferred_split_folio(struct folio *folio)
>>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>  {
>>>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>  #ifdef CONFIG_MEMCG
>>>> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>>>>         if (folio_test_swapcache(folio))
>>>>                 return;
>>>>
>>>> -       if (!list_empty(&folio->_deferred_list))
>>>> -               return;
>>>> -
>>>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>> +       if (partially_mapped)
>>>> +               folio_set_partially_mapped(folio);
>>>> +       else
>>>> +               folio_clear_partially_mapped(folio);
>>>>         if (list_empty(&folio->_deferred_list)) {
>>>> -               if (folio_test_pmd_mappable(folio))
>>>> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>>> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>>>> +               if (partially_mapped) {
>>>> +                       if (folio_test_pmd_mappable(folio))
>>>> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>>> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>>>
>>> This code completely broke MTHP_STAT_SPLIT_DEFERRED for PMD_ORDER. It
>>> added the folio to the deferred_list as entirely_mapped
>>> (partially_mapped == false).
>>> However, when partially_mapped becomes true, there's no opportunity to
>>> add it again
>>> as it has been there on the list. Are you consistently seeing the counter for
>>> PMD_ORDER as 0?
>>>
>>
>> Ah I see it, this should fix it?
>>
>> -void deferred_split_folio(struct folio *folio)
>> +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>  {
>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  #ifdef CONFIG_MEMCG
>> @@ -3485,14 +3488,14 @@ void deferred_split_folio(struct folio *folio)
>>         if (folio_test_swapcache(folio))
>>                 return;
>>
>> -       if (!list_empty(&folio->_deferred_list))
>> -               return;
>> -
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -       if (list_empty(&folio->_deferred_list)) {
>> +       if (partially_mapped) {
>> +               folio_set_partially_mapped(folio);
>>                 if (folio_test_pmd_mappable(folio))
>>                         count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>                 count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>> +       }
>> +       if (list_empty(&folio->_deferred_list)) {
>>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>>                 ds_queue->split_queue_len++;
>>  #ifdef CONFIG_MEMCG
>>
> 
> not enough. as deferred_split_folio(true) won't be called if folio has been
> deferred_list in __folio_remove_rmap():
> 
>         if (partially_mapped && folio_test_anon(folio) &&
>             list_empty(&folio->_deferred_list))
>                 deferred_split_folio(folio, true);
> 
> so you will still see 0.
> 

ah yes, Thanks.

So below diff over the current v3 series should work for all cases:


diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b4d72479330d..482e3ab60911 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3483,6 +3483,7 @@ void __folio_undo_large_rmappable(struct folio *folio)
        spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
+/* partially_mapped=false won't clear PG_partially_mapped folio flag */
 void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
@@ -3515,16 +3516,16 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
                return;
 
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       if (partially_mapped)
+       if (partially_mapped) {
                folio_set_partially_mapped(folio);
-       else
-               folio_clear_partially_mapped(folio);
+               if (folio_test_pmd_mappable(folio))
+                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+       } else {
+               /* partially mapped folios cannont become partially unmapped */
+               VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
+       }
        if (list_empty(&folio->_deferred_list)) {
-               if (partially_mapped) {
-                       if (folio_test_pmd_mappable(folio))
-                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
-               }
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
diff --git a/mm/rmap.c b/mm/rmap.c
index 9ad558c2bad0..4c330635aa4e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1578,7 +1578,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
         * Check partially_mapped first to ensure it is a large folio.
         */
        if (partially_mapped && folio_test_anon(folio) &&
-           list_empty(&folio->_deferred_list))
+           !folio_test_partially_mapped(folio))
                deferred_split_folio(folio, true);
 
        __folio_mod_stat(folio, -nr, -nr_pmdmapped);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 12:36         ` Usama Arif
@ 2024-08-14 23:05           ` Barry Song
  2024-08-15 15:25             ` Usama Arif
  0 siblings, 1 reply; 42+ messages in thread
From: Barry Song @ 2024-08-14 23:05 UTC (permalink / raw)
  To: usamaarif642
  Cc: akpm, baohua, cerasuolodomenico, corbet, david, hannes,
	kernel-team, linux-doc, linux-kernel, linux-mm, riel,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, willy, yuzhao


On Thu, Aug 15, 2024 at 12:37 AM Usama Arif <usamaarif642@gmail.com> wrote:
[snip]
> >>>>
> >>>> -void deferred_split_folio(struct folio *folio)
> >>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >>>>  {
> >>>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>  #ifdef CONFIG_MEMCG
> >>>> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
> >>>>         if (folio_test_swapcache(folio))
> >>>>                 return;
> >>>>
> >>>> -       if (!list_empty(&folio->_deferred_list))
> >>>> -               return;
> >>>> -
> >>>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >>>> +       if (partially_mapped)
> >>>> +               folio_set_partially_mapped(folio);
> >>>> +       else
> >>>> +               folio_clear_partially_mapped(folio);
> >>>>         if (list_empty(&folio->_deferred_list)) {
> >>>> -               if (folio_test_pmd_mappable(folio))
> >>>> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> >>>> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> >>>> +               if (partially_mapped) {
> >>>> +                       if (folio_test_pmd_mappable(folio))
> >>>> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> >>>> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> >>>
> >>> This code completely broke MTHP_STAT_SPLIT_DEFERRED for PMD_ORDER. It
> >>> added the folio to the deferred_list as entirely_mapped
> >>> (partially_mapped == false).
> >>> However, when partially_mapped becomes true, there's no opportunity to
> >>> add it again
> >>> as it has been there on the list. Are you consistently seeing the counter for
> >>> PMD_ORDER as 0?
> >>>
> >>
> >> Ah I see it, this should fix it?
> >>
> >> -void deferred_split_folio(struct folio *folio)
> >> +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
> >> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >>  {
> >>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>  #ifdef CONFIG_MEMCG
> >> @@ -3485,14 +3488,14 @@ void deferred_split_folio(struct folio *folio)
> >>         if (folio_test_swapcache(folio))
> >>                 return;
> >>
> >> -       if (!list_empty(&folio->_deferred_list))
> >> -               return;
> >> -
> >>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >> -       if (list_empty(&folio->_deferred_list)) {
> >> +       if (partially_mapped) {
> >> +               folio_set_partially_mapped(folio);
> >>                 if (folio_test_pmd_mappable(folio))
> >>                         count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> >>                 count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> >> +       }
> >> +       if (list_empty(&folio->_deferred_list)) {
> >>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> >>                 ds_queue->split_queue_len++;
> >>  #ifdef CONFIG_MEMCG
> >>
> >
> > not enough. as deferred_split_folio(true) won't be called if folio has been
> > deferred_list in __folio_remove_rmap():
> >
> >         if (partially_mapped && folio_test_anon(folio) &&
> >             list_empty(&folio->_deferred_list))
> >                 deferred_split_folio(folio, true);
> >
> > so you will still see 0.
> >
>
> ah yes, Thanks.
>
> So below diff over the current v3 series should work for all cases:
>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b4d72479330d..482e3ab60911 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3483,6 +3483,7 @@ void __folio_undo_large_rmappable(struct folio *folio)
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
>
> +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
>  void deferred_split_folio(struct folio *folio, bool partially_mapped)
>  {
>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> @@ -3515,16 +3516,16 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>                 return;
>
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -       if (partially_mapped)
> +       if (partially_mapped) {
>                 folio_set_partially_mapped(folio);
> -       else
> -               folio_clear_partially_mapped(folio);
> +               if (folio_test_pmd_mappable(folio))
> +                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> +               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +       } else {
> +               /* partially mapped folios cannont become partially unmapped */
> +               VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
> +       }
>         if (list_empty(&folio->_deferred_list)) {
> -               if (partially_mapped) {
> -                       if (folio_test_pmd_mappable(folio))
> -                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> -                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> -               }
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
>  #ifdef CONFIG_MEMCG
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9ad558c2bad0..4c330635aa4e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1578,7 +1578,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>          * Check partially_mapped first to ensure it is a large folio.
>          */
>         if (partially_mapped && folio_test_anon(folio) &&
> -           list_empty(&folio->_deferred_list))
> +           !folio_test_partially_mapped(folio))
>                 deferred_split_folio(folio, true);
>
>         __folio_mod_stat(folio, -nr, -nr_pmdmapped);
>

This is an improvement, but there's still an issue. Two or more threads can
execute deferred_split_folio() simultaneously, which could lead to
DEFERRED_SPLIT being added multiple times. We should double-check
the status after acquiring the spinlock.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f401ceded697..3d247826fb95 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3607,10 +3607,12 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (partially_mapped) {
-		folio_set_partially_mapped(folio);
-		if (folio_test_pmd_mappable(folio))
-			count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-		count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		if (!folio_test_partially_mapped(folio)) {
+			folio_set_partially_mapped(folio);
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		}
 	} else {
 		/* partially mapped folios cannont become partially unmapped */
 		VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-14 23:05           ` Barry Song
@ 2024-08-15 15:25             ` Usama Arif
  2024-08-15 23:30               ` Andrew Morton
  0 siblings, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-15 15:25 UTC (permalink / raw)
  To: Barry Song, akpm
  Cc: baohua, cerasuolodomenico, corbet, david, hannes, kernel-team,
	linux-doc, linux-kernel, linux-mm, riel, roman.gushchin, rppt,
	ryan.roberts, shakeel.butt, willy, yuzhao



On 15/08/2024 00:05, Barry Song wrote:
> 
> On Thu, Aug 15, 2024 at 12:37 AM Usama Arif <usamaarif642@gmail.com> wrote:
> [snip]
>>>>>>
>>>>>> -void deferred_split_folio(struct folio *folio)
>>>>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>>>  {
>>>>>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>  #ifdef CONFIG_MEMCG
>>>>>> @@ -3485,14 +3487,17 @@ void deferred_split_folio(struct folio *folio)
>>>>>>         if (folio_test_swapcache(folio))
>>>>>>                 return;
>>>>>>
>>>>>> -       if (!list_empty(&folio->_deferred_list))
>>>>>> -               return;
>>>>>> -
>>>>>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>>>> +       if (partially_mapped)
>>>>>> +               folio_set_partially_mapped(folio);
>>>>>> +       else
>>>>>> +               folio_clear_partially_mapped(folio);
>>>>>>         if (list_empty(&folio->_deferred_list)) {
>>>>>> -               if (folio_test_pmd_mappable(folio))
>>>>>> -                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>>>>> -               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>>>>>> +               if (partially_mapped) {
>>>>>> +                       if (folio_test_pmd_mappable(folio))
>>>>>> +                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>>>>> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>>>>>
>>>>> This code completely broke MTHP_STAT_SPLIT_DEFERRED for PMD_ORDER. It
>>>>> added the folio to the deferred_list as entirely_mapped
>>>>> (partially_mapped == false).
>>>>> However, when partially_mapped becomes true, there's no opportunity to
>>>>> add it again
>>>>> as it has been there on the list. Are you consistently seeing the counter for
>>>>> PMD_ORDER as 0?
>>>>>
>>>>
>>>> Ah I see it, this should fix it?
>>>>
>>>> -void deferred_split_folio(struct folio *folio)
>>>> +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
>>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>  {
>>>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>  #ifdef CONFIG_MEMCG
>>>> @@ -3485,14 +3488,14 @@ void deferred_split_folio(struct folio *folio)
>>>>         if (folio_test_swapcache(folio))
>>>>                 return;
>>>>
>>>> -       if (!list_empty(&folio->_deferred_list))
>>>> -               return;
>>>> -
>>>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>> -       if (list_empty(&folio->_deferred_list)) {
>>>> +       if (partially_mapped) {
>>>> +               folio_set_partially_mapped(folio);
>>>>                 if (folio_test_pmd_mappable(folio))
>>>>                         count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>>>                 count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>>>> +       }
>>>> +       if (list_empty(&folio->_deferred_list)) {
>>>>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>>>>                 ds_queue->split_queue_len++;
>>>>  #ifdef CONFIG_MEMCG
>>>>
>>>
>>> not enough. as deferred_split_folio(true) won't be called if folio has been
>>> deferred_list in __folio_remove_rmap():
>>>
>>>         if (partially_mapped && folio_test_anon(folio) &&
>>>             list_empty(&folio->_deferred_list))
>>>                 deferred_split_folio(folio, true);
>>>
>>> so you will still see 0.
>>>
>>
>> ah yes, Thanks.
>>
>> So below diff over the current v3 series should work for all cases:
>>
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index b4d72479330d..482e3ab60911 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3483,6 +3483,7 @@ void __folio_undo_large_rmappable(struct folio *folio)
>>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  }
>>
>> +/* partially_mapped=false won't clear PG_partially_mapped folio flag */
>>  void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>  {
>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>> @@ -3515,16 +3516,16 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>                 return;
>>
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -       if (partially_mapped)
>> +       if (partially_mapped) {
>>                 folio_set_partially_mapped(folio);
>> -       else
>> -               folio_clear_partially_mapped(folio);
>> +               if (folio_test_pmd_mappable(folio))
>> +                       count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>> +               count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>> +       } else {
>> +               /* partially mapped folios cannont become partially unmapped */
>> +               VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
>> +       }
>>         if (list_empty(&folio->_deferred_list)) {
>> -               if (partially_mapped) {
>> -                       if (folio_test_pmd_mappable(folio))
>> -                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>> -                       count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
>> -               }
>>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>>                 ds_queue->split_queue_len++;
>>  #ifdef CONFIG_MEMCG
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 9ad558c2bad0..4c330635aa4e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1578,7 +1578,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>          * Check partially_mapped first to ensure it is a large folio.
>>          */
>>         if (partially_mapped && folio_test_anon(folio) &&
>> -           list_empty(&folio->_deferred_list))
>> +           !folio_test_partially_mapped(folio))
>>                 deferred_split_folio(folio, true);
>>
>>         __folio_mod_stat(folio, -nr, -nr_pmdmapped);
>>
> 
> This is an improvement, but there's still an issue. Two or more threads can
> execute deferred_split_folio() simultaneously, which could lead to
> DEFERRED_SPLIT being added multiple times. We should double-check
> the status after acquiring the spinlock.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f401ceded697..3d247826fb95 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3607,10 +3607,12 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (partially_mapped) {
> -		folio_set_partially_mapped(folio);
> -		if (folio_test_pmd_mappable(folio))
> -			count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> -		count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +		if (!folio_test_partially_mapped(folio)) {
> +			folio_set_partially_mapped(folio);
> +			if (folio_test_pmd_mappable(folio))
> +				count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> +			count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
> +		}
>  	} else {
>  		/* partially mapped folios cannont become partially unmapped */
>  		VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);

Actually above is still not thread safe. multiple threads can test partially_mapped and see its false at the same time and all of them would then increment stats. I believe !folio_test_set_partially_mapped would be best. Hopefully below diff over v3 should cover all the fixes that have come up until now.

Independent of this series, I also think its a good idea to add a selftest for this deferred_split counter. I will send a separate patch for it that just maps a THP, unmaps a small part from it and checks the counter. I think split_huge_page_test.c is probably the right place for it.

If everyone is happy with it, Andrew could replace the original fix patch in [1] with below.

[1] https://lore.kernel.org/all/20240814200220.F1964C116B1@smtp.kernel.org/

commit c627655548fa09b59849e942da4decc84fa0b0f2
Author: Usama Arif <usamaarif642@gmail.com>
Date:   Thu Aug 15 16:07:20 2024 +0100

    mm: Introduce a pageflag for partially mapped folios fix
    
    Fixes the original commit by not clearing partially mapped bit
    in hugeTLB folios and fixing deferred split THP stats.
    
    Signed-off-by: Usama Arif <usamaarif642@gmail.com>

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cecc1bad7910..7bee743ede40 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -863,6 +863,7 @@ static inline void ClearPageCompound(struct page *page)
 }
 FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
 FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
+FOLIO_TEST_SET_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
 #else
 FOLIO_FLAG_FALSE(large_rmappable)
 FOLIO_FLAG_FALSE(partially_mapped)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c024ab0f745c..24371e4ef19b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3459,6 +3459,7 @@ void __folio_undo_large_rmappable(struct folio *folio)
        spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
+/* partially_mapped=false won't clear PG_partially_mapped folio flag */
 void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
@@ -3488,16 +3489,17 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
                return;
 
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       if (partially_mapped)
-               folio_set_partially_mapped(folio);
-       else
-               folio_clear_partially_mapped(folio);
-       if (list_empty(&folio->_deferred_list)) {
-               if (partially_mapped) {
+       if (partially_mapped) {
+               if (!folio_test_set_partially_mapped(folio)) {
                        if (folio_test_pmd_mappable(folio))
                                count_vm_event(THP_DEFERRED_SPLIT_PAGE);
                        count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
                }
+       } else {
+               /* partially mapped folios cannot become non-partially mapped */
+               VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
+       }
+       if (list_empty(&folio->_deferred_list)) {
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2ae2d9a18e40..1fdd9eab240c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1758,7 +1758,6 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
                free_gigantic_folio(folio, huge_page_order(h));
        } else {
                INIT_LIST_HEAD(&folio->_deferred_list);
-               folio_clear_partially_mapped(folio);
                folio_put(folio);
        }
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 9ad558c2bad0..4c330635aa4e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1578,7 +1578,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
         * Check partially_mapped first to ensure it is a large folio.
         */
        if (partially_mapped && folio_test_anon(folio) &&
-           list_empty(&folio->_deferred_list))
+           !folio_test_partially_mapped(folio))
                deferred_split_folio(folio, true);
 
        __folio_mod_stat(folio, -nr, -nr_pmdmapped);





  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
                     ` (2 preceding siblings ...)
  2024-08-14 11:10   ` Barry Song
@ 2024-08-15 16:33   ` David Hildenbrand
  2024-08-15 17:10     ` Usama Arif
  2024-08-16 15:44   ` Matthew Wilcox
  4 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2024-08-15 16:33 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 13.08.24 14:02, Usama Arif wrote:
> Currently folio->_deferred_list is used to keep track of
> partially_mapped folios that are going to be split under memory
> pressure. In the next patch, all THPs that are faulted in and collapsed
> by khugepaged are also going to be tracked using _deferred_list.
> 
> This patch introduces a pageflag to be able to distinguish between
> partially mapped folios and others in the deferred_list at split time in
> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> possible to distinguish between partially mapped folios and others in
> deferred_split_scan.
> 
> Eventhough it introduces an extra flag to track if the folio is
> partially mapped, there is no functional change intended with this
> patch and the flag is not useful in this patch itself, it will
> become useful in the next patch when _deferred_list has non partially
> mapped folios.
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>   include/linux/huge_mm.h    |  4 ++--
>   include/linux/page-flags.h |  3 +++
>   mm/huge_memory.c           | 21 +++++++++++++--------
>   mm/hugetlb.c               |  1 +
>   mm/internal.h              |  4 +++-
>   mm/memcontrol.c            |  3 ++-
>   mm/migrate.c               |  3 ++-
>   mm/page_alloc.c            |  5 +++--
>   mm/rmap.c                  |  3 ++-
>   mm/vmscan.c                |  3 ++-
>   10 files changed, 33 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4c32058cacfe..969f11f360d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return split_huge_page_to_list_to_order(page, NULL, 0);
>   }
> -void deferred_split_folio(struct folio *folio);
> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>   
>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   		unsigned long address, bool freeze, struct folio *folio);
> @@ -495,7 +495,7 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return 0;
>   }
> -static inline void deferred_split_folio(struct folio *folio) {}
> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>   #define split_huge_pmd(__vma, __pmd, __address)	\
>   	do { } while (0)
>   
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a0a29bd092f8..cecc1bad7910 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -182,6 +182,7 @@ enum pageflags {
>   	/* At least one page in this folio has the hwpoison flag set */
>   	PG_has_hwpoisoned = PG_active,
>   	PG_large_rmappable = PG_workingset, /* anon or file-backed */
> +	PG_partially_mapped, /* was identified to be partially mapped */
>   };
>   
>   #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
> @@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
>   	ClearPageHead(page);
>   }
>   FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>   #else
>   FOLIO_FLAG_FALSE(large_rmappable)
> +FOLIO_FLAG_FALSE(partially_mapped)
>   #endif
>   
>   #define PG_head_mask ((1UL << PG_head))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6df0e9f4f56c..c024ab0f745c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>   			 * page_deferred_list.
>   			 */
>   			list_del_init(&folio->_deferred_list);
> +			folio_clear_partially_mapped(folio);
>   		}
>   		spin_unlock(&ds_queue->split_queue_lock);
>   		if (mapping) {
> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>   	if (!list_empty(&folio->_deferred_list)) {
>   		ds_queue->split_queue_len--;
>   		list_del_init(&folio->_deferred_list);
> +		folio_clear_partially_mapped(folio);
>   	}
>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>   }
>   
> -void deferred_split_folio(struct folio *folio)
> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>   {
	/* We lost race with folio_put() */>   		 
list_del_init(&folio->_deferred_list);
> +			folio_clear_partially_mapped(folio);
>   			ds_queue->split_queue_len--;
>   		}
>   		if (!--sc->nr_to_scan)
> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>   next:
>   		folio_put(folio);
>   	}
> -
>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>   	list_splice_tail(&list, &ds_queue->split_queue);
>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1fdd9eab240c..2ae2d9a18e40 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>   		free_gigantic_folio(folio, huge_page_order(h));
>   	} else {
>   		INIT_LIST_HEAD(&folio->_deferred_list);
> +		folio_clear_partially_mapped(folio);
>   		folio_put(folio);
>   	}
>   }
> diff --git a/mm/internal.h b/mm/internal.h
> index 52f7fc4e8ac3..d64546b8d377 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>   	atomic_set(&folio->_entire_mapcount, -1);
>   	atomic_set(&folio->_nr_pages_mapped, 0);
>   	atomic_set(&folio->_pincount, 0);
> -	if (order > 1)
> +	if (order > 1) {
>   		INIT_LIST_HEAD(&folio->_deferred_list);
> +		folio_clear_partially_mapped(folio);

Can we use the non-atomic version here?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-15 16:33   ` David Hildenbrand
@ 2024-08-15 17:10     ` Usama Arif
  2024-08-15 21:06       ` Barry Song
  2024-08-15 21:08       ` David Hildenbrand
  0 siblings, 2 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-15 17:10 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 15/08/2024 17:33, David Hildenbrand wrote:
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 6df0e9f4f56c..c024ab0f745c 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>                * page_deferred_list.
>>                */
>>               list_del_init(&folio->_deferred_list);
>> +            folio_clear_partially_mapped(folio);
>>           }
>>           spin_unlock(&ds_queue->split_queue_lock);
>>           if (mapping) {
>> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
>>       if (!list_empty(&folio->_deferred_list)) {
>>           ds_queue->split_queue_len--;
>>           list_del_init(&folio->_deferred_list);
>> +        folio_clear_partially_mapped(folio);
>>       }
>>       spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>   }
>>   -void deferred_split_folio(struct folio *folio)
>> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>   {
>     /* We lost race with folio_put() */>            list_del_init(&folio->_deferred_list);

Was there some comment here? I just see ">" remove from the start of /* We lost race with folio_put() */

>> +            folio_clear_partially_mapped(folio);
>>               ds_queue->split_queue_len--;
>>           }
>>           if (!--sc->nr_to_scan)
>> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   next:
>>           folio_put(folio);
>>       }
>> -
>>       spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>       list_splice_tail(&list, &ds_queue->split_queue);
>>       spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 1fdd9eab240c..2ae2d9a18e40 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>>           free_gigantic_folio(folio, huge_page_order(h));
>>       } else {
>>           INIT_LIST_HEAD(&folio->_deferred_list);
>> +        folio_clear_partially_mapped(folio);
>>           folio_put(folio);
>>       }
>>   }
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 52f7fc4e8ac3..d64546b8d377 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>       atomic_set(&folio->_entire_mapcount, -1);
>>       atomic_set(&folio->_nr_pages_mapped, 0);
>>       atomic_set(&folio->_pincount, 0);
>> -    if (order > 1)
>> +    if (order > 1) {
>>           INIT_LIST_HEAD(&folio->_deferred_list);
>> +        folio_clear_partially_mapped(folio);
> 
> Can we use the non-atomic version here?
> 

I believe we can use the non-atomic version in all places where set/clear is done as all set/clear are protected by ds_queue->split_queue_lock. So basically could replace all folio_set/clear_partially_mapped with __folio_set/clear_partially_mapped.

But I guess its likely not going to make much difference? I will do it anyways in the next revision, rather than sending a fix patch. There haven't been any reviews for patch 5 so will wait a few days for any comments on that.

Thanks


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-13 12:02 ` [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-08-15 18:47   ` Kairui Song
  2024-08-15 19:16     ` Usama Arif
  0 siblings, 1 reply; 42+ messages in thread
From: Kairui Song @ 2024-08-15 18:47 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team,
	Shuang Zhai

On Tue, Aug 13, 2024 at 8:03 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
> From: Yu Zhao <yuzhao@google.com>
>
> If a tail page has only two references left, one inherited from the
> isolation of its head and the other from lru_add_page_tail() which we
> are about to drop, it means this tail page was concurrently zapped.
> Then we can safely free it and save page reclaim or migration the
> trouble of trying it.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/huge_memory.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)

Hi, Usama, Yu

This commit is causing the kernel to panic very quickly with build
kernel test on top of tmpfs with all mTHP enabled, the panic comes
after:

[  207.147705] BUG: Bad page state in process tar  pfn:14ae70
[  207.149376] page: refcount:3 mapcount:2 mapping:0000000000000000
index:0x562d23b70 pfn:0x14ae70
[  207.151750] flags:
0x17ffffc0020019(locked|uptodate|dirty|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
[  207.154325] raw: 0017ffffc0020019 dead000000000100 dead000000000122
0000000000000000
[  207.156442] raw: 0000000562d23b70 0000000000000000 0000000300000001
0000000000000000
[  207.158561] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[  207.160325] Modules linked in:
[  207.161194] CPU: 22 UID: 0 PID: 2650 Comm: tar Not tainted
6.11.0-rc3.ptch+ #136
[  207.163198] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[  207.164946] Call Trace:
[  207.165636]  <TASK>
[  207.166226]  dump_stack_lvl+0x53/0x70
[  207.167241]  bad_page+0x70/0x120
[  207.168131]  free_page_is_bad+0x5f/0x70
[  207.169193]  free_unref_folios+0x3a5/0x620
[  207.170320]  ? __mem_cgroup_uncharge_folios+0x7e/0xa0
[  207.171705]  __split_huge_page+0xb02/0xcf0
[  207.172839]  ? smp_call_function_many_cond+0x105/0x4b0
[  207.174250]  ? __pfx_flush_tlb_func+0x10/0x10
[  207.175410]  ? on_each_cpu_cond_mask+0x29/0x50
[  207.176603]  split_huge_page_to_list_to_order+0x857/0x9b0
[  207.178052]  shrink_folio_list+0x4e1/0x1200
[  207.179198]  evict_folios+0x468/0xab0
[  207.180202]  try_to_shrink_lruvec+0x1f3/0x280
[  207.181394]  shrink_lruvec+0x89/0x780
[  207.182398]  ? mem_cgroup_iter+0x66/0x290
[  207.183488]  shrink_node+0x243/0xb00
[  207.184474]  do_try_to_free_pages+0xbd/0x4e0
[  207.185621]  try_to_free_mem_cgroup_pages+0x107/0x230
[  207.186994]  try_charge_memcg+0x184/0x5d0
[  207.188092]  charge_memcg+0x3a/0x60
[  207.189046]  __mem_cgroup_charge+0x2c/0x80
[  207.190162]  shmem_alloc_and_add_folio+0x1a3/0x470
[  207.191469]  shmem_get_folio_gfp+0x24a/0x670
[  207.192635]  shmem_write_begin+0x56/0xd0
[  207.193703]  generic_perform_write+0x140/0x330
[  207.194919]  shmem_file_write_iter+0x89/0x90
[  207.196082]  vfs_write+0x2f3/0x420
[  207.197019]  ksys_write+0x5d/0xd0
[  207.197914]  do_syscall_64+0x47/0x110
[  207.198915]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  207.200293] RIP: 0033:0x7f2e6099c784
[  207.201278] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 08 0e 00 00 74 13 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 $
8 89 e5 48 83 ec 20 48 89
[  207.206280] RSP: 002b:00007ffdb1a0e7d8 EFLAGS: 00000202 ORIG_RAX:
0000000000000001
[  207.208312] RAX: ffffffffffffffda RBX: 00000000000005e7 RCX: 00007f2e6099c784
[  207.210225] RDX: 00000000000005e7 RSI: 0000562d23b77000 RDI: 0000000000000004
[  207.212145] RBP: 00007ffdb1a0e820 R08: 00000000000005e7 R09: 0000000000000007
[  207.214064] R10: 0000000000000180 R11: 0000000000000202 R12: 0000562d23b77000
[  207.215974] R13: 0000000000000004 R14: 00000000000005e7 R15: 0000000000000000
[  207.217888]  </TASK>

Test is done using ZRAM as SWAP, 1G memcg, and run:
cd /mnt/tmpfs
time tar zxf "$linux_src"
make -j64 clean
make defconfig
/usr/bin/time make -j64

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 04ee8abd6475..85a424e954be 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3059,7 +3059,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         unsigned int new_nr = 1 << new_order;
>         int order = folio_order(folio);
>         unsigned int nr = 1 << order;
> +       struct folio_batch free_folios;
>
> +       folio_batch_init(&free_folios);
>         /* complete memcg works before add pages to LRU */
>         split_page_memcg(head, order, new_order);
>
> @@ -3143,6 +3145,26 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                 if (subpage == page)
>                         continue;
>                 folio_unlock(new_folio);
> +               /*
> +                * If a folio has only two references left, one inherited
> +                * from the isolation of its head and the other from
> +                * lru_add_page_tail() which we are about to drop, it means this
> +                * folio was concurrently zapped. Then we can safely free it
> +                * and save page reclaim or migration the trouble of trying it.
> +                */
> +               if (list && folio_ref_freeze(new_folio, 2)) {
> +                       VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
> +                       VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
> +                       VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
> +
> +                       folio_clear_active(new_folio);
> +                       folio_clear_unevictable(new_folio);
> +                       if (!folio_batch_add(&free_folios, folio)) {
> +                               mem_cgroup_uncharge_folios(&free_folios);
> +                               free_unref_folios(&free_folios);
> +                       }
> +                       continue;
> +               }
>
>                 /*
>                  * Subpages may be freed if there wasn't any mapping
> @@ -3153,6 +3175,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                  */
>                 free_page_and_swap_cache(subpage);
>         }
> +
> +       if (free_folios.nr) {
> +               mem_cgroup_uncharge_folios(&free_folios);
> +               free_unref_folios(&free_folios);
> +       }
>  }
>
>  /* Racy check whether the huge page can be split */
> --
> 2.43.5
>
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-15 18:47   ` Kairui Song
@ 2024-08-15 19:16     ` Usama Arif
  2024-08-16 16:55       ` Kairui Song
  0 siblings, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-15 19:16 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team,
	Shuang Zhai



On 15/08/2024 19:47, Kairui Song wrote:
> On Tue, Aug 13, 2024 at 8:03 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> From: Yu Zhao <yuzhao@google.com>
>>
>> If a tail page has only two references left, one inherited from the
>> isolation of its head and the other from lru_add_page_tail() which we
>> are about to drop, it means this tail page was concurrently zapped.
>> Then we can safely free it and save page reclaim or migration the
>> trouble of trying it.
>>
>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>> Tested-by: Shuang Zhai <zhais@google.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>>  mm/huge_memory.c | 27 +++++++++++++++++++++++++++
>>  1 file changed, 27 insertions(+)
> 
> Hi, Usama, Yu
> 
> This commit is causing the kernel to panic very quickly with build
> kernel test on top of tmpfs with all mTHP enabled, the panic comes
> after:
> 

Hi,

Thanks for pointing this out. It is a very silly bug I have introduced going from v1 page version to the folio version of the patch in v3.

Doing below over this patch will fix it:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 907813102430..a6ca454e1168 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3183,7 +3183,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
                        folio_clear_active(new_folio);
                        folio_clear_unevictable(new_folio);
-                       if (!folio_batch_add(&free_folios, folio)) {
+                       if (!folio_batch_add(&free_folios, new_folio)) {
                                mem_cgroup_uncharge_folios(&free_folios);
                                free_unref_folios(&free_folios);
                        }


I will include it in the next revision.

> [  207.147705] BUG: Bad page state in process tar  pfn:14ae70
> [  207.149376] page: refcount:3 mapcount:2 mapping:0000000000000000
> index:0x562d23b70 pfn:0x14ae70
> [  207.151750] flags:
> 0x17ffffc0020019(locked|uptodate|dirty|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
> [  207.154325] raw: 0017ffffc0020019 dead000000000100 dead000000000122
> 0000000000000000
> [  207.156442] raw: 0000000562d23b70 0000000000000000 0000000300000001
> 0000000000000000
> [  207.158561] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [  207.160325] Modules linked in:
> [  207.161194] CPU: 22 UID: 0 PID: 2650 Comm: tar Not tainted
> 6.11.0-rc3.ptch+ #136
> [  207.163198] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> [  207.164946] Call Trace:
> [  207.165636]  <TASK>
> [  207.166226]  dump_stack_lvl+0x53/0x70
> [  207.167241]  bad_page+0x70/0x120
> [  207.168131]  free_page_is_bad+0x5f/0x70
> [  207.169193]  free_unref_folios+0x3a5/0x620
> [  207.170320]  ? __mem_cgroup_uncharge_folios+0x7e/0xa0
> [  207.171705]  __split_huge_page+0xb02/0xcf0
> [  207.172839]  ? smp_call_function_many_cond+0x105/0x4b0
> [  207.174250]  ? __pfx_flush_tlb_func+0x10/0x10
> [  207.175410]  ? on_each_cpu_cond_mask+0x29/0x50
> [  207.176603]  split_huge_page_to_list_to_order+0x857/0x9b0
> [  207.178052]  shrink_folio_list+0x4e1/0x1200
> [  207.179198]  evict_folios+0x468/0xab0
> [  207.180202]  try_to_shrink_lruvec+0x1f3/0x280
> [  207.181394]  shrink_lruvec+0x89/0x780
> [  207.182398]  ? mem_cgroup_iter+0x66/0x290
> [  207.183488]  shrink_node+0x243/0xb00
> [  207.184474]  do_try_to_free_pages+0xbd/0x4e0
> [  207.185621]  try_to_free_mem_cgroup_pages+0x107/0x230
> [  207.186994]  try_charge_memcg+0x184/0x5d0
> [  207.188092]  charge_memcg+0x3a/0x60
> [  207.189046]  __mem_cgroup_charge+0x2c/0x80
> [  207.190162]  shmem_alloc_and_add_folio+0x1a3/0x470
> [  207.191469]  shmem_get_folio_gfp+0x24a/0x670
> [  207.192635]  shmem_write_begin+0x56/0xd0
> [  207.193703]  generic_perform_write+0x140/0x330
> [  207.194919]  shmem_file_write_iter+0x89/0x90
> [  207.196082]  vfs_write+0x2f3/0x420
> [  207.197019]  ksys_write+0x5d/0xd0
> [  207.197914]  do_syscall_64+0x47/0x110
> [  207.198915]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  207.200293] RIP: 0033:0x7f2e6099c784
> [  207.201278] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 08 0e 00 00 74 13 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 $
> 8 89 e5 48 83 ec 20 48 89
> [  207.206280] RSP: 002b:00007ffdb1a0e7d8 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000001
> [  207.208312] RAX: ffffffffffffffda RBX: 00000000000005e7 RCX: 00007f2e6099c784
> [  207.210225] RDX: 00000000000005e7 RSI: 0000562d23b77000 RDI: 0000000000000004
> [  207.212145] RBP: 00007ffdb1a0e820 R08: 00000000000005e7 R09: 0000000000000007
> [  207.214064] R10: 0000000000000180 R11: 0000000000000202 R12: 0000562d23b77000
> [  207.215974] R13: 0000000000000004 R14: 00000000000005e7 R15: 0000000000000000
> [  207.217888]  </TASK>
> 
> Test is done using ZRAM as SWAP, 1G memcg, and run:
> cd /mnt/tmpfs
> time tar zxf "$linux_src"
> make -j64 clean
> make defconfig
> /usr/bin/time make -j64
> 
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 04ee8abd6475..85a424e954be 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3059,7 +3059,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>         unsigned int new_nr = 1 << new_order;
>>         int order = folio_order(folio);
>>         unsigned int nr = 1 << order;
>> +       struct folio_batch free_folios;
>>
>> +       folio_batch_init(&free_folios);
>>         /* complete memcg works before add pages to LRU */
>>         split_page_memcg(head, order, new_order);
>>
>> @@ -3143,6 +3145,26 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>                 if (subpage == page)
>>                         continue;
>>                 folio_unlock(new_folio);
>> +               /*
>> +                * If a folio has only two references left, one inherited
>> +                * from the isolation of its head and the other from
>> +                * lru_add_page_tail() which we are about to drop, it means this
>> +                * folio was concurrently zapped. Then we can safely free it
>> +                * and save page reclaim or migration the trouble of trying it.
>> +                */
>> +               if (list && folio_ref_freeze(new_folio, 2)) {
>> +                       VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
>> +                       VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
>> +                       VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
>> +
>> +                       folio_clear_active(new_folio);
>> +                       folio_clear_unevictable(new_folio);
>> +                       if (!folio_batch_add(&free_folios, folio)) {
>> +                               mem_cgroup_uncharge_folios(&free_folios);
>> +                               free_unref_folios(&free_folios);
>> +                       }
>> +                       continue;
>> +               }
>>
>>                 /*
>>                  * Subpages may be freed if there wasn't any mapping
>> @@ -3153,6 +3175,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>                  */
>>                 free_page_and_swap_cache(subpage);
>>         }
>> +
>> +       if (free_folios.nr) {
>> +               mem_cgroup_uncharge_folios(&free_folios);
>> +               free_unref_folios(&free_folios);
>> +       }
>>  }
>>
>>  /* Racy check whether the huge page can be split */
>> --
>> 2.43.5
>>
>>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-15 17:10     ` Usama Arif
@ 2024-08-15 21:06       ` Barry Song
  2024-08-15 21:08       ` David Hildenbrand
  1 sibling, 0 replies; 42+ messages in thread
From: Barry Song @ 2024-08-15 21:06 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, akpm, linux-mm, hannes, riel, shakeel.butt,
	roman.gushchin, yuzhao, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

On Fri, Aug 16, 2024 at 5:10 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 15/08/2024 17:33, David Hildenbrand wrote:
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 6df0e9f4f56c..c024ab0f745c 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3397,6 +3397,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>                * page_deferred_list.
> >>                */
> >>               list_del_init(&folio->_deferred_list);
> >> +            folio_clear_partially_mapped(folio);
> >>           }
> >>           spin_unlock(&ds_queue->split_queue_lock);
> >>           if (mapping) {
> >> @@ -3453,11 +3454,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >>       if (!list_empty(&folio->_deferred_list)) {
> >>           ds_queue->split_queue_len--;
> >>           list_del_init(&folio->_deferred_list);
> >> +        folio_clear_partially_mapped(folio);
> >>       }
> >>       spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >>   }
> >>   -void deferred_split_folio(struct folio *folio)
> >> +void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >>   {
> >     /* We lost race with folio_put() */>            list_del_init(&folio->_deferred_list);
>
> Was there some comment here? I just see ">" remove from the start of /* We lost race with folio_put() */
>
> >> +            folio_clear_partially_mapped(folio);
> >>               ds_queue->split_queue_len--;
> >>           }
> >>           if (!--sc->nr_to_scan)
> >> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> >>   next:
> >>           folio_put(folio);
> >>       }
> >> -
> >>       spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >>       list_splice_tail(&list, &ds_queue->split_queue);
> >>       spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 1fdd9eab240c..2ae2d9a18e40 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >>           free_gigantic_folio(folio, huge_page_order(h));
> >>       } else {
> >>           INIT_LIST_HEAD(&folio->_deferred_list);
> >> +        folio_clear_partially_mapped(folio);
> >>           folio_put(folio);
> >>       }
> >>   }
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 52f7fc4e8ac3..d64546b8d377 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
> >>       atomic_set(&folio->_entire_mapcount, -1);
> >>       atomic_set(&folio->_nr_pages_mapped, 0);
> >>       atomic_set(&folio->_pincount, 0);
> >> -    if (order > 1)
> >> +    if (order > 1) {
> >>           INIT_LIST_HEAD(&folio->_deferred_list);
> >> +        folio_clear_partially_mapped(folio);
> >
> > Can we use the non-atomic version here?
> >
>
> I believe we can use the non-atomic version in all places where set/clear is done as all set/clear are protected by ds_queue->split_queue_lock. So basically could replace all folio_set/clear_partially_mapped with __folio_set/clear_partially_mapped.
>

right. That is why I thought the below is actually safe.
but i appreciate a test_set of course(and non-atomic):

+               if (!folio_test_partially_mapped(folio)) {
+                       folio_set_partially_mapped(folio);
+                       if (folio_test_pmd_mappable(folio))
+                               count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+                       count_mthp_stat(folio_order(folio),
MTHP_STAT_SPLIT_DEFERRED);
+               }


> But I guess its likely not going to make much difference? I will do it anyways in the next revision, rather than sending a fix patch. There haven't been any reviews for patch 5 so will wait a few days for any comments on that.
>
> Thanks

Thanks
Barry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-15 17:10     ` Usama Arif
  2024-08-15 21:06       ` Barry Song
@ 2024-08-15 21:08       ` David Hildenbrand
  1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2024-08-15 21:08 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team


> 
> Was there some comment here? I just see ">" remove from the start of /* We lost race with folio_put() */
> 

Likely I wanted to comment something but decided otherwise, sorry :)

>>> +            folio_clear_partially_mapped(folio);
>>>                ds_queue->split_queue_len--;
>>>            }
>>>            if (!--sc->nr_to_scan)
>>> @@ -3558,7 +3564,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>    next:
>>>            folio_put(folio);
>>>        }
>>> -
>>>        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>        list_splice_tail(&list, &ds_queue->split_queue);
>>>        spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 1fdd9eab240c..2ae2d9a18e40 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>>>            free_gigantic_folio(folio, huge_page_order(h));
>>>        } else {
>>>            INIT_LIST_HEAD(&folio->_deferred_list);
>>> +        folio_clear_partially_mapped(folio);
>>>            folio_put(folio);
>>>        }
>>>    }
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 52f7fc4e8ac3..d64546b8d377 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -662,8 +662,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>>        atomic_set(&folio->_entire_mapcount, -1);
>>>        atomic_set(&folio->_nr_pages_mapped, 0);
>>>        atomic_set(&folio->_pincount, 0);
>>> -    if (order > 1)
>>> +    if (order > 1) {
>>>            INIT_LIST_HEAD(&folio->_deferred_list);
>>> +        folio_clear_partially_mapped(folio);
>>
>> Can we use the non-atomic version here?
>>
> 
> I believe we can use the non-atomic version in all places where set/clear is done as all set/clear are protected by ds_queue->split_queue_lock. So basically could replace all folio_set/clear_partially_mapped with __folio_set/clear_partially_mapped.
> 
> But I guess its likely not going to make much difference? I will do it anyways in the next revision, rather than sending a fix patch. There haven't been any reviews for patch 5 so will wait a few days for any comments on that.

If we can avoid atomics, please do! :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-15 15:25             ` Usama Arif
@ 2024-08-15 23:30               ` Andrew Morton
  2024-08-16  2:50                 ` Yu Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2024-08-15 23:30 UTC (permalink / raw)
  To: Usama Arif
  Cc: Barry Song, baohua, cerasuolodomenico, corbet, david, hannes,
	kernel-team, linux-doc, linux-kernel, linux-mm, riel,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, willy, yuzhao

On Thu, 15 Aug 2024 16:25:09 +0100 Usama Arif <usamaarif642@gmail.com> wrote:

> 
> 
> commit c627655548fa09b59849e942da4decc84fa0b0f2
> Author: Usama Arif <usamaarif642@gmail.com>
> Date:   Thu Aug 15 16:07:20 2024 +0100
> 
>     mm: Introduce a pageflag for partially mapped folios fix
>     
>     Fixes the original commit by not clearing partially mapped bit
>     in hugeTLB folios and fixing deferred split THP stats.
>
> ...
>

Life is getting complicated.

> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1758,7 +1758,6 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>                 free_gigantic_folio(folio, huge_page_order(h));
>         } else {
>                 INIT_LIST_HEAD(&folio->_deferred_list);
> -               folio_clear_partially_mapped(folio);
>                 folio_put(folio);
>         }
>  }

Yu Zhao's "mm/hugetlb: use __GFP_COMP for gigantic folios" was
expecting that folio_clear_partially_mapped() to be there.

I resolved this within
mm-hugetlb-use-__gfp_comp-for-gigantic-folios.patch thusly:

@@ -1748,18 +1704,8 @@ static void __update_and_free_hugetlb_fo
 
 	folio_ref_unfreeze(folio, 1);
 
-	/*
-	 * Non-gigantic pages demoted from CMA allocated gigantic pages
-	 * need to be given back to CMA in free_gigantic_folio.
-	 */
-	if (hstate_is_gigantic(h) ||
-	    hugetlb_cma_folio(folio, huge_page_order(h))) {
-		destroy_compound_gigantic_folio(folio, huge_page_order(h));
-		free_gigantic_folio(folio, huge_page_order(h));
-	} else {
-		INIT_LIST_HEAD(&folio->_deferred_list);
-		folio_put(folio);
-	}
+	INIT_LIST_HEAD(&folio->_deferred_list);
+	hugetlb_free_folio(folio);
 }
 
 /*

Please check.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-15 23:30               ` Andrew Morton
@ 2024-08-16  2:50                 ` Yu Zhao
  0 siblings, 0 replies; 42+ messages in thread
From: Yu Zhao @ 2024-08-16  2:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Usama Arif, Barry Song, baohua, cerasuolodomenico, corbet, david,
	hannes, kernel-team, linux-doc, linux-kernel, linux-mm, riel,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, willy

On Thu, Aug 15, 2024 at 5:30 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 15 Aug 2024 16:25:09 +0100 Usama Arif <usamaarif642@gmail.com> wrote:
>
> >
> >
> > commit c627655548fa09b59849e942da4decc84fa0b0f2
> > Author: Usama Arif <usamaarif642@gmail.com>
> > Date:   Thu Aug 15 16:07:20 2024 +0100
> >
> >     mm: Introduce a pageflag for partially mapped folios fix
> >
> >     Fixes the original commit by not clearing partially mapped bit
> >     in hugeTLB folios and fixing deferred split THP stats.
> >
> > ...
> >
>
> Life is getting complicated.
>
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1758,7 +1758,6 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >                 free_gigantic_folio(folio, huge_page_order(h));
> >         } else {
> >                 INIT_LIST_HEAD(&folio->_deferred_list);
> > -               folio_clear_partially_mapped(folio);
> >                 folio_put(folio);
> >         }
> >  }
>
> Yu Zhao's "mm/hugetlb: use __GFP_COMP for gigantic folios" was
> expecting that folio_clear_partially_mapped() to be there.
>
> I resolved this within
> mm-hugetlb-use-__gfp_comp-for-gigantic-folios.patch thusly:
>
> @@ -1748,18 +1704,8 @@ static void __update_and_free_hugetlb_fo
>
>         folio_ref_unfreeze(folio, 1);
>
> -       /*
> -        * Non-gigantic pages demoted from CMA allocated gigantic pages
> -        * need to be given back to CMA in free_gigantic_folio.
> -        */
> -       if (hstate_is_gigantic(h) ||
> -           hugetlb_cma_folio(folio, huge_page_order(h))) {
> -               destroy_compound_gigantic_folio(folio, huge_page_order(h));
> -               free_gigantic_folio(folio, huge_page_order(h));
> -       } else {
> -               INIT_LIST_HEAD(&folio->_deferred_list);
> -               folio_put(folio);
> -       }
> +       INIT_LIST_HEAD(&folio->_deferred_list);
> +       hugetlb_free_folio(folio);
>  }
>
>  /*
>
> Please check.

Confirmed, thanks.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
                     ` (3 preceding siblings ...)
  2024-08-15 16:33   ` David Hildenbrand
@ 2024-08-16 15:44   ` Matthew Wilcox
  2024-08-16 16:08     ` Usama Arif
  4 siblings, 1 reply; 42+ messages in thread
From: Matthew Wilcox @ 2024-08-16 15:44 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Tue, Aug 13, 2024 at 01:02:47PM +0100, Usama Arif wrote:
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a0a29bd092f8..cecc1bad7910 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -182,6 +182,7 @@ enum pageflags {
>  	/* At least one page in this folio has the hwpoison flag set */
>  	PG_has_hwpoisoned = PG_active,
>  	PG_large_rmappable = PG_workingset, /* anon or file-backed */
> +	PG_partially_mapped, /* was identified to be partially mapped */

No, you can't do this.  You have to be really careful when reusing page
flags, you can't just take the next one.  What made you think it would
be this easy?

I'd suggest using PG_reclaim.  You also need to add PG_partially_mapped
to PAGE_FLAGS_SECOND.  You might get away without that if you're
guaranteeing it'll always be clear when you free the folio; I don't
understand this series so I don't know if that's true or not.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-16 15:44   ` Matthew Wilcox
@ 2024-08-16 16:08     ` Usama Arif
  2024-08-16 16:28       ` Matthew Wilcox
  0 siblings, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-16 16:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team



On 16/08/2024 16:44, Matthew Wilcox wrote:
> On Tue, Aug 13, 2024 at 01:02:47PM +0100, Usama Arif wrote:
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index a0a29bd092f8..cecc1bad7910 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -182,6 +182,7 @@ enum pageflags {
>>  	/* At least one page in this folio has the hwpoison flag set */
>>  	PG_has_hwpoisoned = PG_active,
>>  	PG_large_rmappable = PG_workingset, /* anon or file-backed */
>> +	PG_partially_mapped, /* was identified to be partially mapped */
> 
> No, you can't do this.  You have to be really careful when reusing page
> flags, you can't just take the next one.  What made you think it would
> be this easy?
> 
> I'd suggest using PG_reclaim.  You also need to add PG_partially_mapped
> to PAGE_FLAGS_SECOND.  You might get away without that if you're
> guaranteeing it'll always be clear when you free the folio; I don't
> understand this series so I don't know if that's true or not.

I am really not sure what the issue is over here.

From what I see, bits 0-7 of folio->_flags_1 are used for storing folio order, bit 8 for PG_has_hwpoisoned and bit 9 for PG_large_rmappable.
Bits 10 and above of folio->_flags_1 are not used any anywhere in the kernel. I am not reusing a page flag of folio->_flags_1, just taking an unused one.

Please have a look at the next few lines of the patch. I have defined the functions as FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE). I believe thats what you are saying in your second paragraph?
I am not sure what you meant by using PG_reclaim.

I have added the next few lines of the patch below:

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a0a29bd092f8..cecc1bad7910 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -182,6 +182,7 @@ enum pageflags {
 	/* At least one page in this folio has the hwpoison flag set */
 	PG_has_hwpoisoned = PG_active,
 	PG_large_rmappable = PG_workingset, /* anon or file-backed */
+	PG_partially_mapped, /* was identified to be partially mapped */
 };
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
@@ -861,8 +862,10 @@ static inline void ClearPageCompound(struct page *page)
 	ClearPageHead(page);
 }
 FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
+FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
 #else
 FOLIO_FLAG_FALSE(large_rmappable)



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-16 16:08     ` Usama Arif
@ 2024-08-16 16:28       ` Matthew Wilcox
  2024-08-16 16:41         ` Usama Arif
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Wilcox @ 2024-08-16 16:28 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Fri, Aug 16, 2024 at 05:08:35PM +0100, Usama Arif wrote:
> On 16/08/2024 16:44, Matthew Wilcox wrote:
> > On Tue, Aug 13, 2024 at 01:02:47PM +0100, Usama Arif wrote:
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index a0a29bd092f8..cecc1bad7910 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -182,6 +182,7 @@ enum pageflags {
> >>  	/* At least one page in this folio has the hwpoison flag set */
> >>  	PG_has_hwpoisoned = PG_active,
> >>  	PG_large_rmappable = PG_workingset, /* anon or file-backed */
> >> +	PG_partially_mapped, /* was identified to be partially mapped */
> > 
> > No, you can't do this.  You have to be really careful when reusing page
> > flags, you can't just take the next one.  What made you think it would
> > be this easy?
> > 
> > I'd suggest using PG_reclaim.  You also need to add PG_partially_mapped
> > to PAGE_FLAGS_SECOND.  You might get away without that if you're
> > guaranteeing it'll always be clear when you free the folio; I don't
> > understand this series so I don't know if that's true or not.
> 
> I am really not sure what the issue is over here.

You've made the code more fragile.  It might happen to work today, but
you've either done something which is subtly broken today, or might
break tomorrow when somebody else rearranges the flags without knowing
about your fragility.

> >From what I see, bits 0-7 of folio->_flags_1 are used for storing folio order, bit 8 for PG_has_hwpoisoned and bit 9 for PG_large_rmappable.
> Bits 10 and above of folio->_flags_1 are not used any anywhere in the kernel. I am not reusing a page flag of folio->_flags_1, just taking an unused one.

No, wrong.  PG_anon_exclusive is used on every page, including tail
pages, and that's above bit 10.

> Please have a look at the next few lines of the patch. I have defined the functions as FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE). I believe thats what you are saying in your second paragraph?
> I am not sure what you meant by using PG_reclaim.

I mean:

-	PG_usama_new_thing,
+	PG_usama_new_thing = PG_reclaim,



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-16 16:28       ` Matthew Wilcox
@ 2024-08-16 16:41         ` Usama Arif
  0 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-16 16:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team



On 16/08/2024 17:28, Matthew Wilcox wrote:
> On Fri, Aug 16, 2024 at 05:08:35PM +0100, Usama Arif wrote:
>> On 16/08/2024 16:44, Matthew Wilcox wrote:
>>> On Tue, Aug 13, 2024 at 01:02:47PM +0100, Usama Arif wrote:
>>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>>> index a0a29bd092f8..cecc1bad7910 100644
>>>> --- a/include/linux/page-flags.h
>>>> +++ b/include/linux/page-flags.h
>>>> @@ -182,6 +182,7 @@ enum pageflags {
>>>>  	/* At least one page in this folio has the hwpoison flag set */
>>>>  	PG_has_hwpoisoned = PG_active,
>>>>  	PG_large_rmappable = PG_workingset, /* anon or file-backed */
>>>> +	PG_partially_mapped, /* was identified to be partially mapped */
>>>
>>> No, you can't do this.  You have to be really careful when reusing page
>>> flags, you can't just take the next one.  What made you think it would
>>> be this easy?
>>>
>>> I'd suggest using PG_reclaim.  You also need to add PG_partially_mapped
>>> to PAGE_FLAGS_SECOND.  You might get away without that if you're
>>> guaranteeing it'll always be clear when you free the folio; I don't
>>> understand this series so I don't know if that's true or not.
>>
>> I am really not sure what the issue is over here.
> 
> You've made the code more fragile.  It might happen to work today, but
> you've either done something which is subtly broken today, or might
> break tomorrow when somebody else rearranges the flags without knowing
> about your fragility.
> 
>> >From what I see, bits 0-7 of folio->_flags_1 are used for storing folio order, bit 8 for PG_has_hwpoisoned and bit 9 for PG_large_rmappable.
>> Bits 10 and above of folio->_flags_1 are not used any anywhere in the kernel. I am not reusing a page flag of folio->_flags_1, just taking an unused one.
> 
> No, wrong.  PG_anon_exclusive is used on every page, including tail
> pages, and that's above bit 10.
> 
>> Please have a look at the next few lines of the patch. I have defined the functions as FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE). I believe thats what you are saying in your second paragraph?
>> I am not sure what you meant by using PG_reclaim.
> 
> I mean:
> 
> -	PG_usama_new_thing,
> +	PG_usama_new_thing = PG_reclaim,
> 

Ah ok, Thanks. The flags below PG_reclaim are either marked as PF_ANY or are arch dependent. So eventhough they might not be used currently for _flags_1, they could be in the future.

I will use PG_partially_mapped = PG_reclaim in the next revision.

Thanks


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-15 19:16     ` Usama Arif
@ 2024-08-16 16:55       ` Kairui Song
  2024-08-16 17:02         ` Usama Arif
  0 siblings, 1 reply; 42+ messages in thread
From: Kairui Song @ 2024-08-16 16:55 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team,
	Shuang Zhai

On Fri, Aug 16, 2024 at 3:16 AM Usama Arif <usamaarif642@gmail.com> wrote:
> On 15/08/2024 19:47, Kairui Song wrote:
> > On Tue, Aug 13, 2024 at 8:03 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >> From: Yu Zhao <yuzhao@google.com>
> >>
> >> If a tail page has only two references left, one inherited from the
> >> isolation of its head and the other from lru_add_page_tail() which we
> >> are about to drop, it means this tail page was concurrently zapped.
> >> Then we can safely free it and save page reclaim or migration the
> >> trouble of trying it.
> >>
> >> Signed-off-by: Yu Zhao <yuzhao@google.com>
> >> Tested-by: Shuang Zhai <zhais@google.com>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >> ---
> >>  mm/huge_memory.c | 27 +++++++++++++++++++++++++++
> >>  1 file changed, 27 insertions(+)
> >
> > Hi, Usama, Yu
> >
> > This commit is causing the kernel to panic very quickly with build
> > kernel test on top of tmpfs with all mTHP enabled, the panic comes
> > after:
> >
>
> Hi,
>
> Thanks for pointing this out. It is a very silly bug I have introduced going from v1 page version to the folio version of the patch in v3.
>
> Doing below over this patch will fix it:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 907813102430..a6ca454e1168 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3183,7 +3183,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>
>                         folio_clear_active(new_folio);
>                         folio_clear_unevictable(new_folio);
> -                       if (!folio_batch_add(&free_folios, folio)) {
> +                       if (!folio_batch_add(&free_folios, new_folio)) {
>                                 mem_cgroup_uncharge_folios(&free_folios);
>                                 free_unref_folios(&free_folios);
>                         }
>
>
> I will include it in the next revision.
>

Hi,

After the fix, I'm still seeing below panic:
[   24.926629] list_del corruption. prev->next should be
ffffea000491cf88, but was ffffea0006207708. (prev=ffffea000491cfc8)
[   24.930783] ------------[ cut here ]------------
[   24.932519] kernel BUG at lib/list_debug.c:64!
[   24.934325] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   24.936339] CPU: 32 UID: 0 PID: 2112 Comm: gzip Not tainted
6.11.0-rc3.ptch+ #147
[   24.938575] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[   24.940680] RIP: 0010:__list_del_entry_valid_or_report+0xaa/0xc0
[   24.942536] Code: 8c ff 0f 0b 48 89 fe 48 c7 c7 f8 9d 51 82 e8 9d
36 8c ff 0f 0b 48 89 d1 48 89 f2 48 89 fe 48 c7 c7 30 9e 51 82 e8 86
36 8c ff <0f> 0b 48 c7 c7 80 9e 51 82 e8 78 36 8c ff 0f 0b 66 0f 1f 44
00 00
[   24.948418] RSP: 0018:ffffc90005c2b770 EFLAGS: 00010246
[   24.949996] RAX: 000000000000006d RBX: ffffea000491cf88 RCX: 0000000000000000
[   24.952293] RDX: 0000000000000000 RSI: ffff889ffee1c180 RDI: ffff889ffee1c180
[   24.954616] RBP: ffffea000491cf80 R08: 0000000000000000 R09: c0000000ffff7fff
[   24.956908] R10: 0000000000000001 R11: ffffc90005c2b5a8 R12: ffffc90005c2b954
[   24.959253] R13: ffffc90005c2bbc0 R14: ffffc90005c2b7c0 R15: ffffc90005c2b940
[   24.961410] FS:  00007fe5a235e740(0000) GS:ffff889ffee00000(0000)
knlGS:0000000000000000
[   24.963587] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.965112] CR2: 00007fe5a24ddcd0 CR3: 000000010cb40001 CR4: 0000000000770eb0
[   24.967037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   24.968933] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   24.970802] PKRU: 55555554
[   24.971559] Call Trace:
[   24.972241]  <TASK>
[   24.972805]  ? __die_body+0x1e/0x60
[   24.973756]  ? die+0x3c/0x60
[   24.974450]  ? do_trap+0xe8/0x110
[   24.975235]  ? __list_del_entry_valid_or_report+0xaa/0xc0
[   24.976543]  ? do_error_trap+0x65/0x80
[   24.977542]  ? __list_del_entry_valid_or_report+0xaa/0xc0
[   24.978891]  ? exc_invalid_op+0x50/0x70
[   24.979870]  ? __list_del_entry_valid_or_report+0xaa/0xc0
[   24.981295]  ? asm_exc_invalid_op+0x1a/0x20
[   24.982389]  ? __list_del_entry_valid_or_report+0xaa/0xc0
[   24.983781]  shrink_folio_list+0x39a/0x1200
[   24.984898]  shrink_inactive_list+0x1c0/0x420
[   24.986082]  shrink_lruvec+0x5db/0x780
[   24.987078]  shrink_node+0x243/0xb00
[   24.988063]  ? get_pfnblock_flags_mask.constprop.117+0x1d/0x50
[   24.989622]  do_try_to_free_pages+0xbd/0x4e0
[   24.990732]  try_to_free_mem_cgroup_pages+0x107/0x230
[   24.992034]  try_charge_memcg+0x184/0x5d0
[   24.993145]  obj_cgroup_charge_pages+0x38/0x110
[   24.994326]  __memcg_kmem_charge_page+0x8d/0xf0
[   24.995531]  __alloc_pages_noprof+0x278/0x360
[   24.996712]  alloc_pages_mpol_noprof+0xf0/0x230
[   24.997896]  pipe_write+0x2ad/0x5f0
[   24.998837]  ? __pfx_tick_nohz_handler+0x10/0x10
[   25.000234]  ? update_process_times+0x8c/0xa0
[   25.001377]  ? timerqueue_add+0x77/0x90
[   25.002257]  vfs_write+0x39b/0x420
[   25.003083]  ksys_write+0xbd/0xd0
[   25.003950]  do_syscall_64+0x47/0x110
[   25.004917]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   25.006210] RIP: 0033:0x7fe5a246f784
[   25.007149] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 08 0e 00 00 74 13 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20
48 89
[   25.011961] RSP: 002b:00007ffdb0057b38 EFLAGS: 00000202 ORIG_RAX:
0000000000000001
[   25.013946] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe5a246f784
[   25.015817] RDX: 0000000000008000 RSI: 0000558c0d311420 RDI: 0000000000000001
[   25.017717] RBP: 00007ffdb0057b60 R08: 0000558c0d258c40 R09: 0000558c0d311420
[   25.019618] R10: 00007ffdb0057600 R11: 0000000000000202 R12: 0000000000008000
[   25.021519] R13: 0000558c0d311420 R14: 0000000000000029 R15: 0000000000001f8d
[   25.023412]  </TASK>
[   25.023998] Modules linked in:
[   25.024900] ---[ end trace 0000000000000000 ]---
[   25.026329] RIP: 0010:__list_del_entry_valid_or_report+0xaa/0xc0
[   25.027885] Code: 8c ff 0f 0b 48 89 fe 48 c7 c7 f8 9d 51 82 e8 9d
36 8c ff 0f 0b 48 89 d1 48 89 f2 48 89 fe 48 c7 c7 30 9e 51 82 e8 86
36 8c ff <0f> 0b 48 c7 c7 80 9e 51 82 e8 78 36 8c ff 0f 0b 66 0f 1f 44
00 00
[   25.032525] RSP: 0018:ffffc90005c2b770 EFLAGS: 00010246
[   25.033892] RAX: 000000000000006d RBX: ffffea000491cf88 RCX: 0000000000000000
[   25.035758] RDX: 0000000000000000 RSI: ffff889ffee1c180 RDI: ffff889ffee1c180
[   25.037661] RBP: ffffea000491cf80 R08: 0000000000000000 R09: c0000000ffff7fff
[   25.039543] R10: 0000000000000001 R11: ffffc90005c2b5a8 R12: ffffc90005c2b954
[   25.041426] R13: ffffc90005c2bbc0 R14: ffffc90005c2b7c0 R15: ffffc90005c2b940
[   25.043323] FS:  00007fe5a235e740(0000) GS:ffff889ffee00000(0000)
knlGS:0000000000000000
[   25.045478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   25.047013] CR2: 00007fe5a24ddcd0 CR3: 000000010cb40001 CR4: 0000000000770eb0
[   25.048935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   25.050858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   25.052881] PKRU: 55555554
[   25.053634] Kernel panic - not syncing: Fatal exception
[   25.056902] Kernel Offset: disabled
[   25.057827] ---[ end Kernel panic - not syncing: Fatal exception ]---

If I revert the fix and this patch, the panic is gone, let me know if
I can help debug it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-16 16:55       ` Kairui Song
@ 2024-08-16 17:02         ` Usama Arif
  2024-08-16 18:11           ` Kairui Song
  0 siblings, 1 reply; 42+ messages in thread
From: Usama Arif @ 2024-08-16 17:02 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team,
	Shuang Zhai



On 16/08/2024 17:55, Kairui Song wrote:
> On Fri, Aug 16, 2024 at 3:16 AM Usama Arif <usamaarif642@gmail.com> wrote:
>> On 15/08/2024 19:47, Kairui Song wrote:
>>> On Tue, Aug 13, 2024 at 8:03 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> From: Yu Zhao <yuzhao@google.com>
>>>>
>>>> If a tail page has only two references left, one inherited from the
>>>> isolation of its head and the other from lru_add_page_tail() which we
>>>> are about to drop, it means this tail page was concurrently zapped.
>>>> Then we can safely free it and save page reclaim or migration the
>>>> trouble of trying it.
>>>>
>>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>>> Tested-by: Shuang Zhai <zhais@google.com>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>> ---
>>>>  mm/huge_memory.c | 27 +++++++++++++++++++++++++++
>>>>  1 file changed, 27 insertions(+)
>>>
>>> Hi, Usama, Yu
>>>
>>> This commit is causing the kernel to panic very quickly with build
>>> kernel test on top of tmpfs with all mTHP enabled, the panic comes
>>> after:
>>>
>>
>> Hi,
>>
>> Thanks for pointing this out. It is a very silly bug I have introduced going from v1 page version to the folio version of the patch in v3.
>>
>> Doing below over this patch will fix it:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 907813102430..a6ca454e1168 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3183,7 +3183,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>
>>                         folio_clear_active(new_folio);
>>                         folio_clear_unevictable(new_folio);
>> -                       if (!folio_batch_add(&free_folios, folio)) {
>> +                       if (!folio_batch_add(&free_folios, new_folio)) {
>>                                 mem_cgroup_uncharge_folios(&free_folios);
>>                                 free_unref_folios(&free_folios);
>>                         }
>>
>>
>> I will include it in the next revision.
>>
> 
> Hi,
> 
> After the fix, I'm still seeing below panic:
> [   24.926629] list_del corruption. prev->next should be
> ffffea000491cf88, but was ffffea0006207708. (prev=ffffea000491cfc8)
> [   24.930783] ------------[ cut here ]------------
> [   24.932519] kernel BUG at lib/list_debug.c:64!
> [   24.934325] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [   24.936339] CPU: 32 UID: 0 PID: 2112 Comm: gzip Not tainted
> 6.11.0-rc3.ptch+ #147
> [   24.938575] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> [   24.940680] RIP: 0010:__list_del_entry_valid_or_report+0xaa/0xc0
> [   24.942536] Code: 8c ff 0f 0b 48 89 fe 48 c7 c7 f8 9d 51 82 e8 9d
> 36 8c ff 0f 0b 48 89 d1 48 89 f2 48 89 fe 48 c7 c7 30 9e 51 82 e8 86
> 36 8c ff <0f> 0b 48 c7 c7 80 9e 51 82 e8 78 36 8c ff 0f 0b 66 0f 1f 44
> 00 00
> [   24.948418] RSP: 0018:ffffc90005c2b770 EFLAGS: 00010246
> [   24.949996] RAX: 000000000000006d RBX: ffffea000491cf88 RCX: 0000000000000000
> [   24.952293] RDX: 0000000000000000 RSI: ffff889ffee1c180 RDI: ffff889ffee1c180
> [   24.954616] RBP: ffffea000491cf80 R08: 0000000000000000 R09: c0000000ffff7fff
> [   24.956908] R10: 0000000000000001 R11: ffffc90005c2b5a8 R12: ffffc90005c2b954
> [   24.959253] R13: ffffc90005c2bbc0 R14: ffffc90005c2b7c0 R15: ffffc90005c2b940
> [   24.961410] FS:  00007fe5a235e740(0000) GS:ffff889ffee00000(0000)
> knlGS:0000000000000000
> [   24.963587] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   24.965112] CR2: 00007fe5a24ddcd0 CR3: 000000010cb40001 CR4: 0000000000770eb0
> [   24.967037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   24.968933] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   24.970802] PKRU: 55555554
> [   24.971559] Call Trace:
> [   24.972241]  <TASK>
> [   24.972805]  ? __die_body+0x1e/0x60
> [   24.973756]  ? die+0x3c/0x60
> [   24.974450]  ? do_trap+0xe8/0x110
> [   24.975235]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> [   24.976543]  ? do_error_trap+0x65/0x80
> [   24.977542]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> [   24.978891]  ? exc_invalid_op+0x50/0x70
> [   24.979870]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> [   24.981295]  ? asm_exc_invalid_op+0x1a/0x20
> [   24.982389]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> [   24.983781]  shrink_folio_list+0x39a/0x1200
> [   24.984898]  shrink_inactive_list+0x1c0/0x420
> [   24.986082]  shrink_lruvec+0x5db/0x780
> [   24.987078]  shrink_node+0x243/0xb00
> [   24.988063]  ? get_pfnblock_flags_mask.constprop.117+0x1d/0x50
> [   24.989622]  do_try_to_free_pages+0xbd/0x4e0
> [   24.990732]  try_to_free_mem_cgroup_pages+0x107/0x230
> [   24.992034]  try_charge_memcg+0x184/0x5d0
> [   24.993145]  obj_cgroup_charge_pages+0x38/0x110
> [   24.994326]  __memcg_kmem_charge_page+0x8d/0xf0
> [   24.995531]  __alloc_pages_noprof+0x278/0x360
> [   24.996712]  alloc_pages_mpol_noprof+0xf0/0x230
> [   24.997896]  pipe_write+0x2ad/0x5f0
> [   24.998837]  ? __pfx_tick_nohz_handler+0x10/0x10
> [   25.000234]  ? update_process_times+0x8c/0xa0
> [   25.001377]  ? timerqueue_add+0x77/0x90
> [   25.002257]  vfs_write+0x39b/0x420
> [   25.003083]  ksys_write+0xbd/0xd0
> [   25.003950]  do_syscall_64+0x47/0x110
> [   25.004917]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   25.006210] RIP: 0033:0x7fe5a246f784
> [   25.007149] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 08 0e 00 00 74 13 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20
> 48 89
> [   25.011961] RSP: 002b:00007ffdb0057b38 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000001
> [   25.013946] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe5a246f784
> [   25.015817] RDX: 0000000000008000 RSI: 0000558c0d311420 RDI: 0000000000000001
> [   25.017717] RBP: 00007ffdb0057b60 R08: 0000558c0d258c40 R09: 0000558c0d311420
> [   25.019618] R10: 00007ffdb0057600 R11: 0000000000000202 R12: 0000000000008000
> [   25.021519] R13: 0000558c0d311420 R14: 0000000000000029 R15: 0000000000001f8d
> [   25.023412]  </TASK>
> [   25.023998] Modules linked in:
> [   25.024900] ---[ end trace 0000000000000000 ]---
> [   25.026329] RIP: 0010:__list_del_entry_valid_or_report+0xaa/0xc0
> [   25.027885] Code: 8c ff 0f 0b 48 89 fe 48 c7 c7 f8 9d 51 82 e8 9d
> 36 8c ff 0f 0b 48 89 d1 48 89 f2 48 89 fe 48 c7 c7 30 9e 51 82 e8 86
> 36 8c ff <0f> 0b 48 c7 c7 80 9e 51 82 e8 78 36 8c ff 0f 0b 66 0f 1f 44
> 00 00
> [   25.032525] RSP: 0018:ffffc90005c2b770 EFLAGS: 00010246
> [   25.033892] RAX: 000000000000006d RBX: ffffea000491cf88 RCX: 0000000000000000
> [   25.035758] RDX: 0000000000000000 RSI: ffff889ffee1c180 RDI: ffff889ffee1c180
> [   25.037661] RBP: ffffea000491cf80 R08: 0000000000000000 R09: c0000000ffff7fff
> [   25.039543] R10: 0000000000000001 R11: ffffc90005c2b5a8 R12: ffffc90005c2b954
> [   25.041426] R13: ffffc90005c2bbc0 R14: ffffc90005c2b7c0 R15: ffffc90005c2b940
> [   25.043323] FS:  00007fe5a235e740(0000) GS:ffff889ffee00000(0000)
> knlGS:0000000000000000
> [   25.045478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   25.047013] CR2: 00007fe5a24ddcd0 CR3: 000000010cb40001 CR4: 0000000000770eb0
> [   25.048935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   25.050858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   25.052881] PKRU: 55555554
> [   25.053634] Kernel panic - not syncing: Fatal exception
> [   25.056902] Kernel Offset: disabled
> [   25.057827] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> If I revert the fix and this patch, the panic is gone, let me know if
> I can help debug it.

Yes, this is also needed to prevent race with shrink_folio:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a6ca454e1168..75f5b059e804 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3183,6 +3183,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
                        folio_clear_active(new_folio);
                        folio_clear_unevictable(new_folio);
+                       list_del(&new_folio->lru);
                        if (!folio_batch_add(&free_folios, new_folio)) {
                                mem_cgroup_uncharge_folios(&free_folios);
                                free_unref_folios(&free_folios);


I have tested this so should be ok, but let me know otherwise.

I will include this in the next revision I will send soon.

Thanks.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-16 17:02         ` Usama Arif
@ 2024-08-16 18:11           ` Kairui Song
  0 siblings, 0 replies; 42+ messages in thread
From: Kairui Song @ 2024-08-16 18:11 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team,
	Shuang Zhai

On Sat, Aug 17, 2024 at 1:03 AM Usama Arif <usamaarif642@gmail.com> wrote:
> On 16/08/2024 17:55, Kairui Song wrote:
> > On Fri, Aug 16, 2024 at 3:16 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >> On 15/08/2024 19:47, Kairui Song wrote:
> >>> On Tue, Aug 13, 2024 at 8:03 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>> From: Yu Zhao <yuzhao@google.com>
> >>>>
> >>>> If a tail page has only two references left, one inherited from the
> >>>> isolation of its head and the other from lru_add_page_tail() which we
> >>>> are about to drop, it means this tail page was concurrently zapped.
> >>>> Then we can safely free it and save page reclaim or migration the
> >>>> trouble of trying it.
> >>>>
> >>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
> >>>> Tested-by: Shuang Zhai <zhais@google.com>
> >>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>> ---
> >>>>  mm/huge_memory.c | 27 +++++++++++++++++++++++++++
> >>>>  1 file changed, 27 insertions(+)
> >>>
> >>> Hi, Usama, Yu
> >>>
> >>> This commit is causing the kernel to panic very quickly with build
> >>> kernel test on top of tmpfs with all mTHP enabled, the panic comes
> >>> after:
> >>>
> >>
> >> Hi,
> >>
> >> Thanks for pointing this out. It is a very silly bug I have introduced going from v1 page version to the folio version of the patch in v3.
> >>
> >> Doing below over this patch will fix it:
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 907813102430..a6ca454e1168 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3183,7 +3183,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> >>
> >>                         folio_clear_active(new_folio);
> >>                         folio_clear_unevictable(new_folio);
> >> -                       if (!folio_batch_add(&free_folios, folio)) {
> >> +                       if (!folio_batch_add(&free_folios, new_folio)) {
> >>                                 mem_cgroup_uncharge_folios(&free_folios);
> >>                                 free_unref_folios(&free_folios);
> >>                         }
> >>
> >>
> >> I will include it in the next revision.
> >>
> >
> > Hi,
> >
> > After the fix, I'm still seeing below panic:
> > [   24.926629] list_del corruption. prev->next should be
> > ffffea000491cf88, but was ffffea0006207708. (prev=ffffea000491cfc8)
> > [   24.930783] ------------[ cut here ]------------
> > [   24.932519] kernel BUG at lib/list_debug.c:64!
> > [   24.934325] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> > [   24.936339] CPU: 32 UID: 0 PID: 2112 Comm: gzip Not tainted
> > 6.11.0-rc3.ptch+ #147
> > [   24.938575] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> > [   24.940680] RIP: 0010:__list_del_entry_valid_or_report+0xaa/0xc0
> > [   24.942536] Code: 8c ff 0f 0b 48 89 fe 48 c7 c7 f8 9d 51 82 e8 9d
> > 36 8c ff 0f 0b 48 89 d1 48 89 f2 48 89 fe 48 c7 c7 30 9e 51 82 e8 86
> > 36 8c ff <0f> 0b 48 c7 c7 80 9e 51 82 e8 78 36 8c ff 0f 0b 66 0f 1f 44
> > 00 00
> > [   24.948418] RSP: 0018:ffffc90005c2b770 EFLAGS: 00010246
> > [   24.949996] RAX: 000000000000006d RBX: ffffea000491cf88 RCX: 0000000000000000
> > [   24.952293] RDX: 0000000000000000 RSI: ffff889ffee1c180 RDI: ffff889ffee1c180
> > [   24.954616] RBP: ffffea000491cf80 R08: 0000000000000000 R09: c0000000ffff7fff
> > [   24.956908] R10: 0000000000000001 R11: ffffc90005c2b5a8 R12: ffffc90005c2b954
> > [   24.959253] R13: ffffc90005c2bbc0 R14: ffffc90005c2b7c0 R15: ffffc90005c2b940
> > [   24.961410] FS:  00007fe5a235e740(0000) GS:ffff889ffee00000(0000)
> > knlGS:0000000000000000
> > [   24.963587] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   24.965112] CR2: 00007fe5a24ddcd0 CR3: 000000010cb40001 CR4: 0000000000770eb0
> > [   24.967037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   24.968933] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [   24.970802] PKRU: 55555554
> > [   24.971559] Call Trace:
> > [   24.972241]  <TASK>
> > [   24.972805]  ? __die_body+0x1e/0x60
> > [   24.973756]  ? die+0x3c/0x60
> > [   24.974450]  ? do_trap+0xe8/0x110
> > [   24.975235]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> > [   24.976543]  ? do_error_trap+0x65/0x80
> > [   24.977542]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> > [   24.978891]  ? exc_invalid_op+0x50/0x70
> > [   24.979870]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> > [   24.981295]  ? asm_exc_invalid_op+0x1a/0x20
> > [   24.982389]  ? __list_del_entry_valid_or_report+0xaa/0xc0
> > [   24.983781]  shrink_folio_list+0x39a/0x1200
> > [   24.984898]  shrink_inactive_list+0x1c0/0x420
> > [   24.986082]  shrink_lruvec+0x5db/0x780
> > [   24.987078]  shrink_node+0x243/0xb00
> > [   24.988063]  ? get_pfnblock_flags_mask.constprop.117+0x1d/0x50
> > [   24.989622]  do_try_to_free_pages+0xbd/0x4e0
> > [   24.990732]  try_to_free_mem_cgroup_pages+0x107/0x230
> > [   24.992034]  try_charge_memcg+0x184/0x5d0
> > [   24.993145]  obj_cgroup_charge_pages+0x38/0x110
> > [   24.994326]  __memcg_kmem_charge_page+0x8d/0xf0
> > [   24.995531]  __alloc_pages_noprof+0x278/0x360
> > [   24.996712]  alloc_pages_mpol_noprof+0xf0/0x230
> > [   24.997896]  pipe_write+0x2ad/0x5f0
> > [   24.998837]  ? __pfx_tick_nohz_handler+0x10/0x10
> > [   25.000234]  ? update_process_times+0x8c/0xa0
> > [   25.001377]  ? timerqueue_add+0x77/0x90
> > [   25.002257]  vfs_write+0x39b/0x420
> > [   25.003083]  ksys_write+0xbd/0xd0
> > [   25.003950]  do_syscall_64+0x47/0x110
> > [   25.004917]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   25.006210] RIP: 0033:0x7fe5a246f784
> > [   25.007149] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
> > 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 08 0e 00 00 74 13 b8 01 00 00
> > 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20
> > 48 89
> > [   25.011961] RSP: 002b:00007ffdb0057b38 EFLAGS: 00000202 ORIG_RAX:
> > 0000000000000001
> > [   25.013946] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe5a246f784
> > [   25.015817] RDX: 0000000000008000 RSI: 0000558c0d311420 RDI: 0000000000000001
> > [   25.017717] RBP: 00007ffdb0057b60 R08: 0000558c0d258c40 R09: 0000558c0d311420
> > [   25.019618] R10: 00007ffdb0057600 R11: 0000000000000202 R12: 0000000000008000
> > [   25.021519] R13: 0000558c0d311420 R14: 0000000000000029 R15: 0000000000001f8d
> > [   25.023412]  </TASK>
> > [   25.023998] Modules linked in:
> > [   25.024900] ---[ end trace 0000000000000000 ]---
> > [   25.026329] RIP: 0010:__list_del_entry_valid_or_report+0xaa/0xc0
> > [   25.027885] Code: 8c ff 0f 0b 48 89 fe 48 c7 c7 f8 9d 51 82 e8 9d
> > 36 8c ff 0f 0b 48 89 d1 48 89 f2 48 89 fe 48 c7 c7 30 9e 51 82 e8 86
> > 36 8c ff <0f> 0b 48 c7 c7 80 9e 51 82 e8 78 36 8c ff 0f 0b 66 0f 1f 44
> > 00 00
> > [   25.032525] RSP: 0018:ffffc90005c2b770 EFLAGS: 00010246
> > [   25.033892] RAX: 000000000000006d RBX: ffffea000491cf88 RCX: 0000000000000000
> > [   25.035758] RDX: 0000000000000000 RSI: ffff889ffee1c180 RDI: ffff889ffee1c180
> > [   25.037661] RBP: ffffea000491cf80 R08: 0000000000000000 R09: c0000000ffff7fff
> > [   25.039543] R10: 0000000000000001 R11: ffffc90005c2b5a8 R12: ffffc90005c2b954
> > [   25.041426] R13: ffffc90005c2bbc0 R14: ffffc90005c2b7c0 R15: ffffc90005c2b940
> > [   25.043323] FS:  00007fe5a235e740(0000) GS:ffff889ffee00000(0000)
> > knlGS:0000000000000000
> > [   25.045478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   25.047013] CR2: 00007fe5a24ddcd0 CR3: 000000010cb40001 CR4: 0000000000770eb0
> > [   25.048935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   25.050858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [   25.052881] PKRU: 55555554
> > [   25.053634] Kernel panic - not syncing: Fatal exception
> > [   25.056902] Kernel Offset: disabled
> > [   25.057827] ---[ end Kernel panic - not syncing: Fatal exception ]---
> >
> > If I revert the fix and this patch, the panic is gone, let me know if
> > I can help debug it.
>
> Yes, this is also needed to prevent race with shrink_folio:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a6ca454e1168..75f5b059e804 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3183,6 +3183,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>
>                         folio_clear_active(new_folio);
>                         folio_clear_unevictable(new_folio);
> +                       list_del(&new_folio->lru);
>                         if (!folio_batch_add(&free_folios, new_folio)) {
>                                 mem_cgroup_uncharge_folios(&free_folios);
>                                 free_unref_folios(&free_folios);
>
>
> I have tested this so should be ok, but let me know otherwise.
>
> I will include this in the next revision I will send soon.
>
> Thanks.

Thanks for the update, the panic problem is gone.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/6] mm: split underutilized THPs
  2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
                   ` (6 preceding siblings ...)
  2024-08-13 17:22 ` [PATCH v3 0/6] mm: split " Andi Kleen
@ 2024-08-18  5:13 ` Hugh Dickins
  2024-08-18  7:45   ` David Hildenbrand
  2024-08-19  2:36   ` Usama Arif
  7 siblings, 2 replies; 42+ messages in thread
From: Hugh Dickins @ 2024-08-18  5:13 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy, ryncsn, ak,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

On Tue, 13 Aug 2024, Usama Arif wrote:

> The current upstream default policy for THP is always. However, Meta
> uses madvise in production as the current THP=always policy vastly
> overprovisions THPs in sparsely accessed memory areas, resulting in
> excessive memory pressure and premature OOM killing.
> Using madvise + relying on khugepaged has certain drawbacks over
> THP=always. Using madvise hints mean THPs aren't "transparent" and
> require userspace changes. Waiting for khugepaged to scan memory and
> collapse pages into THP can be slow and unpredictable in terms of performance
> (i.e. you dont know when the collapse will happen), while production
> environments require predictable performance. If there is enough memory
> available, its better for both performance and predictability to have
> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> to collapse it, and deal with sparsely populated THPs when the system is
> running out of memory.
> 
> This patch-series is an attempt to mitigate the issue of running out of
> memory when THP is always enabled. During runtime whenever a THP is being
> faulted in or collapsed by khugepaged, the THP is added to a list.
> Whenever memory reclaim happens, the kernel runs the deferred_split
> shrinker which goes through the list and checks if the THP was underutilized,
> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> If this number goes above a certain threshold, the shrinker will attempt
> to split that THP. Then at remap time, the pages that were zero-filled are
> mapped to the shared zeropage, hence saving memory. This method avoids the
> downside of wasting memory in areas where THP is sparsely filled when THP
> is always enabled, while still providing the upside THPs like reduced TLB
> misses without having to use madvise.
> 
> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> tested with THP shrinker. The results after 2 hours are as follows:
> 
>                             | THP=madvise |  THP=always   | THP=always
>                             |             |               | + shrinker series
>                             |             |               | + max_ptes_none=409
> -----------------------------------------------------------------------------
> Performance improvement     |      -      |    +1.8%      |     +1.7%
> (over THP=madvise)          |             |               |
> -----------------------------------------------------------------------------
> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> -----------------------------------------------------------------------------
> max_ptes_none=409 means that any THP that has more than 409 out of 512
> (80%) zero filled filled pages will be split.
> 
> To test out the patches, the below commands without the shrinker will
> invoke OOM killer immediately and kill stress, but will not fail with
> the shrinker:
> 
> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> mkdir /sys/fs/cgroup/test
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> echo 20M > /sys/fs/cgroup/test/memory.max
> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> # allocate twice memory.max for each stress worker and touch 40/512 of
> # each THP, i.e. vm-stride 50K.
> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> # killer.
> # Without the shrinker, OOM killer is invoked immediately irrespective
> # of max_ptes_none value and kills stress.
> stress --vm 1 --vm-bytes 40M --vm-stride 50K
> 
> v2 -> v3:
> - Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
> - Use flags argument instead of bools in remove_migration_ptes (Johannes)
> - Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
>   (David Hildenbrand).
> - Split out the last patch of v2 into 3, one for introducing the flag,
>   one for splitting underutilized THPs on _deferred_list and one for adding
>   sysfs entry to disable splitting (David Hildenbrand).
> 
> v1 -> v2:
> - Turn page checks and operations to folio versions in __split_huge_page.
>   This means patches 1 and 2 from v1 are no longer needed.
>   (David Hildenbrand)
> - Map to shared zeropage in all cases if the base page is zero-filled.
>   The uffd selftest was removed.
>   (David Hildenbrand).
> - rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
>   (Rik van Riel).
> - Use unsigned long instead of uint64_t (kernel test robot).
>  
> Alexander Zhu (1):
>   mm: selftest to verify zero-filled pages are mapped to zeropage
> 
> Usama Arif (3):
>   mm: Introduce a pageflag for partially mapped folios
>   mm: split underutilized THPs
>   mm: add sysfs entry to disable splitting underutilized THPs
> 
> Yu Zhao (2):
>   mm: free zapped tail pages when splitting isolated thp
>   mm: remap unused subpages to shared zeropage when splitting isolated
>     thp

Sorry, I don't have time to review this, but notice you're intending
a v4 of the series, so want to bring up four points quickly before that.

1. Even with the two fixes to 1/6 in __split_huge_page(), under load
this series causes system lockup, with interrupts disabled on most CPUs.

The error is in deferred_split_scan(), where the old code just did
a list_splice_tail() under split_queue_lock, but this series ends up
doing more there, including a folio_put(): deadlock when racing, and
that is the final folio_put() which brings refcount down to 0, which
then wants to take split_queue_lock.

The patch I've been using successfully on 6.11-rc3-next-20240816 below:
I do have other problems with current mm commits, so have not been able
to sustain a load for very long, but suspect those problems unrelated
to this series. Please fold this fix, or your own equivalent, into
your next version.

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3270,7 +3270,8 @@ static void __split_huge_page(struct pag
 
 			folio_clear_active(new_folio);
 			folio_clear_unevictable(new_folio);
-			if (!folio_batch_add(&free_folios, folio)) {
+			list_del(&new_folio->lru);
+			if (!folio_batch_add(&free_folios, new_folio)) {
 				mem_cgroup_uncharge_folios(&free_folios);
 				free_unref_folios(&free_folios);
 			}
@@ -3706,42 +3707,37 @@ static unsigned long deferred_split_scan
 		bool did_split = false;
 		bool underutilized = false;
 
-		if (folio_test_partially_mapped(folio))
-			goto split;
-		underutilized = thp_underutilized(folio);
-		if (underutilized)
-			goto split;
-		continue;
-split:
+		if (!folio_test_partially_mapped(folio)) {
+			underutilized = thp_underutilized(folio);
+			if (!underutilized)
+				goto next;
+		}
 		if (!folio_trylock(folio))
-			continue;
-		did_split = !split_folio(folio);
-		folio_unlock(folio);
-		if (did_split) {
-			/* Splitting removed folio from the list, drop reference here */
-			folio_put(folio);
+			goto next;
+		if (!split_folio(folio)) {
+			did_split = true;
 			if (underutilized)
 				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
 			split++;
 		}
-	}
-
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	/*
-	 * Only add back to the queue if folio is partially mapped.
-	 * If thp_underutilized returns false, or if split_folio fails in
-	 * the case it was underutilized, then consider it used and don't
-	 * add it back to split_queue.
-	 */
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
-		if (folio_test_partially_mapped(folio))
-			list_move(&folio->_deferred_list, &ds_queue->split_queue);
-		else {
+		folio_unlock(folio);
+next:
+		/*
+		 * split_folio() removes folio from list on success.
+		 * Only add back to the queue if folio is partially mapped.
+		 * If thp_underutilized returns false, or if split_folio fails
+		 * in the case it was underutilized, then consider it used and
+		 * don't add it back to split_queue.
+		 */
+		if (!did_split && !folio_test_partially_mapped(folio)) {
 			list_del_init(&folio->_deferred_list);
 			ds_queue->split_queue_len--;
 		}
 		folio_put(folio);
 	}
+
+	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	list_splice_tail(&list, &ds_queue->split_queue);
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	/*

2. I don't understand why there needs to be a new PG_partially_mapped
flag, with all its attendant sets and tests and clears all over.  Why
can't deferred_split_scan() detect that case for itself, using the
criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
is commented "Do not use outside of rmap and debug code", and
folio_nr_pages_mapped() is currently only used from mm/debug.c; but
using the info already maintained is preferable to adding a PG_flag
(and perhaps more efficient - skips splitting when _nr_pages_mapped
already fell to 0 and folio will soon be freed).

3. Everything in /sys/kernel/mm/transparent_hugepage/ is about THPs,
so please remove the "thp_" from "thp_low_util_shrinker" -
"shrink_underused" perhaps.  And it needs a brief description in
Documentation/admin-guide/mm/transhuge.rst.

4. Wouldn't "underused" be better than "underutilized" throughout?

Hugh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/6] mm: split underutilized THPs
  2024-08-18  5:13 ` Hugh Dickins
@ 2024-08-18  7:45   ` David Hildenbrand
  2024-08-19  2:38     ` Usama Arif
  2024-08-19  2:36   ` Usama Arif
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2024-08-18  7:45 UTC (permalink / raw)
  To: Hugh Dickins, Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, baohua, ryan.roberts, rppt, willy, ryncsn, ak,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

Hi Hugh,

> 2. I don't understand why there needs to be a new PG_partially_mapped
> flag, with all its attendant sets and tests and clears all over.  Why
> can't deferred_split_scan() detect that case for itself, using the
> criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
> is commented "Do not use outside of rmap and debug code", and
> folio_nr_pages_mapped() is currently only used from mm/debug.c; but
> using the info already maintained is preferable to adding a PG_flag
> (and perhaps more efficient - skips splitting when _nr_pages_mapped
> already fell to 0 and folio will soon be freed).

No new users of _nr_pages_mapped if easily/cleanly avoidable, please.

I'm currently cleaning up the final patches that introduce a new kernel 
config where we will stop maintaining the page->_mapcount for large 
folios (and consequently have to stop maintaining folio->_nr_pages_mapped).

That's the main reasons for the comment -- at one point in my life I 
want to be done with that project ;) .

folio->_nr_pages_mapped will still exist and be maintained without the 
new kernel config enabled. But in the new one, once we'll detect a 
partial mapping we'll have to flag the folio -- for example as done in 
this series.

Having two ways of handling that, depending on the kernel config, will 
not make the code any better.

But I agree that we should look into minimizing the usage of any new 
such flag: I'd have thought we only have to set the flag once, once we 
detect a partial mapping ... still have to review that patch more 
thoroughly.

> 
> 3. Everything in /sys/kernel/mm/transparent_hugepage/ is about THPs,
> so please remove the "thp_" from "thp_low_util_shrinker" -
> "shrink_underused" perhaps.  And it needs a brief description in
> Documentation/admin-guide/mm/transhuge.rst.

agreed.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/6] mm: split underutilized THPs
  2024-08-18  5:13 ` Hugh Dickins
  2024-08-18  7:45   ` David Hildenbrand
@ 2024-08-19  2:36   ` Usama Arif
  1 sibling, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-19  2:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, baohua, ryan.roberts, rppt, willy, ryncsn, ak,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team



On 18/08/2024 06:13, Hugh Dickins wrote:
> On Tue, 13 Aug 2024, Usama Arif wrote:
> 
>> The current upstream default policy for THP is always. However, Meta
>> uses madvise in production as the current THP=always policy vastly
>> overprovisions THPs in sparsely accessed memory areas, resulting in
>> excessive memory pressure and premature OOM killing.
>> Using madvise + relying on khugepaged has certain drawbacks over
>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>> require userspace changes. Waiting for khugepaged to scan memory and
>> collapse pages into THP can be slow and unpredictable in terms of performance
>> (i.e. you dont know when the collapse will happen), while production
>> environments require predictable performance. If there is enough memory
>> available, its better for both performance and predictability to have
>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>> to collapse it, and deal with sparsely populated THPs when the system is
>> running out of memory.
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>> If this number goes above a certain threshold, the shrinker will attempt
>> to split that THP. Then at remap time, the pages that were zero-filled are
>> mapped to the shared zeropage, hence saving memory. This method avoids the
>> downside of wasting memory in areas where THP is sparsely filled when THP
>> is always enabled, while still providing the upside THPs like reduced TLB
>> misses without having to use madvise.
>>
>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>> tested with THP shrinker. The results after 2 hours are as follows:
>>
>>                             | THP=madvise |  THP=always   | THP=always
>>                             |             |               | + shrinker series
>>                             |             |               | + max_ptes_none=409
>> -----------------------------------------------------------------------------
>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>> (over THP=madvise)          |             |               |
>> -----------------------------------------------------------------------------
>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>> -----------------------------------------------------------------------------
>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>> (80%) zero filled filled pages will be split.
>>
>> To test out the patches, the below commands without the shrinker will
>> invoke OOM killer immediately and kill stress, but will not fail with
>> the shrinker:
>>
>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>> mkdir /sys/fs/cgroup/test
>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>> echo 20M > /sys/fs/cgroup/test/memory.max
>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>> # allocate twice memory.max for each stress worker and touch 40/512 of
>> # each THP, i.e. vm-stride 50K.
>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>> # killer.
>> # Without the shrinker, OOM killer is invoked immediately irrespective
>> # of max_ptes_none value and kills stress.
>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>
>> v2 -> v3:
>> - Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
>> - Use flags argument instead of bools in remove_migration_ptes (Johannes)
>> - Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
>>   (David Hildenbrand).
>> - Split out the last patch of v2 into 3, one for introducing the flag,
>>   one for splitting underutilized THPs on _deferred_list and one for adding
>>   sysfs entry to disable splitting (David Hildenbrand).
>>
>> v1 -> v2:
>> - Turn page checks and operations to folio versions in __split_huge_page.
>>   This means patches 1 and 2 from v1 are no longer needed.
>>   (David Hildenbrand)
>> - Map to shared zeropage in all cases if the base page is zero-filled.
>>   The uffd selftest was removed.
>>   (David Hildenbrand).
>> - rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
>>   (Rik van Riel).
>> - Use unsigned long instead of uint64_t (kernel test robot).
>>  
>> Alexander Zhu (1):
>>   mm: selftest to verify zero-filled pages are mapped to zeropage
>>
>> Usama Arif (3):
>>   mm: Introduce a pageflag for partially mapped folios
>>   mm: split underutilized THPs
>>   mm: add sysfs entry to disable splitting underutilized THPs
>>
>> Yu Zhao (2):
>>   mm: free zapped tail pages when splitting isolated thp
>>   mm: remap unused subpages to shared zeropage when splitting isolated
>>     thp
> 
> Sorry, I don't have time to review this, but notice you're intending
> a v4 of the series, so want to bring up four points quickly before that.
> 
> 1. Even with the two fixes to 1/6 in __split_huge_page(), under load
> this series causes system lockup, with interrupts disabled on most CPUs.
> 
> The error is in deferred_split_scan(), where the old code just did
> a list_splice_tail() under split_queue_lock, but this series ends up
> doing more there, including a folio_put(): deadlock when racing, and
> that is the final folio_put() which brings refcount down to 0, which
> then wants to take split_queue_lock.
> 
> The patch I've been using successfully on 6.11-rc3-next-20240816 below:
> I do have other problems with current mm commits, so have not been able
> to sustain a load for very long, but suspect those problems unrelated
> to this series. Please fold this fix, or your own equivalent, into
> your next version.
> 
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3270,7 +3270,8 @@ static void __split_huge_page(struct pag
>  
>  			folio_clear_active(new_folio);
>  			folio_clear_unevictable(new_folio);
> -			if (!folio_batch_add(&free_folios, folio)) {
> +			list_del(&new_folio->lru);
> +			if (!folio_batch_add(&free_folios, new_folio)) {
>  				mem_cgroup_uncharge_folios(&free_folios);
>  				free_unref_folios(&free_folios);
>  			}
> @@ -3706,42 +3707,37 @@ static unsigned long deferred_split_scan
>  		bool did_split = false;
>  		bool underutilized = false;
>  
> -		if (folio_test_partially_mapped(folio))
> -			goto split;
> -		underutilized = thp_underutilized(folio);
> -		if (underutilized)
> -			goto split;
> -		continue;
> -split:
> +		if (!folio_test_partially_mapped(folio)) {
> +			underutilized = thp_underutilized(folio);
> +			if (!underutilized)
> +				goto next;
> +		}
>  		if (!folio_trylock(folio))
> -			continue;
> -		did_split = !split_folio(folio);
> -		folio_unlock(folio);
> -		if (did_split) {
> -			/* Splitting removed folio from the list, drop reference here */
> -			folio_put(folio);
> +			goto next;
> +		if (!split_folio(folio)) {
> +			did_split = true;
>  			if (underutilized)
>  				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
>  			split++;
>  		}
> -	}
> -
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	/*
> -	 * Only add back to the queue if folio is partially mapped.
> -	 * If thp_underutilized returns false, or if split_folio fails in
> -	 * the case it was underutilized, then consider it used and don't
> -	 * add it back to split_queue.
> -	 */
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> -		if (folio_test_partially_mapped(folio))
> -			list_move(&folio->_deferred_list, &ds_queue->split_queue);
> -		else {
> +		folio_unlock(folio);
> +next:
> +		/*
> +		 * split_folio() removes folio from list on success.
> +		 * Only add back to the queue if folio is partially mapped.
> +		 * If thp_underutilized returns false, or if split_folio fails
> +		 * in the case it was underutilized, then consider it used and
> +		 * don't add it back to split_queue.
> +		 */
> +		if (!did_split && !folio_test_partially_mapped(folio)) {
>  			list_del_init(&folio->_deferred_list);
>  			ds_queue->split_queue_len--;
>  		}
>  		folio_put(folio);
>  	}
> +
> +	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +	list_splice_tail(&list, &ds_queue->split_queue);
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
>  	/*
> 
> 2. I don't understand why there needs to be a new PG_partially_mapped
> flag, with all its attendant sets and tests and clears all over.  Why
> can't deferred_split_scan() detect that case for itself, using the
> criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
> is commented "Do not use outside of rmap and debug code", and
> folio_nr_pages_mapped() is currently only used from mm/debug.c; but
> using the info already maintained is preferable to adding a PG_flag
> (and perhaps more efficient - skips splitting when _nr_pages_mapped
> already fell to 0 and folio will soon be freed).
> 
> 3. Everything in /sys/kernel/mm/transparent_hugepage/ is about THPs,
> so please remove the "thp_" from "thp_low_util_shrinker" -
> "shrink_underused" perhaps.  And it needs a brief description in
> Documentation/admin-guide/mm/transhuge.rst.
> 
> 4. Wouldn't "underused" be better than "underutilized" throughout?
> 
> Hugh

Thanks for the review, especially for pointing out the deadlock. I have addressed points 1, 3 and 4 in v4, point 2 was addressed by David.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/6] mm: split underutilized THPs
  2024-08-18  7:45   ` David Hildenbrand
@ 2024-08-19  2:38     ` Usama Arif
  0 siblings, 0 replies; 42+ messages in thread
From: Usama Arif @ 2024-08-19  2:38 UTC (permalink / raw)
  To: David Hildenbrand, Hugh Dickins
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, baohua, ryan.roberts, rppt, willy, ryncsn, ak,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team



On 18/08/2024 08:45, David Hildenbrand wrote:
> Hi Hugh,
> 
>> 2. I don't understand why there needs to be a new PG_partially_mapped
>> flag, with all its attendant sets and tests and clears all over.  Why
>> can't deferred_split_scan() detect that case for itself, using the
>> criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
>> is commented "Do not use outside of rmap and debug code", and
>> folio_nr_pages_mapped() is currently only used from mm/debug.c; but
>> using the info already maintained is preferable to adding a PG_flag
>> (and perhaps more efficient - skips splitting when _nr_pages_mapped
>> already fell to 0 and folio will soon be freed).
> 
> No new users of _nr_pages_mapped if easily/cleanly avoidable, please.
> 
> I'm currently cleaning up the final patches that introduce a new kernel config where we will stop maintaining the page->_mapcount for large folios (and consequently have to stop maintaining folio->_nr_pages_mapped).
> 
> That's the main reasons for the comment -- at one point in my life I want to be done with that project ;) .
> 
> folio->_nr_pages_mapped will still exist and be maintained without the new kernel config enabled. But in the new one, once we'll detect a partial mapping we'll have to flag the folio -- for example as done in this series.
> 
> Having two ways of handling that, depending on the kernel config, will not make the code any better.
> 
> But I agree that we should look into minimizing the usage of any new such flag: I'd have thought we only have to set the flag once, once we detect a partial mapping ... still have to review that patch more thoroughly.

Yes, the flag is set only once, in deferred_split_folio once we detect a partial mapping.


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-08-19  2:38 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
2024-08-13 12:02 ` [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-08-15 18:47   ` Kairui Song
2024-08-15 19:16     ` Usama Arif
2024-08-16 16:55       ` Kairui Song
2024-08-16 17:02         ` Usama Arif
2024-08-16 18:11           ` Kairui Song
2024-08-13 12:02 ` [PATCH v3 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
2024-08-13 12:02 ` [PATCH v3 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
2024-08-14  3:30   ` Yu Zhao
2024-08-14 10:20     ` Usama Arif
2024-08-14 10:44   ` Barry Song
2024-08-14 10:52     ` Barry Song
2024-08-14 11:11     ` Usama Arif
2024-08-14 11:20       ` Barry Song
2024-08-14 11:26         ` Barry Song
2024-08-14 11:30         ` Usama Arif
2024-08-14 11:10   ` Barry Song
2024-08-14 11:20     ` Usama Arif
2024-08-14 11:23       ` Barry Song
2024-08-14 12:36         ` Usama Arif
2024-08-14 23:05           ` Barry Song
2024-08-15 15:25             ` Usama Arif
2024-08-15 23:30               ` Andrew Morton
2024-08-16  2:50                 ` Yu Zhao
2024-08-15 16:33   ` David Hildenbrand
2024-08-15 17:10     ` Usama Arif
2024-08-15 21:06       ` Barry Song
2024-08-15 21:08       ` David Hildenbrand
2024-08-16 15:44   ` Matthew Wilcox
2024-08-16 16:08     ` Usama Arif
2024-08-16 16:28       ` Matthew Wilcox
2024-08-16 16:41         ` Usama Arif
2024-08-13 12:02 ` [PATCH v3 5/6] mm: split underutilized THPs Usama Arif
2024-08-13 12:02 ` [PATCH v3 6/6] mm: add sysfs entry to disable splitting " Usama Arif
2024-08-13 17:22 ` [PATCH v3 0/6] mm: split " Andi Kleen
2024-08-14 10:13   ` Usama Arif
2024-08-18  5:13 ` Hugh Dickins
2024-08-18  7:45   ` David Hildenbrand
2024-08-19  2:38     ` Usama Arif
2024-08-19  2:36   ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox