linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* (no subject)
@ 2022-08-26 22:03 Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 1/9] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Subject: [PATCH mm-unstable v2 0/9] mm: add file/shmem support to MADV_COLLAPSE

v2 Forward

Mostly a RESEND: rebase on latest mm-unstable + minor bug fixes from
kernel test robot.
--------------------------------

This series builds on top of the previous "mm: userspace hugepage collapse"
series which introduced the MADV_COLLAPSE madvise mode and added support
for private, anonymous mappings[1], by adding support for file and shmem
backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.

File and shmem support have been added with effort to align with existing
MADV_COLLAPSE semantics and policy decisions[2].  Collapse of shmem-backed
memory ignores kernel-guiding directives and heuristics including all
sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
options (shmem always supports large folios).  Like anonymous mappings, on
successful return of MADV_COLLAPSE on file/shmem memory, the contents of
memory mapped by the addresses provided will be synchronously pmd-mapped
THPs.

This functionality unlocks two important uses:

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  Now, we can have the best of both worlds: Peak
	upfront performance and lower RAM footprints.

(2)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.

khugepaged has received a small improvement by association and can now
detect and collapse pte-mapped THPs.  However, there is still work to be
done along the file collapse path.  Compound pages of arbitrary order still
needs to be supported and THP collapse needs to be converted to using
folios in general.  Eventually, we'd like to move away from the read-only
and executable-mapped constraints currently imposed on eligible files and
support any inode claiming huge folio support.  That said, I think the
series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem
memory.

Patches 1-3	Implement the guts of the series.
Patch 4 	Is a tracepoint for debugging.
Patches 5-8 	Refactor existing khugepaged selftests to work with new
		memory types.
Patch 9 	Adds a userfaultfd selftest mode to mimic a functional test
		of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.

Applies against mm-unstable.

[1] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
[2] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/

v1 -> v2:
- Add missing definition for khugepaged_add_pte_mapped_thp() in
  !CONFIG_SHEM builds, in "mm/khugepaged: attempt to map
  file/shmem-backed pte-mapped THPs by pmds"
- Minor bugfixes in "mm/madvise: add file and shmem support to
  MADV_COLLAPSE" for !CONFIG_SHMEM, !CONFIG_TRANSPARENT_HUGEPAGE and some
  compiler settings.
- Rebased on latest mm-unstable

Zach O'Keefe (9):
  mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
  mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by
    pmds
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  selftests/vm: dedup THP helpers
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory

 include/linux/khugepaged.h                    |  13 +-
 include/linux/shmem_fs.h                      |  10 +-
 include/trace/events/huge_memory.h            |  36 +
 kernel/events/uprobes.c                       |   2 +-
 mm/huge_memory.c                              |   2 +-
 mm/khugepaged.c                               | 289 ++++--
 mm/shmem.c                                    |  18 +-
 tools/testing/selftests/vm/Makefile           |   2 +
 tools/testing/selftests/vm/khugepaged.c       | 828 ++++++++++++------
 tools/testing/selftests/vm/soft-dirty.c       |   2 +-
 .../selftests/vm/split_huge_page_test.c       |  12 +-
 tools/testing/selftests/vm/userfaultfd.c      | 171 +++-
 tools/testing/selftests/vm/vm_util.c          |  36 +-
 tools/testing/selftests/vm/vm_util.h          |   5 +-
 14 files changed, 1040 insertions(+), 386 deletions(-)

-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 1/9] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
  2022-08-26 22:03 Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 2/9] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Extend 'mm/thp: add flag to enforce sysfs THP in
hugepage_vma_check()' to shmem, allowing callers to ignore
/sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount.

This is intended to be used by MADV_COLLAPSE, and the rationale is
analogous to the anon/file case: MADV_COLLAPSE is not coupled to
directives that advise the kernel's decisions on when THPs should be
considered eligible. shmem/tmpfs always claims large folio support,
regardless of sysfs or mount options.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/linux/shmem_fs.h | 10 ++++++----
 mm/huge_memory.c         |  2 +-
 mm/shmem.c               | 18 +++++++++---------
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff0b990de83d..f5e9b01dbf4c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -92,11 +92,13 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 int shmem_unuse(unsigned int type);
 
-extern bool shmem_is_huge(struct vm_area_struct *vma,
-			  struct inode *inode, pgoff_t index);
-static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
+extern bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
+			  pgoff_t index, bool shmem_huge_force);
+static inline bool shmem_huge_enabled(struct vm_area_struct *vma,
+				      bool shmem_huge_force)
 {
-	return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
+	return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff,
+			     shmem_huge_force);
 }
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88d98241a635..b3acc8e3046d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -119,7 +119,7 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * own flags.
 	 */
 	if (!in_pf && shmem_file(vma->vm_file))
-		return shmem_huge_enabled(vma);
+		return shmem_huge_enabled(vma, !enforce_sysfs);
 
 	/* Enforce sysfs THP requirements as necessary */
 	if (enforce_sysfs &&
diff --git a/mm/shmem.c b/mm/shmem.c
index 42e5888bf84d..b9bab1abf142 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -472,20 +472,20 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-bool shmem_is_huge(struct vm_area_struct *vma,
-		   struct inode *inode, pgoff_t index)
+bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
+		   pgoff_t index, bool shmem_huge_force)
 {
 	loff_t i_size;
 
 	if (!S_ISREG(inode->i_mode))
 		return false;
-	if (shmem_huge == SHMEM_HUGE_DENY)
-		return false;
 	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
+	if (shmem_huge == SHMEM_HUGE_FORCE || shmem_huge_force)
 		return true;
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return false;
 
 	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
@@ -680,8 +680,8 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
-bool shmem_is_huge(struct vm_area_struct *vma,
-		   struct inode *inode, pgoff_t index)
+bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
+		   pgoff_t index, bool shmem_huge_force)
 {
 	return false;
 }
@@ -1069,7 +1069,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 			STATX_ATTR_NODUMP);
 	generic_fillattr(&init_user_ns, inode, stat);
 
-	if (shmem_is_huge(NULL, inode, 0))
+	if (shmem_is_huge(NULL, inode, 0, false))
 		stat->blksize = HPAGE_PMD_SIZE;
 
 	if (request_mask & STATX_BTIME) {
@@ -1910,7 +1910,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (!shmem_is_huge(vma, inode, index))
+	if (!shmem_is_huge(vma, inode, index, false))
 		goto alloc_nohuge;
 
 	huge_gfp = vma_thp_gfp_mask(vma);
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 2/9] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds
  2022-08-26 22:03 Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 1/9] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 3/9] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

The main benefit of THPs are that they can be mapped at the pmd level,
increasing the likelihood of TLB hit and spending less cycles in page
table walks.  pte-mapped hugepages - that is - hugepage-aligned compound
pages of order HPAGE_PMD_ORDER - although being contiguous in physical
memory, don't have this advantage.  In fact, one could argue they are
detrimental to system performance overall since they occupy a precious
hugepage-aligned/sized region of physical memory that could otherwise
be used more effectively.  Additionally, pte-mapped hugepages can be the
cheapest memory to collapse for khugepaged since no new hugepage
allocation or copying of memory contents is necessary - we only need to
update the mapping page tables.

In the anonymous collapse path, we are able to collapse pte-mapped
hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
effort when compound pages (of any order) are encountered.

Identify pte-mapped hugepages in the file/shmem collapse path.  In
khugepaged context, attempt to update page tables mapping this hugepage.
Note that these collapses still count towards the
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and
if the pte-mapped hugepage was also mapped into multiple process' address
spaces, could be incremented for each page table update.  Since we
increment the counter when a pte-mapped hugepage is successfully added to
the list of to-collapse pte-mapped THPs, it's possible that we never
actually update the page table either.  This is different from how
file/shmem pages_collapsed accounting works today where only a successful
page cache update is counted (it's also possible here that no page tables
are actually changed).  Though it incurs some slop, this is preferred to
either not accounting for the event at all, or plumbing through data in
struct mm_slot on whether to account for the collapse or not.

Note that work still needs to be done to support arbitrary compound
pages, and that this should all be converted to using folios.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 49 ++++++++++++++++++++++++++----
 2 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 55392bf30a03..fbbb25494d60 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -17,6 +17,7 @@
 	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
 	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
 	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
+	EM( SCAN_PTE_MAPPED_HUGEPAGE,	"pte_mapped_hugepage")		\
 	EM( SCAN_PAGE_RO,		"no_writable_page")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d8e388106322..6022a08db1cd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -34,6 +34,7 @@ enum scan_result {
 	SCAN_EXCEED_SHARED_PTE,
 	SCAN_PTE_NON_PRESENT,
 	SCAN_PTE_UFFD_WP,
+	SCAN_PTE_MAPPED_HUGEPAGE,
 	SCAN_PAGE_RO,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
@@ -1349,18 +1350,22 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
  * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
  * khugepaged should try to collapse the page table.
  */
-static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
+static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 					  unsigned long addr)
 {
 	struct mm_slot *mm_slot;
+	bool ret = false;
 
 	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
 
 	spin_lock(&khugepaged_mm_lock);
 	mm_slot = get_mm_slot(mm);
-	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
+	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
 		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
+		ret = true;
+	}
 	spin_unlock(&khugepaged_mm_lock);
+	return ret;
 }
 
 static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1397,9 +1402,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	pte_t *start_pte, *pte;
 	pmd_t *pmd;
 	spinlock_t *ptl;
-	int count = 0;
+	int count = 0, result = SCAN_FAIL;
 	int i;
 
+	mmap_assert_write_locked(mm);
+
+	/* Fast check before locking page if already PMD-mapped  */
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
+	if (result != SCAN_SUCCEED)
+		return;
+
 	if (!vma || !vma->vm_file ||
 	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
 		return;
@@ -1748,7 +1760,11 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
 		 * we locked the first page, then a THP might be there already.
 		 */
 		if (PageTransCompound(page)) {
-			result = SCAN_PAGE_COMPOUND;
+			result = compound_order(page) == HPAGE_PMD_ORDER &&
+					index == start
+					/* Maybe PMD-mapped */
+					? SCAN_PTE_MAPPED_HUGEPAGE
+					: SCAN_PAGE_COMPOUND;
 			goto out_unlock;
 		}
 
@@ -1986,7 +2002,11 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 		 * into a PMD sized page
 		 */
 		if (PageTransCompound(page)) {
-			result = SCAN_PAGE_COMPOUND;
+			result = compound_order(page) == HPAGE_PMD_ORDER &&
+					xas.xa_index == start
+					/* Maybe PMD-mapped */
+					? SCAN_PTE_MAPPED_HUGEPAGE
+					: SCAN_PAGE_COMPOUND;
 			break;
 		}
 
@@ -2046,6 +2066,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 {
 }
+
+static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
+					  unsigned long addr)
+{
+	return false;
+}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2137,8 +2163,19 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 								  &mmap_locked,
 								  cc);
 			}
-			if (*result == SCAN_SUCCEED)
+			switch (*result) {
+			case SCAN_PTE_MAPPED_HUGEPAGE:
+				if (!khugepaged_add_pte_mapped_thp(mm,
+								   khugepaged_scan.address))
+					break;
+				fallthrough;
+			case SCAN_SUCCEED:
 				++khugepaged_pages_collapsed;
+				break;
+			default:
+				break;
+			}
+
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 3/9] mm/madvise: add file and shmem support to MADV_COLLAPSE
  2022-08-26 22:03 Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 1/9] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 2/9] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 4/9] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).

On success, the backing memory will be a hugepage.  For the memory range
and process provided, the page tables will synchronously have a huge pmd
installed, mapping the THP.  Other mappings of the file extent mapped by
the memory range may be added to a set of entries that khugepaged will
later process and attempt update their page tables to map the THP by a pmd.

This functionality unlocks two important uses:

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  Now, we can have the best of both worlds: Peak
	upfront performance and lower RAM footprints.

(2)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/linux/khugepaged.h         |  13 +-
 include/trace/events/huge_memory.h |   1 +
 kernel/events/uprobes.c            |   2 +-
 mm/khugepaged.c                    | 241 ++++++++++++++++++++++-------
 4 files changed, 198 insertions(+), 59 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 384f034ae947..70162d707caf 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
 				 unsigned long vm_flags);
 extern void khugepaged_min_free_kbytes_update(void);
 #ifdef CONFIG_SHMEM
-extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
+extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
+				   bool install_pmd);
 #else
-static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
-					   unsigned long addr)
+static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
+					  unsigned long addr, bool install_pmd)
 {
+	return 0;
 }
 #endif
 
@@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
 					unsigned long vm_flags)
 {
 }
-static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
-					   unsigned long addr)
+static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
+					  unsigned long addr, bool install_pmd)
 {
+	return 0;
 }
 
 static inline void khugepaged_min_free_kbytes_update(void)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index fbbb25494d60..a8db658e99e9 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -11,6 +11,7 @@
 	EM( SCAN_FAIL,			"failed")			\
 	EM( SCAN_SUCCEED,		"succeeded")			\
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
+	EM( SCAN_PMD_NON_PRESENT,	"pmd_non_present")		\
 	EM( SCAN_PMD_MAPPED,		"page_pmd_mapped")		\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 401bc2d24ce0..d48c47811e45 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -553,7 +553,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	/* try collapse pmd for compound page */
 	if (!ret && orig_page_huge)
-		collapse_pte_mapped_thp(mm, vaddr);
+		collapse_pte_mapped_thp(mm, vaddr, false);
 
 	return ret;
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6022a08db1cd..34c0c74b3839 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -28,6 +28,7 @@ enum scan_result {
 	SCAN_FAIL,
 	SCAN_SUCCEED,
 	SCAN_PMD_NULL,
+	SCAN_PMD_NON_PRESENT,
 	SCAN_PMD_MAPPED,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
@@ -870,6 +871,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
 				cc->is_khugepaged))
 		return SCAN_VMA_CHECK;
+	return SCAN_SUCCEED;
+}
+
+static int hugepage_vma_revalidate_anon(struct mm_struct *mm,
+					unsigned long address,
+					struct vm_area_struct **vmap,
+					struct collapse_control *cc)
+{
+	int ret = hugepage_vma_revalidate(mm, address, vmap, cc);
+
+	if (ret != SCAN_SUCCEED)
+		return ret;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
 	 * remapped to file after khugepaged reaquired the mmap_lock.
@@ -877,8 +890,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	 * hugepage_vma_check may return true for qualified file
 	 * vmas.
 	 */
-	if (!vma->anon_vma || !vma_is_anonymous(vma))
-		return SCAN_VMA_CHECK;
+	if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))
+		return SCAN_PAGE_ANON;
 	return SCAN_SUCCEED;
 }
 
@@ -899,7 +912,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 	barrier();
 #endif
 	if (!pmd_present(pmde))
-		return SCAN_PMD_NULL;
+		return SCAN_PMD_NON_PRESENT;
 	if (pmd_trans_huge(pmde))
 		return SCAN_PMD_MAPPED;
 	if (pmd_bad(pmde))
@@ -1027,7 +1040,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma, cc);
+	result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1058,7 +1071,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * handled by the anon_vma lock + PG_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma, cc);
+	result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1361,13 +1374,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	spin_lock(&khugepaged_mm_lock);
 	mm_slot = get_mm_slot(mm);
 	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
+		int i;
+		/*
+		 * Multiple callers may be adding entries here.  Do a quick
+		 * check to see the entry hasn't already been added by someone
+		 * else.
+		 */
+		for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i)
+			if (mm_slot->pte_mapped_thp[i] == addr)
+				goto out;
 		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
 		ret = true;
 	}
+out:
 	spin_unlock(&khugepaged_mm_lock);
 	return ret;
 }
 
+/* hpage must be locked, and mmap_lock must be held in write */
+static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
+			pmd_t *pmdp, struct page *hpage)
+{
+	struct vm_fault vmf = {
+		.vma = vma,
+		.address = addr,
+		.flags = 0,
+		.pmd = pmdp,
+	};
+
+	VM_BUG_ON(!PageTransHuge(hpage));
+	mmap_assert_write_locked(vma->vm_mm);
+
+	if (do_set_pmd(&vmf, hpage))
+		return SCAN_FAIL;
+
+	get_page(hpage);
+	return SCAN_SUCCEED;
+}
+
 static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 				  unsigned long addr, pmd_t *pmdp)
 {
@@ -1389,12 +1433,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
  *
  * @mm: process address space where collapse happens
  * @addr: THP collapse address
+ * @install_pmd: If a huge PMD should be installed
  *
  * This function checks whether all the PTEs in the PMD are pointing to the
  * right THP. If so, retract the page table so the THP can refault in with
- * as pmd-mapped.
+ * as pmd-mapped. Possibly install a huge PMD mapping the THP.
  */
-void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
+int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
+			    bool install_pmd)
 {
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
@@ -1409,12 +1455,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 
 	/* Fast check before locking page if already PMD-mapped  */
 	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
-	if (result != SCAN_SUCCEED)
-		return;
+	if (result == SCAN_PMD_MAPPED)
+		return result;
 
 	if (!vma || !vma->vm_file ||
 	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-		return;
+		return SCAN_VMA_CHECK;
 
 	/*
 	 * If we are here, we've succeeded in replacing all the native pages
@@ -1424,24 +1470,44 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	 * analogously elide sysfs THP settings here.
 	 */
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
-		return;
+		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
 	if (userfaultfd_wp(vma))
-		return;
+		return SCAN_PTE_UFFD_WP;
 
 	hpage = find_lock_page(vma->vm_file->f_mapping,
 			       linear_page_index(vma, haddr));
 	if (!hpage)
-		return;
+		return SCAN_PAGE_NULL;
 
-	if (!PageHead(hpage))
+	if (!PageHead(hpage)) {
+		result = SCAN_FAIL;
 		goto drop_hpage;
+	}
 
-	if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
+	if (!PageTransCompound(hpage)) {
+		result = SCAN_FAIL;
 		goto drop_hpage;
+	}
+
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
+	switch (result) {
+	case SCAN_SUCCEED:
+		break;
+	case SCAN_PMD_NON_PRESENT:
+		/*
+		 * In MADV_COLLAPSE path, possible race with khugepaged where
+		 * all pte entries have been removed and pmd cleared.  If so,
+		 * skip all the pte checks and just update the pmd mapping.
+		 */
+		goto install_pmd;
+	default:
+		goto drop_hpage;
+	}
 
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+	result = SCAN_FAIL;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1453,8 +1519,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 			continue;
 
 		/* page swapped out, abort */
-		if (!pte_present(*pte))
+		if (!pte_present(*pte)) {
+			result = SCAN_PTE_NON_PRESENT;
 			goto abort;
+		}
 
 		page = vm_normal_page(vma, addr, *pte);
 		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
@@ -1489,12 +1557,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
 	}
 
-	/* step 4: collapse pmd */
+	/* step 4: remove pte entries */
 	collapse_and_free_pmd(mm, vma, haddr, pmd);
+
+install_pmd:
+	/* step 5: install pmd entry */
+	result = install_pmd
+			? set_huge_pmd(vma, haddr, pmd, hpage)
+			: SCAN_SUCCEED;
+
 drop_hpage:
 	unlock_page(hpage);
 	put_page(hpage);
-	return;
+	return result;
 
 abort:
 	pte_unmap_unlock(start_pte, ptl);
@@ -1516,22 +1591,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 		goto out;
 
 	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
+		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
 
 out:
 	mm_slot->nr_pte_mapped_thp = 0;
 	mmap_write_unlock(mm);
 }
 
-static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
+static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
+			       struct mm_struct *target_mm,
+			       unsigned long target_addr, struct page *hpage,
+			       struct collapse_control *cc)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long addr;
-	pmd_t *pmd;
+	int target_result = SCAN_FAIL;
 
 	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		int result = SCAN_FAIL;
+		struct mm_struct *mm = NULL;
+		unsigned long addr = 0;
+		pmd_t *pmd;
+		bool is_target = false;
+
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
 		 * got written to. These VMAs are likely not worth investing
@@ -1548,24 +1630,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * ptl. It has higher chance to recover THP for the VMA, but
 		 * has higher cost too.
 		 */
-		if (vma->anon_vma)
-			continue;
+		if (vma->anon_vma) {
+			result = SCAN_PAGE_ANON;
+			goto next;
+		}
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-		if (addr & ~HPAGE_PMD_MASK)
-			continue;
-		if (vma->vm_end < addr + HPAGE_PMD_SIZE)
-			continue;
+		if (addr & ~HPAGE_PMD_MASK ||
+		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
+			result = SCAN_VMA_CHECK;
+			goto next;
+		}
 		mm = vma->vm_mm;
-		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
-			continue;
+		is_target = mm == target_mm && addr == target_addr;
+		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
+		if (result != SCAN_SUCCEED)
+			goto next;
 		/*
 		 * We need exclusive mmap_lock to retract page table.
 		 *
 		 * We use trylock due to lock inversion: we need to acquire
 		 * mmap_lock while holding page lock. Fault path does it in
 		 * reverse order. Trylock is a way to avoid deadlock.
+		 *
+		 * Also, it's not MADV_COLLAPSE's job to collapse other
+		 * mappings - let khugepaged take care of them later.
 		 */
-		if (mmap_write_trylock(mm)) {
+		result = SCAN_PTE_MAPPED_HUGEPAGE;
+		if ((cc->is_khugepaged || is_target) &&
+		    mmap_write_trylock(mm)) {
 			/*
 			 * When a vma is registered with uffd-wp, we can't
 			 * recycle the pmd pgtable because there can be pte
@@ -1574,22 +1666,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			 * it'll always mapped in small page size for uffd-wp
 			 * registered ranges.
 			 */
-			if (!hpage_collapse_test_exit(mm) &&
-			    !userfaultfd_wp(vma))
-				collapse_and_free_pmd(mm, vma, addr, pmd);
+			if (hpage_collapse_test_exit(mm)) {
+				result = SCAN_ANY_PROCESS;
+				goto unlock_next;
+			}
+			if (userfaultfd_wp(vma)) {
+				result = SCAN_PTE_UFFD_WP;
+				goto unlock_next;
+			}
+			collapse_and_free_pmd(mm, vma, addr, pmd);
+			if (!cc->is_khugepaged && is_target)
+				result = set_huge_pmd(vma, addr, pmd, hpage);
+			else
+				result = SCAN_SUCCEED;
+
+unlock_next:
 			mmap_write_unlock(mm);
-		} else {
-			/* Try again later */
+			goto next;
+		}
+		/*
+		 * Calling context will handle target mm/addr. Otherwise, let
+		 * khugepaged try again later.
+		 */
+		if (!is_target) {
 			khugepaged_add_pte_mapped_thp(mm, addr);
+			continue;
 		}
+next:
+		if (is_target)
+			target_result = result;
 	}
 	i_mmap_unlock_write(mapping);
+	return target_result;
 }
 
 /**
  * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
  *
  * @mm: process address space where collapse happens
+ * @addr: virtual collapse start address
  * @file: file that collapse on
  * @start: collapse start address
  * @cc: collapse context and scratchpad
@@ -1609,8 +1724,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static int collapse_file(struct mm_struct *mm, struct file *file,
-			 pgoff_t start, struct collapse_control *cc)
+static int collapse_file(struct mm_struct *mm, unsigned long addr,
+			 struct file *file, pgoff_t start,
+			 struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *hpage;
@@ -1912,7 +2028,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
 		/*
 		 * Remove pte page tables, so we can re-fault the page as huge.
 		 */
-		retract_page_tables(mapping, start);
+		result = retract_page_tables(mapping, start, mm, addr, hpage,
+					     cc);
 		unlock_page(hpage);
 		hpage = NULL;
 	} else {
@@ -1968,8 +2085,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
 	return result;
 }
 
-static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				pgoff_t start, struct collapse_control *cc)
+static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+				    struct file *file, pgoff_t start,
+				    struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2049,7 +2167,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			result = collapse_file(mm, file, start, cc);
+			result = collapse_file(mm, addr, file, start, cc);
 		}
 	}
 
@@ -2057,8 +2175,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 	return result;
 }
 #else
-static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				pgoff_t start, struct collapse_control *cc)
+static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+				    struct file *file, pgoff_t start,
+				    struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2153,8 +2272,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				*result = khugepaged_scan_file(mm, file, pgoff,
-							       cc);
+				*result = hpage_collapse_scan_file(mm,
+								   khugepaged_scan.address,
+								   file, pgoff, cc);
 				mmap_locked = false;
 				fput(file);
 			} else {
@@ -2452,10 +2572,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	*prev = vma;
 
-	/* TODO: Support file/shmem */
-	if (!vma->anon_vma || !vma_is_anonymous(vma))
-		return -EINVAL;
-
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
 		return -EINVAL;
 
@@ -2486,16 +2602,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		}
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
-		result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
-						 cc);
+		if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
+			struct file *file = get_file(vma->vm_file);
+			pgoff_t pgoff = linear_page_index(vma, addr);
+
+			mmap_read_unlock(mm);
+			mmap_locked = false;
+			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
+							  cc);
+			fput(file);
+		} else {
+			result = hpage_collapse_scan_pmd(mm, vma, addr,
+							 &mmap_locked, cc);
+		}
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
+handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
+		case SCAN_PTE_MAPPED_HUGEPAGE:
+			BUG_ON(mmap_locked);
+			BUG_ON(*prev);
+			mmap_write_lock(mm);
+			result = collapse_pte_mapped_thp(mm, addr, true);
+			mmap_write_unlock(mm);
+			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 4/9] mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  2022-08-26 22:03 Zach O'Keefe
                   ` (2 preceding siblings ...)
  2022-08-26 22:03 ` [PATCH mm-unstable v2 3/9] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 5/9] selftests/vm: dedup THP helpers Zach O'Keefe
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
While this change is targeted at debugging MADV_COLLAPSE pathway, the
"mm_khugepaged" prefix is retained for symmetry with
huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
to prevent changing kernel ABI as much as possible.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h | 34 ++++++++++++++++++++++++++++++
 mm/khugepaged.c                    |  3 ++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index a8db658e99e9..1d2dd88dc3c4 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -169,5 +169,39 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->ret)
 );
 
+TRACE_EVENT(mm_khugepaged_scan_file,
+
+	TP_PROTO(struct mm_struct *mm, struct page *page, const char *filename,
+		 int present, int swap, int result),
+
+	TP_ARGS(mm, page, filename, present, swap, result),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, pfn)
+		__string(filename, filename)
+		__field(int, present)
+		__field(int, swap)
+		__field(int, result)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->pfn = page ? page_to_pfn(page) : -1;
+		__assign_str(filename, filename);
+		__entry->present = present;
+		__entry->swap = swap;
+		__entry->result = result;
+	),
+
+	TP_printk("mm=%p, scan_pfn=0x%lx, filename=%s, present=%d, swap=%d, result=%s",
+		__entry->mm,
+		__entry->pfn,
+		__get_str(filename),
+		__entry->present,
+		__entry->swap,
+		__print_symbolic(__entry->result, SCAN_STATUS))
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34c0c74b3839..55144f33ba09 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2171,7 +2171,8 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 	}
 
-	/* TODO: tracepoints */
+	trace_mm_khugepaged_scan_file(mm, page, file->f_path.dentry->d_iname,
+				      present, swap, result);
 	return result;
 }
 #else
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 5/9] selftests/vm: dedup THP helpers
  2022-08-26 22:03 Zach O'Keefe
                   ` (3 preceding siblings ...)
  2022-08-26 22:03 ` [PATCH mm-unstable v2 4/9] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 6/9] selftests/vm: modularize thp collapse memory operations Zach O'Keefe
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

These files:

tools/testing/selftests/vm/vm_util.c
tools/testing/selftests/vm/khugepaged.c

Both contain logic to:

1) Determine hugepage size on current system
2) Read /proc/self/smaps to determine number of THPs at an address

Refactor selftests/vm/khugepaged.c to use the vm_util common helpers
and add it as a build dependency.

Since selftests/vm/khugepaged.c is the largest user of check_huge(),
change the signature of check_huge() to match
selftests/vm/khugepaged.c's useage: take an expected number of
hugepages, and return a bool indicating if the correct number of
hugepages were found.  Add a wrapper, check_huge_anon(), in anticipation
of checking smaps for file and shmem hugepages.

Update existing callsites to use the new pattern / function.

Likewise, check_for_pattern() was duplicated, and it's a general enough
helper to include in vm_util helpers as well.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/Makefile           |  1 +
 tools/testing/selftests/vm/khugepaged.c       | 64 ++-----------------
 tools/testing/selftests/vm/soft-dirty.c       |  2 +-
 .../selftests/vm/split_huge_page_test.c       | 12 ++--
 tools/testing/selftests/vm/vm_util.c          | 26 +++++---
 tools/testing/selftests/vm/vm_util.h          |  3 +-
 6 files changed, 32 insertions(+), 76 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index b52f2cc51482..df4fa77febca 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -97,6 +97,7 @@ TEST_FILES += va_128TBswitch.sh
 
 include ../lib.mk
 
+$(OUTPUT)/khugepaged: vm_util.c
 $(OUTPUT)/madv_populate: vm_util.c
 $(OUTPUT)/soft-dirty: vm_util.c
 $(OUTPUT)/split_huge_page_test: vm_util.c
diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index b77b1e28cdb3..e5c602f7a18b 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -11,6 +11,8 @@
 #include <sys/mman.h>
 #include <sys/wait.h>
 
+#include "vm_util.h"
+
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
@@ -351,64 +353,12 @@ static void save_settings(void)
 	signal(SIGQUIT, restore_settings);
 }
 
-#define MAX_LINE_LENGTH 500
-
-static bool check_for_pattern(FILE *fp, char *pattern, char *buf)
-{
-	while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
-		if (!strncmp(buf, pattern, strlen(pattern)))
-			return true;
-	}
-	return false;
-}
-
 static bool check_huge(void *addr, int nr_hpages)
 {
-	bool thp = false;
-	int ret;
-	FILE *fp;
-	char buffer[MAX_LINE_LENGTH];
-	char addr_pattern[MAX_LINE_LENGTH];
-
-	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "%08lx-",
-		       (unsigned long) addr);
-	if (ret >= MAX_LINE_LENGTH) {
-		printf("%s: Pattern is too long\n", __func__);
-		exit(EXIT_FAILURE);
-	}
-
-
-	fp = fopen(PID_SMAPS, "r");
-	if (!fp) {
-		printf("%s: Failed to open file %s\n", __func__, PID_SMAPS);
-		exit(EXIT_FAILURE);
-	}
-	if (!check_for_pattern(fp, addr_pattern, buffer))
-		goto err_out;
-
-	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "AnonHugePages:%10ld kB",
-		       nr_hpages * (hpage_pmd_size >> 10));
-	if (ret >= MAX_LINE_LENGTH) {
-		printf("%s: Pattern is too long\n", __func__);
-		exit(EXIT_FAILURE);
-	}
-	/*
-	 * Fetch the AnonHugePages: in the same block and check whether it got
-	 * the expected number of hugeepages next.
-	 */
-	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
-		goto err_out;
-
-	if (strncmp(buffer, addr_pattern, strlen(addr_pattern)))
-		goto err_out;
-
-	thp = true;
-err_out:
-	fclose(fp);
-	return thp;
+	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
 }
 
-
+#define MAX_LINE_LENGTH 500
 static bool check_swap(void *addr, unsigned long size)
 {
 	bool swap = false;
@@ -430,7 +380,7 @@ static bool check_swap(void *addr, unsigned long size)
 		printf("%s: Failed to open file %s\n", __func__, PID_SMAPS);
 		exit(EXIT_FAILURE);
 	}
-	if (!check_for_pattern(fp, addr_pattern, buffer))
+	if (!check_for_pattern(fp, addr_pattern, buffer, sizeof(buffer)))
 		goto err_out;
 
 	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "Swap:%19ld kB",
@@ -443,7 +393,7 @@ static bool check_swap(void *addr, unsigned long size)
 	 * Fetch the Swap: in the same block and check whether it got
 	 * the expected number of hugeepages next.
 	 */
-	if (!check_for_pattern(fp, "Swap:", buffer))
+	if (!check_for_pattern(fp, "Swap:", buffer, sizeof(buffer)))
 		goto err_out;
 
 	if (strncmp(buffer, addr_pattern, strlen(addr_pattern)))
@@ -1045,7 +995,7 @@ int main(int argc, const char **argv)
 	setbuf(stdout, NULL);
 
 	page_size = getpagesize();
-	hpage_pmd_size = read_num("hpage_pmd_size");
+	hpage_pmd_size = read_pmd_pagesize();
 	hpage_pmd_nr = hpage_pmd_size / page_size;
 
 	default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
diff --git a/tools/testing/selftests/vm/soft-dirty.c b/tools/testing/selftests/vm/soft-dirty.c
index e3a43f5d4fa2..21d8830c5f24 100644
--- a/tools/testing/selftests/vm/soft-dirty.c
+++ b/tools/testing/selftests/vm/soft-dirty.c
@@ -91,7 +91,7 @@ static void test_hugepage(int pagemap_fd, int pagesize)
 	for (i = 0; i < hpage_len; i++)
 		map[i] = (char)i;
 
-	if (check_huge(map)) {
+	if (check_huge_anon(map, 1, hpage_len)) {
 		ksft_test_result_pass("Test %s huge page allocation\n", __func__);
 
 		clear_softdirty();
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
index 6aa2b8253aed..76e1c36dd9e5 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -92,7 +92,6 @@ void split_pmd_thp(void)
 {
 	char *one_page;
 	size_t len = 4 * pmd_pagesize;
-	uint64_t thp_size;
 	size_t i;
 
 	one_page = memalign(pmd_pagesize, len);
@@ -107,8 +106,7 @@ void split_pmd_thp(void)
 	for (i = 0; i < len; i++)
 		one_page[i] = (char)i;
 
-	thp_size = check_huge(one_page);
-	if (!thp_size) {
+	if (!check_huge_anon(one_page, 1, pmd_pagesize)) {
 		printf("No THP is allocated\n");
 		exit(EXIT_FAILURE);
 	}
@@ -124,9 +122,8 @@ void split_pmd_thp(void)
 		}
 
 
-	thp_size = check_huge(one_page);
-	if (thp_size) {
-		printf("Still %ld kB AnonHugePages not split\n", thp_size);
+	if (check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
 		exit(EXIT_FAILURE);
 	}
 
@@ -172,8 +169,7 @@ void split_pte_mapped_thp(void)
 	for (i = 0; i < len; i++)
 		one_page[i] = (char)i;
 
-	thp_size = check_huge(one_page);
-	if (!thp_size) {
+	if (!check_huge_anon(one_page, 1, pmd_pagesize)) {
 		printf("No THP is allocated\n");
 		exit(EXIT_FAILURE);
 	}
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index b58ab11a7a30..9dae51b8219f 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -42,9 +42,9 @@ void clear_softdirty(void)
 		ksft_exit_fail_msg("writing clear_refs failed\n");
 }
 
-static bool check_for_pattern(FILE *fp, const char *pattern, char *buf)
+bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len)
 {
-	while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
+	while (fgets(buf, len, fp)) {
 		if (!strncmp(buf, pattern, strlen(pattern)))
 			return true;
 	}
@@ -72,9 +72,10 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
-uint64_t check_huge(void *addr)
+bool __check_huge(void *addr, char *pattern, int nr_hpages,
+		  uint64_t hpage_size)
 {
-	uint64_t thp = 0;
+	uint64_t thp = -1;
 	int ret;
 	FILE *fp;
 	char buffer[MAX_LINE_LENGTH];
@@ -89,20 +90,27 @@ uint64_t check_huge(void *addr)
 	if (!fp)
 		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, SMAP_FILE_PATH);
 
-	if (!check_for_pattern(fp, addr_pattern, buffer))
+	if (!check_for_pattern(fp, addr_pattern, buffer, sizeof(buffer)))
 		goto err_out;
 
 	/*
-	 * Fetch the AnonHugePages: in the same block and check the number of
+	 * Fetch the pattern in the same block and check the number of
 	 * hugepages.
 	 */
-	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
+	if (!check_for_pattern(fp, pattern, buffer, sizeof(buffer)))
 		goto err_out;
 
-	if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1)
+	snprintf(addr_pattern, MAX_LINE_LENGTH, "%s%%9ld kB", pattern);
+
+	if (sscanf(buffer, addr_pattern, &thp) != 1)
 		ksft_exit_fail_msg("Reading smap error\n");
 
 err_out:
 	fclose(fp);
-	return thp;
+	return thp == (nr_hpages * (hpage_size >> 10));
+}
+
+bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size)
+{
+	return __check_huge(addr, "AnonHugePages: ", nr_hpages, hpage_size);
 }
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 2e512bd57ae1..8434ea0c95cd 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -5,5 +5,6 @@
 uint64_t pagemap_get_entry(int fd, char *start);
 bool pagemap_is_softdirty(int fd, char *start);
 void clear_softdirty(void);
+bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
-uint64_t check_huge(void *addr);
+bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 6/9] selftests/vm: modularize thp collapse memory operations
  2022-08-26 22:03 Zach O'Keefe
                   ` (4 preceding siblings ...)
  2022-08-26 22:03 ` [PATCH mm-unstable v2 5/9] selftests/vm: dedup THP helpers Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 7/9] selftests/vm: add thp collapse file and tmpfs testing Zach O'Keefe
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Modularize operations to setup, cleanup, fault, and check for huge
pages, for a given memory type.  This allows reusing existing tests
with additional memory types by defining new memory operations.
Following patches will add file and shmem memory types.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 366 +++++++++++++-----------
 1 file changed, 200 insertions(+), 166 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index e5c602f7a18b..b4b1709507a5 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -28,8 +28,16 @@ static int hpage_pmd_nr;
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
+struct mem_ops {
+	void *(*setup_area)(int nr_hpages);
+	void (*cleanup_area)(void *p, unsigned long size);
+	void (*fault)(void *p, unsigned long start, unsigned long end);
+	bool (*check_huge)(void *addr, int nr_hpages);
+};
+
 struct collapse_context {
-	void (*collapse)(const char *msg, char *p, int nr_hpages, bool expect);
+	void (*collapse)(const char *msg, char *p, int nr_hpages,
+			 struct mem_ops *ops, bool expect);
 	bool enforce_pte_scan_limits;
 };
 
@@ -353,11 +361,6 @@ static void save_settings(void)
 	signal(SIGQUIT, restore_settings);
 }
 
-static bool check_huge(void *addr, int nr_hpages)
-{
-	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
-}
-
 #define MAX_LINE_LENGTH 500
 static bool check_swap(void *addr, unsigned long size)
 {
@@ -431,18 +434,25 @@ static void fill_memory(int *p, unsigned long start, unsigned long end)
  * Returns pmd-mapped hugepage in VMA marked VM_HUGEPAGE, filled with
  * validate_memory()'able contents.
  */
-static void *alloc_hpage(void)
+static void *alloc_hpage(struct mem_ops *ops)
 {
-	void *p;
+	void *p = ops->setup_area(1);
 
-	p = alloc_mapping(1);
+	ops->fault(p, 0, hpage_pmd_size);
+	if (madvise(p, hpage_pmd_size, MADV_HUGEPAGE)) {
+		perror("madvise(MADV_HUGEPAGE)");
+		exit(EXIT_FAILURE);
+	}
 	printf("Allocate huge page...");
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p, 1))
-		success("OK");
-	else
-		fail("Fail");
+	if (madvise(p, hpage_pmd_size, MADV_COLLAPSE)) {
+		perror("madvise(MADV_COLLAPSE)");
+		exit(EXIT_FAILURE);
+	}
+	if (!ops->check_huge(p, 1)) {
+		perror("madvise(MADV_COLLAPSE)");
+		exit(EXIT_FAILURE);
+	}
+	success("OK");
 	return p;
 }
 
@@ -459,18 +469,40 @@ static void validate_memory(int *p, unsigned long start, unsigned long end)
 	}
 }
 
-static void madvise_collapse(const char *msg, char *p, int nr_hpages,
-			     bool expect)
+static void *anon_setup_area(int nr_hpages)
+{
+	return alloc_mapping(nr_hpages);
+}
+
+static void anon_cleanup_area(void *p, unsigned long size)
+{
+	munmap(p, size);
+}
+
+static void anon_fault(void *p, unsigned long start, unsigned long end)
+{
+	fill_memory(p, start, end);
+}
+
+static bool anon_check_huge(void *addr, int nr_hpages)
+{
+	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
+}
+
+static struct mem_ops anon_ops = {
+	.setup_area = &anon_setup_area,
+	.cleanup_area = &anon_cleanup_area,
+	.fault = &anon_fault,
+	.check_huge = &anon_check_huge,
+};
+
+static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
+			       struct mem_ops *ops, bool expect)
 {
 	int ret;
 	struct settings settings = *current_settings();
 
 	printf("%s...", msg);
-	/* Sanity check */
-	if (!check_huge(p, 0)) {
-		printf("Unexpected huge page\n");
-		exit(EXIT_FAILURE);
-	}
 
 	/*
 	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
@@ -482,9 +514,10 @@ static void madvise_collapse(const char *msg, char *p, int nr_hpages,
 	/* Clear VM_NOHUGEPAGE */
 	madvise(p, nr_hpages * hpage_pmd_size, MADV_HUGEPAGE);
 	ret = madvise(p, nr_hpages * hpage_pmd_size, MADV_COLLAPSE);
+
 	if (((bool)ret) == expect)
 		fail("Fail: Bad return value");
-	else if (check_huge(p, nr_hpages) != expect)
+	else if (!ops->check_huge(p, expect ? nr_hpages : 0))
 		fail("Fail: check_huge()");
 	else
 		success("OK");
@@ -492,14 +525,26 @@ static void madvise_collapse(const char *msg, char *p, int nr_hpages,
 	pop_settings();
 }
 
+static void madvise_collapse(const char *msg, char *p, int nr_hpages,
+			     struct mem_ops *ops, bool expect)
+{
+	/* Sanity check */
+	if (!ops->check_huge(p, 0)) {
+		printf("Unexpected huge page\n");
+		exit(EXIT_FAILURE);
+	}
+	__madvise_collapse(msg, p, nr_hpages, ops, expect);
+}
+
 #define TICK 500000
-static bool wait_for_scan(const char *msg, char *p, int nr_hpages)
+static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
+			  struct mem_ops *ops)
 {
 	int full_scans;
 	int timeout = 6; /* 3 seconds */
 
 	/* Sanity check */
-	if (!check_huge(p, 0)) {
+	if (!ops->check_huge(p, 0)) {
 		printf("Unexpected huge page\n");
 		exit(EXIT_FAILURE);
 	}
@@ -511,7 +556,7 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages)
 
 	printf("%s...", msg);
 	while (timeout--) {
-		if (check_huge(p, nr_hpages))
+		if (ops->check_huge(p, nr_hpages))
 			break;
 		if (read_num("khugepaged/full_scans") >= full_scans)
 			break;
@@ -525,19 +570,20 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages)
 }
 
 static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
-				bool expect)
+				struct mem_ops *ops, bool expect)
 {
-	if (wait_for_scan(msg, p, nr_hpages)) {
+	if (wait_for_scan(msg, p, nr_hpages, ops)) {
 		if (expect)
 			fail("Timeout");
 		else
 			success("OK");
 		return;
-	} else if (check_huge(p, nr_hpages) == expect) {
+	}
+
+	if (ops->check_huge(p, expect ? nr_hpages : 0))
 		success("OK");
-	} else {
+	else
 		fail("Fail");
-	}
 }
 
 static void alloc_at_fault(void)
@@ -551,7 +597,7 @@ static void alloc_at_fault(void)
 	p = alloc_mapping(1);
 	*p = 1;
 	printf("Allocate huge page on fault...");
-	if (check_huge(p, 1))
+	if (check_huge_anon(p, 1, hpage_pmd_size))
 		success("OK");
 	else
 		fail("Fail");
@@ -560,49 +606,48 @@ static void alloc_at_fault(void)
 
 	madvise(p, page_size, MADV_DONTNEED);
 	printf("Split huge PMD on MADV_DONTNEED...");
-	if (check_huge(p, 0))
+	if (check_huge_anon(p, 0, hpage_pmd_size))
 		success("OK");
 	else
 		fail("Fail");
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full(struct collapse_context *c)
+static void collapse_full(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 	int nr_hpages = 4;
 	unsigned long size = nr_hpages * hpage_pmd_size;
 
-	p = alloc_mapping(nr_hpages);
-	fill_memory(p, 0, size);
+	p = ops->setup_area(nr_hpages);
+	ops->fault(p, 0, size);
 	c->collapse("Collapse multiple fully populated PTE table", p, nr_hpages,
-		    true);
+		    ops, true);
 	validate_memory(p, 0, size);
-	munmap(p, size);
+	ops->cleanup_area(p, size);
 }
 
-static void collapse_empty(struct collapse_context *c)
+static void collapse_empty(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_mapping(1);
-	c->collapse("Do not collapse empty PTE table", p, 1, false);
-	munmap(p, hpage_pmd_size);
+	p = ops->setup_area(1);
+	c->collapse("Do not collapse empty PTE table", p, 1, ops, false);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry(struct collapse_context *c)
+static void collapse_single_pte_entry(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_mapping(1);
-	fill_memory(p, 0, page_size);
+	p = ops->setup_area(1);
+	ops->fault(p, 0, page_size);
 	c->collapse("Collapse PTE table with single PTE entry present", p,
-		    1, true);
-	validate_memory(p, 0, page_size);
-	munmap(p, hpage_pmd_size);
+		    1, ops, true);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_none(struct collapse_context *c)
+static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_none = hpage_pmd_nr / 2;
 	struct settings settings = *current_settings();
@@ -611,30 +656,30 @@ static void collapse_max_ptes_none(struct collapse_context *c)
 	settings.khugepaged.max_ptes_none = max_ptes_none;
 	push_settings(&settings);
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
 
-	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
+	ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 	c->collapse("Maybe collapse with max_ptes_none exceeded", p, 1,
-		    !c->enforce_pte_scan_limits);
+		    ops, !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 
 	if (c->enforce_pte_scan_limits) {
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
-		c->collapse("Collapse with max_ptes_none PTEs empty", p, 1,
+		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+		c->collapse("Collapse with max_ptes_none PTEs empty", p, 1, ops,
 			    true);
 		validate_memory(p, 0,
 				(hpage_pmd_nr - max_ptes_none) * page_size);
 	}
-
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 	pop_settings();
 }
 
-static void collapse_swapin_single_pte(struct collapse_context *c)
+static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
-	p = alloc_mapping(1);
-	fill_memory(p, 0, hpage_pmd_size);
+
+	p = ops->setup_area(1);
+	ops->fault(p, 0, hpage_pmd_size);
 
 	printf("Swapout one page...");
 	if (madvise(p, page_size, MADV_PAGEOUT)) {
@@ -648,20 +693,21 @@ static void collapse_swapin_single_pte(struct collapse_context *c)
 		goto out;
 	}
 
-	c->collapse("Collapse with swapping in single PTE entry", p, 1, true);
+	c->collapse("Collapse with swapping in single PTE entry", p, 1, ops,
+		    true);
 	validate_memory(p, 0, hpage_pmd_size);
 out:
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_swap(struct collapse_context *c)
+static void collapse_max_ptes_swap(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
 	void *p;
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
+	ops->fault(p, 0, hpage_pmd_size);
 
-	fill_memory(p, 0, hpage_pmd_size);
 	printf("Swapout %d of %d pages...", max_ptes_swap + 1, hpage_pmd_nr);
 	if (madvise(p, (max_ptes_swap + 1) * page_size, MADV_PAGEOUT)) {
 		perror("madvise(MADV_PAGEOUT)");
@@ -674,12 +720,12 @@ static void collapse_max_ptes_swap(struct collapse_context *c)
 		goto out;
 	}
 
-	c->collapse("Maybe collapse with max_ptes_swap exceeded", p, 1,
+	c->collapse("Maybe collapse with max_ptes_swap exceeded", p, 1, ops,
 		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, hpage_pmd_size);
 
 	if (c->enforce_pte_scan_limits) {
-		fill_memory(p, 0, hpage_pmd_size);
+		ops->fault(p, 0, hpage_pmd_size);
 		printf("Swapout %d of %d pages...", max_ptes_swap,
 		       hpage_pmd_nr);
 		if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
@@ -694,63 +740,65 @@ static void collapse_max_ptes_swap(struct collapse_context *c)
 		}
 
 		c->collapse("Collapse with max_ptes_swap pages swapped out", p,
-			    1, true);
+			    1, ops, true);
 		validate_memory(p, 0, hpage_pmd_size);
 	}
 out:
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry_compound(struct collapse_context *c)
+static void collapse_single_pte_entry_compound(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
+
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
 	printf("Split huge page leaving single PTE mapping compound page...");
 	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
-	if (check_huge(p, 0))
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 
 	c->collapse("Collapse PTE table with single PTE mapping compound page",
-		    p, 1, true);
+		    p, 1, ops, true);
 	validate_memory(p, 0, page_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_full_of_compound(struct collapse_context *c)
+static void collapse_full_of_compound(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
 	printf("Split huge page leaving single PTE page table full of compound pages...");
 	madvise(p, page_size, MADV_NOHUGEPAGE);
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-	if (check_huge(p, 0))
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 
-	c->collapse("Collapse PTE table full of compound pages", p, 1, true);
+	c->collapse("Collapse PTE table full of compound pages", p, 1, ops,
+		    true);
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_compound_extreme(struct collapse_context *c)
+static void collapse_compound_extreme(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 	int i;
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
 	for (i = 0; i < hpage_pmd_nr; i++) {
 		printf("\rConstruct PTE page table full of different PTE-mapped compound pages %3d/%d...",
 				i + 1, hpage_pmd_nr);
 
 		madvise(BASE_ADDR, hpage_pmd_size, MADV_HUGEPAGE);
-		fill_memory(BASE_ADDR, 0, hpage_pmd_size);
-		if (!check_huge(BASE_ADDR, 1)) {
+		ops->fault(BASE_ADDR, 0, hpage_pmd_size);
+		if (!ops->check_huge(BASE_ADDR, 1)) {
 			printf("Failed to allocate huge page\n");
 			exit(EXIT_FAILURE);
 		}
@@ -777,30 +825,30 @@ static void collapse_compound_extreme(struct collapse_context *c)
 		}
 	}
 
-	munmap(BASE_ADDR, hpage_pmd_size);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p, 0))
+	ops->cleanup_area(BASE_ADDR, hpage_pmd_size);
+	ops->fault(p, 0, hpage_pmd_size);
+	if (!ops->check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
 
 	c->collapse("Collapse PTE table full of different compound pages", p, 1,
-		    true);
+		    ops, true);
 
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_fork(struct collapse_context *c)
+static void collapse_fork(struct collapse_context *c, struct mem_ops *ops)
 {
 	int wstatus;
 	void *p;
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
 
 	printf("Allocate small page...");
-	fill_memory(p, 0, page_size);
-	if (check_huge(p, 0))
+	ops->fault(p, 0, page_size);
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
@@ -811,17 +859,17 @@ static void collapse_fork(struct collapse_context *c)
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p, 0))
+		if (ops->check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
 
-		fill_memory(p, page_size, 2 * page_size);
+		ops->fault(p, page_size, 2 * page_size);
 		c->collapse("Collapse PTE table with single page shared with parent process",
-			    p, 1, true);
+			    p, 1, ops, true);
 
 		validate_memory(p, 0, page_size);
-		munmap(p, hpage_pmd_size);
+		ops->cleanup_area(p, hpage_pmd_size);
 		exit(exit_status);
 	}
 
@@ -829,27 +877,27 @@ static void collapse_fork(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has small page...");
-	if (check_huge(p, 0))
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 	validate_memory(p, 0, page_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_fork_compound(struct collapse_context *c)
+static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *ops)
 {
 	int wstatus;
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
 	printf("Share huge page over fork()...");
 	if (!fork()) {
 		/* Do not touch settings on child exit */
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p, 1))
+		if (ops->check_huge(p, 1))
 			success("OK");
 		else
 			fail("Fail");
@@ -857,20 +905,20 @@ static void collapse_fork_compound(struct collapse_context *c)
 		printf("Split huge page PMD in child process...");
 		madvise(p, page_size, MADV_NOHUGEPAGE);
 		madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-		if (check_huge(p, 0))
+		if (ops->check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
-		fill_memory(p, 0, page_size);
+		ops->fault(p, 0, page_size);
 
 		write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
 		c->collapse("Collapse PTE table full of compound pages in child",
-			    p, 1, true);
+			    p, 1, ops, true);
 		write_num("khugepaged/max_ptes_shared",
 			  current_settings()->khugepaged.max_ptes_shared);
 
 		validate_memory(p, 0, hpage_pmd_size);
-		munmap(p, hpage_pmd_size);
+		ops->cleanup_area(p, hpage_pmd_size);
 		exit(exit_status);
 	}
 
@@ -878,59 +926,59 @@ static void collapse_fork_compound(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has huge page...");
-	if (check_huge(p, 1))
+	if (ops->check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_shared(struct collapse_context *c)
+static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
 	int wstatus;
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
 	printf("Share huge page over fork()...");
 	if (!fork()) {
 		/* Do not touch settings on child exit */
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p, 1))
+		if (ops->check_huge(p, 1))
 			success("OK");
 		else
 			fail("Fail");
 
 		printf("Trigger CoW on page %d of %d...",
 				hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
-		if (check_huge(p, 0))
+		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
+		if (ops->check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
 
 		c->collapse("Maybe collapse with max_ptes_shared exceeded", p,
-			    1, !c->enforce_pte_scan_limits);
+			    1, ops, !c->enforce_pte_scan_limits);
 
 		if (c->enforce_pte_scan_limits) {
 			printf("Trigger CoW on page %d of %d...",
 			       hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
-			fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) *
+			ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared) *
 				    page_size);
-			if (check_huge(p, 0))
+			if (ops->check_huge(p, 0))
 				success("OK");
 			else
 				fail("Fail");
 
 			c->collapse("Collapse with max_ptes_shared PTEs shared",
-				    p, 1,  true);
+				    p, 1, ops, true);
 		}
 
 		validate_memory(p, 0, hpage_pmd_size);
-		munmap(p, hpage_pmd_size);
+		ops->cleanup_area(p, hpage_pmd_size);
 		exit(exit_status);
 	}
 
@@ -938,42 +986,28 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has huge page...");
-	if (check_huge(p, 1))
+	if (ops->check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void madvise_collapse_existing_thps(void)
+static void madvise_collapse_existing_thps(struct collapse_context *c,
+					   struct mem_ops *ops)
 {
 	void *p;
-	int err;
 
-	p = alloc_mapping(1);
-	fill_memory(p, 0, hpage_pmd_size);
+	p = ops->setup_area(1);
+	ops->fault(p, 0, hpage_pmd_size);
+	c->collapse("Collapse fully populated PTE table...", p, 1, ops, true);
+	validate_memory(p, 0, hpage_pmd_size);
 
-	printf("Collapse fully populated PTE table...");
-	/*
-	 * Note that we don't set MADV_HUGEPAGE here, which
-	 * also tests that VM_HUGEPAGE isn't required for
-	 * MADV_COLLAPSE in "madvise" mode.
-	 */
-	err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
-	if (err == 0 && check_huge(p, 1)) {
-		success("OK");
-		printf("Re-collapse PMD-mapped hugepage");
-		err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
-		if (err == 0 && check_huge(p, 1))
-			success("OK");
-		else
-			fail("Fail");
-	} else {
-		fail("Fail");
-	}
+	/* c->collapse() will find a hugepage and complain - call directly. */
+	__madvise_collapse("Re-collapse PMD-mapped hugepage", p, 1, ops, true);
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
 int main(int argc, const char **argv)
@@ -1013,37 +1047,37 @@ int main(int argc, const char **argv)
 		c.collapse = &khugepaged_collapse;
 		c.enforce_pte_scan_limits = true;
 
-		collapse_full(&c);
-		collapse_empty(&c);
-		collapse_single_pte_entry(&c);
-		collapse_max_ptes_none(&c);
-		collapse_swapin_single_pte(&c);
-		collapse_max_ptes_swap(&c);
-		collapse_single_pte_entry_compound(&c);
-		collapse_full_of_compound(&c);
-		collapse_compound_extreme(&c);
-		collapse_fork(&c);
-		collapse_fork_compound(&c);
-		collapse_max_ptes_shared(&c);
+		collapse_full(&c, &anon_ops);
+		collapse_empty(&c, &anon_ops);
+		collapse_single_pte_entry(&c, &anon_ops);
+		collapse_max_ptes_none(&c, &anon_ops);
+		collapse_swapin_single_pte(&c, &anon_ops);
+		collapse_max_ptes_swap(&c, &anon_ops);
+		collapse_single_pte_entry_compound(&c, &anon_ops);
+		collapse_full_of_compound(&c, &anon_ops);
+		collapse_compound_extreme(&c, &anon_ops);
+		collapse_fork(&c, &anon_ops);
+		collapse_fork_compound(&c, &anon_ops);
+		collapse_max_ptes_shared(&c, &anon_ops);
 	}
 	if (!strcmp(tests, "madvise") || !strcmp(tests, "all")) {
 		printf("\n*** Testing context: madvise ***\n");
 		c.collapse = &madvise_collapse;
 		c.enforce_pte_scan_limits = false;
 
-		collapse_full(&c);
-		collapse_empty(&c);
-		collapse_single_pte_entry(&c);
-		collapse_max_ptes_none(&c);
-		collapse_swapin_single_pte(&c);
-		collapse_max_ptes_swap(&c);
-		collapse_single_pte_entry_compound(&c);
-		collapse_full_of_compound(&c);
-		collapse_compound_extreme(&c);
-		collapse_fork(&c);
-		collapse_fork_compound(&c);
-		collapse_max_ptes_shared(&c);
-		madvise_collapse_existing_thps();
+		collapse_full(&c, &anon_ops);
+		collapse_empty(&c, &anon_ops);
+		collapse_single_pte_entry(&c, &anon_ops);
+		collapse_max_ptes_none(&c, &anon_ops);
+		collapse_swapin_single_pte(&c, &anon_ops);
+		collapse_max_ptes_swap(&c, &anon_ops);
+		collapse_single_pte_entry_compound(&c, &anon_ops);
+		collapse_full_of_compound(&c, &anon_ops);
+		collapse_compound_extreme(&c, &anon_ops);
+		collapse_fork(&c, &anon_ops);
+		collapse_fork_compound(&c, &anon_ops);
+		collapse_max_ptes_shared(&c, &anon_ops);
+		madvise_collapse_existing_thps(&c, &anon_ops);
 	}
 
 	restore_settings(0);
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 7/9] selftests/vm: add thp collapse file and tmpfs testing
  2022-08-26 22:03 Zach O'Keefe
                   ` (5 preceding siblings ...)
  2022-08-26 22:03 ` [PATCH mm-unstable v2 6/9] selftests/vm: modularize thp collapse memory operations Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 8/9] selftests/vm: add thp collapse shmem testing Zach O'Keefe
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add memory operations for file-backed and tmpfs memory.  Call existing
tests with these new memory operations to test collapse functionality
of khugepaged and MADV_COLLAPSE on file-backed and tmpfs memory.  Not all
tests are reusable; for example, collapse_swapin_single_pte() which
checks swap usage.

Refactor test arguments.  Usage is now:

Usage: ./khugepaged <test type> [dir]

        <test type>     : <context>:<mem_type>
        <context>       : [all|khugepaged|madvise]
        <mem_type>      : [all|anon|file]

        "file,all" mem_type requires [dir] argument

        "file,all" mem_type requires kernel built with
        CONFIG_READ_ONLY_THP_FOR_FS=y

        if [dir] is a (sub)directory of a tmpfs mount, tmpfs must be
        mounted with huge=madvise option for khugepaged tests to work

Refactor calling tests to make it clear what collapse context / memory
operations they support, but only invoke tests requested by user. Also log
what test is being ran, and with what context / memory, to make test logs
more human readable.

A new test file is created and deleted for every test to ensure no pages
remain in the page cache between tests.  Also, disable read_ahead_kb so
that pages don't find their way into the page cache without the tests
faulting them in.

Add file and shmem wrappers to vm_utils check for file and shmem
hugepages in smaps.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 401 ++++++++++++++++++++----
 tools/testing/selftests/vm/vm_util.c    |  10 +
 tools/testing/selftests/vm/vm_util.h    |   2 +
 3 files changed, 357 insertions(+), 56 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index b4b1709507a5..0ddfffb87411 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -1,6 +1,7 @@
 #define _GNU_SOURCE
 #include <fcntl.h>
 #include <limits.h>
+#include <dirent.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -10,12 +11,20 @@
 
 #include <sys/mman.h>
 #include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/vfs.h>
+
+#include "linux/magic.h"
 
 #include "vm_util.h"
 
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
+#ifndef MADV_POPULATE_READ
+#define MADV_POPULATE_READ 22
+#endif
 #ifndef MADV_COLLAPSE
 #define MADV_COLLAPSE 25
 #endif
@@ -26,21 +35,48 @@ static unsigned long page_size;
 static int hpage_pmd_nr;
 
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
+#define READ_AHEAD_SYSFS "/sys/block/sda/queue/read_ahead_kb"
 #define PID_SMAPS "/proc/self/smaps"
+#define TEST_FILE "collapse_test_file"
+
+#define MAX_LINE_LENGTH 500
+
+enum vma_type {
+	VMA_ANON,
+	VMA_FILE,
+	VMA_SHMEM,
+};
 
 struct mem_ops {
 	void *(*setup_area)(int nr_hpages);
 	void (*cleanup_area)(void *p, unsigned long size);
 	void (*fault)(void *p, unsigned long start, unsigned long end);
 	bool (*check_huge)(void *addr, int nr_hpages);
+	const char *name;
 };
 
+static struct mem_ops *file_ops;
+static struct mem_ops *anon_ops;
+
 struct collapse_context {
 	void (*collapse)(const char *msg, char *p, int nr_hpages,
 			 struct mem_ops *ops, bool expect);
 	bool enforce_pte_scan_limits;
+	const char *name;
+};
+
+static struct collapse_context *khugepaged_context;
+static struct collapse_context *madvise_context;
+
+struct file_info {
+	const char *dir;
+	char path[MAX_LINE_LENGTH];
+	enum vma_type type;
+	int fd;
 };
 
+static struct file_info finfo;
+
 enum thp_enabled {
 	THP_ALWAYS,
 	THP_MADVISE,
@@ -106,6 +142,7 @@ struct settings {
 	enum shmem_enabled shmem_enabled;
 	bool use_zero_page;
 	struct khugepaged_settings khugepaged;
+	unsigned long read_ahead_kb;
 };
 
 static struct settings saved_settings;
@@ -124,6 +161,11 @@ static void fail(const char *msg)
 	exit_status++;
 }
 
+static void skip(const char *msg)
+{
+	printf(" \e[33m%s\e[0m\n", msg);
+}
+
 static int read_file(const char *path, char *buf, size_t buflen)
 {
 	int fd;
@@ -151,13 +193,19 @@ static int write_file(const char *path, const char *buf, size_t buflen)
 	ssize_t numwritten;
 
 	fd = open(path, O_WRONLY);
-	if (fd == -1)
+	if (fd == -1) {
+		printf("open(%s)\n", path);
+		exit(EXIT_FAILURE);
 		return 0;
+	}
 
 	numwritten = write(fd, buf, buflen - 1);
 	close(fd);
-	if (numwritten < 1)
+	if (numwritten < 1) {
+		printf("write(%s)\n", buf);
+		exit(EXIT_FAILURE);
 		return 0;
+	}
 
 	return (unsigned int) numwritten;
 }
@@ -224,20 +272,11 @@ static void write_string(const char *name, const char *val)
 	}
 }
 
-static const unsigned long read_num(const char *name)
+static const unsigned long _read_num(const char *path)
 {
-	char path[PATH_MAX];
 	char buf[21];
-	int ret;
-
-	ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
-	if (ret >= PATH_MAX) {
-		printf("%s: Pathname is too long\n", __func__);
-		exit(EXIT_FAILURE);
-	}
 
-	ret = read_file(path, buf, sizeof(buf));
-	if (ret < 0) {
+	if (read_file(path, buf, sizeof(buf)) < 0) {
 		perror("read_file(read_num)");
 		exit(EXIT_FAILURE);
 	}
@@ -245,10 +284,9 @@ static const unsigned long read_num(const char *name)
 	return strtoul(buf, NULL, 10);
 }
 
-static void write_num(const char *name, unsigned long num)
+static const unsigned long read_num(const char *name)
 {
 	char path[PATH_MAX];
-	char buf[21];
 	int ret;
 
 	ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
@@ -256,6 +294,12 @@ static void write_num(const char *name, unsigned long num)
 		printf("%s: Pathname is too long\n", __func__);
 		exit(EXIT_FAILURE);
 	}
+	return _read_num(path);
+}
+
+static void _write_num(const char *path, unsigned long num)
+{
+	char buf[21];
 
 	sprintf(buf, "%ld", num);
 	if (!write_file(path, buf, strlen(buf) + 1)) {
@@ -264,6 +308,19 @@ static void write_num(const char *name, unsigned long num)
 	}
 }
 
+static void write_num(const char *name, unsigned long num)
+{
+	char path[PATH_MAX];
+	int ret;
+
+	ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+	if (ret >= PATH_MAX) {
+		printf("%s: Pathname is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+	_write_num(path, num);
+}
+
 static void write_settings(struct settings *settings)
 {
 	struct khugepaged_settings *khugepaged = &settings->khugepaged;
@@ -283,6 +340,8 @@ static void write_settings(struct settings *settings)
 	write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
 	write_num("khugepaged/max_ptes_shared", khugepaged->max_ptes_shared);
 	write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
+
+	_write_num(READ_AHEAD_SYSFS, settings->read_ahead_kb);
 }
 
 #define MAX_SETTINGS_DEPTH 4
@@ -341,6 +400,7 @@ static void save_settings(void)
 		.shmem_enabled =
 			read_string("shmem_enabled", shmem_enabled_strings),
 		.use_zero_page = read_num("use_zero_page"),
+		.read_ahead_kb = _read_num(READ_AHEAD_SYSFS),
 	};
 	saved_settings.khugepaged = (struct khugepaged_settings) {
 		.defrag = read_num("khugepaged/defrag"),
@@ -361,7 +421,6 @@ static void save_settings(void)
 	signal(SIGQUIT, restore_settings);
 }
 
-#define MAX_LINE_LENGTH 500
 static bool check_swap(void *addr, unsigned long size)
 {
 	bool swap = false;
@@ -489,11 +548,109 @@ static bool anon_check_huge(void *addr, int nr_hpages)
 	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
 }
 
-static struct mem_ops anon_ops = {
+static void *file_setup_area(int nr_hpages)
+{
+	struct stat path_stat;
+	struct statfs fs;
+	int fd;
+	void *p;
+	unsigned long size;
+
+	stat(finfo.dir, &path_stat);
+	if (!S_ISDIR(path_stat.st_mode)) {
+		printf("%s: Not a directory (%s)\n", __func__, finfo.dir);
+		exit(EXIT_FAILURE);
+	}
+	if (snprintf(finfo.path, sizeof(finfo.path), "%s/" TEST_FILE,
+		     finfo.dir) >= sizeof(finfo.path)) {
+		printf("%s: Pathname is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+	if (statfs(finfo.dir, &fs)) {
+		perror("statfs()");
+		exit(EXIT_FAILURE);
+	}
+	finfo.type = fs.f_type == TMPFS_MAGIC ? VMA_SHMEM : VMA_FILE;
+
+	unlink(finfo.path);  /* Cleanup from previous failed tests */
+	printf("Creating %s for collapse%s...", finfo.path,
+	       finfo.type == VMA_SHMEM ? " (tmpfs)" : "");
+	fd = open(finfo.path, O_DSYNC | O_CREAT | O_RDWR | O_TRUNC | O_EXCL,
+		  777);
+	if (fd < 0) {
+		perror("open()");
+		exit(EXIT_FAILURE);
+	}
+
+	size = nr_hpages * hpage_pmd_size;
+	p = alloc_mapping(nr_hpages);
+	fill_memory(p, 0, size);
+	write(fd, p, size);
+	close(fd);
+	munmap(p, size);
+	success("OK");
+
+	printf("Opening %s read only for collapse...", finfo.path);
+	finfo.fd = open(finfo.path, O_RDONLY, 777);
+	if (finfo.fd < 0) {
+		perror("open()");
+		exit(EXIT_FAILURE);
+	}
+	p = mmap(BASE_ADDR, size, PROT_READ | PROT_EXEC,
+		 MAP_PRIVATE, finfo.fd, 0);
+	if (p == MAP_FAILED || p != BASE_ADDR) {
+		perror("mmap()");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Drop page cache */
+	write_file("/proc/sys/vm/drop_caches", "3", 2);
+	success("OK");
+	return p;
+}
+
+static void file_cleanup_area(void *p, unsigned long size)
+{
+	munmap(p, size);
+	close(finfo.fd);
+	unlink(finfo.path);
+}
+
+static void file_fault(void *p, unsigned long start, unsigned long end)
+{
+	if (madvise(((char *)p) + start, end - start, MADV_POPULATE_READ)) {
+		perror("madvise(MADV_POPULATE_READ");
+		exit(EXIT_FAILURE);
+	}
+}
+
+static bool file_check_huge(void *addr, int nr_hpages)
+{
+	switch (finfo.type) {
+	case VMA_FILE:
+		return check_huge_file(addr, nr_hpages, hpage_pmd_size);
+	case VMA_SHMEM:
+		return check_huge_shmem(addr, nr_hpages, hpage_pmd_size);
+	default:
+		exit(EXIT_FAILURE);
+		return false;
+	}
+}
+
+static struct mem_ops __anon_ops = {
 	.setup_area = &anon_setup_area,
 	.cleanup_area = &anon_cleanup_area,
 	.fault = &anon_fault,
 	.check_huge = &anon_check_huge,
+	.name = "anon",
+};
+
+static struct mem_ops __file_ops = {
+	.setup_area = &file_setup_area,
+	.cleanup_area = &file_cleanup_area,
+	.fault = &file_fault,
+	.check_huge = &file_check_huge,
+	.name = "file",
 };
 
 static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
@@ -509,6 +666,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
 	 */
 	settings.thp_enabled = THP_NEVER;
+	settings.shmem_enabled = SHMEM_NEVER;
 	push_settings(&settings);
 
 	/* Clear VM_NOHUGEPAGE */
@@ -580,12 +738,37 @@ static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
 		return;
 	}
 
+	/*
+	 * For file and shmem memory, khugepaged only retracts pte entries after
+	 * putting the new hugepage in the page cache. The hugepage must be
+	 * subsequently refaulted to install the pmd mapping for the mm.
+	 */
+	if (ops != &__anon_ops)
+		ops->fault(p, 0, nr_hpages * hpage_pmd_size);
+
 	if (ops->check_huge(p, expect ? nr_hpages : 0))
 		success("OK");
 	else
 		fail("Fail");
 }
 
+static struct collapse_context __khugepaged_context = {
+	.collapse = &khugepaged_collapse,
+	.enforce_pte_scan_limits = true,
+	.name = "khugepaged",
+};
+
+static struct collapse_context __madvise_context = {
+	.collapse = &madvise_collapse,
+	.enforce_pte_scan_limits = false,
+	.name = "madvise",
+};
+
+static bool is_tmpfs(struct mem_ops *ops)
+{
+	return ops == &__file_ops && finfo.type == VMA_SHMEM;
+}
+
 static void alloc_at_fault(void)
 {
 	struct settings settings = *current_settings();
@@ -658,6 +841,13 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
 
 	p = ops->setup_area(1);
 
+	if (is_tmpfs(ops)) {
+		/* shmem pages always in the page cache */
+		printf("tmpfs...");
+		skip("Skip");
+		goto skip;
+	}
+
 	ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 	c->collapse("Maybe collapse with max_ptes_none exceeded", p, 1,
 		    ops, !c->enforce_pte_scan_limits);
@@ -670,6 +860,7 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
 		validate_memory(p, 0,
 				(hpage_pmd_nr - max_ptes_none) * page_size);
 	}
+skip:
 	ops->cleanup_area(p, hpage_pmd_size);
 	pop_settings();
 }
@@ -753,6 +944,13 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c, struc
 
 	p = alloc_hpage(ops);
 
+	if (is_tmpfs(ops)) {
+		/* MADV_DONTNEED won't evict tmpfs pages */
+		printf("tmpfs...");
+		skip("Skip");
+		goto skip;
+	}
+
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
 	printf("Split huge page leaving single PTE mapping compound page...");
 	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
@@ -764,6 +962,7 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c, struc
 	c->collapse("Collapse PTE table with single PTE mapping compound page",
 		    p, 1, ops, true);
 	validate_memory(p, 0, page_size);
+skip:
 	ops->cleanup_area(p, hpage_pmd_size);
 }
 
@@ -1010,9 +1209,72 @@ static void madvise_collapse_existing_thps(struct collapse_context *c,
 	ops->cleanup_area(p, hpage_pmd_size);
 }
 
+static void usage(void)
+{
+	fprintf(stderr, "\nUsage: ./khugepaged <test type> [dir]\n\n");
+	fprintf(stderr, "\t<test type>\t: <context>:<mem_type>\n");
+	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
+	fprintf(stderr, "\t<mem_type>\t: [all|anon|file]\n");
+	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
+	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
+	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
+	fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
+	fprintf(stderr,	"\tmounted with huge=madvise option for khugepaged tests to work\n");
+	exit(1);
+}
+
+static void parse_test_type(int argc, const char **argv)
+{
+	char *buf;
+	const char *token;
+
+	if (argc == 1) {
+		/* Backwards compatibility */
+		khugepaged_context =  &__khugepaged_context;
+		madvise_context =  &__madvise_context;
+		anon_ops = &__anon_ops;
+		return;
+	}
+
+	buf = strdup(argv[1]);
+	token = strsep(&buf, ":");
+
+	if (!strcmp(token, "all")) {
+		khugepaged_context =  &__khugepaged_context;
+		madvise_context =  &__madvise_context;
+	} else if (!strcmp(token, "khugepaged")) {
+		khugepaged_context =  &__khugepaged_context;
+	} else if (!strcmp(token, "madvise")) {
+		madvise_context =  &__madvise_context;
+	} else {
+		usage();
+	}
+
+	if (!buf)
+		usage();
+
+	if (!strcmp(buf, "all")) {
+		file_ops =  &__file_ops;
+		anon_ops = &__anon_ops;
+	} else if (!strcmp(buf, "anon")) {
+		anon_ops = &__anon_ops;
+	} else if (!strcmp(buf, "file")) {
+		file_ops =  &__file_ops;
+	} else {
+		usage();
+	}
+
+	if (!file_ops)
+		return;
+
+	if (argc != 3)
+		usage();
+
+	finfo.dir = argv[2];
+}
+
 int main(int argc, const char **argv)
 {
-	struct collapse_context c;
 	struct settings default_settings = {
 		.thp_enabled = THP_MADVISE,
 		.thp_defrag = THP_DEFRAG_ALWAYS,
@@ -1023,8 +1285,17 @@ int main(int argc, const char **argv)
 			.alloc_sleep_millisecs = 10,
 			.scan_sleep_millisecs = 10,
 		},
+		/*
+		 * When testing file-backed memory, the collapse path
+		 * looks at how many pages are found in the page cache, not
+		 * what pages are mapped. Disable read ahead optimization so
+		 * pages don't find their way into the page cache unless
+		 * we mem_ops->fault() them in.
+		 */
+		.read_ahead_kb = 0,
 	};
-	const char *tests = argc == 1 ? "all" : argv[1];
+
+	parse_test_type(argc, argv);
 
 	setbuf(stdout, NULL);
 
@@ -1042,43 +1313,61 @@ int main(int argc, const char **argv)
 
 	alloc_at_fault();
 
-	if (!strcmp(tests, "khugepaged") || !strcmp(tests, "all")) {
-		printf("\n*** Testing context: khugepaged ***\n");
-		c.collapse = &khugepaged_collapse;
-		c.enforce_pte_scan_limits = true;
-
-		collapse_full(&c, &anon_ops);
-		collapse_empty(&c, &anon_ops);
-		collapse_single_pte_entry(&c, &anon_ops);
-		collapse_max_ptes_none(&c, &anon_ops);
-		collapse_swapin_single_pte(&c, &anon_ops);
-		collapse_max_ptes_swap(&c, &anon_ops);
-		collapse_single_pte_entry_compound(&c, &anon_ops);
-		collapse_full_of_compound(&c, &anon_ops);
-		collapse_compound_extreme(&c, &anon_ops);
-		collapse_fork(&c, &anon_ops);
-		collapse_fork_compound(&c, &anon_ops);
-		collapse_max_ptes_shared(&c, &anon_ops);
-	}
-	if (!strcmp(tests, "madvise") || !strcmp(tests, "all")) {
-		printf("\n*** Testing context: madvise ***\n");
-		c.collapse = &madvise_collapse;
-		c.enforce_pte_scan_limits = false;
-
-		collapse_full(&c, &anon_ops);
-		collapse_empty(&c, &anon_ops);
-		collapse_single_pte_entry(&c, &anon_ops);
-		collapse_max_ptes_none(&c, &anon_ops);
-		collapse_swapin_single_pte(&c, &anon_ops);
-		collapse_max_ptes_swap(&c, &anon_ops);
-		collapse_single_pte_entry_compound(&c, &anon_ops);
-		collapse_full_of_compound(&c, &anon_ops);
-		collapse_compound_extreme(&c, &anon_ops);
-		collapse_fork(&c, &anon_ops);
-		collapse_fork_compound(&c, &anon_ops);
-		collapse_max_ptes_shared(&c, &anon_ops);
-		madvise_collapse_existing_thps(&c, &anon_ops);
-	}
+#define TEST(t, c, o) do { \
+	if (c && o) { \
+		printf("\nRun test: " #t " (%s:%s)\n", c->name, o->name); \
+		t(c, o); \
+	} \
+	} while (0)
+
+	TEST(collapse_full, khugepaged_context, anon_ops);
+	TEST(collapse_full, khugepaged_context, file_ops);
+	TEST(collapse_full, madvise_context, anon_ops);
+	TEST(collapse_full, madvise_context, file_ops);
+
+	TEST(collapse_empty, khugepaged_context, anon_ops);
+	TEST(collapse_empty, madvise_context, anon_ops);
+
+	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
+	TEST(collapse_single_pte_entry, madvise_context, file_ops);
+
+	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
+	TEST(collapse_max_ptes_none, khugepaged_context, file_ops);
+	TEST(collapse_max_ptes_none, madvise_context, anon_ops);
+	TEST(collapse_max_ptes_none, madvise_context, file_ops);
+
+	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
+	TEST(collapse_single_pte_entry_compound, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
+	TEST(collapse_single_pte_entry_compound, madvise_context, file_ops);
+
+	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
+	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
+	TEST(collapse_full_of_compound, madvise_context, anon_ops);
+	TEST(collapse_full_of_compound, madvise_context, file_ops);
+
+	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
+	TEST(collapse_compound_extreme, madvise_context, anon_ops);
+
+	TEST(collapse_swapin_single_pte, khugepaged_context, anon_ops);
+	TEST(collapse_swapin_single_pte, madvise_context, anon_ops);
+
+	TEST(collapse_max_ptes_swap, khugepaged_context, anon_ops);
+	TEST(collapse_max_ptes_swap, madvise_context, anon_ops);
+
+	TEST(collapse_fork, khugepaged_context, anon_ops);
+	TEST(collapse_fork, madvise_context, anon_ops);
+
+	TEST(collapse_fork_compound, khugepaged_context, anon_ops);
+	TEST(collapse_fork_compound, madvise_context, anon_ops);
+
+	TEST(collapse_max_ptes_shared, khugepaged_context, anon_ops);
+	TEST(collapse_max_ptes_shared, madvise_context, anon_ops);
+
+	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
+	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
 
 	restore_settings(0);
 }
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index 9dae51b8219f..f11f8adda521 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -114,3 +114,13 @@ bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size)
 {
 	return __check_huge(addr, "AnonHugePages: ", nr_hpages, hpage_size);
 }
+
+bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size)
+{
+	return __check_huge(addr, "FilePmdMapped:", nr_hpages, hpage_size);
+}
+
+bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size)
+{
+	return __check_huge(addr, "ShmemPmdMapped:", nr_hpages, hpage_size);
+}
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 8434ea0c95cd..5c35de454e08 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -8,3 +8,5 @@ void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
+bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
+bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 8/9] selftests/vm: add thp collapse shmem testing
  2022-08-26 22:03 Zach O'Keefe
                   ` (6 preceding siblings ...)
  2022-08-26 22:03 ` [PATCH mm-unstable v2 7/9] selftests/vm: add thp collapse file and tmpfs testing Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-26 22:03 ` [PATCH mm-unstable v2 9/9] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Zach O'Keefe
  2022-08-31 21:47 ` Yang Shi
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add memory operations for shmem (memfd) memory, and reuse
existing tests with the new memory operations.

Shmem tests can be called with "shmem" mem_type, and shmem tests
are ran with "all" mem_type as well.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 57 ++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 0ddfffb87411..9a136f65d43a 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -57,6 +57,7 @@ struct mem_ops {
 
 static struct mem_ops *file_ops;
 static struct mem_ops *anon_ops;
+static struct mem_ops *shmem_ops;
 
 struct collapse_context {
 	void (*collapse)(const char *msg, char *p, int nr_hpages,
@@ -637,6 +638,40 @@ static bool file_check_huge(void *addr, int nr_hpages)
 	}
 }
 
+static void *shmem_setup_area(int nr_hpages)
+{
+	void *p;
+	unsigned long size = nr_hpages * hpage_pmd_size;
+
+	finfo.fd = memfd_create("khugepaged-selftest-collapse-shmem", 0);
+	if (finfo.fd < 0)  {
+		perror("memfd_create()");
+		exit(EXIT_FAILURE);
+	}
+	if (ftruncate(finfo.fd, size)) {
+		perror("ftruncate()");
+		exit(EXIT_FAILURE);
+	}
+	p = mmap(BASE_ADDR, size, PROT_READ | PROT_WRITE, MAP_SHARED, finfo.fd,
+		 0);
+	if (p != BASE_ADDR) {
+		perror("mmap()");
+		exit(EXIT_FAILURE);
+	}
+	return p;
+}
+
+static void shmem_cleanup_area(void *p, unsigned long size)
+{
+	munmap(p, size);
+	close(finfo.fd);
+}
+
+static bool shmem_check_huge(void *addr, int nr_hpages)
+{
+	return check_huge_shmem(addr, nr_hpages, hpage_pmd_size);
+}
+
 static struct mem_ops __anon_ops = {
 	.setup_area = &anon_setup_area,
 	.cleanup_area = &anon_cleanup_area,
@@ -653,6 +688,14 @@ static struct mem_ops __file_ops = {
 	.name = "file",
 };
 
+static struct mem_ops __shmem_ops = {
+	.setup_area = &shmem_setup_area,
+	.cleanup_area = &shmem_cleanup_area,
+	.fault = &anon_fault,
+	.check_huge = &shmem_check_huge,
+	.name = "shmem",
+};
+
 static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 			       struct mem_ops *ops, bool expect)
 {
@@ -1214,7 +1257,7 @@ static void usage(void)
 	fprintf(stderr, "\nUsage: ./khugepaged <test type> [dir]\n\n");
 	fprintf(stderr, "\t<test type>\t: <context>:<mem_type>\n");
 	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
-	fprintf(stderr, "\t<mem_type>\t: [all|anon|file]\n");
+	fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
 	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
 	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
 	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
@@ -1256,10 +1299,13 @@ static void parse_test_type(int argc, const char **argv)
 	if (!strcmp(buf, "all")) {
 		file_ops =  &__file_ops;
 		anon_ops = &__anon_ops;
+		shmem_ops = &__shmem_ops;
 	} else if (!strcmp(buf, "anon")) {
 		anon_ops = &__anon_ops;
 	} else if (!strcmp(buf, "file")) {
 		file_ops =  &__file_ops;
+	} else if (!strcmp(buf, "shmem")) {
+		shmem_ops = &__shmem_ops;
 	} else {
 		usage();
 	}
@@ -1278,7 +1324,7 @@ int main(int argc, const char **argv)
 	struct settings default_settings = {
 		.thp_enabled = THP_MADVISE,
 		.thp_defrag = THP_DEFRAG_ALWAYS,
-		.shmem_enabled = SHMEM_NEVER,
+		.shmem_enabled = SHMEM_ADVISE,
 		.use_zero_page = 0,
 		.khugepaged = {
 			.defrag = 1,
@@ -1322,16 +1368,20 @@ int main(int argc, const char **argv)
 
 	TEST(collapse_full, khugepaged_context, anon_ops);
 	TEST(collapse_full, khugepaged_context, file_ops);
+	TEST(collapse_full, khugepaged_context, shmem_ops);
 	TEST(collapse_full, madvise_context, anon_ops);
 	TEST(collapse_full, madvise_context, file_ops);
+	TEST(collapse_full, madvise_context, shmem_ops);
 
 	TEST(collapse_empty, khugepaged_context, anon_ops);
 	TEST(collapse_empty, madvise_context, anon_ops);
 
 	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
 	TEST(collapse_single_pte_entry, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, shmem_ops);
 	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
 	TEST(collapse_single_pte_entry, madvise_context, file_ops);
+	TEST(collapse_single_pte_entry, madvise_context, shmem_ops);
 
 	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
 	TEST(collapse_max_ptes_none, khugepaged_context, file_ops);
@@ -1345,8 +1395,10 @@ int main(int argc, const char **argv)
 
 	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
 	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
+	TEST(collapse_full_of_compound, khugepaged_context, shmem_ops);
 	TEST(collapse_full_of_compound, madvise_context, anon_ops);
 	TEST(collapse_full_of_compound, madvise_context, file_ops);
+	TEST(collapse_full_of_compound, madvise_context, shmem_ops);
 
 	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
 	TEST(collapse_compound_extreme, madvise_context, anon_ops);
@@ -1368,6 +1420,7 @@ int main(int argc, const char **argv)
 
 	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
 	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
+	TEST(madvise_collapse_existing_thps, madvise_context, shmem_ops);
 
 	restore_settings(0);
 }
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH mm-unstable v2 9/9] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  2022-08-26 22:03 Zach O'Keefe
                   ` (7 preceding siblings ...)
  2022-08-26 22:03 ` [PATCH mm-unstable v2 8/9] selftests/vm: add thp collapse shmem testing Zach O'Keefe
@ 2022-08-26 22:03 ` Zach O'Keefe
  2022-08-31 21:47 ` Yang Shi
  9 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-08-26 22:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add :collapse mod to userfaultfd selftest.  Currently this mod is
only valid for "shmem" test type, but could be used for other test
types.

When provided, memory allocated by ->allocate_area() will be
hugepage-aligned enforced to be hugepage-sized.  userfaultf_minor_test,
after the UFFD-registered mapping has been populated by UUFD minor
fault handler, attempt to MADV_COLLAPSE the UFFD-registered mapping to
collapse the memory into a pmd-mapped THP.

This test is meant to be a functional test of what occurs during
UFFD-driven live migration of VMs backed by huge tmpfs where, after
a hugepage-sized region has been successfully migrated (in native
page-sized chunks, to avoid latency of fetched a hugepage over the
network), we want to reclaim previous VM performance by remapping it
at the PMD level.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/Makefile      |   1 +
 tools/testing/selftests/vm/userfaultfd.c | 171 ++++++++++++++++++-----
 2 files changed, 134 insertions(+), 38 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index df4fa77febca..c22b5b613296 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -101,6 +101,7 @@ $(OUTPUT)/khugepaged: vm_util.c
 $(OUTPUT)/madv_populate: vm_util.c
 $(OUTPUT)/soft-dirty: vm_util.c
 $(OUTPUT)/split_huge_page_test: vm_util.c
+$(OUTPUT)/userfaultfd: vm_util.c
 
 ifeq ($(MACHINE),x86_64)
 BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 7be709d9eed0..74babdbc02e5 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -61,10 +61,11 @@
 #include <sys/random.h>
 
 #include "../kselftest.h"
+#include "vm_util.h"
 
 #ifdef __NR_userfaultfd
 
-static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
+static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;
 
 #define BOUNCE_RANDOM		(1<<0)
 #define BOUNCE_RACINGFAULTS	(1<<1)
@@ -79,6 +80,8 @@ static int test_type;
 
 #define UFFD_FLAGS	(O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY)
 
+#define BASE_PMD_ADDR ((void *)(1UL << 30))
+
 /* test using /dev/userfaultfd, instead of userfaultfd(2) */
 static bool test_dev_userfaultfd;
 
@@ -97,9 +100,10 @@ static int huge_fd;
 static unsigned long long *count_verify;
 static int uffd = -1;
 static int uffd_flags, finished, *pipefd;
-static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
+static char *area_src, *area_src_alias, *area_dst, *area_dst_alias, *area_remap;
 static char *zeropage;
 pthread_attr_t attr;
+static bool test_collapse;
 
 /* Userfaultfd test statistics */
 struct uffd_stats {
@@ -127,6 +131,8 @@ struct uffd_stats {
 #define swap(a, b) \
 	do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0)
 
+#define factor_of_2(x) ((x) ^ ((x) & ((x) - 1)))
+
 const char *examples =
     "# Run anonymous memory test on 100MiB region with 99999 bounces:\n"
     "./userfaultfd anon 100 99999\n\n"
@@ -152,6 +158,8 @@ static void usage(void)
 		"Supported mods:\n");
 	fprintf(stderr, "\tsyscall - Use userfaultfd(2) (default)\n");
 	fprintf(stderr, "\tdev - Use /dev/userfaultfd instead of userfaultfd(2)\n");
+	fprintf(stderr, "\tcollapse - Test MADV_COLLAPSE of UFFDIO_REGISTER_MODE_MINOR\n"
+		"memory\n");
 	fprintf(stderr, "\nExample test mod usage:\n");
 	fprintf(stderr, "# Run anonymous memory test with /dev/userfaultfd:\n");
 	fprintf(stderr, "./userfaultfd anon:dev 100 99999\n\n");
@@ -229,12 +237,10 @@ static void anon_release_pages(char *rel_area)
 		err("madvise(MADV_DONTNEED) failed");
 }
 
-static void anon_allocate_area(void **alloc_area)
+static void anon_allocate_area(void **alloc_area, bool is_src)
 {
 	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
 			   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
-	if (*alloc_area == MAP_FAILED)
-		err("mmap of anonymous memory failed");
 }
 
 static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
@@ -252,7 +258,7 @@ static void hugetlb_release_pages(char *rel_area)
 	}
 }
 
-static void hugetlb_allocate_area(void **alloc_area)
+static void hugetlb_allocate_area(void **alloc_area, bool is_src)
 {
 	void *area_alias = NULL;
 	char **alloc_area_alias;
@@ -262,7 +268,7 @@ static void hugetlb_allocate_area(void **alloc_area)
 			nr_pages * page_size,
 			PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB |
-				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+				(is_src ? 0 : MAP_NORESERVE),
 			-1,
 			0);
 	else
@@ -270,9 +276,9 @@ static void hugetlb_allocate_area(void **alloc_area)
 			nr_pages * page_size,
 			PROT_READ | PROT_WRITE,
 			MAP_SHARED |
-				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+				(is_src ? 0 : MAP_NORESERVE),
 			huge_fd,
-			*alloc_area == area_src ? 0 : nr_pages * page_size);
+			is_src ? 0 : nr_pages * page_size);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of hugetlbfs file failed");
 
@@ -282,12 +288,12 @@ static void hugetlb_allocate_area(void **alloc_area)
 			PROT_READ | PROT_WRITE,
 			MAP_SHARED,
 			huge_fd,
-			*alloc_area == area_src ? 0 : nr_pages * page_size);
+			is_src ? 0 : nr_pages * page_size);
 		if (area_alias == MAP_FAILED)
 			err("mmap of hugetlb file alias failed");
 	}
 
-	if (*alloc_area == area_src) {
+	if (is_src) {
 		alloc_area_alias = &area_src_alias;
 	} else {
 		alloc_area_alias = &area_dst_alias;
@@ -310,21 +316,36 @@ static void shmem_release_pages(char *rel_area)
 		err("madvise(MADV_REMOVE) failed");
 }
 
-static void shmem_allocate_area(void **alloc_area)
+static void shmem_allocate_area(void **alloc_area, bool is_src)
 {
 	void *area_alias = NULL;
-	bool is_src = alloc_area == (void **)&area_src;
-	unsigned long offset = is_src ? 0 : nr_pages * page_size;
+	size_t bytes = nr_pages * page_size;
+	unsigned long offset = is_src ? 0 : bytes;
+	char *p = NULL, *p_alias = NULL;
+
+	if (test_collapse) {
+		p = BASE_PMD_ADDR;
+		if (!is_src)
+			/* src map + alias + interleaved hpages */
+			p += 2 * (bytes + hpage_size);
+		p_alias = p;
+		p_alias += bytes;
+		p_alias += hpage_size;  /* Prevent src/dst VMA merge */
+	}
 
-	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			   MAP_SHARED, shm_fd, offset);
+	*alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+			   shm_fd, offset);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of memfd failed");
+	if (test_collapse && *alloc_area != p)
+		err("mmap of memfd failed at %p", p);
 
-	area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			  MAP_SHARED, shm_fd, offset);
+	area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+			  shm_fd, offset);
 	if (area_alias == MAP_FAILED)
 		err("mmap of memfd alias failed");
+	if (test_collapse && area_alias != p_alias)
+		err("mmap of anonymous memory failed at %p", p_alias);
 
 	if (is_src)
 		area_src_alias = area_alias;
@@ -337,28 +358,39 @@ static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset)
 	*start = (unsigned long)area_dst_alias + offset;
 }
 
+static void shmem_check_pmd_mapping(void *p, int expect_nr_hpages)
+{
+	if (!check_huge_shmem(area_dst_alias, expect_nr_hpages, hpage_size))
+		err("Did not find expected %d number of hugepages",
+		    expect_nr_hpages);
+}
+
 struct uffd_test_ops {
-	void (*allocate_area)(void **alloc_area);
+	void (*allocate_area)(void **alloc_area, bool is_src);
 	void (*release_pages)(char *rel_area);
 	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
+	void (*check_pmd_mapping)(void *p, int expect_nr_hpages);
 };
 
 static struct uffd_test_ops anon_uffd_test_ops = {
 	.allocate_area	= anon_allocate_area,
 	.release_pages	= anon_release_pages,
 	.alias_mapping = noop_alias_mapping,
+	.check_pmd_mapping = NULL,
 };
 
 static struct uffd_test_ops shmem_uffd_test_ops = {
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
 	.alias_mapping = shmem_alias_mapping,
+	.check_pmd_mapping = shmem_check_pmd_mapping,
 };
 
 static struct uffd_test_ops hugetlb_uffd_test_ops = {
 	.allocate_area	= hugetlb_allocate_area,
 	.release_pages	= hugetlb_release_pages,
 	.alias_mapping = hugetlb_alias_mapping,
+	.check_pmd_mapping = NULL,
 };
 
 static struct uffd_test_ops *uffd_test_ops;
@@ -478,6 +510,7 @@ static void uffd_test_ctx_clear(void)
 	munmap_area((void **)&area_src_alias);
 	munmap_area((void **)&area_dst);
 	munmap_area((void **)&area_dst_alias);
+	munmap_area((void **)&area_remap);
 }
 
 static void uffd_test_ctx_init(uint64_t features)
@@ -486,8 +519,8 @@ static void uffd_test_ctx_init(uint64_t features)
 
 	uffd_test_ctx_clear();
 
-	uffd_test_ops->allocate_area((void **)&area_src);
-	uffd_test_ops->allocate_area((void **)&area_dst);
+	uffd_test_ops->allocate_area((void **)&area_src, true);
+	uffd_test_ops->allocate_area((void **)&area_dst, false);
 
 	userfaultfd_open(&features);
 
@@ -804,6 +837,7 @@ static void *uffd_poll_thread(void *arg)
 				err("remove failure");
 			break;
 		case UFFD_EVENT_REMAP:
+			area_remap = area_dst;  /* save for later unmap */
 			area_dst = (char *)(unsigned long)msg.arg.remap.to;
 			break;
 		}
@@ -1256,13 +1290,30 @@ static int userfaultfd_sig_test(void)
 	return userfaults != 0;
 }
 
+void check_memory_contents(char *p)
+{
+	unsigned long i;
+	uint8_t expected_byte;
+	void *expected_page;
+
+	if (posix_memalign(&expected_page, page_size, page_size))
+		err("out of memory");
+
+	for (i = 0; i < nr_pages; ++i) {
+		expected_byte = ~((uint8_t)(i % ((uint8_t)-1)));
+		memset(expected_page, expected_byte, page_size);
+		if (my_bcmp(expected_page, p + (i * page_size), page_size))
+			err("unexpected page contents after minor fault");
+	}
+
+	free(expected_page);
+}
+
 static int userfaultfd_minor_test(void)
 {
-	struct uffdio_register uffdio_register;
 	unsigned long p;
+	struct uffdio_register uffdio_register;
 	pthread_t uffd_mon;
-	uint8_t expected_byte;
-	void *expected_page;
 	char c;
 	struct uffd_stats stats = { 0 };
 
@@ -1301,17 +1352,7 @@ static int userfaultfd_minor_test(void)
 	 * fault. uffd_poll_thread will resolve the fault by bit-flipping the
 	 * page's contents, and then issuing a CONTINUE ioctl.
 	 */
-
-	if (posix_memalign(&expected_page, page_size, page_size))
-		err("out of memory");
-
-	for (p = 0; p < nr_pages; ++p) {
-		expected_byte = ~((uint8_t)(p % ((uint8_t)-1)));
-		memset(expected_page, expected_byte, page_size);
-		if (my_bcmp(expected_page, area_dst_alias + (p * page_size),
-			    page_size))
-			err("unexpected page contents after minor fault");
-	}
+	check_memory_contents(area_dst_alias);
 
 	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
 		err("pipe write");
@@ -1320,6 +1361,23 @@ static int userfaultfd_minor_test(void)
 
 	uffd_stats_report(&stats, 1);
 
+	if (test_collapse) {
+		printf("testing collapse of uffd memory into PMD-mapped THPs:");
+		if (madvise(area_dst_alias, nr_pages * page_size,
+			    MADV_COLLAPSE))
+			err("madvise(MADV_COLLAPSE)");
+
+		uffd_test_ops->check_pmd_mapping(area_dst,
+						 nr_pages * page_size /
+						 hpage_size);
+		/*
+		 * This won't cause uffd-fault - it purely just makes sure there
+		 * was no corruption.
+		 */
+		check_memory_contents(area_dst_alias);
+		printf(" done.\n");
+	}
+
 	return stats.missing_faults != 0 || stats.minor_faults != nr_pages;
 }
 
@@ -1656,6 +1714,8 @@ static void parse_test_type_arg(const char *raw_type)
 			test_dev_userfaultfd = true;
 		else if (!strcmp(token, "syscall"))
 			test_dev_userfaultfd = false;
+		else if (!strcmp(token, "collapse"))
+			test_collapse = true;
 		else
 			err("unrecognized test mod '%s'", token);
 	}
@@ -1663,8 +1723,11 @@ static void parse_test_type_arg(const char *raw_type)
 	if (!test_type)
 		err("failed to parse test type argument: '%s'", raw_type);
 
+	if (test_collapse && test_type != TEST_SHMEM)
+		err("Unsupported test: %s", raw_type);
+
 	if (test_type == TEST_HUGETLB)
-		page_size = default_huge_page_size();
+		page_size = hpage_size;
 	else
 		page_size = sysconf(_SC_PAGE_SIZE);
 
@@ -1702,6 +1765,8 @@ static void sigalrm(int sig)
 
 int main(int argc, char **argv)
 {
+	size_t bytes;
+
 	if (argc < 4)
 		usage();
 
@@ -1709,11 +1774,41 @@ int main(int argc, char **argv)
 		err("failed to arm SIGALRM");
 	alarm(ALARM_INTERVAL_SECS);
 
+	hpage_size = default_huge_page_size();
 	parse_test_type_arg(argv[1]);
+	bytes = atol(argv[2]) * 1024 * 1024;
+
+	if (test_collapse && bytes & (hpage_size - 1))
+		err("MiB must be multiple of %lu if :collapse mod set",
+		    hpage_size >> 20);
 
 	nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
-	nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
-		nr_cpus;
+
+	if (test_collapse) {
+		/* nr_cpus must divide (bytes / page_size), otherwise,
+		 * area allocations of (nr_pages * paze_size) won't be a
+		 * multiple of hpage_size, even if bytes is a multiple of
+		 * hpage_size.
+		 *
+		 * This means that nr_cpus must divide (N * (2 << (H-P))
+		 * where:
+		 *	bytes = hpage_size * N
+		 *	hpage_size = 2 << H
+		 *	page_size = 2 << P
+		 *
+		 * And we want to chose nr_cpus to be the largest value
+		 * satisfying this constraint, not larger than the number
+		 * of online CPUs. Unfortunately, prime factorization of
+		 * N and nr_cpus may be arbitrary, so have to search for it.
+		 * Instead, just use the highest power of 2 dividing both
+		 * nr_cpus and (bytes / page_size).
+		 */
+		int x = factor_of_2(nr_cpus);
+		int y = factor_of_2(bytes / page_size);
+
+		nr_cpus = x < y ? x : y;
+	}
+	nr_pages_per_cpu = bytes / page_size / nr_cpus;
 	if (!nr_pages_per_cpu) {
 		_err("invalid MiB");
 		usage();
-- 
2.37.2.672.g94769d06f0-goog



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2022-08-26 22:03 Zach O'Keefe
                   ` (8 preceding siblings ...)
  2022-08-26 22:03 ` [PATCH mm-unstable v2 9/9] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Zach O'Keefe
@ 2022-08-31 21:47 ` Yang Shi
  2022-09-01  0:24   ` Re: Zach O'Keefe
  9 siblings, 1 reply; 35+ messages in thread
From: Yang Shi @ 2022-08-31 21:47 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

Hi Zach,

I did a quick look at the series, basically no show stopper to me. But
I didn't find time to review them thoroughly yet, quite busy on
something else. Just a heads up, I didn't mean to ignore you. I will
review them when I find some time.

Thanks,
Yang

On Fri, Aug 26, 2022 at 3:03 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Subject: [PATCH mm-unstable v2 0/9] mm: add file/shmem support to MADV_COLLAPSE
>
> v2 Forward
>
> Mostly a RESEND: rebase on latest mm-unstable + minor bug fixes from
> kernel test robot.
> --------------------------------
>
> This series builds on top of the previous "mm: userspace hugepage collapse"
> series which introduced the MADV_COLLAPSE madvise mode and added support
> for private, anonymous mappings[1], by adding support for file and shmem
> backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.
>
> File and shmem support have been added with effort to align with existing
> MADV_COLLAPSE semantics and policy decisions[2].  Collapse of shmem-backed
> memory ignores kernel-guiding directives and heuristics including all
> sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
> options (shmem always supports large folios).  Like anonymous mappings, on
> successful return of MADV_COLLAPSE on file/shmem memory, the contents of
> memory mapped by the addresses provided will be synchronously pmd-mapped
> THPs.
>
> This functionality unlocks two important uses:
>
> (1)     Immediately back executable text by THPs.  Current support provided
>         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
>         system which might impair services from serving at their full rated
>         load after (re)starting.  Tricks like mremap(2)'ing text onto
>         anonymous memory to immediately realize iTLB performance prevents
>         page sharing and demand paging, both of which increase steady state
>         memory footprint.  Now, we can have the best of both worlds: Peak
>         upfront performance and lower RAM footprints.
>
> (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
>         faults by fetching native-sized pages over the network (to avoid
>         latency of transferring an entire hugepage).  However, after guest
>         memory has been fully copied to the new host, MADV_COLLAPSE can
>         be used to immediately increase guest performance.
>
> khugepaged has received a small improvement by association and can now
> detect and collapse pte-mapped THPs.  However, there is still work to be
> done along the file collapse path.  Compound pages of arbitrary order still
> needs to be supported and THP collapse needs to be converted to using
> folios in general.  Eventually, we'd like to move away from the read-only
> and executable-mapped constraints currently imposed on eligible files and
> support any inode claiming huge folio support.  That said, I think the
> series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem
> memory.
>
> Patches 1-3     Implement the guts of the series.
> Patch 4         Is a tracepoint for debugging.
> Patches 5-8     Refactor existing khugepaged selftests to work with new
>                 memory types.
> Patch 9         Adds a userfaultfd selftest mode to mimic a functional test
>                 of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.
>
> Applies against mm-unstable.
>
> [1] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
> [2] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/
>
> v1 -> v2:
> - Add missing definition for khugepaged_add_pte_mapped_thp() in
>   !CONFIG_SHEM builds, in "mm/khugepaged: attempt to map
>   file/shmem-backed pte-mapped THPs by pmds"
> - Minor bugfixes in "mm/madvise: add file and shmem support to
>   MADV_COLLAPSE" for !CONFIG_SHMEM, !CONFIG_TRANSPARENT_HUGEPAGE and some
>   compiler settings.
> - Rebased on latest mm-unstable
>
> Zach O'Keefe (9):
>   mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
>   mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by
>     pmds
>   mm/madvise: add file and shmem support to MADV_COLLAPSE
>   mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
>   selftests/vm: dedup THP helpers
>   selftests/vm: modularize thp collapse memory operations
>   selftests/vm: add thp collapse file and tmpfs testing
>   selftests/vm: add thp collapse shmem testing
>   selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
>
>  include/linux/khugepaged.h                    |  13 +-
>  include/linux/shmem_fs.h                      |  10 +-
>  include/trace/events/huge_memory.h            |  36 +
>  kernel/events/uprobes.c                       |   2 +-
>  mm/huge_memory.c                              |   2 +-
>  mm/khugepaged.c                               | 289 ++++--
>  mm/shmem.c                                    |  18 +-
>  tools/testing/selftests/vm/Makefile           |   2 +
>  tools/testing/selftests/vm/khugepaged.c       | 828 ++++++++++++------
>  tools/testing/selftests/vm/soft-dirty.c       |   2 +-
>  .../selftests/vm/split_huge_page_test.c       |  12 +-
>  tools/testing/selftests/vm/userfaultfd.c      | 171 +++-
>  tools/testing/selftests/vm/vm_util.c          |  36 +-
>  tools/testing/selftests/vm/vm_util.h          |   5 +-
>  14 files changed, 1040 insertions(+), 386 deletions(-)
>
> --
> 2.37.2.672.g94769d06f0-goog
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2022-08-31 21:47 ` Yang Shi
@ 2022-09-01  0:24   ` Zach O'Keefe
  0 siblings, 0 replies; 35+ messages in thread
From: Zach O'Keefe @ 2022-09-01  0:24 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Wed, Aug 31, 2022 at 2:47 PM Yang Shi <shy828301@gmail.com> wrote:
>
> Hi Zach,
>
> I did a quick look at the series, basically no show stopper to me. But
> I didn't find time to review them thoroughly yet, quite busy on
> something else. Just a heads up, I didn't mean to ignore you. I will
> review them when I find some time.
>
> Thanks,
> Yang

Hey Yang,

Thanks for taking the time to look through, and thanks for giving me a
heads up, and no rush!

In the last day or so, while porting this series around, I encountered
some subtle edge cases I wanted to clean up / address - so it's good
you didn't do a thorough review yet. I was *hoping* to have a v3 out
last night (which evidently did not happen) and it does not seem like
it will happen today, so I'll leave this message as a request for
reviewers to hold off on a thorough review until v3.

Thanks for your time as always,
Zach

^ permalink raw reply	[flat|nested] 35+ messages in thread

* (no subject)
@ 2025-08-12 13:34 Baoquan He
  2025-08-12 13:49 ` Baoquan He
  0 siblings, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-08-12 13:34 UTC (permalink / raw)
  To: linux-mm, christophe.leroy

alexghiti@rivosinc.com, agordeev@linux.ibm.com, linux@armlinux.org.uk,
linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
x86@kernel.org, chris@zankel.net, jcmvbkbc@gmail.com, linux-um@lists.infradead.org
Cc: ryabinin.a.a@gmail.com, glider@google.com, andreyknvl@gmail.com,
	dvyukov@google.com, vincenzo.frascino@arm.com,
	akpm@linux-foundation.org, kasan-dev@googlegroups.com,
	linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
	sj@kernel.org, lorenzo.stoakes@oracle.com, elver@google.com,
	snovitoll@gmail.com
Bcc: bhe@redhat.com
Subject: Re: [PATCH v2 00/12] mm/kasan: make kasan=on|off work for all three
 modes
Reply-To: 
In-Reply-To: <20250812124941.69508-1-bhe@redhat.com>

Forgot adding related ARCH mailing list or people to CC, add them.

On 08/12/25 at 08:49pm, Baoquan He wrote:
> Currently only hw_tags mode of kasan can be enabled or disabled with
> kernel parameter kasan=on|off for built kernel. For kasan generic and
> sw_tags mode, there's no way to disable them once kernel is built.
> This is not convenient sometime, e.g in system kdump is configured.
> When the 1st kernel has KASAN enabled and crash triggered to switch to
> kdump kernel, the generic or sw_tags mode will cost much extra memory
> for kasan shadow while in fact it's meaningless to have kasan in kdump
> kernel.
> 
> So this patchset moves the kasan=on|off out of hw_tags scope and into
> common code to make it visible in generic and sw_tags mode too. Then we
> can add kasan=off in kdump kernel to reduce the unneeded meomry cost for
> kasan.
> 
> Changelog:
> ====
> v1->v2:
> - Add __ro_after_init for __ro_after_init, and remove redundant blank
>   lines in mm/kasan/common.c. Thanks to Marco.
> - Fix a code bug in <linux/kasan-enabled.h> when CONFIG_KASAN is unset,
>   this is found out by SeongJae and Lorenzo, and also reported by LKP
>   report, thanks to them.
> - Add a missing kasan_enabled() checking in kasan_report(). This will
>   cause below KASAN report info even though kasan=off is set:
>      ==================================================================
>      BUG: KASAN: stack-out-of-bounds in tick_program_event+0x130/0x150
>      Read of size 4 at addr ffff00005f747778 by task swapper/0/1
>      
>      CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0+ #8 PREEMPT(voluntary) 
>      Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F31n (SCP: 2.10.20220810) 09/30/2022
>      Call trace:
>       show_stack+0x30/0x90 (C)
>       dump_stack_lvl+0x7c/0xa0
>       print_address_description.constprop.0+0x90/0x310
>       print_report+0x104/0x1f0
>       kasan_report+0xc8/0x110
>       __asan_report_load4_noabort+0x20/0x30
>       tick_program_event+0x130/0x150
>       ......snip...
>      ==================================================================
> 
> - Add jump_label_init() calling before kasan_init() in setup_arch() in these
>   architectures: xtensa, arm. Because they currenly rely on
>   jump_label_init() in main() which is a little late. Then the early static
>   key kasan_flag_enabled in kasan_init() won't work.
> 
> - In UML architecture, change to enable kasan_flag_enabled in arch_mm_preinit()
>   because kasan_init() is enabled before main(), there's no chance to operate
>   on static key in kasan_init().
> 
> Test:
> =====
> In v1, I took test on x86_64 for generic mode, and on arm64 for
> generic, sw_tags and hw_tags mode. All of them works well.
> 
> In v2, I only tested on arm64 for generic, sw_tags and hw_tags mode, it
> works. For powerpc, I got a BOOK3S/64 machine, while it says
> 'KASAN not enabled as it requires radix' and KASAN is disabled. Will
> look for other POWER machine to test this.
> ====
> 
> Baoquan He (12):
>   mm/kasan: add conditional checks in functions to return directly if
>     kasan is disabled
>   mm/kasan: move kasan= code to common place
>   mm/kasan/sw_tags: don't initialize kasan if it's disabled
>   arch/arm: don't initialize kasan if it's disabled
>   arch/arm64: don't initialize kasan if it's disabled
>   arch/loongarch: don't initialize kasan if it's disabled
>   arch/powerpc: don't initialize kasan if it's disabled
>   arch/riscv: don't initialize kasan if it's disabled
>   arch/x86: don't initialize kasan if it's disabled
>   arch/xtensa: don't initialize kasan if it's disabled
>   arch/um: don't initialize kasan if it's disabled
>   mm/kasan: make kasan=on|off take effect for all three modes
> 
>  arch/arm/kernel/setup.c                |  6 +++++
>  arch/arm/mm/kasan_init.c               |  6 +++++
>  arch/arm64/mm/kasan_init.c             |  7 ++++++
>  arch/loongarch/mm/kasan_init.c         |  5 ++++
>  arch/powerpc/mm/kasan/init_32.c        |  8 +++++-
>  arch/powerpc/mm/kasan/init_book3e_64.c |  6 +++++
>  arch/powerpc/mm/kasan/init_book3s_64.c |  6 +++++
>  arch/riscv/mm/kasan_init.c             |  6 +++++
>  arch/um/kernel/mem.c                   |  6 +++++
>  arch/x86/mm/kasan_init_64.c            |  6 +++++
>  arch/xtensa/kernel/setup.c             |  1 +
>  arch/xtensa/mm/kasan_init.c            |  6 +++++
>  include/linux/kasan-enabled.h          | 18 ++++++-------
>  mm/kasan/common.c                      | 25 ++++++++++++++++++
>  mm/kasan/generic.c                     | 20 +++++++++++++--
>  mm/kasan/hw_tags.c                     | 35 ++------------------------
>  mm/kasan/init.c                        |  6 +++++
>  mm/kasan/quarantine.c                  |  3 +++
>  mm/kasan/report.c                      |  4 ++-
>  mm/kasan/shadow.c                      | 23 ++++++++++++++++-
>  mm/kasan/sw_tags.c                     |  9 +++++++
>  21 files changed, 165 insertions(+), 47 deletions(-)
> 
> -- 
> 2.41.0
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2025-08-12 13:34 Baoquan He
@ 2025-08-12 13:49 ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-08-12 13:49 UTC (permalink / raw)
  To: linux-mm, christophe.leroy

On 08/12/25 at 09:34pm, Baoquan He wrote:
> alexghiti@rivosinc.com, agordeev@linux.ibm.com, linux@armlinux.org.uk,
> linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
> linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
> x86@kernel.org, chris@zankel.net, jcmvbkbc@gmail.com, linux-um@lists.infradead.org
> Cc: ryabinin.a.a@gmail.com, glider@google.com, andreyknvl@gmail.com,
> 	dvyukov@google.com, vincenzo.frascino@arm.com,
> 	akpm@linux-foundation.org, kasan-dev@googlegroups.com,
> 	linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
> 	sj@kernel.org, lorenzo.stoakes@oracle.com, elver@google.com,
> 	snovitoll@gmail.com
> Bcc: bhe@redhat.com
> Subject: Re: [PATCH v2 00/12] mm/kasan: make kasan=on|off work for all three
>  modes
> Reply-To: 
> In-Reply-To: <20250812124941.69508-1-bhe@redhat.com>
> 
> Forgot adding related ARCH mailing list or people to CC, add them.

Sorry for the noise, I made mistake on mail format when adding people to
CC.

> 
> On 08/12/25 at 08:49pm, Baoquan He wrote:
> > Currently only hw_tags mode of kasan can be enabled or disabled with
> > kernel parameter kasan=on|off for built kernel. For kasan generic and
> > sw_tags mode, there's no way to disable them once kernel is built.
> > This is not convenient sometime, e.g in system kdump is configured.
> > When the 1st kernel has KASAN enabled and crash triggered to switch to
> > kdump kernel, the generic or sw_tags mode will cost much extra memory
> > for kasan shadow while in fact it's meaningless to have kasan in kdump
> > kernel.
> > 
> > So this patchset moves the kasan=on|off out of hw_tags scope and into
> > common code to make it visible in generic and sw_tags mode too. Then we
> > can add kasan=off in kdump kernel to reduce the unneeded meomry cost for
> > kasan.
> > 
> > Changelog:
> > ====
> > v1->v2:
> > - Add __ro_after_init for __ro_after_init, and remove redundant blank
> >   lines in mm/kasan/common.c. Thanks to Marco.
> > - Fix a code bug in <linux/kasan-enabled.h> when CONFIG_KASAN is unset,
> >   this is found out by SeongJae and Lorenzo, and also reported by LKP
> >   report, thanks to them.
> > - Add a missing kasan_enabled() checking in kasan_report(). This will
> >   cause below KASAN report info even though kasan=off is set:
> >      ==================================================================
> >      BUG: KASAN: stack-out-of-bounds in tick_program_event+0x130/0x150
> >      Read of size 4 at addr ffff00005f747778 by task swapper/0/1
> >      
> >      CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0+ #8 PREEMPT(voluntary) 
> >      Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F31n (SCP: 2.10.20220810) 09/30/2022
> >      Call trace:
> >       show_stack+0x30/0x90 (C)
> >       dump_stack_lvl+0x7c/0xa0
> >       print_address_description.constprop.0+0x90/0x310
> >       print_report+0x104/0x1f0
> >       kasan_report+0xc8/0x110
> >       __asan_report_load4_noabort+0x20/0x30
> >       tick_program_event+0x130/0x150
> >       ......snip...
> >      ==================================================================
> > 
> > - Add jump_label_init() calling before kasan_init() in setup_arch() in these
> >   architectures: xtensa, arm. Because they currenly rely on
> >   jump_label_init() in main() which is a little late. Then the early static
> >   key kasan_flag_enabled in kasan_init() won't work.
> > 
> > - In UML architecture, change to enable kasan_flag_enabled in arch_mm_preinit()
> >   because kasan_init() is enabled before main(), there's no chance to operate
> >   on static key in kasan_init().
> > 
> > Test:
> > =====
> > In v1, I took test on x86_64 for generic mode, and on arm64 for
> > generic, sw_tags and hw_tags mode. All of them works well.
> > 
> > In v2, I only tested on arm64 for generic, sw_tags and hw_tags mode, it
> > works. For powerpc, I got a BOOK3S/64 machine, while it says
> > 'KASAN not enabled as it requires radix' and KASAN is disabled. Will
> > look for other POWER machine to test this.
> > ====
> > 
> > Baoquan He (12):
> >   mm/kasan: add conditional checks in functions to return directly if
> >     kasan is disabled
> >   mm/kasan: move kasan= code to common place
> >   mm/kasan/sw_tags: don't initialize kasan if it's disabled
> >   arch/arm: don't initialize kasan if it's disabled
> >   arch/arm64: don't initialize kasan if it's disabled
> >   arch/loongarch: don't initialize kasan if it's disabled
> >   arch/powerpc: don't initialize kasan if it's disabled
> >   arch/riscv: don't initialize kasan if it's disabled
> >   arch/x86: don't initialize kasan if it's disabled
> >   arch/xtensa: don't initialize kasan if it's disabled
> >   arch/um: don't initialize kasan if it's disabled
> >   mm/kasan: make kasan=on|off take effect for all three modes
> > 
> >  arch/arm/kernel/setup.c                |  6 +++++
> >  arch/arm/mm/kasan_init.c               |  6 +++++
> >  arch/arm64/mm/kasan_init.c             |  7 ++++++
> >  arch/loongarch/mm/kasan_init.c         |  5 ++++
> >  arch/powerpc/mm/kasan/init_32.c        |  8 +++++-
> >  arch/powerpc/mm/kasan/init_book3e_64.c |  6 +++++
> >  arch/powerpc/mm/kasan/init_book3s_64.c |  6 +++++
> >  arch/riscv/mm/kasan_init.c             |  6 +++++
> >  arch/um/kernel/mem.c                   |  6 +++++
> >  arch/x86/mm/kasan_init_64.c            |  6 +++++
> >  arch/xtensa/kernel/setup.c             |  1 +
> >  arch/xtensa/mm/kasan_init.c            |  6 +++++
> >  include/linux/kasan-enabled.h          | 18 ++++++-------
> >  mm/kasan/common.c                      | 25 ++++++++++++++++++
> >  mm/kasan/generic.c                     | 20 +++++++++++++--
> >  mm/kasan/hw_tags.c                     | 35 ++------------------------
> >  mm/kasan/init.c                        |  6 +++++
> >  mm/kasan/quarantine.c                  |  3 +++
> >  mm/kasan/report.c                      |  4 ++-
> >  mm/kasan/shadow.c                      | 23 ++++++++++++++++-
> >  mm/kasan/sw_tags.c                     |  9 +++++++
> >  21 files changed, 165 insertions(+), 47 deletions(-)
> > 
> > -- 
> > 2.41.0
> > 
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
@ 2025-02-08  8:19 Director Inspectorate
  0 siblings, 0 replies; 35+ messages in thread
From: Director Inspectorate @ 2025-02-08  8:19 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

Möchten Sie der großen Illuminaten-Gesellschaft beitreten, um reich, berühmt und geschützt im Leben zu werden, schließen Sie sich uns noch heute an, um Mitglied zu werden und all Ihre schwierigen Probleme im Leben zu lösen. Bitte kontaktieren Sie die Verwaltung, indem Sie auf "Antworten auf" klicken, um weitere Informationen darüber zu erhalten, wie Sie dieser mächtigen Gesellschaft beitreten können.


Do you want to join the Great Illuminati society to become rich, famous and protected in life, join us today to become a member and solve all your difficult problems in life. Please contact the administration by clicking on "reply to" for more information on how to join this powerful society.


[-- Attachment #2: Type: text/html, Size: 3159 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 00/18] Rearrange batched folio freeing
@ 2024-02-27 17:42 Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Other than the obvious "remove calls to compound_head" changes, the
fundamental belief here is that iterating a linked list is much slower
than iterating an array (5-15x slower in my testing).  There's also
an associated belief that since we iterate the batch of folios three
times, we do better when the array is small (ie 15 entries) than we do
with a batch that is hundreds of entries long, which only gives us the
opportunity for the first pages to fall out of cache by the time we get
to the end.

It is possible we should increase the size of folio_batch.  Hopefully the
bots let us know if this introduces any performance regressions.

v3:
 - Rebased on next-20240227
 - Add folios_put_refs() to support unmapping large PTE-mapped folios
 - Used folio_batch_reinit() instead of assigning 0 to fbatch->nr.  This
   makes sure the iterator is correctly reset.
 
v2:
 - Redo the shrink_folio_list() patch to free the mapped folios at
   the end instead of calling try_to_unmap_flush() more often.
 - Improve a number of commit messages
 - Use pcp_allowed_order() instead of PAGE_ALLOC_COSTLY_ORDER (Ryan)
 - Fix move_folios_to_lru() comment (Ryan)
 - Add patches 15-18
 - Collect R-b tags from Ryan

Matthew Wilcox (Oracle) (18):
  mm: Make folios_put() the basis of release_pages()
  mm: Convert free_unref_page_list() to use folios
  mm: Add free_unref_folios()
  mm: Use folios_put() in __folio_batch_release()
  memcg: Add mem_cgroup_uncharge_folios()
  mm: Remove use of folio list from folios_put()
  mm: Use free_unref_folios() in put_pages_list()
  mm: use __page_cache_release() in folios_put()
  mm: Handle large folios in free_unref_folios()
  mm: Allow non-hugetlb large folios to be batch processed
  mm: Free folios in a batch in shrink_folio_list()
  mm: Free folios directly in move_folios_to_lru()
  memcg: Remove mem_cgroup_uncharge_list()
  mm: Remove free_unref_page_list()
  mm: Remove lru_to_page()
  mm: Convert free_pages_and_swap_cache() to use folios_put()
  mm: Use a folio in __collapse_huge_page_copy_succeeded()
  mm: Convert free_swap_cache() to take a folio

 include/linux/memcontrol.h |  26 +++--
 include/linux/mm.h         |  17 ++--
 include/linux/swap.h       |   8 +-
 mm/internal.h              |   4 +-
 mm/khugepaged.c            |  30 +++---
 mm/memcontrol.c            |  16 +--
 mm/memory.c                |   2 +-
 mm/mlock.c                 |   3 +-
 mm/page_alloc.c            |  76 +++++++-------
 mm/swap.c                  | 198 ++++++++++++++++++++-----------------
 mm/swap_state.c            |  33 ++++---
 mm/vmscan.c                |  52 ++++------
 12 files changed, 240 insertions(+), 225 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-03-06 13:42   ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

Hugetlb folios still get special treatment, but normal large folios
can now be freed by free_unref_folios().  This should have a reasonable
performance impact, TBD.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index dce5ea67ae05..6b697d33fa5b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1003,12 +1003,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		if (!folio_ref_sub_and_test(folio, nr_refs))
 			continue;
 
-		if (folio_test_large(folio)) {
+		/* hugetlb has its own memcg */
+		if (folio_test_hugetlb(folio)) {
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			__folio_put_large(folio);
+			free_huge_folio(folio);
 			continue;
 		}
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
@ 2024-03-06 13:42   ` Ryan Roberts
  2024-03-06 16:09     ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-06 13:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Andrew Morton; +Cc: linux-mm

Hi Matthew,

Afraid I have another bug for you...

On 27/02/2024 17:42, Matthew Wilcox (Oracle) wrote:
> Hugetlb folios still get special treatment, but normal large folios
> can now be freed by free_unref_folios().  This should have a reasonable
> performance impact, TBD.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

When running some swap tests with this change (which is in mm-stable) present, I see BadThings(TM). Usually I see a "bad page state" followed by a delay of a few seconds, followed by an oops or NULL pointer deref. Bisect points to this change, and if I revert it, the problem goes away.

Here is one example, running against mm-unstable (a7f399ae964e):

[   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
[   76.240196] kernel BUG at include/linux/mm.h:1120!
[   76.240198] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[   76.240724]  dump_backtrace+0x98/0xf8
[   76.241523] Modules linked in:
[   76.241943]  show_stack+0x20/0x38
[   76.242282] 
[   76.242680]  dump_stack_lvl+0x48/0x60
[   76.242855] CPU: 2 PID: 62 Comm: kcompactd0 Not tainted 6.8.0-rc5-00456-ga7f399ae964e #16
[   76.243278]  dump_stack+0x18/0x28
[   76.244138] Hardware name: linux,dummy-virt (DT)
[   76.244510]  bad_page+0x88/0x128
[   76.244995] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   76.245370]  free_page_is_bad_report+0xa4/0xb8
[   76.246101] pc : migrate_folio_done+0x140/0x150
[   76.246572]  __free_pages_ok+0x370/0x4b0
[   76.247048] lr : migrate_folio_done+0x140/0x150
[   76.247489]  destroy_large_folio+0x94/0x108
[   76.247971] sp : ffff800083f5b8d0
[   76.248451]  __folio_put_large+0x70/0xc0
[   76.248807] x29: ffff800083f5b8d0
[   76.249256]  __folio_put+0xac/0xc0
[   76.249260]  deferred_split_scan+0x234/0x340
[   76.249607]  x28: 0000000000000000
[   76.249997]  do_shrink_slab+0x144/0x460
[   76.250444]  x27: ffff800083f5bb30
[   76.250829]  shrink_slab+0x2e0/0x4e0
[   76.251234] 
[   76.251604]  shrink_node+0x204/0x8a0
[   76.251979] x26: 0000000000000001
[   76.252147]  do_try_to_free_pages+0xd0/0x568
[   76.252527]  x25: 0000000000000010
[   76.252881]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[   76.253337]  x24: fffffc0008552800
[   76.253687]  try_charge_memcg+0x12c/0x650
[   76.254219] 
[   76.254583]  __mem_cgroup_charge+0x6c/0xd0
[   76.255013] x23: ffff0000e6f353a8
[   76.255181]  __handle_mm_fault+0xe90/0x16a8
[   76.255624]  x22: ffff0013f5fa59c0
[   76.255977]  handle_mm_fault+0x70/0x2b0
[   76.256413]  x21: 0000000000000000
[   76.256756]  do_page_fault+0x100/0x4c0
[   76.257177] 
[   76.257540]  do_translation_fault+0xb4/0xd0
[   76.257932] x20: 0000000000000007
[   76.258095]  do_mem_abort+0x4c/0xa8
[   76.258532]  x19: fffffc0008552800
[   76.258883]  el0_da+0x2c/0x78
[   76.259263]  x18: 0000000000000010
[   76.259616]  el0t_64_sync_handler+0xe4/0x158
[   76.259933] 
[   76.260286]  el0t_64_sync+0x190/0x198
[   76.260729] x17: 3030303030303020 x16: 6666666666666666 x15: 3030303030303030
[   76.262010] x14: 0000000000000000 x13: 7465732029732867 x12: 616c662045455246
[   76.262746] x11: 5f54415f4b434548 x10: ffff800082e8bff8 x9 : ffff8000801276ac
[   76.263462] x8 : 00000000ffffefff x7 : ffff800082e8bff8 x6 : 0000000000000000
[   76.264182] x5 : ffff0013f5eb9d08 x4 : 0000000000000000 x3 : 0000000000000000
[   76.264903] x2 : 0000000000000000 x1 : ffff0000c105d640 x0 : 000000000000003e
[   76.265604] Call trace:
[   76.265865]  migrate_folio_done+0x140/0x150
[   76.266278]  migrate_pages_batch+0x9ec/0xff0
[   76.266716]  migrate_pages+0xd20/0xe20
[   76.267103]  compact_zone+0x7b4/0x1000
[   76.267460]  kcompactd_do_work+0x174/0x4d8
[   76.267869]  kcompactd+0x26c/0x418
[   76.268175]  kthread+0x120/0x130
[   76.268517]  ret_from_fork+0x10/0x20
[   76.268892] Code: aa1303e0 b000d161 9100c021 97fe0465 (d4210000) 
[   76.269447] ---[ end trace 0000000000000000 ]---
[   76.269893] note: kcompactd0[62] exited with irqs disabled
[   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0xffffbd0a0 pfn:0x2554a0
[   76.270483] note: kcompactd0[62] exited with preempt_count 1
[   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
[   76.272521] flags: 0xbfffc0000080058(uptodate|dirty|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
[   76.273265] page_type: 0xffffffff()
[   76.273542] raw: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000
[   76.274368] raw: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000
[   76.275043] head: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000
[   76.275651] head: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000
[   76.276407] head: 0bfffc0000000000 0000000000000000 fffffc0008552848 0000000000000000
[   76.277064] head: 0000001000000000 0000000000000000 00000000ffffffff 0000000000000000
[   76.277784] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[   76.278502] ------------[ cut here ]------------
[   76.278893] kernel BUG at include/linux/mm.h:1120!
[   76.279269] Internal error: Oops - BUG: 00000000f2000800 [#2] PREEMPT SMP
[   76.280144] Modules linked in:
[   76.280401] CPU: 6 PID: 1337 Comm: usemem Tainted: G    B D            6.8.0-rc5-00456-ga7f399ae964e #16
[   76.281214] Hardware name: linux,dummy-virt (DT)
[   76.281635] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   76.282256] pc : deferred_split_scan+0x2f0/0x340
[   76.282698] lr : deferred_split_scan+0x2f0/0x340
[   76.283082] sp : ffff80008681b830
[   76.283426] x29: ffff80008681b830 x28: ffff0000cd4fb3c0 x27: fffffc0008552800
[   76.284113] x26: 0000000000000001 x25: 00000000ffffffff x24: 0000000000000001
[   76.284914] x23: 0000000000000000 x22: fffffc0008552800 x21: ffff0000e9df7820
[   76.285590] x20: ffff80008681b898 x19: ffff0000e9df7818 x18: 0000000000000000
[   76.286271] x17: 0000000000000001 x16: 0000000000000001 x15: ffff0000c0617210
[   76.286927] x14: ffff0000c10b6558 x13: 0000000000000040 x12: 0000000000000228
[   76.287543] x11: 0000000000000040 x10: 0000000000000a90 x9 : ffff800080220ed8
[   76.288176] x8 : ffff0000cd4fbeb0 x7 : 0000000000000000 x6 : 0000000000000000
[   76.288842] x5 : ffff0013f5f35d08 x4 : 0000000000000000 x3 : 0000000000000000
[   76.289538] x2 : 0000000000000000 x1 : ffff0000cd4fb3c0 x0 : 000000000000003e
[   76.290201] Call trace:
[   76.290432]  deferred_split_scan+0x2f0/0x340
[   76.290856]  do_shrink_slab+0x144/0x460
[   76.291221]  shrink_slab+0x2e0/0x4e0
[   76.291513]  shrink_node+0x204/0x8a0
[   76.291831]  do_try_to_free_pages+0xd0/0x568
[   76.292192]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[   76.292599]  try_charge_memcg+0x12c/0x650
[   76.292926]  __mem_cgroup_charge+0x6c/0xd0
[   76.293289]  __handle_mm_fault+0xe90/0x16a8
[   76.293713]  handle_mm_fault+0x70/0x2b0
[   76.294031]  do_page_fault+0x100/0x4c0
[   76.294343]  do_translation_fault+0xb4/0xd0
[   76.294694]  do_mem_abort+0x4c/0xa8
[   76.294968]  el0_da+0x2c/0x78
[   76.295202]  el0t_64_sync_handler+0xe4/0x158
[   76.295565]  el0t_64_sync+0x190/0x198
[   76.295860] Code: aa1603e0 d000d0e1 9100c021 97fdc715 (d4210000) 
[   76.296429] ---[ end trace 0000000000000000 ]---
[   76.296805] note: usemem[1337] exited with irqs disabled
[   76.297261] note: usemem[1337] exited with preempt_count 1



My test case is intended to stress swap:

  - Running in VM (on Ampere Altra) with 70 vCPUs and 80G RAM
  - Have a 35G block ram device (CONFIG_BLK_DEV_RAM & "brd.rd_nr=1 brd.rd_size=36700160")
  - the ramdisk is configured as the swap backend
  - run the test case in a memcg constrained to 40G (to force mem pressure)
  - test case has 70 processes, each allocating and writing 1G of RAM


swapoff -a
mkswap /dev/ram0
swapon -f /dev/ram0
cgcreate -g memory:/mmperfcgroup
echo 40G > /sys/fs/cgroup/mmperfcgroup/memory.max
cgexec -g memory:mmperfcgroup sudo -u $(whoami) bash

Then inside that second bash shell, run this script:

--8<---
function run_usemem_once {
        ./usemem -n 70 -O 1G | grep -v "free memory"
}

function run_usemem_multi {
        size=${1}
        for i in {1..2}; do
                echo "${size} THP ${i}"
                run_usemem_once
        done
}

echo never > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
run_usemem_multi "64K"
--8<---

It will usually get through the first iteration of the loop in run_usemem_multi() and fail on the second. I've never seen it get all the way through both iterations.

"usemem" is from the vm-scalability suite. It just allocates and writes loads of anonymous memory (70 is concurrent processes, 1G is the amount of memory per process). Then the memory pressure from the cgroup causes lots of swap to happen.

> ---
>  mm/swap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index dce5ea67ae05..6b697d33fa5b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1003,12 +1003,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>  		if (!folio_ref_sub_and_test(folio, nr_refs))
>  			continue;
>  
> -		if (folio_test_large(folio)) {
> +		/* hugetlb has its own memcg */
> +		if (folio_test_hugetlb(folio)) {

This still looks reasonable to me after re-review, so I have no idea what the problem is? I recall seeing some weird crashes when I looked at this original RFC, but didn't have time to debug at the time. I wonder if the root cause is the same.

If you find a smoking gun, I'm happy to test it if the above is too painful to reproduce.

Thanks,
Ryan

>  			if (lruvec) {
>  				unlock_page_lruvec_irqrestore(lruvec, flags);
>  				lruvec = NULL;
>  			}
> -			__folio_put_large(folio);
> +			free_huge_folio(folio);
>  			continue;
>  		}
>  



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 13:42   ` Ryan Roberts
@ 2024-03-06 16:09     ` Matthew Wilcox
  2024-03-06 16:19       ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2024-03-06 16:09 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
> When running some swap tests with this change (which is in mm-stable)
> present, I see BadThings(TM). Usually I see a "bad page state"
> followed by a delay of a few seconds, followed by an oops or NULL
> pointer deref. Bisect points to this change, and if I revert it,
> the problem goes away.

That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
the same time and it's all interleaved.  That said, I can pick some
nuggets out of it.

> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
> [   76.240196] kernel BUG at include/linux/mm.h:1120!

These are the two different BUGs being called simultaneously ...

The first one is bad_page() in page_alloc.c and the second is
put_page_testzero()
        VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);

I'm sure it's significant that both of these are the same page (pfn
2554a0).  Feels like we have two CPUs calling put_folio() at the same
time, and one of them underflows.  It probably doesn't matter which call
trace ends up in bad_page() and which in put_page_testzero().

One of them is coming from deferred_split_scan(), which is weird because
we can see the folio_try_get() earlier in the function.  So whatever
this folio was, we found it on the deferred split list, got its refcount,
moved it to the local list, either failed to get the lock, or
successfully got the lock, split it, unlocked it and put it.

(I can see this was invoked from page fault -> memcg shrinking.  That's
probably irrelevant but explains some of the functions in the backtrace)

The other call trace comes from migrate_folio_done() where we're putting
the _source_ folio.  That was called from migrate_pages_batch() which
was called from kcompactd.

Um.  Where do we handle the deferred list in the migration code?

I've also tried looking at this from a different angle -- what is it
about this commit that produces this problem?  It's a fairly small
commit:

-               if (folio_test_large(folio)) {
+               /* hugetlb has its own memcg */
+               if (folio_test_hugetlb(folio)) {
                        if (lruvec) {
                                unlock_page_lruvec_irqrestore(lruvec, flags);
                                lruvec = NULL;
                        }
-                       __folio_put_large(folio);
+                       free_huge_folio(folio);

So all that's changed is that large non-hugetlb folios do not call
__folio_put_large().  As a reminder, that function does:

        if (!folio_test_hugetlb(folio))
                page_cache_release(folio);
        destroy_large_folio(folio);

and destroy_large_folio() does:
        if (folio_test_large_rmappable(folio))
                folio_undo_large_rmappable(folio);

        mem_cgroup_uncharge(folio);
        free_the_page(&folio->page, folio_order(folio));

So after my patch, instead of calling (in order):

	page_cache_release(folio);
	folio_undo_large_rmappable(folio);
	mem_cgroup_uncharge(folio);
	free_unref_page()

it calls:

	__page_cache_release(folio, &lruvec, &flags);
	mem_cgroup_uncharge_folios()
	folio_undo_large_rmappable(folio);

So have I simply widened the window for this race, whatever it is
exactly?  Something involving mis-handling of the deferred list?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 16:09     ` Matthew Wilcox
@ 2024-03-06 16:19       ` Ryan Roberts
  2024-03-06 17:41         ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-06 16:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 06/03/2024 16:09, Matthew Wilcox wrote:
> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>> When running some swap tests with this change (which is in mm-stable)
>> present, I see BadThings(TM). Usually I see a "bad page state"
>> followed by a delay of a few seconds, followed by an oops or NULL
>> pointer deref. Bisect points to this change, and if I revert it,
>> the problem goes away.
> 
> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
> the same time and it's all interleaved.  That said, I can pick some
> nuggets out of it.
> 
>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
> 
> These are the two different BUGs being called simultaneously ...
> 
> The first one is bad_page() in page_alloc.c and the second is
> put_page_testzero()
>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> 
> I'm sure it's significant that both of these are the same page (pfn
> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
> time, and one of them underflows.  It probably doesn't matter which call
> trace ends up in bad_page() and which in put_page_testzero().
> 
> One of them is coming from deferred_split_scan(), which is weird because
> we can see the folio_try_get() earlier in the function.  So whatever
> this folio was, we found it on the deferred split list, got its refcount,
> moved it to the local list, either failed to get the lock, or
> successfully got the lock, split it, unlocked it and put it.
> 
> (I can see this was invoked from page fault -> memcg shrinking.  That's
> probably irrelevant but explains some of the functions in the backtrace)
> 
> The other call trace comes from migrate_folio_done() where we're putting
> the _source_ folio.  That was called from migrate_pages_batch() which
> was called from kcompactd.
> 
> Um.  Where do we handle the deferred list in the migration code?
> 
> 
> I've also tried looking at this from a different angle -- what is it
> about this commit that produces this problem?  It's a fairly small
> commit:
> 
> -               if (folio_test_large(folio)) {
> +               /* hugetlb has its own memcg */
> +               if (folio_test_hugetlb(folio)) {
>                         if (lruvec) {
>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>                                 lruvec = NULL;
>                         }
> -                       __folio_put_large(folio);
> +                       free_huge_folio(folio);
> 
> So all that's changed is that large non-hugetlb folios do not call
> __folio_put_large().  As a reminder, that function does:
> 
>         if (!folio_test_hugetlb(folio))
>                 page_cache_release(folio);
>         destroy_large_folio(folio);
> 
> and destroy_large_folio() does:
>         if (folio_test_large_rmappable(folio))
>                 folio_undo_large_rmappable(folio);
> 
>         mem_cgroup_uncharge(folio);
>         free_the_page(&folio->page, folio_order(folio));
> 
> So after my patch, instead of calling (in order):
> 
> 	page_cache_release(folio);
> 	folio_undo_large_rmappable(folio);
> 	mem_cgroup_uncharge(folio);
> 	free_unref_page()
> 
> it calls:
> 
> 	__page_cache_release(folio, &lruvec, &flags);
> 	mem_cgroup_uncharge_folios()
> 	folio_undo_large_rmappable(folio);
> 
> So have I simply widened the window for this race 

Yes that's the conclusion I'm coming to. I have reverted this patch and am still
seeing what looks like the same problem very occasionally. (I was just about to
let you know when I saw this reply). It's much harder to reproduce now... great.

The original oops I reported against your RFC is here:
https://lore.kernel.org/linux-mm/eeaf36cf-8e29-4de2-9e5a-9ec2a5e30c61@arm.com/

Looks like I had UBSAN enabled for that run. Let me turn on all the bells and
whistles and see if I can get it to repro more reliably to bisect.

Assuming the original oops and this are related, that implies that the problem
is lurking somewhere in this series, if not this patch.

I'll come back to you shortly...

>, whatever it is
> exactly?  Something involving mis-handling of the deferred list?
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 16:19       ` Ryan Roberts
@ 2024-03-06 17:41         ` Ryan Roberts
  2024-03-06 18:41           ` Zi Yan
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-06 17:41 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 06/03/2024 16:19, Ryan Roberts wrote:
> On 06/03/2024 16:09, Matthew Wilcox wrote:
>> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>>> When running some swap tests with this change (which is in mm-stable)
>>> present, I see BadThings(TM). Usually I see a "bad page state"
>>> followed by a delay of a few seconds, followed by an oops or NULL
>>> pointer deref. Bisect points to this change, and if I revert it,
>>> the problem goes away.
>>
>> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
>> the same time and it's all interleaved.  That said, I can pick some
>> nuggets out of it.
>>
>>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
>>
>> These are the two different BUGs being called simultaneously ...
>>
>> The first one is bad_page() in page_alloc.c and the second is
>> put_page_testzero()
>>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>
>> I'm sure it's significant that both of these are the same page (pfn
>> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
>> time, and one of them underflows.  It probably doesn't matter which call
>> trace ends up in bad_page() and which in put_page_testzero().
>>
>> One of them is coming from deferred_split_scan(), which is weird because
>> we can see the folio_try_get() earlier in the function.  So whatever
>> this folio was, we found it on the deferred split list, got its refcount,
>> moved it to the local list, either failed to get the lock, or
>> successfully got the lock, split it, unlocked it and put it.
>>
>> (I can see this was invoked from page fault -> memcg shrinking.  That's
>> probably irrelevant but explains some of the functions in the backtrace)
>>
>> The other call trace comes from migrate_folio_done() where we're putting
>> the _source_ folio.  That was called from migrate_pages_batch() which
>> was called from kcompactd.
>>
>> Um.  Where do we handle the deferred list in the migration code?
>>
>>
>> I've also tried looking at this from a different angle -- what is it
>> about this commit that produces this problem?  It's a fairly small
>> commit:
>>
>> -               if (folio_test_large(folio)) {
>> +               /* hugetlb has its own memcg */
>> +               if (folio_test_hugetlb(folio)) {
>>                         if (lruvec) {
>>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>>                                 lruvec = NULL;
>>                         }
>> -                       __folio_put_large(folio);
>> +                       free_huge_folio(folio);
>>
>> So all that's changed is that large non-hugetlb folios do not call
>> __folio_put_large().  As a reminder, that function does:
>>
>>         if (!folio_test_hugetlb(folio))
>>                 page_cache_release(folio);
>>         destroy_large_folio(folio);
>>
>> and destroy_large_folio() does:
>>         if (folio_test_large_rmappable(folio))
>>                 folio_undo_large_rmappable(folio);
>>
>>         mem_cgroup_uncharge(folio);
>>         free_the_page(&folio->page, folio_order(folio));
>>
>> So after my patch, instead of calling (in order):
>>
>> 	page_cache_release(folio);
>> 	folio_undo_large_rmappable(folio);
>> 	mem_cgroup_uncharge(folio);
>> 	free_unref_page()
>>
>> it calls:
>>
>> 	__page_cache_release(folio, &lruvec, &flags);
>> 	mem_cgroup_uncharge_folios()
>> 	folio_undo_large_rmappable(folio);
>>
>> So have I simply widened the window for this race 
> 
> Yes that's the conclusion I'm coming to. I have reverted this patch and am still
> seeing what looks like the same problem very occasionally. (I was just about to
> let you know when I saw this reply). It's much harder to reproduce now... great.
> 
> The original oops I reported against your RFC is here:
> https://lore.kernel.org/linux-mm/eeaf36cf-8e29-4de2-9e5a-9ec2a5e30c61@arm.com/
> 
> Looks like I had UBSAN enabled for that run. Let me turn on all the bells and
> whistles and see if I can get it to repro more reliably to bisect.
> 
> Assuming the original oops and this are related, that implies that the problem
> is lurking somewhere in this series, if not this patch.
> 
> I'll come back to you shortly...

Just a bunch of circumstantial observations, I'm afraid. No conclusions yet...

With this patch reverted:

- Haven't triggered with any of the sanitizers compiled in
- Have only triggered when my code is on top (swap-out mTHP)
- Have only triggered when compiled using GCC 12.2 (can't trigger with 11.4)

So perhaps I'm looking at 2 different things, with this new intermittent problem
caused by my changes. Or perhaps my changes increase the window significantly.

I have to go pick up my daughter now. Can look at this some more tomorrow, but
struggling for ideas - need a way to more reliably reproduce.

> 
>> , whatever it is
>> exactly?  Something involving mis-handling of the deferred list?
>>
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 17:41         ` Ryan Roberts
@ 2024-03-06 18:41           ` Zi Yan
  2024-03-06 19:55             ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Zi Yan @ 2024-03-06 18:41 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Andrew Morton, linux-mm, Yang Shi, Huang Ying

[-- Attachment #1: Type: text/plain, Size: 6172 bytes --]

On 6 Mar 2024, at 12:41, Ryan Roberts wrote:

> On 06/03/2024 16:19, Ryan Roberts wrote:
>> On 06/03/2024 16:09, Matthew Wilcox wrote:
>>> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>>>> When running some swap tests with this change (which is in mm-stable)
>>>> present, I see BadThings(TM). Usually I see a "bad page state"
>>>> followed by a delay of a few seconds, followed by an oops or NULL
>>>> pointer deref. Bisect points to this change, and if I revert it,
>>>> the problem goes away.
>>>
>>> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
>>> the same time and it's all interleaved.  That said, I can pick some
>>> nuggets out of it.
>>>
>>>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>>>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
>>>
>>> These are the two different BUGs being called simultaneously ...
>>>
>>> The first one is bad_page() in page_alloc.c and the second is
>>> put_page_testzero()
>>>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>>
>>> I'm sure it's significant that both of these are the same page (pfn
>>> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
>>> time, and one of them underflows.  It probably doesn't matter which call
>>> trace ends up in bad_page() and which in put_page_testzero().
>>>
>>> One of them is coming from deferred_split_scan(), which is weird because
>>> we can see the folio_try_get() earlier in the function.  So whatever
>>> this folio was, we found it on the deferred split list, got its refcount,
>>> moved it to the local list, either failed to get the lock, or
>>> successfully got the lock, split it, unlocked it and put it.
>>>
>>> (I can see this was invoked from page fault -> memcg shrinking.  That's
>>> probably irrelevant but explains some of the functions in the backtrace)
>>>
>>> The other call trace comes from migrate_folio_done() where we're putting
>>> the _source_ folio.  That was called from migrate_pages_batch() which
>>> was called from kcompactd.
>>>
>>> Um.  Where do we handle the deferred list in the migration code?
>>>
>>>
>>> I've also tried looking at this from a different angle -- what is it
>>> about this commit that produces this problem?  It's a fairly small
>>> commit:
>>>
>>> -               if (folio_test_large(folio)) {
>>> +               /* hugetlb has its own memcg */
>>> +               if (folio_test_hugetlb(folio)) {
>>>                         if (lruvec) {
>>>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>>>                                 lruvec = NULL;
>>>                         }
>>> -                       __folio_put_large(folio);
>>> +                       free_huge_folio(folio);
>>>
>>> So all that's changed is that large non-hugetlb folios do not call
>>> __folio_put_large().  As a reminder, that function does:
>>>
>>>         if (!folio_test_hugetlb(folio))
>>>                 page_cache_release(folio);
>>>         destroy_large_folio(folio);
>>>
>>> and destroy_large_folio() does:
>>>         if (folio_test_large_rmappable(folio))
>>>                 folio_undo_large_rmappable(folio);
>>>
>>>         mem_cgroup_uncharge(folio);
>>>         free_the_page(&folio->page, folio_order(folio));
>>>
>>> So after my patch, instead of calling (in order):
>>>
>>> 	page_cache_release(folio);
>>> 	folio_undo_large_rmappable(folio);
>>> 	mem_cgroup_uncharge(folio);
>>> 	free_unref_page()
>>>
>>> it calls:
>>>
>>> 	__page_cache_release(folio, &lruvec, &flags);
>>> 	mem_cgroup_uncharge_folios()
>>> 	folio_undo_large_rmappable(folio);
>>>
>>> So have I simply widened the window for this race
>>
>> Yes that's the conclusion I'm coming to. I have reverted this patch and am still
>> seeing what looks like the same problem very occasionally. (I was just about to
>> let you know when I saw this reply). It's much harder to reproduce now... great.
>>
>> The original oops I reported against your RFC is here:
>> https://lore.kernel.org/linux-mm/eeaf36cf-8e29-4de2-9e5a-9ec2a5e30c61@arm.com/
>>
>> Looks like I had UBSAN enabled for that run. Let me turn on all the bells and
>> whistles and see if I can get it to repro more reliably to bisect.
>>
>> Assuming the original oops and this are related, that implies that the problem
>> is lurking somewhere in this series, if not this patch.
>>
>> I'll come back to you shortly...
>
> Just a bunch of circumstantial observations, I'm afraid. No conclusions yet...
>
> With this patch reverted:
>
> - Haven't triggered with any of the sanitizers compiled in
> - Have only triggered when my code is on top (swap-out mTHP)
> - Have only triggered when compiled using GCC 12.2 (can't trigger with 11.4)
>
> So perhaps I'm looking at 2 different things, with this new intermittent problem
> caused by my changes. Or perhaps my changes increase the window significantly.
>
> I have to go pick up my daughter now. Can look at this some more tomorrow, but
> struggling for ideas - need a way to more reliably reproduce.
>
>>
>>> , whatever it is
>>> exactly?  Something involving mis-handling of the deferred list?

I had a chat with willy on the deferred list mis-handling. Current migration
code (starting from commit 616b8371539a6 ("mm: thp: enable thp migration in
generic path")) does not properly handle THP and mTHP on the deferred list.
So if the source folio is on the deferred list, after migration,
the destination folio will not. But this seems a benign bug, since
the opportunity of splitting a partially mapped THP/mTHP is gone.

In terms of potential races, the source folio refcount is elevated before
migration, deferred_split_scan() can move the folio off the deferred_list,
but cannot split it. During folio_migrate_mapping() when folio is frozen,
deferred_split_scan() cannot move the folio off the deferred_list to begin
with.

I am going to send a patch to fix the deferred_list handling in migration,
but it seems not be related to the bug in this email thread.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 18:41           ` Zi Yan
@ 2024-03-06 19:55             ` Matthew Wilcox
  2024-03-06 21:55               ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2024-03-06 19:55 UTC (permalink / raw)
  To: Zi Yan; +Cc: Ryan Roberts, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Wed, Mar 06, 2024 at 01:41:13PM -0500, Zi Yan wrote:
> I had a chat with willy on the deferred list mis-handling. Current migration
> code (starting from commit 616b8371539a6 ("mm: thp: enable thp migration in
> generic path")) does not properly handle THP and mTHP on the deferred list.
> So if the source folio is on the deferred list, after migration,
> the destination folio will not. But this seems a benign bug, since
> the opportunity of splitting a partially mapped THP/mTHP is gone.
> 
> In terms of potential races, the source folio refcount is elevated before
> migration, deferred_split_scan() can move the folio off the deferred_list,
> but cannot split it. During folio_migrate_mapping() when folio is frozen,
> deferred_split_scan() cannot move the folio off the deferred_list to begin
> with.
> 
> I am going to send a patch to fix the deferred_list handling in migration,
> but it seems not be related to the bug in this email thread.

... IOW the source folio remains on the deferred list until its
refcount goes to 0, at which point we call folio_undo_large_rmappable()
and remove it from the deferred list.

A different line of enquiry might be the "else /* We lost race with
folio_put() */" in deferred_split_scan().  If somebody froze the
refcount, we can lose track of a deferred-split folio.  But I think
that's OK too.  The only places which freeze a folio are vmscan (about
to free), folio_migrate_mapping() (discussed above), and page splitting.
In none of these cases do we want to keep the folio on the deferred
split list because we're either freeing it, migrating it or splitting
it.

Oh, and there's something in s390 that I can't be bothered to look at.

Hang on, I think I see it.  It is a race between folio freeing and
deferred_split_scan(), but page migration is absolved.  Look:

CPU 1: deferred_split_scan:
spin_lock_irqsave(split_queue_lock)
list_for_each_entry_safe()
folio_try_get()
list_move(&folio->_deferred_list, &list);
spin_unlock_irqrestore(split_queue_lock)
list_for_each_entry_safe() {
	folio_trylock() <- fails
	folio_put(folio);

CPU 2: folio_put:
folio_undo_large_rmappable
        ds_queue = get_deferred_split_queue(folio);
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
                list_del_init(&folio->_deferred_list);
*** at this point CPU 1 is not holding the split_queue_lock; the
folio is on the local list.  Which we just corrupted ***

Now anything can happen.  It's a pretty tight race that involves at
least two CPUs (CPU 2 might have been the one to have the folio locked
at the time CPU 1 caalled folio_trylock()).  But I definitely widened
the window by moving the decrement of the refcount and the removal from
the deferred list further apart.

OK, so what's the solution here?  Personally I favour using a
folio_batch in deferred_split_scan() to hold the folios that we're
going to try to remove instead of a linked list.  Other ideas that are
perhaps less intrusive?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 19:55             ` Matthew Wilcox
@ 2024-03-06 21:55               ` Matthew Wilcox
  2024-03-07  8:56                 ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2024-03-06 21:55 UTC (permalink / raw)
  To: Zi Yan; +Cc: Ryan Roberts, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
> Hang on, I think I see it.  It is a race between folio freeing and
> deferred_split_scan(), but page migration is absolved.  Look:
> 
> CPU 1: deferred_split_scan:
> spin_lock_irqsave(split_queue_lock)
> list_for_each_entry_safe()
> folio_try_get()
> list_move(&folio->_deferred_list, &list);
> spin_unlock_irqrestore(split_queue_lock)
> list_for_each_entry_safe() {
> 	folio_trylock() <- fails
> 	folio_put(folio);
> 
> CPU 2: folio_put:
> folio_undo_large_rmappable
>         ds_queue = get_deferred_split_queue(folio);
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>                 list_del_init(&folio->_deferred_list);
> *** at this point CPU 1 is not holding the split_queue_lock; the
> folio is on the local list.  Which we just corrupted ***
> 
> Now anything can happen.  It's a pretty tight race that involves at
> least two CPUs (CPU 2 might have been the one to have the folio locked
> at the time CPU 1 caalled folio_trylock()).  But I definitely widened
> the window by moving the decrement of the refcount and the removal from
> the deferred list further apart.
> 
> 
> OK, so what's the solution here?  Personally I favour using a
> folio_batch in deferred_split_scan() to hold the folios that we're
> going to try to remove instead of a linked list.  Other ideas that are
> perhaps less intrusive?

I looked at a few options, but I think we need to keep the refcount
elevated until we've got the folios back on the deferred split list.
And we can't call folio_put() while holding the split_queue_lock or
we'll deadlock.  So we need to maintain a list of folios that isn't
linked through deferred_list.  Anyway, this is basically untested,
except that it compiles.

Opinions?  Better patches?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd745bcc97ff..0120a47ea7a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
+	struct folio_batch batch;
 	struct folio *folio, *next;
 	int split = 0;
 
@@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&batch);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
-		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
-			/* We lost race with folio_put() */
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+		if (!folio_try_get(folio))
+			continue;
+		if (!folio_trylock(folio))
+			continue;
+		list_del_init(&folio->_deferred_list);
+		if (folio_batch_add(&batch, folio) == 0) {
+			--sc->nr_to_scan;
+			break;
 		}
 		if (!--sc->nr_to_scan)
 			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
-		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
+	while ((folio = folio_batch_next(&batch)) != NULL) {
 		if (!split_folio(folio))
 			split++;
 		folio_unlock(folio);
-next:
-		folio_put(folio);
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
+	while ((folio = folio_batch_next(&batch)) != NULL) {
+		if (!folio_test_large(folio))
+			continue;
+		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
+	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
+	folios_put(&batch);
+
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
 	 * This can happen if pages were freed under us.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 21:55               ` Matthew Wilcox
@ 2024-03-07  8:56                 ` Ryan Roberts
  2024-03-07 13:50                   ` Yin, Fengwei
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-07  8:56 UTC (permalink / raw)
  To: Matthew Wilcox, Zi Yan; +Cc: Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 06/03/2024 21:55, Matthew Wilcox wrote:
> On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
>> Hang on, I think I see it.  It is a race between folio freeing and
>> deferred_split_scan(), but page migration is absolved.  Look:
>>
>> CPU 1: deferred_split_scan:
>> spin_lock_irqsave(split_queue_lock)
>> list_for_each_entry_safe()
>> folio_try_get()
>> list_move(&folio->_deferred_list, &list);
>> spin_unlock_irqrestore(split_queue_lock)
>> list_for_each_entry_safe() {
>> 	folio_trylock() <- fails
>> 	folio_put(folio);
>>
>> CPU 2: folio_put:
>> folio_undo_large_rmappable
>>         ds_queue = get_deferred_split_queue(folio);
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>                 list_del_init(&folio->_deferred_list);
>> *** at this point CPU 1 is not holding the split_queue_lock; the
>> folio is on the local list.  Which we just corrupted ***

Wow, this would have taken me weeks...

I just want to make sure I've understood correctly: CPU1's folio_put() is not the last reference, and it keeps iterating through the local list. Then CPU2 does the final folio_put() which causes list_del_init() to modify the local list concurrently with CPU1's iteration, so CPU1 probably goes into the weeds?

>>
>> Now anything can happen.  It's a pretty tight race that involves at
>> least two CPUs (CPU 2 might have been the one to have the folio locked
>> at the time CPU 1 caalled folio_trylock()).  But I definitely widened
>> the window by moving the decrement of the refcount and the removal from
>> the deferred list further apart.
>>
>>
>> OK, so what's the solution here?  Personally I favour using a
>> folio_batch in deferred_split_scan() to hold the folios that we're
>> going to try to remove instead of a linked list.  Other ideas that are
>> perhaps less intrusive?
> 
> I looked at a few options, but I think we need to keep the refcount
> elevated until we've got the folios back on the deferred split list.
> And we can't call folio_put() while holding the split_queue_lock or
> we'll deadlock.  So we need to maintain a list of folios that isn't
> linked through deferred_list.  Anyway, this is basically untested,
> except that it compiles.

If we can't call folio_put() under spinlock, then I agree.

> 
> Opinions?  Better patches?

I assume the fact that one scan is limited to freeing a batch-worth of folios is not a problem? The shrinker will keep calling while there are folios on the deferred list?

> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd745bcc97ff..0120a47ea7a1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> +	struct folio_batch batch;
>  	struct folio *folio, *next;
>  	int split = 0;
>  
> @@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>  
> +	folio_batch_init(&batch);
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	/* Take pin on all head pages to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
> -		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> -			/* We lost race with folio_put() */
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +		if (!folio_try_get(folio))
> +			continue;
> +		if (!folio_trylock(folio))
> +			continue;
> +		list_del_init(&folio->_deferred_list);
> +		if (folio_batch_add(&batch, folio) == 0) {
> +			--sc->nr_to_scan;
> +			break;
>  		}
>  		if (!--sc->nr_to_scan)
>  			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> -		if (!folio_trylock(folio))
> -			goto next;
> -		/* split_huge_page() removes page from list on success */
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>  		if (!split_folio(folio))
>  			split++;
>  		folio_unlock(folio);
> -next:
> -		folio_put(folio);
>  	}
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
> +		if (!folio_test_large(folio))
> +			continue;
> +		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> +	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> +	folios_put(&batch);
> +
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>  	 * This can happen if pages were freed under us.

I've added this patch to my branch and tested (still without the patch that I fingered as the culprit originally, for now). Unfortuantely it is still blowing up at about the same rate, although it looks very different now. I've seen bad things twice. The first time was RCU stalls, but systemd had turned the log level down so no stack trace and I didn't manage to get any further information. The second time, this:

[  338.519401] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519402] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519407] Mem abort info:
[  338.519407]   ESR = 0x0000000096000004
[  338.519408]   EC = 0x25: DABT (current EL), IL = 32 bits
[  338.519588] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519591] Mem abort info:
[  338.519592]   ESR = 0x0000000096000004
[  338.519593]   EC = 0x25: DABT (current EL), IL = 32 bits
[  338.519594]   SET = 0, FnV = 0
[  338.519595]   EA = 0, S1PTW = 0
[  338.519596]   FSC = 0x04: level 0 translation fault
[  338.519597] Data abort info:
[  338.519597]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  338.519598]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  338.519599]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  338.519600] [fffc001b13a8c870] address between user and kernel address ranges
[  338.519602] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  338.519605] Modules linked in:
[  338.519607] CPU: 43 PID: 3234 Comm: usemem Not tainted 6.8.0-rc5-00465-g279cb41b481e-dirty #3
[  338.519610] Hardware name: linux,dummy-virt (DT)
[  338.519611] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  338.519613] pc : down_read_trylock+0x2c/0xd0
[  338.519618] lr : folio_lock_anon_vma_read+0x74/0x2c8
[  338.519623] sp : ffff800087f935c0
[  338.519623] x29: ffff800087f935c0 x28: 0000000000000000 x27: ffff800087f937e0
[  338.519626] x26: 0000000000000001 x25: ffff800087f937a8 x24: fffffc0007258180
[  338.519628] x23: ffff800087f936c8 x22: fffc001b13a8c870 x21: ffff0000f7d51d69
[  338.519630] x20: ffff0000f7d51d68 x19: fffffc0007258180 x18: 0000000000000000
[  338.519632] x17: 0000000000000001 x16: ffff0000c90ab458 x15: 0000000000000040
[  338.519634] x14: ffff0000c8c7b558 x13: 0000000000000228 x12: 000040f22f534640
[  338.519637] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080338b3c
[  338.519639] x8 : ffff800087f93618 x7 : 0000000000000000 x6 : ffff0000c9692f50
[  338.519641] x5 : ffff800087f936b0 x4 : 0000000000000001 x3 : ffff0000d70d9140
[  338.519643] x2 : 0000000000000001 x1 : fffc001b13a8c870 x0 : fffc001b13a8c870
[  338.519645] Call trace:
[  338.519646]  down_read_trylock+0x2c/0xd0
[  338.519648]  folio_lock_anon_vma_read+0x74/0x2c8
[  338.519650]  rmap_walk_anon+0x1d8/0x2c0
[  338.519652]  folio_referenced+0x1b4/0x1e0
[  338.519655]  shrink_folio_list+0x768/0x10c8
[  338.519658]  shrink_lruvec+0x5dc/0xb30
[  338.519660]  shrink_node+0x4d8/0x8b0
[  338.519662]  do_try_to_free_pages+0xe0/0x5a8
[  338.519665]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[  338.519667]  try_charge_memcg+0x114/0x658
[  338.519671]  __mem_cgroup_charge+0x6c/0xd0
[  338.519672]  __handle_mm_fault+0x42c/0x1640
[  338.519675]  handle_mm_fault+0x70/0x290
[  338.519677]  do_page_fault+0xfc/0x4d8
[  338.519681]  do_translation_fault+0xa4/0xc0
[  338.519682]  do_mem_abort+0x4c/0xa8
[  338.519685]  el0_da+0x2c/0x78
[  338.519687]  el0t_64_sync_handler+0xb8/0x130
[  338.519689]  el0t_64_sync+0x190/0x198
[  338.519692] Code: aa0003e1 b9400862 11000442 b9000862 (f9400000) 
[  338.519693] ---[ end trace 0000000000000000 ]---

The fault is when trying to do an atomic_long_read(&sem->count) here:

struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
					  struct rmap_walk_control *rwc)
{
	struct anon_vma *anon_vma = NULL;
	struct anon_vma *root_anon_vma;
	unsigned long anon_mapping;

retry:
	rcu_read_lock();
	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	if (!folio_mapped(folio))
		goto out;

	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
	root_anon_vma = READ_ONCE(anon_vma->root);
	if (down_read_trylock(&root_anon_vma->rwsem)) { <<<<<<<

I guess we are still corrupting folios?




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2024-03-07  8:56                 ` Ryan Roberts
@ 2024-03-07 13:50                   ` Yin, Fengwei
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Yin, Fengwei @ 2024-03-07 13:50 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Zi Yan
  Cc: Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 4:56 PM,  wrote:
> I just want to make sure I've understood correctly: CPU1's folio_put()
> is not the last reference, and it keeps iterating through the local
> list. Then CPU2 does the final folio_put() which causes list_del_init()
> to modify the local list concurrently with CPU1's iteration, so CPU1
> probably goes into the weeds?

My understanding is this can not corrupt the folio->deferred_list as
this folio was iterated already.


But I did see other strange thing:
[   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 
index:0xffffbd0a0 pfn:0x2554a0
[   76.270483] note: kcompactd0[62] exited with preempt_count 1
[   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0

This large folio has order 0? Maybe folio->_flags_1 was screwed?

In free_unref_folios(), there is code like following:
                 if (order > 0 && folio_test_large_rmappable(folio))
                         folio_undo_large_rmappable(folio);

But with destroy_large_folio():
         if (folio_test_large_rmappable(folio)) 

			folio_undo_large_rmappable(folio);

Can it connect to the folio has zero refcount still in deferred list
with Matthew's patch?


Looks like folio order was cleared unexpected somewhere.

Regards
Yin, Fengwei



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2024-03-07 13:50                   ` Yin, Fengwei
@ 2024-03-07 14:05                     ` Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  0 siblings, 2 replies; 35+ messages in thread
From: Matthew Wilcox @ 2024-03-07 14:05 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
> 
> 
> On 3/7/2024 4:56 PM,  wrote:
> > I just want to make sure I've understood correctly: CPU1's folio_put()
> > is not the last reference, and it keeps iterating through the local
> > list. Then CPU2 does the final folio_put() which causes list_del_init()
> > to modify the local list concurrently with CPU1's iteration, so CPU1
> > probably goes into the weeds?
> 
> My understanding is this can not corrupt the folio->deferred_list as
> this folio was iterated already.

I am not convinced about that at all.  It's possible this isn't the only
problem, but deleting something from a list without holding (the correct)
lock is something you have to think incredibly hard about to get right.
I didn't bother going any deeper into the analysis once I spotted the
locking problem, but the proof is very much on you that this is not a bug!

> But I did see other strange thing:
> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
> index:0xffffbd0a0 pfn:0x2554a0
> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
> 
> This large folio has order 0? Maybe folio->_flags_1 was screwed?
> 
> In free_unref_folios(), there is code like following:
>                 if (order > 0 && folio_test_large_rmappable(folio))
>                         folio_undo_large_rmappable(folio);
> 
> But with destroy_large_folio():
>         if (folio_test_large_rmappable(folio))
> 
> 			folio_undo_large_rmappable(folio);
> 
> Can it connect to the folio has zero refcount still in deferred list
> with Matthew's patch?
> 
> 
> Looks like folio order was cleared unexpected somewhere.

No, we intentionally clear it:

free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
page[1].flags &= ~PAGE_FLAGS_SECOND;

PAGE_FLAGS_SECOND includes the order, which is why we have to save it
away in folio->private so that we know what it is in the second loop.
So it's always been cleared by the time we call free_page_is_bad().


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
@ 2024-03-07 15:24                       ` Ryan Roberts
  2024-03-07 16:24                         ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  1 sibling, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-07 15:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 14:05, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>                         folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>         if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.

I think there could be something to this...

I have a set up where, when running with Matthew's deferred split fix AND have
commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
everything works as expected. And at the end, I have the expected amount of
memory free (seen in meminfo and buddyinfo).

But if I run only with the deferred split fix and DO NOT revert the other
change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
stalls where I can't even interact on the serial port. Sometimes (more usually)
everything just gets stuck trying to reclaim and allocate memory. And when I
kill the jobs, I still have barely any memory in the system - about 10% what I
would expect.

So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
only telling it about the first 4K page? So we end up leaking the rest?

> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-07 16:24                         ` Ryan Roberts
  2024-03-07 23:02                           ` Re: Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-07 16:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 15:24, Ryan Roberts wrote:
> On 07/03/2024 14:05, Matthew Wilcox wrote:
>> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/7/2024 4:56 PM,  wrote:
>>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>>> is not the last reference, and it keeps iterating through the local
>>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>>> probably goes into the weeds?
>>>
>>> My understanding is this can not corrupt the folio->deferred_list as
>>> this folio was iterated already.
>>
>> I am not convinced about that at all.  It's possible this isn't the only
>> problem, but deleting something from a list without holding (the correct)
>> lock is something you have to think incredibly hard about to get right.
>> I didn't bother going any deeper into the analysis once I spotted the
>> locking problem, but the proof is very much on you that this is not a bug!
>>
>>> But I did see other strange thing:
>>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>>> index:0xffffbd0a0 pfn:0x2554a0
>>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>>
>>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>>
>>> In free_unref_folios(), there is code like following:
>>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>>                         folio_undo_large_rmappable(folio);
>>>
>>> But with destroy_large_folio():
>>>         if (folio_test_large_rmappable(folio))
>>>
>>> 			folio_undo_large_rmappable(folio);
>>>
>>> Can it connect to the folio has zero refcount still in deferred list
>>> with Matthew's patch?
>>>
>>>
>>> Looks like folio order was cleared unexpected somewhere.
> 
> I think there could be something to this...
> 
> I have a set up where, when running with Matthew's deferred split fix AND have
> commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
> everything works as expected. And at the end, I have the expected amount of
> memory free (seen in meminfo and buddyinfo).
> 
> But if I run only with the deferred split fix and DO NOT revert the other
> change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> stalls where I can't even interact on the serial port. Sometimes (more usually)
> everything just gets stuck trying to reclaim and allocate memory. And when I
> kill the jobs, I still have barely any memory in the system - about 10% what I
> would expect.
> 
> So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
> free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
> only telling it about the first 4K page? So we end up leaking the rest?

I notice that before the commit, large folios are uncharged with
__mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().

The former has an upfront check:

	if (!folio_memcg(folio))
		return;

I'm not exactly sure what that's checking but could the fact this is missing
after the change cause things to go wonky?


> 
>>
>> No, we intentionally clear it:
>>
>> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
>> page[1].flags &= ~PAGE_FLAGS_SECOND;
>>
>> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
>> away in folio->private so that we know what it is in the second loop.
>> So it's always been cleared by the time we call free_page_is_bad().
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2024-03-07 16:24                         ` Re: Ryan Roberts
@ 2024-03-07 23:02                           ` Matthew Wilcox
  0 siblings, 0 replies; 35+ messages in thread
From: Matthew Wilcox @ 2024-03-07 23:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 04:24:43PM +0000, Ryan Roberts wrote:
> > But if I run only with the deferred split fix and DO NOT revert the other
> > change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> > stalls where I can't even interact on the serial port. Sometimes (more usually)
> > everything just gets stuck trying to reclaim and allocate memory. And when I
> > kill the jobs, I still have barely any memory in the system - about 10% what I
> > would expect.

(for the benefit of anyone trying to follow along, this is now
understood; it was my missing folio_put() in the 'folio_trylock failed'
path)

> I notice that before the commit, large folios are uncharged with
> __mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().
> 
> The former has an upfront check:
> 
> 	if (!folio_memcg(folio))
> 		return;
> 
> I'm not exactly sure what that's checking but could the fact this is missing
> after the change cause things to go wonky?

Honestly, I think that's stale.  uncharge_folio() checks the same
thing very early on, so all it's actually saving is a test of the LRU
flag.

Looks like the need for it went away in 2017 with commit a9d5adeeb4b2c73c
which stopped using page->lru to gather the single page onto a
degenerate list.  I'll try to remember to submit a patch to delete
that check.

By the way, something we could try to see if the problem goes away is to
re-narrow the window that i widened.  ie something like this:

+++ b/mm/swap.c
@@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
                        free_huge_folio(folio);
                        continue;
                }
+               if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+                       folio_undo_large_rmappable(folio);

                __page_cache_release(folio, &lruvec, &flags);




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-08  1:06                       ` Yin, Fengwei
  1 sibling, 0 replies; 35+ messages in thread
From: Yin, Fengwei @ 2024-03-08  1:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 10:05 PM, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
Removing folio from deferred_list in folio_put() also needs require
split_queue_lock. So my understanding is no deleting without hold
correct lock. local list iteration is impacted. But that's not the issue
Ryan hit here.

> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                  if (order > 0 && folio_test_large_rmappable(folio))
>>                          folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>          if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.
> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().
Oh. That's the key. Thanks a lot for detail explanation.

I thought there was a bug in other place, covered by
destroy_large_folio() but exposed by free_unref_folios()...


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC] [PATCH 0/3] xfs: use large folios for buffers
@ 2024-01-18 22:19 Dave Chinner
  2024-01-22 10:13 ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

The XFS buffer cache supports metadata buffers up to 64kB, and it does so by
aggregating multiple pages into a single contiguous memory region using
vmapping. This is expensive (both the setup and the runtime TLB mapping cost),
and would be unnecessary if we could allocate large contiguous memory regions
for the buffers in the first place.

Enter multi-page folios.

This patchset converts the buffer cache to use the folio API, then enhances it
to optimisitically use large folios where possible. It retains the old "vmap an
array of single page folios" functionality as a fallback when large folio
allocation fails. This means that, like page cache support for large folios, we
aren't dependent on large folio allocation succeeding all the time.

This relegates the single page array allocation mechanism to the "slow path"
that we don't have to care so much about performance of this path anymore. This
might allow us to simplify it a bit in future.

One of the issues with the folio conversion is that we use a couple of APIs that
take struct page ** (i.e. pointers to page pointer arrays) and there aren't
folio counterparts. These are the bulk page allocator and vm_map_ram(). In the
cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing
that this array will only contain single page folios and that single page folios
and struct page are the same structure and so have the same address. This is a
bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio
versions of these interfaces right now. We don't need to use the bulk page
allocator so much any more, because that's now a slow path and we could probably
just call folio_alloc() in a loop like we used to. What to do about vm_map_ram()
is a little less clear....

The other issue I tripped over in doing this conversion is that the
discontiguous buffer straddling code in the buf log item dirty region tracking
is broken. We don't actually exercise that code on existing configurations, and
I tripped over it when tracking down a bug in the folio conversion. I fixed it
and short-circuted the check for contiguous buffers, but that didn't fix the
failure I was seeing (which was not handling bp->b_offset and large folios
properly when building bios).

Apart from those issues, the conversion and enhancement is relatively straight
forward.  It passes fstests on both 512 and 4096 byte sector size storage (512
byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values)
and doesn't appear to cause any problems with large directory buffers, though I
haven't done any real testing on those yet. Large folio allocations are
definitely being exercised, though, as all the inode cluster buffers are 16kB on
a 512 byte inode V5 filesystem.

Thoughts, comments, etc?

Note: this patchset is on top of the NOFS removal patchset I sent a
few days ago. That can be pulled from this git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup

^ permalink raw reply	[flat|nested] 35+ messages in thread

* (no subject)
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
@ 2024-01-22 10:13 ` Andi Kleen
  2024-01-22 11:53   ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2024-01-22 10:13 UTC (permalink / raw)
  To: linux-xfs, david, linux-mm

Dave Chinner <david@fromorbit.com> writes:

> Thoughts, comments, etc?

The interesting part is if it will cause additional tail latencies
allocating under fragmentation with direct reclaim, compaction
etc. being triggered before it falls back to the base page path.

In fact it is highly likely it will, the question is just how bad it is.

Unfortunately benchmarking for that isn't that easy, it needs artificial
memory fragmentation and then some high stress workload, and then
instrumenting the transactions for individual latencies. 

I would in any case add a tunable for it in case people run into this.
Tail latencies are a common concern on many IO workloads.

-Andi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2024-01-22 10:13 ` Andi Kleen
@ 2024-01-22 11:53   ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2024-01-22 11:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-xfs, linux-mm

On Mon, Jan 22, 2024 at 02:13:23AM -0800, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Thoughts, comments, etc?
> 
> The interesting part is if it will cause additional tail latencies
> allocating under fragmentation with direct reclaim, compaction
> etc. being triggered before it falls back to the base page path.

It's not like I don't know these problems exist with memory
allocation. Go have a look at xlog_kvmalloc() which is an open coded
kvmalloc() that allows the high order kmalloc allocations to
fail-fast without triggering all the expensive and unnecessary
direct reclaim overhead (e.g. compaction!) because we can fall back
to vmalloc without huge concerns. When high order allocations start
to fail, then we fall back to vmalloc and then we hit the long
standing vmalloc scalability problems before anything else in XFS or
the IO path becomes a bottleneck.

IOWs, we already know that fail-fast high-order allocation is a more
efficient and effective fast path than using vmalloc/vmap_ram() all
the time. As this is an RFC, I haven't implemented stuff like this
yet - I haven't seen anything in the profiles indicating that high
order folio allocation is failing and causing lots of reclaim
overhead, so I simply haven't added fail-fast behaviour yet...

> In fact it is highly likely it will, the question is just how bad it is.
> 
> Unfortunately benchmarking for that isn't that easy, it needs artificial
> memory fragmentation and then some high stress workload, and then
> instrumenting the transactions for individual latencies. 

I stress test and measure XFS metadata performance under sustained
memory pressure all the time. This change has not caused any
obvious regressions in the short time I've been testing it.

I still need to do perf testing on large directory block sizes. That
is where high-order allocations will get stressed - that's where
xlog_kvmalloc() starts dominating the profiles as it trips over
vmalloc scalability issues...

> I would in any case add a tunable for it in case people run into this.

No tunables. It either works or it doesn't. If we can't make
it work reliably by default, we throw it in the dumpster, light it
on fire and walk away.

> Tail latencies are a common concern on many IO workloads.

Yes, for user data operations it's a common concern. For metadata,
not so much - there's so many far worse long tail latencies in
metadata operations (like waiting for journal space) that memory
allocation latencies in the metadata IO path are largely noise....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* (no subject)
@ 2023-05-11 12:58 Ryan Roberts
  2023-05-11 13:13 ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2023-05-11 12:58 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, SeongJae Park
  Cc: Ryan Roberts, linux-kernel, linux-mm, damon

Date: Thu, 11 May 2023 11:38:28 +0100
Subject: [PATCH v1 0/5] Encapsulate PTE contents from non-arch code

Hi All,

This series improves the encapsulation of pte entries by disallowing non-arch
code from directly dereferencing pte_t pointers. Instead code must use a new
helper, `pte_t ptep_deref(pte_t *ptep)`. By default, this helper does a direct
dereference of the pointer, so generated code should be exactly the same. But
it's presence sets us up for arch code being able to override the default to
"virtualize" the ptes without needing to maintain a shadow table.

I intend to take advantage of this for arm64 to enable use of its "contiguous
bit" to coalesce multiple ptes into a single tlb entry, reducing pressure and
improving performance. I have an RFC for the first part of this work at [1]. The
cover letter there also explains the second part, which this series is enabling.

I intend to post an RFC for the contpte changes in due course, but it would be
good to get the ball rolling on this enabler.

There are 2 reasons that I need the encapsulation:

  - Prevent leaking the arch-private PTE_CONT bit to the core code. If the core
    code reads a pte that contains this bit, it could end up calling
    set_pte_at() with the bit set which would confuse the implementation. So we
    can always clear PTE_CONT in ptep_deref() (and ptep_get()) to avoid a leaky
    abstraction.
  - Contiguous ptes have a single access and dirty bit for the contiguous range.
    So we need to "mix-in" those bits when the core is dereferencing a pte that
    lies in the contig range. There is code that dereferences the pte then takes
    different actions based on access/dirty (see e.g. write_protect_page()).

While ptep_get() and ptep_get_lockless() already exist, both of them are
implemented using READ_ONCE() by default. While we could use ptep_get() instead
of the new ptep_deref(), I didn't want to risk performance regression.
Alternatively, all call sites that currently use ptep_get() that need the
lockless behaviour could be upgraded to ptep_get_lockless() and ptep_get() could
be downgraded to a simple dereference. That would be cleanest, but is a much
bigger (and likely error prone) change because all the arch code would need to
be updated for the new definitions of ptep_get().

The series is split up as follows:

patchs 1-2: Fix bugs where code was _setting_ ptes directly, rather than using
            set_pte_at() and friends.
patch 3:    Fix highmem unmapping issue I spotted while doing the work.
patch 4:    Introduce the new ptep_deref() helper with default implementation.
patch 5:    Convert all direct dereferences to use ptep_deref().

[1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/

Thanks,
Ryan


Ryan Roberts (5):
  mm: vmalloc must set pte via arch code
  mm: damon must atomically clear young on ptes and pmds
  mm: Fix failure to unmap pte on highmem systems
  mm: Add new ptep_deref() helper to fully encapsulate pte_t
  mm: ptep_deref() conversion

 .../drm/i915/gem/selftests/i915_gem_mman.c    |   8 +-
 drivers/misc/sgi-gru/grufault.c               |   2 +-
 drivers/vfio/vfio_iommu_type1.c               |   7 +-
 drivers/xen/privcmd.c                         |   2 +-
 fs/proc/task_mmu.c                            |  33 +++---
 fs/userfaultfd.c                              |   6 +-
 include/linux/hugetlb.h                       |   2 +-
 include/linux/mm_inline.h                     |   2 +-
 include/linux/pgtable.h                       |  13 ++-
 kernel/events/uprobes.c                       |   2 +-
 mm/damon/ops-common.c                         |  18 ++-
 mm/damon/ops-common.h                         |   4 +-
 mm/damon/paddr.c                              |   6 +-
 mm/damon/vaddr.c                              |  14 ++-
 mm/filemap.c                                  |   2 +-
 mm/gup.c                                      |  21 ++--
 mm/highmem.c                                  |  12 +-
 mm/hmm.c                                      |   2 +-
 mm/huge_memory.c                              |   4 +-
 mm/hugetlb.c                                  |   2 +-
 mm/hugetlb_vmemmap.c                          |   6 +-
 mm/kasan/init.c                               |   9 +-
 mm/kasan/shadow.c                             |  10 +-
 mm/khugepaged.c                               |  24 ++--
 mm/ksm.c                                      |  22 ++--
 mm/madvise.c                                  |   6 +-
 mm/mapping_dirty_helpers.c                    |   4 +-
 mm/memcontrol.c                               |   4 +-
 mm/memory-failure.c                           |   6 +-
 mm/memory.c                                   | 103 +++++++++---------
 mm/mempolicy.c                                |   6 +-
 mm/migrate.c                                  |  14 ++-
 mm/migrate_device.c                           |  14 ++-
 mm/mincore.c                                  |   2 +-
 mm/mlock.c                                    |   6 +-
 mm/mprotect.c                                 |   8 +-
 mm/mremap.c                                   |   2 +-
 mm/page_table_check.c                         |   4 +-
 mm/page_vma_mapped.c                          |  26 +++--
 mm/pgtable-generic.c                          |   2 +-
 mm/rmap.c                                     |  32 +++---
 mm/sparse-vmemmap.c                           |   8 +-
 mm/swap_state.c                               |   4 +-
 mm/swapfile.c                                 |  16 +--
 mm/userfaultfd.c                              |   4 +-
 mm/vmalloc.c                                  |  11 +-
 mm/vmscan.c                                   |  14 ++-
 virt/kvm/kvm_main.c                           |   9 +-
 48 files changed, 302 insertions(+), 236 deletions(-)

--
2.25.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2023-05-11 12:58 Ryan Roberts
@ 2023-05-11 13:13 ` Ryan Roberts
  0 siblings, 0 replies; 35+ messages in thread
From: Ryan Roberts @ 2023-05-11 13:13 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, SeongJae Park
  Cc: linux-kernel, linux-mm, damon

My appologies for the noise: A blank line between Cc and Subject has broken the
subject and grouping in lore.

Please Ignore this, I will resend.


On 11/05/2023 13:58, Ryan Roberts wrote:
> Date: Thu, 11 May 2023 11:38:28 +0100
> Subject: [PATCH v1 0/5] Encapsulate PTE contents from non-arch code
> 
> Hi All,
> 
> This series improves the encapsulation of pte entries by disallowing non-arch
> code from directly dereferencing pte_t pointers. Instead code must use a new
> helper, `pte_t ptep_deref(pte_t *ptep)`. By default, this helper does a direct
> dereference of the pointer, so generated code should be exactly the same. But
> it's presence sets us up for arch code being able to override the default to
> "virtualize" the ptes without needing to maintain a shadow table.
> 
> I intend to take advantage of this for arm64 to enable use of its "contiguous
> bit" to coalesce multiple ptes into a single tlb entry, reducing pressure and
> improving performance. I have an RFC for the first part of this work at [1]. The
> cover letter there also explains the second part, which this series is enabling.
> 
> I intend to post an RFC for the contpte changes in due course, but it would be
> good to get the ball rolling on this enabler.
> 
> There are 2 reasons that I need the encapsulation:
> 
>   - Prevent leaking the arch-private PTE_CONT bit to the core code. If the core
>     code reads a pte that contains this bit, it could end up calling
>     set_pte_at() with the bit set which would confuse the implementation. So we
>     can always clear PTE_CONT in ptep_deref() (and ptep_get()) to avoid a leaky
>     abstraction.
>   - Contiguous ptes have a single access and dirty bit for the contiguous range.
>     So we need to "mix-in" those bits when the core is dereferencing a pte that
>     lies in the contig range. There is code that dereferences the pte then takes
>     different actions based on access/dirty (see e.g. write_protect_page()).
> 
> While ptep_get() and ptep_get_lockless() already exist, both of them are
> implemented using READ_ONCE() by default. While we could use ptep_get() instead
> of the new ptep_deref(), I didn't want to risk performance regression.
> Alternatively, all call sites that currently use ptep_get() that need the
> lockless behaviour could be upgraded to ptep_get_lockless() and ptep_get() could
> be downgraded to a simple dereference. That would be cleanest, but is a much
> bigger (and likely error prone) change because all the arch code would need to
> be updated for the new definitions of ptep_get().
> 
> The series is split up as follows:
> 
> patchs 1-2: Fix bugs where code was _setting_ ptes directly, rather than using
>             set_pte_at() and friends.
> patch 3:    Fix highmem unmapping issue I spotted while doing the work.
> patch 4:    Introduce the new ptep_deref() helper with default implementation.
> patch 5:    Convert all direct dereferences to use ptep_deref().
> 
> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (5):
>   mm: vmalloc must set pte via arch code
>   mm: damon must atomically clear young on ptes and pmds
>   mm: Fix failure to unmap pte on highmem systems
>   mm: Add new ptep_deref() helper to fully encapsulate pte_t
>   mm: ptep_deref() conversion
> 
>  .../drm/i915/gem/selftests/i915_gem_mman.c    |   8 +-
>  drivers/misc/sgi-gru/grufault.c               |   2 +-
>  drivers/vfio/vfio_iommu_type1.c               |   7 +-
>  drivers/xen/privcmd.c                         |   2 +-
>  fs/proc/task_mmu.c                            |  33 +++---
>  fs/userfaultfd.c                              |   6 +-
>  include/linux/hugetlb.h                       |   2 +-
>  include/linux/mm_inline.h                     |   2 +-
>  include/linux/pgtable.h                       |  13 ++-
>  kernel/events/uprobes.c                       |   2 +-
>  mm/damon/ops-common.c                         |  18 ++-
>  mm/damon/ops-common.h                         |   4 +-
>  mm/damon/paddr.c                              |   6 +-
>  mm/damon/vaddr.c                              |  14 ++-
>  mm/filemap.c                                  |   2 +-
>  mm/gup.c                                      |  21 ++--
>  mm/highmem.c                                  |  12 +-
>  mm/hmm.c                                      |   2 +-
>  mm/huge_memory.c                              |   4 +-
>  mm/hugetlb.c                                  |   2 +-
>  mm/hugetlb_vmemmap.c                          |   6 +-
>  mm/kasan/init.c                               |   9 +-
>  mm/kasan/shadow.c                             |  10 +-
>  mm/khugepaged.c                               |  24 ++--
>  mm/ksm.c                                      |  22 ++--
>  mm/madvise.c                                  |   6 +-
>  mm/mapping_dirty_helpers.c                    |   4 +-
>  mm/memcontrol.c                               |   4 +-
>  mm/memory-failure.c                           |   6 +-
>  mm/memory.c                                   | 103 +++++++++---------
>  mm/mempolicy.c                                |   6 +-
>  mm/migrate.c                                  |  14 ++-
>  mm/migrate_device.c                           |  14 ++-
>  mm/mincore.c                                  |   2 +-
>  mm/mlock.c                                    |   6 +-
>  mm/mprotect.c                                 |   8 +-
>  mm/mremap.c                                   |   2 +-
>  mm/page_table_check.c                         |   4 +-
>  mm/page_vma_mapped.c                          |  26 +++--
>  mm/pgtable-generic.c                          |   2 +-
>  mm/rmap.c                                     |  32 +++---
>  mm/sparse-vmemmap.c                           |   8 +-
>  mm/swap_state.c                               |   4 +-
>  mm/swapfile.c                                 |  16 +--
>  mm/userfaultfd.c                              |   4 +-
>  mm/vmalloc.c                                  |  11 +-
>  mm/vmscan.c                                   |  14 ++-
>  virt/kvm/kvm_main.c                           |   9 +-
>  48 files changed, 302 insertions(+), 236 deletions(-)
> 
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <20220421164138.1250943-1-yury.norov@gmail.com>]

* Re:
       [not found] <20220421164138.1250943-1-yury.norov@gmail.com>
@ 2022-04-21 23:04 ` John Hubbard
  2022-04-21 23:09   ` Re: John Hubbard
  2022-04-21 23:17   ` Re: Yury Norov
  0 siblings, 2 replies; 35+ messages in thread
From: John Hubbard @ 2022-04-21 23:04 UTC (permalink / raw)
  To: Yury Norov, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 09:41, Yury Norov wrote:
> Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
> 

Hi Yuri,

Thanks for picking this up. I have been distracted and didn't trust
myself to focus on this properly, so it's good to have help!

IT/admin point: somehow the first line of the commit description didn't
make it into an actual email subject. The subject line was blank when it
arrived in my inbox, and the subject is in the body here instead. Not
sure how that happened.

Maybe check your git-sendemail setup?


> pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> API requires struct page **pages to be provided (not NULL). However,
> the comment to pin_user_pages() says:
> 
>      * @pages:      array that receives pointers to the pages pinned.
>      *              Should be at least nr_pages long. Or NULL, if caller
>      *              only intends to ensure the pages are faulted in.
> 
> This patch fixes comments along the pin_user_pages code, and also adds
> WARN_ON(!pages), so that API users will have better understanding
> on how to use it.

No need to quote the code in the commit log. Instead, just summarize.
For example:

pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
API requires struct page **pages to be provided (not NULL). However, the
comment to pin_user_pages() clearly allows for passing in a NULL @pages
argument.

Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
enforce the API.

> 
> It has been independently spotted by Minchan Kim and confirmed with
> John Hubbard:
> 
> https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/

Let's add a Cc: line for Michan as well:

Cc: Minchan Kim <minchan@kernel.org>

> 
> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
> ---
>   mm/gup.c | 26 ++++++++++++++++++++++----
>   1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f598a037eb04..559626457585 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2871,6 +2871,10 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */

Please delete each and every one of these one-line comments, because
they merely echo what the code says.

> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
>   }
> @@ -2893,6 +2897,10 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages,
>   	 */
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return 0;
> +
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return 0;
>   	/*
>   	 * FOLL_FAST_ONLY is required in order to match the API description of
>   	 * this routine: no fall back to regular ("slow") GUP.
> @@ -2920,8 +2928,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
>    * @nr_pages:	number of pages from start to pin
>    * @gup_flags:	flags modifying lookup behaviour
>    * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long. Or NULL, if caller
> - *		only intends to ensure the pages are faulted in.
> + *		Should be at least nr_pages long.
>    * @vmas:	array of pointers to vmas corresponding to each page.
>    *		Or NULL if the caller does not require them.
>    * @locked:	pointer to lock flag indicating whether lock is held and
> @@ -2944,6 +2951,10 @@ long pin_user_pages_remote(struct mm_struct *mm,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
>   				       pages, vmas, locked);
> @@ -2957,8 +2968,7 @@ EXPORT_SYMBOL(pin_user_pages_remote);
>    * @nr_pages:	number of pages from start to pin
>    * @gup_flags:	flags modifying lookup behaviour
>    * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long. Or NULL, if caller
> - *		only intends to ensure the pages are faulted in.
> + *		Should be at least nr_pages long.
>    * @vmas:	array of pointers to vmas corresponding to each page.
>    *		Or NULL if the caller does not require them.
>    *
> @@ -2976,6 +2986,10 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return __gup_longterm_locked(current->mm, start, nr_pages,
>   				     pages, vmas, gup_flags);
> @@ -2994,6 +3008,10 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
>   }

I hope we don't break any callers with the newly enforced !pages, but it's
the right thing to do, in order to avoid misunderstandings.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2022-04-21 23:04 ` Re: John Hubbard
@ 2022-04-21 23:09   ` John Hubbard
  2022-04-21 23:17   ` Re: Yury Norov
  1 sibling, 0 replies; 35+ messages in thread
From: John Hubbard @ 2022-04-21 23:09 UTC (permalink / raw)
  To: Yury Norov, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 16:04, John Hubbard wrote:
> On 4/21/22 09:41, Yury Norov wrote:
>> Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
>>
> 
> Hi Yuri,

...and I see that I have typo'd both Yury's and Minchan's name (further
down), in the same email!

Really apologize for screwing that up. It's Yury-with-a-"y", I know. :)


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2022-04-21 23:04 ` Re: John Hubbard
  2022-04-21 23:09   ` Re: John Hubbard
@ 2022-04-21 23:17   ` Yury Norov
  2022-04-21 23:21     ` Re: John Hubbard
  1 sibling, 1 reply; 35+ messages in thread
From: Yury Norov @ 2022-04-21 23:17 UTC (permalink / raw)
  To: John Hubbard; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Thu, Apr 21, 2022 at 04:04:44PM -0700, John Hubbard wrote:
> On 4/21/22 09:41, Yury Norov wrote:
> > Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
> > 
> 
> Hi Yuri,
> 
> Thanks for picking this up. I have been distracted and didn't trust
> myself to focus on this properly, so it's good to have help!
> 
> IT/admin point: somehow the first line of the commit description didn't
> make it into an actual email subject. The subject line was blank when it
> arrived in my inbox, and the subject is in the body here instead. Not
> sure how that happened.
> 
> Maybe check your git-sendemail setup?
 
git-sendmail is OK. I just accidentally added empty line above Subject,
which broke format. My bad, sorry for this.
 
> > pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> > API requires struct page **pages to be provided (not NULL). However,
> > the comment to pin_user_pages() says:
> > 
> >      * @pages:      array that receives pointers to the pages pinned.
> >      *              Should be at least nr_pages long. Or NULL, if caller
> >      *              only intends to ensure the pages are faulted in.
> > 
> > This patch fixes comments along the pin_user_pages code, and also adds
> > WARN_ON(!pages), so that API users will have better understanding
> > on how to use it.
> 
> No need to quote the code in the commit log. Instead, just summarize.
> For example:
> 
> pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> API requires struct page **pages to be provided (not NULL). However, the
> comment to pin_user_pages() clearly allows for passing in a NULL @pages
> argument.
> 
> Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
> enforce the API.
> 
> > 
> > It has been independently spotted by Minchan Kim and confirmed with
> > John Hubbard:
> > 
> > https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/
> 
> Let's add a Cc: line for Michan as well:
> 
> Cc: Minchan Kim <minchan@kernel.org>
 
He's in CC already, I think...
 
> > Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
> > ---
> >   mm/gup.c | 26 ++++++++++++++++++++++----
> >   1 file changed, 22 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f598a037eb04..559626457585 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2871,6 +2871,10 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> 
> Please delete each and every one of these one-line comments, because
> they merely echo what the code says.

Sure.
 
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> >   }
> > @@ -2893,6 +2897,10 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages,
> >   	 */
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return 0;
> > +
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return 0;
> >   	/*
> >   	 * FOLL_FAST_ONLY is required in order to match the API description of
> >   	 * this routine: no fall back to regular ("slow") GUP.
> > @@ -2920,8 +2928,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
> >    * @nr_pages:	number of pages from start to pin
> >    * @gup_flags:	flags modifying lookup behaviour
> >    * @pages:	array that receives pointers to the pages pinned.
> > - *		Should be at least nr_pages long. Or NULL, if caller
> > - *		only intends to ensure the pages are faulted in.
> > + *		Should be at least nr_pages long.
> >    * @vmas:	array of pointers to vmas corresponding to each page.
> >    *		Or NULL if the caller does not require them.
> >    * @locked:	pointer to lock flag indicating whether lock is held and
> > @@ -2944,6 +2951,10 @@ long pin_user_pages_remote(struct mm_struct *mm,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
> >   				       pages, vmas, locked);
> > @@ -2957,8 +2968,7 @@ EXPORT_SYMBOL(pin_user_pages_remote);
> >    * @nr_pages:	number of pages from start to pin
> >    * @gup_flags:	flags modifying lookup behaviour
> >    * @pages:	array that receives pointers to the pages pinned.
> > - *		Should be at least nr_pages long. Or NULL, if caller
> > - *		only intends to ensure the pages are faulted in.
> > + *		Should be at least nr_pages long.
> >    * @vmas:	array of pointers to vmas corresponding to each page.
> >    *		Or NULL if the caller does not require them.
> >    *
> > @@ -2976,6 +2986,10 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return __gup_longterm_locked(current->mm, start, nr_pages,
> >   				     pages, vmas, gup_flags);
> > @@ -2994,6 +3008,10 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
> >   }
> 
> I hope we don't break any callers with the newly enforced !pages, but it's
> the right thing to do, in order to avoid misunderstandings.
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

Let me test v2 and resend shortly.

Thanks,
Yury


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2022-04-21 23:17   ` Re: Yury Norov
@ 2022-04-21 23:21     ` John Hubbard
  0 siblings, 0 replies; 35+ messages in thread
From: John Hubbard @ 2022-04-21 23:21 UTC (permalink / raw)
  To: Yury Norov; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 16:17, Yury Norov wrote:
>> Let's add a Cc: line for Michan as well:
>>
>> Cc: Minchan Kim <minchan@kernel.org>
>   
> He's in CC already, I think...
>   

Here, I am talking about attribution in the commit log, as opposed
to the email Cc. In other words, I'm suggesting that you literally
add this line to the commit description:

Cc: Minchan Kim <minchan@kernel.org>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 35+ messages in thread

* (no subject)
@ 2021-08-12  9:21 Valdis Klētnieks
  2021-08-12  9:42 ` SeongJae Park
  0 siblings, 1 reply; 35+ messages in thread
From: Valdis Klētnieks @ 2021-08-12  9:21 UTC (permalink / raw)
  To: SeongJae Park, Andrew Morton; +Cc: linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

In this commit:

commit fedc37448fb1be5d03e420ca7791d4286893d5ec
Author: SeongJae Park <sjpark@amazon.de>
Date:   Tue Aug 10 16:55:51 2021 +1000

    mm/idle_page_tracking: make PG_idle reusable

diff --git a/mm/Kconfig b/mm/Kconfig
index 504336de9a1e..d0b85dc12429 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -739,10 +739,18 @@ config DEFERRED_STRUCT_PAGE_INIT
          lifetime of the system until these kthreads finish the
          initialisation.

+config PAGE_IDLE_FLAG
+       bool "Add PG_idle and PG_young flags"
+       help
+         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
+         Accessed bit writers can set the state of the bit in the flags to let
+         other PTE Accessed bit readers don't disturbed.

This needs to be converted to proper, or at least comprehensible, English....


[-- Attachment #2: Type: application/pgp-signature, Size: 494 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2021-08-12  9:21 Valdis Klētnieks
@ 2021-08-12  9:42 ` SeongJae Park
  2021-08-12 20:19   ` Re: Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: SeongJae Park @ 2021-08-12  9:42 UTC (permalink / raw)
  To: Valdis Klētnieks
  Cc: SeongJae Park, Andrew Morton, linux-mm, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

Hello Valdis,

On Thu, 12 Aug 2021 05:21:57 -0400 "Valdis =?utf-8?Q?Kl=c4=93tnieks?=" <valdis.kletnieks@vt.edu> wrote:

> In this commit:
> 
> commit fedc37448fb1be5d03e420ca7791d4286893d5ec
> Author: SeongJae Park <sjpark@amazon.de>
> Date:   Tue Aug 10 16:55:51 2021 +1000
> 
>     mm/idle_page_tracking: make PG_idle reusable
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 504336de9a1e..d0b85dc12429 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -739,10 +739,18 @@ config DEFERRED_STRUCT_PAGE_INIT
>           lifetime of the system until these kthreads finish the
>           initialisation.
> 
> +config PAGE_IDLE_FLAG
> +       bool "Add PG_idle and PG_young flags"
> +       help
> +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> +         Accessed bit writers can set the state of the bit in the flags to let
> +         other PTE Accessed bit readers don't disturbed.
> 
> This needs to be converted to proper, or at least comprehensible, English....

Thank you for the comment.

How about below?

--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
        bool "Add PG_idle and PG_young flags"
        select PAGE_EXTENSION if !64BIT
        help
-         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
-         Accessed bit writers can set the state of the bit in the flags to let
-         other PTE Accessed bit readers don't disturbed.
+         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
+         PTE Accessed bit writers can save the state of the bit in the flags
+         to let other PTE Accessed bit readers don't get disturbed.


Thanks,
SeongJae Park


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2021-08-12  9:42 ` SeongJae Park
@ 2021-08-12 20:19   ` Andrew Morton
  2021-08-13  8:14     ` Re: SeongJae Park
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2021-08-12 20:19 UTC (permalink / raw)
  To: SeongJae Park
  Cc:  Valdis Klētnieks , SeongJae Park, linux-mm, linux-kernel

On Thu, 12 Aug 2021 09:42:40 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> > +config PAGE_IDLE_FLAG
> > +       bool "Add PG_idle and PG_young flags"
> > +       help
> > +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > +         Accessed bit writers can set the state of the bit in the flags to let
> > +         other PTE Accessed bit readers don't disturbed.
> > 
> > This needs to be converted to proper, or at least comprehensible, English....
> 
> Thank you for the comment.
> 
> How about below?
> 
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
>         bool "Add PG_idle and PG_young flags"
>         select PAGE_EXTENSION if !64BIT
>         help
> -         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> -         Accessed bit writers can set the state of the bit in the flags to let
> -         other PTE Accessed bit readers don't disturbed.
> +         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
> +         PTE Accessed bit writers can save the state of the bit in the flags
> +         to let other PTE Accessed bit readers don't get disturbed.

How about this?

--- a/mm/Kconfig~mm-idle_page_tracking-make-pg_idle-reusable-fix-fix
+++ a/mm/Kconfig
@@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
 	bool "Add PG_idle and PG_young flags"
 	select PAGE_EXTENSION if !64BIT
 	help
-	  This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
-	  Accessed bit writers can set the state of the bit in the flags to let
-	  other PTE Accessed bit readers don't disturbed.
+	  This adds PG_idle and PG_young flags to 'struct page'.  PTE Accessed
+	  bit writers can set the state of the bit in the flags so that PTE
+	  Accessed bit readers may avoid disturbance.
 
 config IDLE_PAGE_TRACKING
 	bool "Enable idle page tracking"

Also, is there any way in which we can avoid presenting this option to
the user?  Because most users will have real trouble understanding what
this thing is for.  Can we simply select it when needed, as dictated by
other, higher-level config options?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2021-08-12 20:19   ` Re: Andrew Morton
@ 2021-08-13  8:14     ` SeongJae Park
  0 siblings, 0 replies; 35+ messages in thread
From: SeongJae Park @ 2021-08-13  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park,  Valdis Klētnieks ,
	SeongJae Park, linux-mm, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

On Thu, 12 Aug 2021 13:19:21 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 12 Aug 2021 09:42:40 +0000 SeongJae Park <sj38.park@gmail.com> wrote:
> 
> > > +config PAGE_IDLE_FLAG
> > > +       bool "Add PG_idle and PG_young flags"
> > > +       help
> > > +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > > +         Accessed bit writers can set the state of the bit in the flags to let
> > > +         other PTE Accessed bit readers don't disturbed.
> > > 
> > > This needs to be converted to proper, or at least comprehensible, English....
> > 
> > Thank you for the comment.
> > 
> > How about below?
> > 
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
> >         bool "Add PG_idle and PG_young flags"
> >         select PAGE_EXTENSION if !64BIT
> >         help
> > -         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > -         Accessed bit writers can set the state of the bit in the flags to let
> > -         other PTE Accessed bit readers don't disturbed.
> > +         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
> > +         PTE Accessed bit writers can save the state of the bit in the flags
> > +         to let other PTE Accessed bit readers don't get disturbed.
> 
> How about this?
> 
> --- a/mm/Kconfig~mm-idle_page_tracking-make-pg_idle-reusable-fix-fix
> +++ a/mm/Kconfig
> @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
>  	bool "Add PG_idle and PG_young flags"
>  	select PAGE_EXTENSION if !64BIT
>  	help
> -	  This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> -	  Accessed bit writers can set the state of the bit in the flags to let
> -	  other PTE Accessed bit readers don't disturbed.
> +	  This adds PG_idle and PG_young flags to 'struct page'.  PTE Accessed
> +	  bit writers can set the state of the bit in the flags so that PTE
> +	  Accessed bit readers may avoid disturbance.
>  
>  config IDLE_PAGE_TRACKING
>  	bool "Enable idle page tracking"

So good, thank you!

> 
> Also, is there any way in which we can avoid presenting this option to
> the user?  Because most users will have real trouble understanding what
> this thing is for.  Can we simply select it when needed, as dictated by
> other, higher-level config options?

I believe this is the right way to go!  I sent a patch for removing the prompt
of this option:
https://lore.kernel.org/linux-mm/20210813081238.34705-1-sj38.park@gmail.com/


Thanks,
SeongJae Park


^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <bug-200209-27@https.bugzilla.kernel.org/>]

* (no subject)
       [not found] <bug-200209-27@https.bugzilla.kernel.org/>
@ 2018-06-28  3:48 ` Andrew Morton
  2018-06-29 18:44   ` Andrey Ryabinin
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2018-06-28  3:48 UTC (permalink / raw)
  To: icytxw
  Cc: bugzilla-daemon, linux-mm, Andrey Ryabinin, Alexander Potapenko,
	Dmitry Vyukov

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 22 Jun 2018 23:37:27 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=200209
> 
>             Bug ID: 200209
>            Summary: UBSAN: Undefined behaviour in mm/fadvise.c:LINE
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: v4.18-rc2
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: icytxw@gmail.com
>         Regression: No
> 
> Hi,
> This bug was found in Linux Kernel v4.18-rc2
> 
> $ cat report0 
> ================================================================================
> UBSAN: Undefined behaviour in mm/fadvise.c:76:10
> signed integer overflow:
> 4 + 9223372036854775805 cannot be represented in type 'long long int'
> CPU: 0 PID: 13477 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x122/0x1c8 lib/dump_stack.c:113
>  ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
>  handle_overflow+0x1c2/0x21f lib/ubsan.c:190
>  __ubsan_handle_add_overflow+0x2a/0x31 lib/ubsan.c:198
>  ksys_fadvise64_64+0xbf0/0xd10 mm/fadvise.c:76
>  __do_sys_fadvise64 mm/fadvise.c:198 [inline]
>  __se_sys_fadvise64 mm/fadvise.c:196 [inline]
>  __x64_sys_fadvise64+0xa9/0x120 mm/fadvise.c:196
>  do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290

That overflow is deliberate:

	endbyte = offset + len;
	if (!len || endbyte < len)
		endbyte = -1;
	else
		endbyte--;		/* inclusive */

Or is there a hole in this logic?

If not, I guess ee can do this another way to keep the checker happy.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2018-06-28  3:48 ` Andrew Morton
@ 2018-06-29 18:44   ` Andrey Ryabinin
  0 siblings, 0 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2018-06-29 18:44 UTC (permalink / raw)
  To: Andrew Morton, icytxw
  Cc: bugzilla-daemon, linux-mm, Alexander Potapenko, Dmitry Vyukov



On 06/28/2018 06:48 AM, Andrew Morton wrote:

>> Hi,
>> This bug was found in Linux Kernel v4.18-rc2
>>
>> $ cat report0 
>> ================================================================================
>> UBSAN: Undefined behaviour in mm/fadvise.c:76:10
>> signed integer overflow:
>> 4 + 9223372036854775805 cannot be represented in type 'long long int'
>> CPU: 0 PID: 13477 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x122/0x1c8 lib/dump_stack.c:113
>>  ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
>>  handle_overflow+0x1c2/0x21f lib/ubsan.c:190
>>  __ubsan_handle_add_overflow+0x2a/0x31 lib/ubsan.c:198
>>  ksys_fadvise64_64+0xbf0/0xd10 mm/fadvise.c:76
>>  __do_sys_fadvise64 mm/fadvise.c:198 [inline]
>>  __se_sys_fadvise64 mm/fadvise.c:196 [inline]
>>  __x64_sys_fadvise64+0xa9/0x120 mm/fadvise.c:196
>>  do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
> 
> That overflow is deliberate:
> 
> 	endbyte = offset + len;
> 	if (!len || endbyte < len)
> 		endbyte = -1;
> 	else
> 		endbyte--;		/* inclusive */
> 
> Or is there a hole in this logic?
> 
> If not, I guess ee can do this another way to keep the checker happy.
 
It should be enough to make overflow unsigned. Unsigned overflow is defined by the C standard.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init()
@ 2018-06-24 20:09 Vladimir Davydov
  2018-07-03 14:52 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-06-24 20:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-mm, tglx, Andrew Morton

On Fri, Jun 22, 2018 at 05:12:21PM +0200, Sebastian Andrzej Siewior wrote:
> scan_shadow_nodes() is the only user of __list_lru_walk_one() which
> disables interrupts before invoking it. The reason is that nlru->lock is
> nesting inside IRQ-safe i_pages lock. Some functions unconditionally
> acquire the lock with the _irq() suffix.
> 
> __list_lru_walk_one() can't acquire the lock unconditionally with _irq()
> suffix because it might invoke a callback which unlocks the nlru->lock
> and invokes a sleeping function without enabling interrupts.
> 
> Add an argument to __list_lru_init() which identifies wheather the
> nlru->lock needs to be acquired with disabling interrupts or without.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/list_lru.h | 12 ++++++++----
>  mm/list_lru.c            | 14 ++++++++++----
>  mm/workingset.c          | 12 ++++--------
>  3 files changed, 22 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 96def9d15b1b..c2161c3a1809 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -51,18 +51,22 @@ struct list_lru_node {
>  
>  struct list_lru {
>  	struct list_lru_node	*node;
> +	bool			lock_irq;
>  #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>  	struct list_head	list;
>  #endif
>  };

TBO I don't like this patch, because the new member of struct list_lru,
lock_irq, has rather obscure meaning IMHO: it makes list_lru_walk
disable irq before taking lru_lock, but at the same time list_lru_add
and list_lru_del never do that, no matter whether lock_irq is true or
false. That is, if a user of struct list_lru sets this flag, he's
supposed to disable irq for list_lru_add/del by himself (mm/workingset
does that). IMHO the code of mm/workingset is clear as it is. Since it
is the only place where this flag is used, I'd rather leave it as is.

>  
>  void list_lru_destroy(struct list_lru *lru);
> -int __list_lru_init(struct list_lru *lru, bool memcg_aware,
> +int __list_lru_init(struct list_lru *lru, bool memcg_aware, bool lock_irq,
>  		    struct lock_class_key *key);
>  
> -#define list_lru_init(lru)		__list_lru_init((lru), false, NULL)
> -#define list_lru_init_key(lru, key)	__list_lru_init((lru), false, (key))
> -#define list_lru_init_memcg(lru)	__list_lru_init((lru), true, NULL)
> +#define list_lru_init(lru)		__list_lru_init((lru), false, false, \
> +							NULL)
> +#define list_lru_init_key(lru, key)	__list_lru_init((lru), false, false, \
> +							(key))
> +#define list_lru_init_memcg(lru)	__list_lru_init((lru), true, false, \
> +							NULL)
>  
>  int memcg_update_all_list_lrus(int num_memcgs);
>  void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index fcfb6c89ed47..1c49d48078e4 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -204,7 +204,10 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
>  	struct list_head *item, *n;
>  	unsigned long isolated = 0;
>  
> -	spin_lock(&nlru->lock);
> +	if (lru->lock_irq)
> +		spin_lock_irq(&nlru->lock);
> +	else
> +		spin_lock(&nlru->lock);
>  	l = list_lru_from_memcg_idx(nlru, memcg_idx);
>  restart:
>  	list_for_each_safe(item, n, &l->list) {
> @@ -251,7 +254,10 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
>  		}
>  	}
>  
> -	spin_unlock(&nlru->lock);
> +	if (lru->lock_irq)
> +		spin_unlock_irq(&nlru->lock);
> +	else
> +		spin_unlock(&nlru->lock);
>  	return isolated;
>  }
>  
> @@ -553,7 +559,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
>  }
>  #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
>  
> -int __list_lru_init(struct list_lru *lru, bool memcg_aware,
> +int __list_lru_init(struct list_lru *lru, bool memcg_aware, bool lock_irq,
>  		    struct lock_class_key *key)
>  {
>  	int i;
> @@ -580,7 +586,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
>  		lru->node = NULL;
>  		goto out;
>  	}
> -
> +	lru->lock_irq = lock_irq;
>  	list_lru_register(lru);
>  out:
>  	memcg_put_cache_ids();
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 529480c21f93..23ce00f48212 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -480,13 +480,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
>  				       struct shrink_control *sc)
>  {
> -	unsigned long ret;
> -
> -	/* list_lru lock nests inside the IRQ-safe i_pages lock */
> -	local_irq_disable();
> -	ret = list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate, NULL);
> -	local_irq_enable();
> -	return ret;
> +	return list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate,
> +				    NULL);
>  }
>  
>  static struct shrinker workingset_shadow_shrinker = {
> @@ -523,7 +518,8 @@ static int __init workingset_init(void)
>  	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
>  	       timestamp_bits, max_order, bucket_order);
>  
> -	ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key);
> +	/* list_lru lock nests inside the IRQ-safe i_pages lock */
> +	ret = __list_lru_init(&shadow_nodes, true, true, &shadow_nodes_key);
>  	if (ret)
>  		goto err;
>  	ret = register_shrinker(&workingset_shadow_shrinker);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* (no subject)
  2018-06-24 20:09 [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init() Vladimir Davydov
@ 2018-07-03 14:52 ` Sebastian Andrzej Siewior
  2018-07-03 21:14   ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 14:52 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton

My intepretation of situtation is that Vladimir Davydon is fine patch #1
and #2 of the series [0] but dislikes the irq argument and struct
member. It has been suggested to use list_lru_shrink_walk_irq() instead
the approach I went on in "mm: list_lru: Add lock_irq member to
__list_lru_init()".

This series is based on the former two patches and introduces
list_lru_shrink_walk_irq() (and makes the third patch of series
obsolete).
In patch 1-3 I tried a tiny cleanup so the different locking
(spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
function.

[0] The patch
      mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
   and
      mm: workingset: make shadow_lru_isolate() use locking suffix

Sebastian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2018-07-03 14:52 ` Sebastian Andrzej Siewior
@ 2018-07-03 21:14   ` Andrew Morton
  2018-07-03 21:44     ` Re: Sebastian Andrzej Siewior
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2018-07-03 21:14 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Vladimir Davydov, linux-mm, tglx,
	Kirill Tkhai


> Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()

Well that's messed up.

On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> My intepretation of situtation is that Vladimir Davydon is fine patch #1
> and #2 of the series [0] but dislikes the irq argument and struct
> member. It has been suggested to use list_lru_shrink_walk_irq() instead
> the approach I went on in "mm: list_lru: Add lock_irq member to
> __list_lru_init()".
> 
> This series is based on the former two patches and introduces
> list_lru_shrink_walk_irq() (and makes the third patch of series
> obsolete).
> In patch 1-3 I tried a tiny cleanup so the different locking
> (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> function.
> 
> [0] The patch
>       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
>    and
>       mm: workingset: make shadow_lru_isolate() use locking suffix
> 

This isn't a very informative [0/n] changelog.  Some overall summary of
the patchset's objective, behaviour, use cases, testing results, etc.

I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
scalability (old complexity was O(n^2), new is O(n))" series, which I
merged eight milliseconds ago.  Kirill's patchset is large but fairly
straightforward so I expect it's good for 4.18.  So I suggest we leave
things a week or more then please take a look at redoing this patchset
on top of that work?  

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2018-07-03 21:14   ` Andrew Morton
@ 2018-07-03 21:44     ` Sebastian Andrzej Siewior
  2018-07-04 14:44       ` Re: Vladimir Davydov
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 21:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vladimir Davydov, linux-mm, tglx, Kirill Tkhai

On 2018-07-03 14:14:29 [-0700], Andrew Morton wrote:
> 
> > Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()
> 
> Well that's messed up.

indeed it is. This should get into Subject:

> On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > My intepretation of situtation is that Vladimir Davydon is fine patch #1
> > and #2 of the series [0] but dislikes the irq argument and struct
> > member. It has been suggested to use list_lru_shrink_walk_irq() instead
> > the approach I went on in "mm: list_lru: Add lock_irq member to
> > __list_lru_init()".
> > 
> > This series is based on the former two patches and introduces
> > list_lru_shrink_walk_irq() (and makes the third patch of series
> > obsolete).
> > In patch 1-3 I tried a tiny cleanup so the different locking
> > (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> > function.
> > 
> > [0] The patch
> >       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
> >    and
> >       mm: workingset: make shadow_lru_isolate() use locking suffix
> > 
> 
> This isn't a very informative [0/n] changelog.  Some overall summary of
> the patchset's objective, behaviour, use cases, testing results, etc.

The patches should be threaded as a reply to 3/3 of the series so I
assumed it was enough. And while Vladimir complained about 2/3 and 3/3
the discussion went on in 2/3 where he suggested to go on with the _irq
function. And testing, well with and without RT the function was invoked
as part of swapping (allocating memory until OOM) without complains.

> I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
> scalability (old complexity was O(n^2), new is O(n))" series, which I
> merged eight milliseconds ago.  Kirill's patchset is large but fairly
> straightforward so I expect it's good for 4.18.  So I suggest we leave
> things a week or more then please take a look at redoing this patchset
> on top of that work?  

If Vladimir is okay with to redo and nobody else complains then I could
rebase these four patches on top of your tree next week.

Sebastian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re:
  2018-07-03 21:44     ` Re: Sebastian Andrzej Siewior
@ 2018-07-04 14:44       ` Vladimir Davydov
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-07-04 14:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Andrew Morton, linux-mm, tglx, Kirill Tkhai

On Tue, Jul 03, 2018 at 11:44:29PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-03 14:14:29 [-0700], Andrew Morton wrote:
> > 
> > > Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()
> > 
> > Well that's messed up.
> 
> indeed it is. This should get into Subject:
> 
> > On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > 
> > > My intepretation of situtation is that Vladimir Davydon is fine patch #1
> > > and #2 of the series [0] but dislikes the irq argument and struct
> > > member. It has been suggested to use list_lru_shrink_walk_irq() instead
> > > the approach I went on in "mm: list_lru: Add lock_irq member to
> > > __list_lru_init()".
> > > 
> > > This series is based on the former two patches and introduces
> > > list_lru_shrink_walk_irq() (and makes the third patch of series
> > > obsolete).
> > > In patch 1-3 I tried a tiny cleanup so the different locking
> > > (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> > > function.
> > > 
> > > [0] The patch
> > >       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
> > >    and
> > >       mm: workingset: make shadow_lru_isolate() use locking suffix
> > > 
> > 
> > This isn't a very informative [0/n] changelog.  Some overall summary of
> > the patchset's objective, behaviour, use cases, testing results, etc.
> 
> The patches should be threaded as a reply to 3/3 of the series so I
> assumed it was enough. And while Vladimir complained about 2/3 and 3/3
> the discussion went on in 2/3 where he suggested to go on with the _irq
> function. And testing, well with and without RT the function was invoked
> as part of swapping (allocating memory until OOM) without complains.
> 
> > I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
> > scalability (old complexity was O(n^2), new is O(n))" series, which I
> > merged eight milliseconds ago.  Kirill's patchset is large but fairly
> > straightforward so I expect it's good for 4.18.  So I suggest we leave
> > things a week or more then please take a look at redoing this patchset
> > on top of that work?  
> 
> If Vladimir is okay with to redo and nobody else complains then I could
> rebase these four patches on top of your tree next week.

IMHO this approach is more straightforward than the one with the per
list_lru flag. For all patches,

Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE:
@ 2007-04-03 18:41 Royal VIP Casino
  0 siblings, 0 replies; 35+ messages in thread
From: Royal VIP Casino @ 2007-04-03 18:41 UTC (permalink / raw)
  To: linux-mm


How about a hulking cock? Interested? Check Penis Enlarge Patch out.

http://www.zcukchud.com/

With Penis Enlarge Patch your dick will be growing<BR>so fast you wont be able to record the exact size.














------------------------
disrespect  for the girl. See how it looks in print -- I translate this froma conversation in one of the best of the German Sunday-school books:    Gretchen. Wilhelm, where is the turnip?     Wilhelm. She has gone to the kitchen.     Gretchen. Where is the accomplished and beautiful English maiden?
describe? And  observe  the strongest of  the several German equ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <F265RQAOCop3wyv9kI3000143b1@hotmail.com>]

* Re:
       [not found] <F265RQAOCop3wyv9kI3000143b1@hotmail.com>
@ 2001-10-08 11:48 ` Joseph A Knapka
  0 siblings, 0 replies; 35+ messages in thread
From: Joseph A Knapka @ 2001-10-08 11:48 UTC (permalink / raw)
  To: Gavin Dolling; +Cc: linux-mm

Hi Gavin,

[Forwarded to linux-mm, since those guys will be able to
 answer your questions much more completely. Maybe someone
 has already solved your problem.]

Gavin Dolling wrote:
> 
> Your VM page has helped me immensely. I'm after so advice though about the
> following. No problem if you are too busy, etc. your site has already helped
> me a great deal so just hit that delete key now ...
> 
> I have an embedded linux system running out of 8M of RAM. It has no backing
> store and uses a RAM disk as its FS. It boots from a flash chip - at boot
> time things are uncompressed into RAM. Running an MTD type system with a
> flash FS is not an option.
> 
> Memory is very tight and it is unfortunate that the binaries effectively
> appear twice in memory. They are in the RAM FS in full and also get paged
> into memory. There is a lot of paging going on which I believe is drowning
> the system.
> 
> We have no swap file (that would obviously be stupid) but a large number of
> buffers (i.e. a lot of dirty pages). The application is networking stuff so
> it is supposed to perform at line rate - the paging appears to be preventing
> this.
> 
> What I wish to do is to page the user space binaries into the page cache,
> mark them so they are never evicted. Delete them from the RAMFS and recover
> the memory. This should be the most optimum way of running the system - in
> terms of memory usage anyway.
> 
> I am planning to hack filemap.c. Going to use page_cache_read on each binary
> and then remove from RAM FS. If the page is not in use I will have to make
> sure that deleting the file does not result in the page being evicted.
> Basically some more hacking required. I am also concerned about the inode
> associated with the page, this is going to cause me pain I think?
> 
> I am going to try this on my PC first. Going to try and force 'cat' to be
> fully paged in and then rename it. I should still be able to use cat at the
> command line.

I don't think this will work as a test case. The address_space mappings
are based on inode identity, and since you won't actually have
a "cat" program on your filesystem, the inode won't be found, so
the kernel will not have a way of knowing that the cached pages
are the right ones. You'd have to leave at least enough of the
filesystem intact for the kernel to be able to map the program
name to the correct inode. You might solve this by pinning the
inode buffers in main memory before reclaiming the RAMFS pages,
but that's pure speculation on my part.

> So basically:
> 
> a) Is this feasible?

See below.

> b) When I delete the binary can I prevent it from being evicted from the
> page cache?
> (I note with interest that if I mv my /usr/bin/emacs whilst emacs is running
>       e.g.   $ emacs &; mv /usr/bin/emacs /usr/bin/emacs2
> it allows me to do it and what's more nothing bad happens. This tells me I
> do not understand enough of what is going on - I would have expected this to
> fail in some manner).

The disk inode for a moved or deleted file (and the file's disk
blocks) don't get freed until all references to the inode are
gone. If the kernel has the file open (eg due to mmap()),
the file can still be used for paging until it's unmapped
by all the processes that are using it. (This is another
reason your test case above might be misleading.)

> c) I must have to leave something in the RAMFS such that the instance of the
> binary still exists even if not its whole content.
> 
> d) Am I insane to try this? (Why would be more useful than just a yes ;-)  )

I don't know. This is a deeper hack than any I've contemplated.
However, I'm tempted to say that it would be easier to figure
out a way to directly add the RAMFS pages to the page cache,
and thus use a single page simultaneously as a cache page and
an FS page. I don't know how hard that's going to be, but I
think it might be easier than trying to yank the FS out from
under an in-use mapping.

Cheers,

-- Joe
# "You know how many remote castles there are along the
#  gorges? You can't MOVE for remote castles!" - Lu Tze re. Uberwald
# Linux MM docs:
http://home.earthlink.net/~jknapka/linux-mm/vmoutline.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-08-12 13:50 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26 22:03 Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 1/9] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 2/9] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 3/9] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 4/9] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 5/9] selftests/vm: dedup THP helpers Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 6/9] selftests/vm: modularize thp collapse memory operations Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 7/9] selftests/vm: add thp collapse file and tmpfs testing Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 8/9] selftests/vm: add thp collapse shmem testing Zach O'Keefe
2022-08-26 22:03 ` [PATCH mm-unstable v2 9/9] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Zach O'Keefe
2022-08-31 21:47 ` Yang Shi
2022-09-01  0:24   ` Re: Zach O'Keefe
  -- strict thread matches above, loose matches on Subject: below --
2025-08-12 13:34 Baoquan He
2025-08-12 13:49 ` Baoquan He
2025-02-08  8:19 Re: Director Inspectorate
2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
2024-03-06 13:42   ` Ryan Roberts
2024-03-06 16:09     ` Matthew Wilcox
2024-03-06 16:19       ` Ryan Roberts
2024-03-06 17:41         ` Ryan Roberts
2024-03-06 18:41           ` Zi Yan
2024-03-06 19:55             ` Matthew Wilcox
2024-03-06 21:55               ` Matthew Wilcox
2024-03-07  8:56                 ` Ryan Roberts
2024-03-07 13:50                   ` Yin, Fengwei
2024-03-07 14:05                     ` Re: Matthew Wilcox
2024-03-07 15:24                       ` Re: Ryan Roberts
2024-03-07 16:24                         ` Re: Ryan Roberts
2024-03-07 23:02                           ` Re: Matthew Wilcox
2024-03-08  1:06                       ` Re: Yin, Fengwei
2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
2024-01-22 10:13 ` Andi Kleen
2024-01-22 11:53   ` Dave Chinner
2023-05-11 12:58 Ryan Roberts
2023-05-11 13:13 ` Ryan Roberts
     [not found] <20220421164138.1250943-1-yury.norov@gmail.com>
2022-04-21 23:04 ` Re: John Hubbard
2022-04-21 23:09   ` Re: John Hubbard
2022-04-21 23:17   ` Re: Yury Norov
2022-04-21 23:21     ` Re: John Hubbard
2021-08-12  9:21 Valdis Klētnieks
2021-08-12  9:42 ` SeongJae Park
2021-08-12 20:19   ` Re: Andrew Morton
2021-08-13  8:14     ` Re: SeongJae Park
     [not found] <bug-200209-27@https.bugzilla.kernel.org/>
2018-06-28  3:48 ` Andrew Morton
2018-06-29 18:44   ` Andrey Ryabinin
2018-06-24 20:09 [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init() Vladimir Davydov
2018-07-03 14:52 ` Sebastian Andrzej Siewior
2018-07-03 21:14   ` Andrew Morton
2018-07-03 21:44     ` Re: Sebastian Andrzej Siewior
2018-07-04 14:44       ` Re: Vladimir Davydov
2007-04-03 18:41 Royal VIP Casino
     [not found] <F265RQAOCop3wyv9kI3000143b1@hotmail.com>
2001-10-08 11:48 ` Joseph A Knapka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox