linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/3] xfs: use large folios for buffers
@ 2024-01-18 22:19 Dave Chinner
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

The XFS buffer cache supports metadata buffers up to 64kB, and it does so by
aggregating multiple pages into a single contiguous memory region using
vmapping. This is expensive (both the setup and the runtime TLB mapping cost),
and would be unnecessary if we could allocate large contiguous memory regions
for the buffers in the first place.

Enter multi-page folios.

This patchset converts the buffer cache to use the folio API, then enhances it
to optimisitically use large folios where possible. It retains the old "vmap an
array of single page folios" functionality as a fallback when large folio
allocation fails. This means that, like page cache support for large folios, we
aren't dependent on large folio allocation succeeding all the time.

This relegates the single page array allocation mechanism to the "slow path"
that we don't have to care so much about performance of this path anymore. This
might allow us to simplify it a bit in future.

One of the issues with the folio conversion is that we use a couple of APIs that
take struct page ** (i.e. pointers to page pointer arrays) and there aren't
folio counterparts. These are the bulk page allocator and vm_map_ram(). In the
cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing
that this array will only contain single page folios and that single page folios
and struct page are the same structure and so have the same address. This is a
bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio
versions of these interfaces right now. We don't need to use the bulk page
allocator so much any more, because that's now a slow path and we could probably
just call folio_alloc() in a loop like we used to. What to do about vm_map_ram()
is a little less clear....

The other issue I tripped over in doing this conversion is that the
discontiguous buffer straddling code in the buf log item dirty region tracking
is broken. We don't actually exercise that code on existing configurations, and
I tripped over it when tracking down a bug in the folio conversion. I fixed it
and short-circuted the check for contiguous buffers, but that didn't fix the
failure I was seeing (which was not handling bp->b_offset and large folios
properly when building bios).

Apart from those issues, the conversion and enhancement is relatively straight
forward.  It passes fstests on both 512 and 4096 byte sector size storage (512
byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values)
and doesn't appear to cause any problems with large directory buffers, though I
haven't done any real testing on those yet. Large folio allocations are
definitely being exercised, though, as all the inode cluster buffers are 16kB on
a 512 byte inode V5 filesystem.

Thoughts, comments, etc?

Note: this patchset is on top of the NOFS removal patchset I sent a
few days ago. That can be pulled from this git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
@ 2024-01-18 22:19 ` Dave Chinner
  2024-01-22  6:41   ` Christoph Hellwig
  2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

From: Dave Chinner <dchinner@redhat.com>

We never log large contiguous regions of unmapped buffers, so this
bug is never triggered by the current code. However, the slowpath
for formatting buffer straddling regions is broken.

That is, the size and shape of the log vector calculated across a
straddle does not match how the formatting code formats a straddle.
This results in a log vector with an uninitialised iovec and this
causes a crash when xlog_write_full() goes to copy the iovec into
the journal.

Whilst touching this code, don't bother checking mapped or single
folio buffers for discontiguous regions because they don't have
them. This significantly reduces the overhead of this check when
logging large buffers as calling xfs_buf_offset() is not free and
it occurs a *lot* in those cases.

Fixes: 929f8b0deb83 ("xfs: optimise xfs_buf_item_size/format for contiguous regions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 43031842341a..83a81cb52d8e 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -56,6 +56,10 @@ xfs_buf_log_format_size(
 			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
 }
 
+/*
+ * We only have to worry about discontiguous buffer range straddling on unmapped
+ * buffers. Everything else will have a contiguous data region we can copy from.
+ */
 static inline bool
 xfs_buf_item_straddle(
 	struct xfs_buf		*bp,
@@ -65,6 +69,9 @@ xfs_buf_item_straddle(
 {
 	void			*first, *last;
 
+	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
+		return false;
+
 	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
 	last = xfs_buf_offset(bp,
 			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
@@ -132,11 +139,13 @@ xfs_buf_item_size_segment(
 	return;
 
 slow_scan:
-	/* Count the first bit we jumped out of the above loop from */
-	(*nvecs)++;
-	*nbytes += XFS_BLF_CHUNK;
+	ASSERT(bp->b_addr == NULL);
 	last_bit = first_bit;
+	nbits = 1;
 	while (last_bit != -1) {
+
+		*nbytes += XFS_BLF_CHUNK;
+
 		/*
 		 * This takes the bit number to start looking from and
 		 * returns the next set bit from there.  It returns -1
@@ -151,6 +160,8 @@ xfs_buf_item_size_segment(
 		 * else keep scanning the current set of bits.
 		 */
 		if (next_bit == -1) {
+			if (first_bit != last_bit)
+				(*nvecs)++;
 			break;
 		} else if (next_bit != last_bit + 1 ||
 		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
@@ -162,7 +173,6 @@ xfs_buf_item_size_segment(
 			last_bit++;
 			nbits++;
 		}
-		*nbytes += XFS_BLF_CHUNK;
 	}
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
@ 2024-01-18 22:19 ` Dave Chinner
  2024-01-19  1:26   ` Darrick J. Wong
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Convert the use of struct pages to struct folio everywhere. This
is just direct API conversion, no actual logic of code changes
should result.

Note: this conversion currently assumes only single page folios are
allocated, and because some of the MM interfaces we use take
pointers to arrays of struct pages, the address of single page
folios and struct pages are the same. e.g alloc_pages_bulk_array(),
vm_map_ram(), etc.


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c      | 127 +++++++++++++++++++++---------------------
 fs/xfs/xfs_buf.h      |  14 ++---
 fs/xfs/xfs_buf_item.c |   2 +-
 fs/xfs/xfs_linux.h    |   8 +++
 4 files changed, 80 insertions(+), 71 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 08f2fbc04db5..15907e92d0d3 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -60,25 +60,25 @@ xfs_buf_submit(
 	return __xfs_buf_submit(bp, !(bp->b_flags & XBF_ASYNC));
 }
 
+/*
+ * Return true if the buffer is vmapped.
+ *
+ * b_addr is null if the buffer is not mapped, but the code is clever enough to
+ * know it doesn't have to map a single folio, so the check has to be both for
+ * b_addr and bp->b_folio_count > 1.
+ */
 static inline int
 xfs_buf_is_vmapped(
 	struct xfs_buf	*bp)
 {
-	/*
-	 * Return true if the buffer is vmapped.
-	 *
-	 * b_addr is null if the buffer is not mapped, but the code is clever
-	 * enough to know it doesn't have to map a single page, so the check has
-	 * to be both for b_addr and bp->b_page_count > 1.
-	 */
-	return bp->b_addr && bp->b_page_count > 1;
+	return bp->b_addr && bp->b_folio_count > 1;
 }
 
 static inline int
 xfs_buf_vmap_len(
 	struct xfs_buf	*bp)
 {
-	return (bp->b_page_count * PAGE_SIZE);
+	return (bp->b_folio_count * PAGE_SIZE);
 }
 
 /*
@@ -197,7 +197,7 @@ xfs_buf_get_maps(
 }
 
 /*
- *	Frees b_pages if it was allocated.
+ *	Frees b_maps if it was allocated.
  */
 static void
 xfs_buf_free_maps(
@@ -273,26 +273,26 @@ _xfs_buf_alloc(
 }
 
 static void
-xfs_buf_free_pages(
+xfs_buf_free_folios(
 	struct xfs_buf	*bp)
 {
 	uint		i;
 
-	ASSERT(bp->b_flags & _XBF_PAGES);
+	ASSERT(bp->b_flags & _XBF_FOLIOS);
 
 	if (xfs_buf_is_vmapped(bp))
-		vm_unmap_ram(bp->b_addr, bp->b_page_count);
+		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
 
-	for (i = 0; i < bp->b_page_count; i++) {
-		if (bp->b_pages[i])
-			__free_page(bp->b_pages[i]);
+	for (i = 0; i < bp->b_folio_count; i++) {
+		if (bp->b_folios[i])
+			__folio_put(bp->b_folios[i]);
 	}
-	mm_account_reclaimed_pages(bp->b_page_count);
+	mm_account_reclaimed_pages(bp->b_folio_count);
 
-	if (bp->b_pages != bp->b_page_array)
-		kfree(bp->b_pages);
-	bp->b_pages = NULL;
-	bp->b_flags &= ~_XBF_PAGES;
+	if (bp->b_folios != bp->b_folio_array)
+		kfree(bp->b_folios);
+	bp->b_folios = NULL;
+	bp->b_flags &= ~_XBF_FOLIOS;
 }
 
 static void
@@ -313,8 +313,8 @@ xfs_buf_free(
 
 	ASSERT(list_empty(&bp->b_lru));
 
-	if (bp->b_flags & _XBF_PAGES)
-		xfs_buf_free_pages(bp);
+	if (bp->b_flags & _XBF_FOLIOS)
+		xfs_buf_free_folios(bp);
 	else if (bp->b_flags & _XBF_KMEM)
 		kfree(bp->b_addr);
 
@@ -345,15 +345,15 @@ xfs_buf_alloc_kmem(
 		return -ENOMEM;
 	}
 	bp->b_offset = offset_in_page(bp->b_addr);
-	bp->b_pages = bp->b_page_array;
-	bp->b_pages[0] = kmem_to_page(bp->b_addr);
-	bp->b_page_count = 1;
+	bp->b_folios = bp->b_folio_array;
+	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
+	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
 
 static int
-xfs_buf_alloc_pages(
+xfs_buf_alloc_folios(
 	struct xfs_buf	*bp,
 	xfs_buf_flags_t	flags)
 {
@@ -364,16 +364,16 @@ xfs_buf_alloc_pages(
 		gfp_mask |= __GFP_NORETRY;
 
 	/* Make sure that we have a page list */
-	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
-	if (bp->b_page_count <= XB_PAGES) {
-		bp->b_pages = bp->b_page_array;
+	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
+	if (bp->b_folio_count <= XB_FOLIOS) {
+		bp->b_folios = bp->b_folio_array;
 	} else {
-		bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count,
+		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
 					gfp_mask);
-		if (!bp->b_pages)
+		if (!bp->b_folios)
 			return -ENOMEM;
 	}
-	bp->b_flags |= _XBF_PAGES;
+	bp->b_flags |= _XBF_FOLIOS;
 
 	/* Assure zeroed buffer for non-read cases. */
 	if (!(flags & XBF_READ))
@@ -387,9 +387,9 @@ xfs_buf_alloc_pages(
 	for (;;) {
 		long	last = filled;
 
-		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
-						bp->b_pages);
-		if (filled == bp->b_page_count) {
+		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
+						(struct page **)bp->b_folios);
+		if (filled == bp->b_folio_count) {
 			XFS_STATS_INC(bp->b_mount, xb_page_found);
 			break;
 		}
@@ -398,7 +398,7 @@ xfs_buf_alloc_pages(
 			continue;
 
 		if (flags & XBF_READ_AHEAD) {
-			xfs_buf_free_pages(bp);
+			xfs_buf_free_folios(bp);
 			return -ENOMEM;
 		}
 
@@ -412,14 +412,14 @@ xfs_buf_alloc_pages(
  *	Map buffer into kernel address-space if necessary.
  */
 STATIC int
-_xfs_buf_map_pages(
+_xfs_buf_map_folios(
 	struct xfs_buf		*bp,
 	xfs_buf_flags_t		flags)
 {
-	ASSERT(bp->b_flags & _XBF_PAGES);
-	if (bp->b_page_count == 1) {
+	ASSERT(bp->b_flags & _XBF_FOLIOS);
+	if (bp->b_folio_count == 1) {
 		/* A single page buffer is always mappable */
-		bp->b_addr = page_address(bp->b_pages[0]);
+		bp->b_addr = folio_address(bp->b_folios[0]);
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
 	} else {
@@ -443,8 +443,8 @@ _xfs_buf_map_pages(
 		 */
 		nofs_flag = memalloc_nofs_save();
 		do {
-			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-						-1);
+			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
+					bp->b_folio_count, -1);
 			if (bp->b_addr)
 				break;
 			vm_unmap_aliases();
@@ -571,7 +571,7 @@ xfs_buf_find_lock(
 			return -ENOENT;
 		}
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
+		bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
 		bp->b_ops = NULL;
 	}
 	return 0;
@@ -629,14 +629,15 @@ xfs_buf_find_insert(
 		goto out_drop_pag;
 
 	/*
-	 * For buffers that fit entirely within a single page, first attempt to
-	 * allocate the memory from the heap to minimise memory usage. If we
-	 * can't get heap memory for these small buffers, we fall back to using
-	 * the page allocator.
+	 * For buffers that fit entirely within a single page folio, first
+	 * attempt to allocate the memory from the heap to minimise memory
+	 * usage. If we can't get heap memory for these small buffers, we fall
+	 * back to using the page allocator.
 	 */
+
 	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
 	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
-		error = xfs_buf_alloc_pages(new_bp, flags);
+		error = xfs_buf_alloc_folios(new_bp, flags);
 		if (error)
 			goto out_free_buf;
 	}
@@ -728,11 +729,11 @@ xfs_buf_get_map(
 
 	/* We do not hold a perag reference anymore. */
 	if (!bp->b_addr) {
-		error = _xfs_buf_map_pages(bp, flags);
+		error = _xfs_buf_map_folios(bp, flags);
 		if (unlikely(error)) {
 			xfs_warn_ratelimited(btp->bt_mount,
-				"%s: failed to map %u pages", __func__,
-				bp->b_page_count);
+				"%s: failed to map %u folios", __func__,
+				bp->b_folio_count);
 			xfs_buf_relse(bp);
 			return error;
 		}
@@ -963,14 +964,14 @@ xfs_buf_get_uncached(
 	if (error)
 		return error;
 
-	error = xfs_buf_alloc_pages(bp, flags);
+	error = xfs_buf_alloc_folios(bp, flags);
 	if (error)
 		goto fail_free_buf;
 
-	error = _xfs_buf_map_pages(bp, 0);
+	error = _xfs_buf_map_folios(bp, 0);
 	if (unlikely(error)) {
 		xfs_warn(target->bt_mount,
-			"%s: failed to map pages", __func__);
+			"%s: failed to map folios", __func__);
 		goto fail_free_buf;
 	}
 
@@ -1465,7 +1466,7 @@ xfs_buf_ioapply_map(
 	blk_opf_t	op)
 {
 	int		page_index;
-	unsigned int	total_nr_pages = bp->b_page_count;
+	unsigned int	total_nr_pages = bp->b_folio_count;
 	int		nr_pages;
 	struct bio	*bio;
 	sector_t	sector =  bp->b_maps[map].bm_bn;
@@ -1503,7 +1504,7 @@ xfs_buf_ioapply_map(
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
+		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
 				      offset);
 		if (rbytes < nbytes)
 			break;
@@ -1716,13 +1717,13 @@ xfs_buf_offset(
 	struct xfs_buf		*bp,
 	size_t			offset)
 {
-	struct page		*page;
+	struct folio		*folio;
 
 	if (bp->b_addr)
 		return bp->b_addr + offset;
 
-	page = bp->b_pages[offset >> PAGE_SHIFT];
-	return page_address(page) + (offset & (PAGE_SIZE-1));
+	folio = bp->b_folios[offset >> PAGE_SHIFT];
+	return folio_address(folio) + (offset & (PAGE_SIZE-1));
 }
 
 void
@@ -1735,18 +1736,18 @@ xfs_buf_zero(
 
 	bend = boff + bsize;
 	while (boff < bend) {
-		struct page	*page;
+		struct folio	*folio;
 		int		page_index, page_offset, csize;
 
 		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
 		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
-		page = bp->b_pages[page_index];
+		folio = bp->b_folios[page_index];
 		csize = min_t(size_t, PAGE_SIZE - page_offset,
 				      BBTOB(bp->b_length) - boff);
 
 		ASSERT((csize + page_offset) <= PAGE_SIZE);
 
-		memset(page_address(page) + page_offset, 0, csize);
+		memset(folio_address(folio) + page_offset, 0, csize);
 
 		boff += csize;
 	}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b470de08a46c..1e7298ff3fa5 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -29,7 +29,7 @@ struct xfs_buf;
 #define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
 #define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
 #define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
-#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
+#define XBF_DONE	 (1u << 5) /* all folios in the buffer uptodate */
 #define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
 #define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
 
@@ -39,7 +39,7 @@ struct xfs_buf;
 #define _XBF_LOGRECOVERY (1u << 18)/* log recovery buffer */
 
 /* flags used only internally */
-#define _XBF_PAGES	 (1u << 20)/* backed by refcounted pages */
+#define _XBF_FOLIOS	 (1u << 20)/* backed by refcounted folios */
 #define _XBF_KMEM	 (1u << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
 
@@ -68,7 +68,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
-	{ _XBF_PAGES,		"PAGES" }, \
+	{ _XBF_FOLIOS,		"FOLIOS" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
 	/* The following interface flags should never be set */ \
@@ -116,7 +116,7 @@ typedef struct xfs_buftarg {
 	struct ratelimit_state	bt_ioerror_rl;
 } xfs_buftarg_t;
 
-#define XB_PAGES	2
+#define XB_FOLIOS	2
 
 struct xfs_buf_map {
 	xfs_daddr_t		bm_bn;	/* block number for I/O */
@@ -180,14 +180,14 @@ struct xfs_buf {
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
 	struct xfs_trans	*b_transp;
-	struct page		**b_pages;	/* array of page pointers */
-	struct page		*b_page_array[XB_PAGES]; /* inline pages */
+	struct folio		**b_folios;	/* array of folio pointers */
+	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
 	struct xfs_buf_map	*b_maps;	/* compound buffer map */
 	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
 	int			b_map_count;
 	atomic_t		b_pin_count;	/* pin count */
 	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
-	unsigned int		b_page_count;	/* size of page array */
+	unsigned int		b_folio_count;	/* size of folio array */
 	unsigned int		b_offset;	/* page offset of b_addr,
 						   only for _XBF_KMEM buffers */
 	int			b_error;	/* error code on I/O */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 83a81cb52d8e..d1407cee48d9 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -69,7 +69,7 @@ xfs_buf_item_straddle(
 {
 	void			*first, *last;
 
-	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
+	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
 		return false;
 
 	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index caccb7f76690..804389b8e802 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -279,4 +279,12 @@ kmem_to_page(void *addr)
 	return virt_to_page(addr);
 }
 
+static inline struct folio *
+kmem_to_folio(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		return page_folio(vmalloc_to_page(addr));
+	return virt_to_folio(addr);
+}
+
 #endif /* __XFS_LINUX__ */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
  2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
@ 2024-01-18 22:19 ` Dave Chinner
  2024-01-19  1:31   ` Darrick J. Wong
  2024-01-22  6:45   ` Christoph Hellwig
  2024-01-19  1:05 ` [RFC] [PATCH 0/3] xfs: use large folios for buffers Darrick J. Wong
  2024-01-22 10:13 ` Andi Kleen
  4 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now that we have the buffer cache using the folio API, we can extend
the use of folios to allocate high order folios for multi-page
buffers rather than an array of single pages that are then vmapped
into a contiguous range.

This creates two types of buffers: single folio buffers that can
have arbitrary order, and multi-folio buffers made up of many single
page folios that get vmapped. The latter is essentially the existing
code, so there are no logic changes to handle this case.

There are a few places where we iterate the folios on a buffer.
These need to be converted to handle the high order folio case.
Luckily, this only occurs when bp->b_folio_count == 1, and the code
for handling this case is just a simple application of the folio API
to the operations that need to be performed.

The code that allocates buffers will optimistically attempt a high
order folio allocation as a fast path. If this high order allocation
fails, then we fall back to the existing multi-folio allocation
code. This now forms the slow allocation path, and hopefully will be
largely unused in normal conditions.

This should improve performance of large buffer operations (e.g.
large directory block sizes) as we should now mostly avoid the
expense of vmapping large buffers (and the vmap lock contention that
can occur) as well as avoid the runtime pressure that frequently
accessing kernel vmapped pages put on the TLBs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 150 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 119 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 15907e92d0d3..df363f17ea1a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -74,6 +74,10 @@ xfs_buf_is_vmapped(
 	return bp->b_addr && bp->b_folio_count > 1;
 }
 
+/*
+ * See comment above xfs_buf_alloc_folios() about the constraints placed on
+ * allocating vmapped buffers.
+ */
 static inline int
 xfs_buf_vmap_len(
 	struct xfs_buf	*bp)
@@ -344,14 +348,72 @@ xfs_buf_alloc_kmem(
 		bp->b_addr = NULL;
 		return -ENOMEM;
 	}
-	bp->b_offset = offset_in_page(bp->b_addr);
 	bp->b_folios = bp->b_folio_array;
 	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
+	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
 	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
 
+/*
+ * Allocating a high order folio makes the assumption that buffers are a
+ * power-of-2 size so that ilog2() returns the exact order needed to fit
+ * the contents of the buffer. Buffer lengths are mostly a power of two,
+ * so this is not an unreasonable approach to take by default.
+ *
+ * The exception here are user xattr data buffers, which can be arbitrarily
+ * sized up to 64kB plus structure metadata. In that case, round up the order.
+ */
+static bool
+xfs_buf_alloc_folio(
+	struct xfs_buf	*bp,
+	gfp_t		gfp_mask)
+{
+	int		length = BBTOB(bp->b_length);
+	int		order;
+
+	order = ilog2(length);
+	if ((1 << order) < length)
+		order = ilog2(length - 1) + 1;
+
+	if (order <= PAGE_SHIFT)
+		order = 0;
+	else
+		order -= PAGE_SHIFT;
+
+	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
+	if (!bp->b_folio_array[0])
+		return false;
+
+	bp->b_folios = bp->b_folio_array;
+	bp->b_folio_count = 1;
+	bp->b_flags |= _XBF_FOLIOS;
+	return true;
+}
+
+/*
+ * When we allocate folios for a buffer, we end up with one of two types of
+ * buffer.
+ *
+ * The first type is a single folio buffer - this may be a high order
+ * folio or just a single page sized folio, but either way they get treated the
+ * same way by the rest of the code - the buffer memory spans a single
+ * contiguous memory region that we don't have to map and unmap to access the
+ * data directly.
+ *
+ * The second type of buffer is the multi-folio buffer. These are *always* made
+ * up of single page folios so that they can be fed to vmap_ram() to return a
+ * contiguous memory region we can access the data through, or mark it as
+ * XBF_UNMAPPED and access the data directly through individual folio_address()
+ * calls.
+ *
+ * We don't use high order folios for this second type of buffer (yet) because
+ * having variable size folios makes offset-to-folio indexing and iteration of
+ * the data range more complex than if they are fixed size. This case should now
+ * be the slow path, though, so unless we regularly fail to allocate high order
+ * folios, there should be little need to optimise this path.
+ */
 static int
 xfs_buf_alloc_folios(
 	struct xfs_buf	*bp,
@@ -363,7 +425,15 @@ xfs_buf_alloc_folios(
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
 
-	/* Make sure that we have a page list */
+	/* Assure zeroed buffer for non-read cases. */
+	if (!(flags & XBF_READ))
+		gfp_mask |= __GFP_ZERO;
+
+	/* Optimistically attempt a single high order folio allocation. */
+	if (xfs_buf_alloc_folio(bp, gfp_mask))
+		return 0;
+
+	/* Fall back to allocating an array of single page folios. */
 	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
 	if (bp->b_folio_count <= XB_FOLIOS) {
 		bp->b_folios = bp->b_folio_array;
@@ -375,9 +445,6 @@ xfs_buf_alloc_folios(
 	}
 	bp->b_flags |= _XBF_FOLIOS;
 
-	/* Assure zeroed buffer for non-read cases. */
-	if (!(flags & XBF_READ))
-		gfp_mask |= __GFP_ZERO;
 
 	/*
 	 * Bulk filling of pages can take multiple calls. Not filling the entire
@@ -418,7 +485,7 @@ _xfs_buf_map_folios(
 {
 	ASSERT(bp->b_flags & _XBF_FOLIOS);
 	if (bp->b_folio_count == 1) {
-		/* A single page buffer is always mappable */
+		/* A single folio buffer is always mappable */
 		bp->b_addr = folio_address(bp->b_folios[0]);
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
@@ -1465,20 +1532,28 @@ xfs_buf_ioapply_map(
 	int		*count,
 	blk_opf_t	op)
 {
-	int		page_index;
-	unsigned int	total_nr_pages = bp->b_folio_count;
-	int		nr_pages;
+	int		folio_index;
+	unsigned int	total_nr_folios = bp->b_folio_count;
+	int		nr_folios;
 	struct bio	*bio;
 	sector_t	sector =  bp->b_maps[map].bm_bn;
 	int		size;
 	int		offset;
 
-	/* skip the pages in the buffer before the start offset */
-	page_index = 0;
+	/*
+	 * If the start offset if larger than a single page, we need to be
+	 * careful. We might have a high order folio, in which case the indexing
+	 * is from the start of the buffer. However, if we have more than one
+	 * folio single page folio in the buffer, we need to skip the folios in
+	 * the buffer before the start offset.
+	 */
+	folio_index = 0;
 	offset = *buf_offset;
-	while (offset >= PAGE_SIZE) {
-		page_index++;
-		offset -= PAGE_SIZE;
+	if (bp->b_folio_count > 1) {
+		while (offset >= PAGE_SIZE) {
+			folio_index++;
+			offset -= PAGE_SIZE;
+		}
 	}
 
 	/*
@@ -1491,28 +1566,28 @@ xfs_buf_ioapply_map(
 
 next_chunk:
 	atomic_inc(&bp->b_io_remaining);
-	nr_pages = bio_max_segs(total_nr_pages);
+	nr_folios = bio_max_segs(total_nr_folios);
 
-	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
+	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
 
-	for (; size && nr_pages; nr_pages--, page_index++) {
-		int	rbytes, nbytes = PAGE_SIZE - offset;
+	for (; size && nr_folios; nr_folios--, folio_index++) {
+		struct folio	*folio = bp->b_folios[folio_index];
+		int		nbytes = folio_size(folio) - offset;
 
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
-				      offset);
-		if (rbytes < nbytes)
+		if (!bio_add_folio(bio, folio, nbytes,
+				offset_in_folio(folio, offset)))
 			break;
 
 		offset = 0;
 		sector += BTOBB(nbytes);
 		size -= nbytes;
-		total_nr_pages--;
+		total_nr_folios--;
 	}
 
 	if (likely(bio->bi_iter.bi_size)) {
@@ -1722,6 +1797,13 @@ xfs_buf_offset(
 	if (bp->b_addr)
 		return bp->b_addr + offset;
 
+	/* Single folio buffers may use large folios. */
+	if (bp->b_folio_count == 1) {
+		folio = bp->b_folios[0];
+		return folio_address(folio) + offset_in_folio(folio, offset);
+	}
+
+	/* Multi-folio buffers always use PAGE_SIZE folios */
 	folio = bp->b_folios[offset >> PAGE_SHIFT];
 	return folio_address(folio) + (offset & (PAGE_SIZE-1));
 }
@@ -1737,18 +1819,24 @@ xfs_buf_zero(
 	bend = boff + bsize;
 	while (boff < bend) {
 		struct folio	*folio;
-		int		page_index, page_offset, csize;
+		int		folio_index, folio_offset, csize;
 
-		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
-		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
-		folio = bp->b_folios[page_index];
-		csize = min_t(size_t, PAGE_SIZE - page_offset,
+		/* Single folio buffers may use large folios. */
+		if (bp->b_folio_count == 1) {
+			folio = bp->b_folios[0];
+			folio_offset = offset_in_folio(folio,
+						bp->b_offset + boff);
+		} else {
+			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
+			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
+			folio = bp->b_folios[folio_index];
+		}
+
+		csize = min_t(size_t, folio_size(folio) - folio_offset,
 				      BBTOB(bp->b_length) - boff);
+		ASSERT((csize + folio_offset) <= folio_size(folio));
 
-		ASSERT((csize + page_offset) <= PAGE_SIZE);
-
-		memset(folio_address(folio) + page_offset, 0, csize);
-
+		memset(folio_address(folio) + folio_offset, 0, csize);
 		boff += csize;
 	}
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC] [PATCH 0/3] xfs: use large folios for buffers
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
                   ` (2 preceding siblings ...)
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
@ 2024-01-19  1:05 ` Darrick J. Wong
  2024-01-22 10:13 ` Andi Kleen
  4 siblings, 0 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-01-19  1:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:38AM +1100, Dave Chinner wrote:
> The XFS buffer cache supports metadata buffers up to 64kB, and it does so by
> aggregating multiple pages into a single contiguous memory region using
> vmapping. This is expensive (both the setup and the runtime TLB mapping cost),
> and would be unnecessary if we could allocate large contiguous memory regions
> for the buffers in the first place.
> 
> Enter multi-page folios.

LOL, hch and I just wrapped up making the xfbtree buffer cache work with
large folios coming from tmpfs.  Though the use case there is simpler
because we require blocksize==PAGE_SIZE, forbid the use of highmem, and
don't need discontig buffers.  Hence we sidestep vm_map_ram. :)

> This patchset converts the buffer cache to use the folio API, then enhances it
> to optimisitically use large folios where possible. It retains the old "vmap an
> array of single page folios" functionality as a fallback when large folio
> allocation fails. This means that, like page cache support for large folios, we
> aren't dependent on large folio allocation succeeding all the time.
> 
> This relegates the single page array allocation mechanism to the "slow path"
> that we don't have to care so much about performance of this path anymore. This
> might allow us to simplify it a bit in future.
> 
> One of the issues with the folio conversion is that we use a couple of APIs that
> take struct page ** (i.e. pointers to page pointer arrays) and there aren't
> folio counterparts. These are the bulk page allocator and vm_map_ram(). In the
> cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing
> that this array will only contain single page folios and that single page folios
> and struct page are the same structure and so have the same address. This is a
> bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio
> versions of these interfaces right now. We don't need to use the bulk page
> allocator so much any more, because that's now a slow path and we could probably
> just call folio_alloc() in a loop like we used to. What to do about vm_map_ram()
> is a little less clear....

Yeah, that's what I suspected.

> The other issue I tripped over in doing this conversion is that the
> discontiguous buffer straddling code in the buf log item dirty region tracking
> is broken. We don't actually exercise that code on existing configurations, and
> I tripped over it when tracking down a bug in the folio conversion. I fixed it
> and short-circuted the check for contiguous buffers, but that didn't fix the
> failure I was seeing (which was not handling bp->b_offset and large folios
> properly when building bios).

Yikes.

> Apart from those issues, the conversion and enhancement is relatively straight
> forward.  It passes fstests on both 512 and 4096 byte sector size storage (512
> byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values)
> and doesn't appear to cause any problems with large directory buffers, though I
> haven't done any real testing on those yet. Large folio allocations are
> definitely being exercised, though, as all the inode cluster buffers are 16kB on
> a 512 byte inode V5 filesystem.
> 
> Thoughts, comments, etc?

Not yet.

> Note: this patchset is on top of the NOFS removal patchset I sent a
> few days ago. That can be pulled from this git branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup

Oooh a branch link, thank you.  It's so much easier if I can pull a
branch while picking through commits over gitweb.

--D

> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
@ 2024-01-19  1:26   ` Darrick J. Wong
  2024-01-19  7:03     ` Dave Chinner
  2024-01-22  6:39     ` Christoph Hellwig
  0 siblings, 2 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-01-19  1:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:40AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Convert the use of struct pages to struct folio everywhere. This
> is just direct API conversion, no actual logic of code changes
> should result.
> 
> Note: this conversion currently assumes only single page folios are
> allocated, and because some of the MM interfaces we use take
> pointers to arrays of struct pages, the address of single page
> folios and struct pages are the same. e.g alloc_pages_bulk_array(),
> vm_map_ram(), etc.
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c      | 127 +++++++++++++++++++++---------------------
>  fs/xfs/xfs_buf.h      |  14 ++---
>  fs/xfs/xfs_buf_item.c |   2 +-
>  fs/xfs/xfs_linux.h    |   8 +++
>  4 files changed, 80 insertions(+), 71 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 08f2fbc04db5..15907e92d0d3 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -60,25 +60,25 @@ xfs_buf_submit(
>  	return __xfs_buf_submit(bp, !(bp->b_flags & XBF_ASYNC));
>  }
>  
> +/*
> + * Return true if the buffer is vmapped.
> + *
> + * b_addr is null if the buffer is not mapped, but the code is clever enough to
> + * know it doesn't have to map a single folio, so the check has to be both for
> + * b_addr and bp->b_folio_count > 1.
> + */
>  static inline int
>  xfs_buf_is_vmapped(
>  	struct xfs_buf	*bp)
>  {
> -	/*
> -	 * Return true if the buffer is vmapped.
> -	 *
> -	 * b_addr is null if the buffer is not mapped, but the code is clever
> -	 * enough to know it doesn't have to map a single page, so the check has
> -	 * to be both for b_addr and bp->b_page_count > 1.
> -	 */
> -	return bp->b_addr && bp->b_page_count > 1;
> +	return bp->b_addr && bp->b_folio_count > 1;
>  }
>  
>  static inline int
>  xfs_buf_vmap_len(
>  	struct xfs_buf	*bp)
>  {
> -	return (bp->b_page_count * PAGE_SIZE);
> +	return (bp->b_folio_count * PAGE_SIZE);
>  }
>  
>  /*
> @@ -197,7 +197,7 @@ xfs_buf_get_maps(
>  }
>  
>  /*
> - *	Frees b_pages if it was allocated.
> + *	Frees b_maps if it was allocated.
>   */
>  static void
>  xfs_buf_free_maps(
> @@ -273,26 +273,26 @@ _xfs_buf_alloc(
>  }
>  
>  static void
> -xfs_buf_free_pages(
> +xfs_buf_free_folios(
>  	struct xfs_buf	*bp)
>  {
>  	uint		i;
>  
> -	ASSERT(bp->b_flags & _XBF_PAGES);
> +	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  
>  	if (xfs_buf_is_vmapped(bp))
> -		vm_unmap_ram(bp->b_addr, bp->b_page_count);
> +		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
>  
> -	for (i = 0; i < bp->b_page_count; i++) {
> -		if (bp->b_pages[i])
> -			__free_page(bp->b_pages[i]);
> +	for (i = 0; i < bp->b_folio_count; i++) {
> +		if (bp->b_folios[i])
> +			__folio_put(bp->b_folios[i]);
>  	}
> -	mm_account_reclaimed_pages(bp->b_page_count);
> +	mm_account_reclaimed_pages(bp->b_folio_count);
>  
> -	if (bp->b_pages != bp->b_page_array)
> -		kfree(bp->b_pages);
> -	bp->b_pages = NULL;
> -	bp->b_flags &= ~_XBF_PAGES;
> +	if (bp->b_folios != bp->b_folio_array)
> +		kfree(bp->b_folios);
> +	bp->b_folios = NULL;
> +	bp->b_flags &= ~_XBF_FOLIOS;
>  }
>  
>  static void
> @@ -313,8 +313,8 @@ xfs_buf_free(
>  
>  	ASSERT(list_empty(&bp->b_lru));
>  
> -	if (bp->b_flags & _XBF_PAGES)
> -		xfs_buf_free_pages(bp);
> +	if (bp->b_flags & _XBF_FOLIOS)
> +		xfs_buf_free_folios(bp);
>  	else if (bp->b_flags & _XBF_KMEM)
>  		kfree(bp->b_addr);
>  
> @@ -345,15 +345,15 @@ xfs_buf_alloc_kmem(
>  		return -ENOMEM;
>  	}
>  	bp->b_offset = offset_in_page(bp->b_addr);
> -	bp->b_pages = bp->b_page_array;
> -	bp->b_pages[0] = kmem_to_page(bp->b_addr);
> -	bp->b_page_count = 1;
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> +	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
>  }
>  
>  static int
> -xfs_buf_alloc_pages(
> +xfs_buf_alloc_folios(
>  	struct xfs_buf	*bp,
>  	xfs_buf_flags_t	flags)
>  {
> @@ -364,16 +364,16 @@ xfs_buf_alloc_pages(
>  		gfp_mask |= __GFP_NORETRY;
>  
>  	/* Make sure that we have a page list */
> -	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> -	if (bp->b_page_count <= XB_PAGES) {
> -		bp->b_pages = bp->b_page_array;
> +	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> +	if (bp->b_folio_count <= XB_FOLIOS) {
> +		bp->b_folios = bp->b_folio_array;
>  	} else {
> -		bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count,
> +		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
>  					gfp_mask);
> -		if (!bp->b_pages)
> +		if (!bp->b_folios)
>  			return -ENOMEM;
>  	}
> -	bp->b_flags |= _XBF_PAGES;
> +	bp->b_flags |= _XBF_FOLIOS;
>  
>  	/* Assure zeroed buffer for non-read cases. */
>  	if (!(flags & XBF_READ))
> @@ -387,9 +387,9 @@ xfs_buf_alloc_pages(
>  	for (;;) {
>  		long	last = filled;
>  
> -		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
> -						bp->b_pages);
> -		if (filled == bp->b_page_count) {
> +		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
> +						(struct page **)bp->b_folios);

Ugh, pointer casting.  I suppose here is where we might want an
alloc_folio_bulk_array that might give us successively smaller
large-folios until b_page_count is satisfied?  (Maybe that's in the next
patch?)

I guess you'd also need a large-folio capable vm_map_ram.  Both of
these things sound reasonable, particularly if somebody wants to write
us a new buffer cache for ext2rs and support large block sizes.

Assuming that one of the goals here is (say) to be able to mount a 16k
blocksize filesystem and try to get 16k folios for the buffer cache?

--D

> +		if (filled == bp->b_folio_count) {
>  			XFS_STATS_INC(bp->b_mount, xb_page_found);
>  			break;
>  		}
> @@ -398,7 +398,7 @@ xfs_buf_alloc_pages(
>  			continue;
>  
>  		if (flags & XBF_READ_AHEAD) {
> -			xfs_buf_free_pages(bp);
> +			xfs_buf_free_folios(bp);
>  			return -ENOMEM;
>  		}
>  
> @@ -412,14 +412,14 @@ xfs_buf_alloc_pages(
>   *	Map buffer into kernel address-space if necessary.
>   */
>  STATIC int
> -_xfs_buf_map_pages(
> +_xfs_buf_map_folios(
>  	struct xfs_buf		*bp,
>  	xfs_buf_flags_t		flags)
>  {
> -	ASSERT(bp->b_flags & _XBF_PAGES);
> -	if (bp->b_page_count == 1) {
> +	ASSERT(bp->b_flags & _XBF_FOLIOS);
> +	if (bp->b_folio_count == 1) {
>  		/* A single page buffer is always mappable */
> -		bp->b_addr = page_address(bp->b_pages[0]);
> +		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
>  	} else {
> @@ -443,8 +443,8 @@ _xfs_buf_map_pages(
>  		 */
>  		nofs_flag = memalloc_nofs_save();
>  		do {
> -			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> -						-1);
> +			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
> +					bp->b_folio_count, -1);
>  			if (bp->b_addr)
>  				break;
>  			vm_unmap_aliases();
> @@ -571,7 +571,7 @@ xfs_buf_find_lock(
>  			return -ENOENT;
>  		}
>  		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
> -		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
> +		bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
>  		bp->b_ops = NULL;
>  	}
>  	return 0;
> @@ -629,14 +629,15 @@ xfs_buf_find_insert(
>  		goto out_drop_pag;
>  
>  	/*
> -	 * For buffers that fit entirely within a single page, first attempt to
> -	 * allocate the memory from the heap to minimise memory usage. If we
> -	 * can't get heap memory for these small buffers, we fall back to using
> -	 * the page allocator.
> +	 * For buffers that fit entirely within a single page folio, first
> +	 * attempt to allocate the memory from the heap to minimise memory
> +	 * usage. If we can't get heap memory for these small buffers, we fall
> +	 * back to using the page allocator.
>  	 */
> +
>  	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
>  	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
> -		error = xfs_buf_alloc_pages(new_bp, flags);
> +		error = xfs_buf_alloc_folios(new_bp, flags);
>  		if (error)
>  			goto out_free_buf;
>  	}
> @@ -728,11 +729,11 @@ xfs_buf_get_map(
>  
>  	/* We do not hold a perag reference anymore. */
>  	if (!bp->b_addr) {
> -		error = _xfs_buf_map_pages(bp, flags);
> +		error = _xfs_buf_map_folios(bp, flags);
>  		if (unlikely(error)) {
>  			xfs_warn_ratelimited(btp->bt_mount,
> -				"%s: failed to map %u pages", __func__,
> -				bp->b_page_count);
> +				"%s: failed to map %u folios", __func__,
> +				bp->b_folio_count);
>  			xfs_buf_relse(bp);
>  			return error;
>  		}
> @@ -963,14 +964,14 @@ xfs_buf_get_uncached(
>  	if (error)
>  		return error;
>  
> -	error = xfs_buf_alloc_pages(bp, flags);
> +	error = xfs_buf_alloc_folios(bp, flags);
>  	if (error)
>  		goto fail_free_buf;
>  
> -	error = _xfs_buf_map_pages(bp, 0);
> +	error = _xfs_buf_map_folios(bp, 0);
>  	if (unlikely(error)) {
>  		xfs_warn(target->bt_mount,
> -			"%s: failed to map pages", __func__);
> +			"%s: failed to map folios", __func__);
>  		goto fail_free_buf;
>  	}
>  
> @@ -1465,7 +1466,7 @@ xfs_buf_ioapply_map(
>  	blk_opf_t	op)
>  {
>  	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_page_count;
> +	unsigned int	total_nr_pages = bp->b_folio_count;
>  	int		nr_pages;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
> @@ -1503,7 +1504,7 @@ xfs_buf_ioapply_map(
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> +		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
>  				      offset);
>  		if (rbytes < nbytes)
>  			break;
> @@ -1716,13 +1717,13 @@ xfs_buf_offset(
>  	struct xfs_buf		*bp,
>  	size_t			offset)
>  {
> -	struct page		*page;
> +	struct folio		*folio;
>  
>  	if (bp->b_addr)
>  		return bp->b_addr + offset;
>  
> -	page = bp->b_pages[offset >> PAGE_SHIFT];
> -	return page_address(page) + (offset & (PAGE_SIZE-1));
> +	folio = bp->b_folios[offset >> PAGE_SHIFT];
> +	return folio_address(folio) + (offset & (PAGE_SIZE-1));
>  }
>  
>  void
> @@ -1735,18 +1736,18 @@ xfs_buf_zero(
>  
>  	bend = boff + bsize;
>  	while (boff < bend) {
> -		struct page	*page;
> +		struct folio	*folio;
>  		int		page_index, page_offset, csize;
>  
>  		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
>  		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -		page = bp->b_pages[page_index];
> +		folio = bp->b_folios[page_index];
>  		csize = min_t(size_t, PAGE_SIZE - page_offset,
>  				      BBTOB(bp->b_length) - boff);
>  
>  		ASSERT((csize + page_offset) <= PAGE_SIZE);
>  
> -		memset(page_address(page) + page_offset, 0, csize);
> +		memset(folio_address(folio) + page_offset, 0, csize);
>  
>  		boff += csize;
>  	}
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index b470de08a46c..1e7298ff3fa5 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -29,7 +29,7 @@ struct xfs_buf;
>  #define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
>  #define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
>  #define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> +#define XBF_DONE	 (1u << 5) /* all folios in the buffer uptodate */
>  #define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
>  #define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
>  
> @@ -39,7 +39,7 @@ struct xfs_buf;
>  #define _XBF_LOGRECOVERY (1u << 18)/* log recovery buffer */
>  
>  /* flags used only internally */
> -#define _XBF_PAGES	 (1u << 20)/* backed by refcounted pages */
> +#define _XBF_FOLIOS	 (1u << 20)/* backed by refcounted folios */
>  #define _XBF_KMEM	 (1u << 21)/* backed by heap memory */
>  #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
>  
> @@ -68,7 +68,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
> -	{ _XBF_PAGES,		"PAGES" }, \
> +	{ _XBF_FOLIOS,		"FOLIOS" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
>  	/* The following interface flags should never be set */ \
> @@ -116,7 +116,7 @@ typedef struct xfs_buftarg {
>  	struct ratelimit_state	bt_ioerror_rl;
>  } xfs_buftarg_t;
>  
> -#define XB_PAGES	2
> +#define XB_FOLIOS	2
>  
>  struct xfs_buf_map {
>  	xfs_daddr_t		bm_bn;	/* block number for I/O */
> @@ -180,14 +180,14 @@ struct xfs_buf {
>  	struct xfs_buf_log_item	*b_log_item;
>  	struct list_head	b_li_list;	/* Log items list head */
>  	struct xfs_trans	*b_transp;
> -	struct page		**b_pages;	/* array of page pointers */
> -	struct page		*b_page_array[XB_PAGES]; /* inline pages */
> +	struct folio		**b_folios;	/* array of folio pointers */
> +	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
>  	struct xfs_buf_map	*b_maps;	/* compound buffer map */
>  	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
>  	int			b_map_count;
>  	atomic_t		b_pin_count;	/* pin count */
>  	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
> -	unsigned int		b_page_count;	/* size of page array */
> +	unsigned int		b_folio_count;	/* size of folio array */
>  	unsigned int		b_offset;	/* page offset of b_addr,
>  						   only for _XBF_KMEM buffers */
>  	int			b_error;	/* error code on I/O */
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 83a81cb52d8e..d1407cee48d9 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -69,7 +69,7 @@ xfs_buf_item_straddle(
>  {
>  	void			*first, *last;
>  
> -	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
> +	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
>  		return false;
>  
>  	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index caccb7f76690..804389b8e802 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -279,4 +279,12 @@ kmem_to_page(void *addr)
>  	return virt_to_page(addr);
>  }
>  
> +static inline struct folio *
> +kmem_to_folio(void *addr)
> +{
> +	if (is_vmalloc_addr(addr))
> +		return page_folio(vmalloc_to_page(addr));
> +	return virt_to_folio(addr);
> +}
> +
>  #endif /* __XFS_LINUX__ */
> -- 
> 2.43.0
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
@ 2024-01-19  1:31   ` Darrick J. Wong
  2024-01-19  7:12     ` Dave Chinner
  2024-01-22  6:45   ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-01-19  1:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have the buffer cache using the folio API, we can extend
> the use of folios to allocate high order folios for multi-page
> buffers rather than an array of single pages that are then vmapped
> into a contiguous range.
> 
> This creates two types of buffers: single folio buffers that can
> have arbitrary order, and multi-folio buffers made up of many single
> page folios that get vmapped. The latter is essentially the existing
> code, so there are no logic changes to handle this case.
> 
> There are a few places where we iterate the folios on a buffer.
> These need to be converted to handle the high order folio case.
> Luckily, this only occurs when bp->b_folio_count == 1, and the code
> for handling this case is just a simple application of the folio API
> to the operations that need to be performed.
> 
> The code that allocates buffers will optimistically attempt a high
> order folio allocation as a fast path. If this high order allocation
> fails, then we fall back to the existing multi-folio allocation
> code. This now forms the slow allocation path, and hopefully will be
> largely unused in normal conditions.
> 
> This should improve performance of large buffer operations (e.g.
> large directory block sizes) as we should now mostly avoid the
> expense of vmapping large buffers (and the vmap lock contention that
> can occur) as well as avoid the runtime pressure that frequently
> accessing kernel vmapped pages put on the TLBs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 150 +++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 119 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 15907e92d0d3..df363f17ea1a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -74,6 +74,10 @@ xfs_buf_is_vmapped(
>  	return bp->b_addr && bp->b_folio_count > 1;
>  }
>  
> +/*
> + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> + * allocating vmapped buffers.
> + */
>  static inline int
>  xfs_buf_vmap_len(
>  	struct xfs_buf	*bp)
> @@ -344,14 +348,72 @@ xfs_buf_alloc_kmem(
>  		bp->b_addr = NULL;
>  		return -ENOMEM;
>  	}
> -	bp->b_offset = offset_in_page(bp->b_addr);
>  	bp->b_folios = bp->b_folio_array;
>  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> +	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
>  	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
>  }
>  
> +/*
> + * Allocating a high order folio makes the assumption that buffers are a
> + * power-of-2 size so that ilog2() returns the exact order needed to fit
> + * the contents of the buffer. Buffer lengths are mostly a power of two,
> + * so this is not an unreasonable approach to take by default.
> + *
> + * The exception here are user xattr data buffers, which can be arbitrarily
> + * sized up to 64kB plus structure metadata. In that case, round up the order.
> + */
> +static bool
> +xfs_buf_alloc_folio(
> +	struct xfs_buf	*bp,
> +	gfp_t		gfp_mask)
> +{
> +	int		length = BBTOB(bp->b_length);
> +	int		order;
> +
> +	order = ilog2(length);
> +	if ((1 << order) < length)
> +		order = ilog2(length - 1) + 1;
> +
> +	if (order <= PAGE_SHIFT)
> +		order = 0;
> +	else
> +		order -= PAGE_SHIFT;
> +
> +	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> +	if (!bp->b_folio_array[0])
> +		return false;
> +
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folio_count = 1;
> +	bp->b_flags |= _XBF_FOLIOS;
> +	return true;

Hmm.  So I guess with this patchset, either we get one multi-page large
folio for the whole buffer, or we fall back to the array of basepage
sized folios?

/me wonders if the extra flexibility from alloc_folio_bulk_array and a
folioized vm_map_ram would eliminate all this special casing?

Uhoh, lights flickering again and ice crashing off the roof.  I better
go before the lights go out again. :(

--D

> +}
> +
> +/*
> + * When we allocate folios for a buffer, we end up with one of two types of
> + * buffer.
> + *
> + * The first type is a single folio buffer - this may be a high order
> + * folio or just a single page sized folio, but either way they get treated the
> + * same way by the rest of the code - the buffer memory spans a single
> + * contiguous memory region that we don't have to map and unmap to access the
> + * data directly.
> + *
> + * The second type of buffer is the multi-folio buffer. These are *always* made
> + * up of single page folios so that they can be fed to vmap_ram() to return a
> + * contiguous memory region we can access the data through, or mark it as
> + * XBF_UNMAPPED and access the data directly through individual folio_address()
> + * calls.
> + *
> + * We don't use high order folios for this second type of buffer (yet) because
> + * having variable size folios makes offset-to-folio indexing and iteration of
> + * the data range more complex than if they are fixed size. This case should now
> + * be the slow path, though, so unless we regularly fail to allocate high order
> + * folios, there should be little need to optimise this path.
> + */
>  static int
>  xfs_buf_alloc_folios(
>  	struct xfs_buf	*bp,
> @@ -363,7 +425,15 @@ xfs_buf_alloc_folios(
>  	if (flags & XBF_READ_AHEAD)
>  		gfp_mask |= __GFP_NORETRY;
>  
> -	/* Make sure that we have a page list */
> +	/* Assure zeroed buffer for non-read cases. */
> +	if (!(flags & XBF_READ))
> +		gfp_mask |= __GFP_ZERO;
> +
> +	/* Optimistically attempt a single high order folio allocation. */
> +	if (xfs_buf_alloc_folio(bp, gfp_mask))
> +		return 0;
> +
> +	/* Fall back to allocating an array of single page folios. */
>  	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
>  	if (bp->b_folio_count <= XB_FOLIOS) {
>  		bp->b_folios = bp->b_folio_array;
> @@ -375,9 +445,6 @@ xfs_buf_alloc_folios(
>  	}
>  	bp->b_flags |= _XBF_FOLIOS;
>  
> -	/* Assure zeroed buffer for non-read cases. */
> -	if (!(flags & XBF_READ))
> -		gfp_mask |= __GFP_ZERO;
>  
>  	/*
>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> @@ -418,7 +485,7 @@ _xfs_buf_map_folios(
>  {
>  	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  	if (bp->b_folio_count == 1) {
> -		/* A single page buffer is always mappable */
> +		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
> @@ -1465,20 +1532,28 @@ xfs_buf_ioapply_map(
>  	int		*count,
>  	blk_opf_t	op)
>  {
> -	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_folio_count;
> -	int		nr_pages;
> +	int		folio_index;
> +	unsigned int	total_nr_folios = bp->b_folio_count;
> +	int		nr_folios;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
>  	int		size;
>  	int		offset;
>  
> -	/* skip the pages in the buffer before the start offset */
> -	page_index = 0;
> +	/*
> +	 * If the start offset if larger than a single page, we need to be
> +	 * careful. We might have a high order folio, in which case the indexing
> +	 * is from the start of the buffer. However, if we have more than one
> +	 * folio single page folio in the buffer, we need to skip the folios in
> +	 * the buffer before the start offset.
> +	 */
> +	folio_index = 0;
>  	offset = *buf_offset;
> -	while (offset >= PAGE_SIZE) {
> -		page_index++;
> -		offset -= PAGE_SIZE;
> +	if (bp->b_folio_count > 1) {
> +		while (offset >= PAGE_SIZE) {
> +			folio_index++;
> +			offset -= PAGE_SIZE;
> +		}
>  	}
>  
>  	/*
> @@ -1491,28 +1566,28 @@ xfs_buf_ioapply_map(
>  
>  next_chunk:
>  	atomic_inc(&bp->b_io_remaining);
> -	nr_pages = bio_max_segs(total_nr_pages);
> +	nr_folios = bio_max_segs(total_nr_folios);
>  
> -	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> +	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
>  	bio->bi_iter.bi_sector = sector;
>  	bio->bi_end_io = xfs_buf_bio_end_io;
>  	bio->bi_private = bp;
>  
> -	for (; size && nr_pages; nr_pages--, page_index++) {
> -		int	rbytes, nbytes = PAGE_SIZE - offset;
> +	for (; size && nr_folios; nr_folios--, folio_index++) {
> +		struct folio	*folio = bp->b_folios[folio_index];
> +		int		nbytes = folio_size(folio) - offset;
>  
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> -				      offset);
> -		if (rbytes < nbytes)
> +		if (!bio_add_folio(bio, folio, nbytes,
> +				offset_in_folio(folio, offset)))
>  			break;
>  
>  		offset = 0;
>  		sector += BTOBB(nbytes);
>  		size -= nbytes;
> -		total_nr_pages--;
> +		total_nr_folios--;
>  	}
>  
>  	if (likely(bio->bi_iter.bi_size)) {
> @@ -1722,6 +1797,13 @@ xfs_buf_offset(
>  	if (bp->b_addr)
>  		return bp->b_addr + offset;
>  
> +	/* Single folio buffers may use large folios. */
> +	if (bp->b_folio_count == 1) {
> +		folio = bp->b_folios[0];
> +		return folio_address(folio) + offset_in_folio(folio, offset);
> +	}
> +
> +	/* Multi-folio buffers always use PAGE_SIZE folios */
>  	folio = bp->b_folios[offset >> PAGE_SHIFT];
>  	return folio_address(folio) + (offset & (PAGE_SIZE-1));
>  }
> @@ -1737,18 +1819,24 @@ xfs_buf_zero(
>  	bend = boff + bsize;
>  	while (boff < bend) {
>  		struct folio	*folio;
> -		int		page_index, page_offset, csize;
> +		int		folio_index, folio_offset, csize;
>  
> -		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> -		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -		folio = bp->b_folios[page_index];
> -		csize = min_t(size_t, PAGE_SIZE - page_offset,
> +		/* Single folio buffers may use large folios. */
> +		if (bp->b_folio_count == 1) {
> +			folio = bp->b_folios[0];
> +			folio_offset = offset_in_folio(folio,
> +						bp->b_offset + boff);
> +		} else {
> +			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> +			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> +			folio = bp->b_folios[folio_index];
> +		}
> +
> +		csize = min_t(size_t, folio_size(folio) - folio_offset,
>  				      BBTOB(bp->b_length) - boff);
> +		ASSERT((csize + folio_offset) <= folio_size(folio));
>  
> -		ASSERT((csize + page_offset) <= PAGE_SIZE);
> -
> -		memset(folio_address(folio) + page_offset, 0, csize);
> -
> +		memset(folio_address(folio) + folio_offset, 0, csize);
>  		boff += csize;
>  	}
>  }
> -- 
> 2.43.0
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-19  1:26   ` Darrick J. Wong
@ 2024-01-19  7:03     ` Dave Chinner
  2024-01-22  6:39     ` Christoph Hellwig
  1 sibling, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2024-01-19  7:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, willy, linux-mm

On Thu, Jan 18, 2024 at 05:26:24PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 19, 2024 at 09:19:40AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Convert the use of struct pages to struct folio everywhere. This
> > is just direct API conversion, no actual logic of code changes
> > should result.
> > 
> > Note: this conversion currently assumes only single page folios are
> > allocated, and because some of the MM interfaces we use take
> > pointers to arrays of struct pages, the address of single page
> > folios and struct pages are the same. e.g alloc_pages_bulk_array(),
> > vm_map_ram(), etc.
> > 
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
.....
> > @@ -387,9 +387,9 @@ xfs_buf_alloc_pages(
> >  	for (;;) {
> >  		long	last = filled;
> >  
> > -		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
> > -						bp->b_pages);
> > -		if (filled == bp->b_page_count) {
> > +		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
> > +						(struct page **)bp->b_folios);
> 
> Ugh, pointer casting.  I suppose here is where we might want an
> alloc_folio_bulk_array that might give us successively smaller
> large-folios until b_page_count is satisfied?  (Maybe that's in the next
> patch?)

No, I explicitly chose not to do that because then converting buffer
offset to memory address becomes excitingly complex. With fixed size
folios, it's just simple math. With variable, unknown sized objects,
we either have to store the size of each object with the pointer,
or walk each object grabbing the size to determine what folio in the
buffer corresponds to a specific offset.

And it's now the slow path, so I don't really care to optimise it
that much.

> I guess you'd also need a large-folio capable vm_map_ram.  Both of
> these things sound reasonable, particularly if somebody wants to write
> us a new buffer cache for ext2rs and support large block sizes.

Maybe so, but we do not require them and I don't really have the
time or desire to try to implement something like that. And, really
what benefit do small multipage folios bring us if we still have to
vmap them?

> Assuming that one of the goals here is (say) to be able to mount a 16k
> blocksize filesystem and try to get 16k folios for the buffer cache?

The goal is that we optimistically use large folios where-ever we
have metadata buffers that are larger than a single page, regardless
of the filesystem block size.

Right now on a 4kB block size filesystem that means inode cluster
buffers (16kB for 512 byte inodes), user xattr buffers larger than a
single page, and directory blocks if the filesytsem is configure
with "-n size=X" and X is 8kB or larger.

On filesystems with block sizes larger than 4kB, it will try to use
large folios for everything but the sector sized AG headers.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-19  1:31   ` Darrick J. Wong
@ 2024-01-19  7:12     ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2024-01-19  7:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, willy, linux-mm

On Thu, Jan 18, 2024 at 05:31:00PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 19, 2024 at 09:19:41AM +1100, Dave Chinner wrote:
> > +/*
> > + * Allocating a high order folio makes the assumption that buffers are a
> > + * power-of-2 size so that ilog2() returns the exact order needed to fit
> > + * the contents of the buffer. Buffer lengths are mostly a power of two,
> > + * so this is not an unreasonable approach to take by default.
> > + *
> > + * The exception here are user xattr data buffers, which can be arbitrarily
> > + * sized up to 64kB plus structure metadata. In that case, round up the order.
> > + */
> > +static bool
> > +xfs_buf_alloc_folio(
> > +	struct xfs_buf	*bp,
> > +	gfp_t		gfp_mask)
> > +{
> > +	int		length = BBTOB(bp->b_length);
> > +	int		order;
> > +
> > +	order = ilog2(length);
> > +	if ((1 << order) < length)
> > +		order = ilog2(length - 1) + 1;
> > +
> > +	if (order <= PAGE_SHIFT)
> > +		order = 0;
> > +	else
> > +		order -= PAGE_SHIFT;
> > +
> > +	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> > +	if (!bp->b_folio_array[0])
> > +		return false;
> > +
> > +	bp->b_folios = bp->b_folio_array;
> > +	bp->b_folio_count = 1;
> > +	bp->b_flags |= _XBF_FOLIOS;
> > +	return true;
> 
> Hmm.  So I guess with this patchset, either we get one multi-page large
> folio for the whole buffer, or we fall back to the array of basepage
> sized folios?

Yes.

> /me wonders if the extra flexibility from alloc_folio_bulk_array and a
> folioized vm_map_ram would eliminate all this special casing?

IMO, no. The basic requirement for a buffer is to provide contiguous
memory space unless XBF_UNMAPPED is specified.

In the case of contiguous space, we either get a single contiguous
range allocated or we have to vmap multiple segments. The moment we
can't get a contiguous memory range, we've lost, we're in the slow
path and we should just do what we know will reliably work as
efficiently as possible.

In the case of XBF_UNMAPPED, if we get a single contiguous range we
can ignore XBF_UNMAPPED, otherwise we should just do what we know
will reliably work as efficiently as possible.

Hence I don't think trying to optimise the "we didn't get a
contiguous memory allocation for the buffer" case for smaller
multi-page folios because it just adds complexity and doesn't
provide any real advantage over the PAGE_SIZE allocation path we do
now.

> Uhoh, lights flickering again and ice crashing off the roof.  I better
> go before the lights go out again. :(

Fun fun!

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-19  1:26   ` Darrick J. Wong
  2024-01-19  7:03     ` Dave Chinner
@ 2024-01-22  6:39     ` Christoph Hellwig
  2024-01-22 12:05       ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-01-22  6:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, willy, linux-mm

On Thu, Jan 18, 2024 at 05:26:24PM -0800, Darrick J. Wong wrote:
> Ugh, pointer casting.  I suppose here is where we might want an
> alloc_folio_bulk_array that might give us successively smaller
> large-folios until b_page_count is satisfied?  (Maybe that's in the next
> patch?)
> 
> I guess you'd also need a large-folio capable vm_map_ram. 

We need to just stop using vm_map_ram, there is no reason to do that
even right now.  It was needed when we used the page cache to back
pagebuf, but these days just sing vmalloc is the right thing for
!unmapped buffers that can't use large folios.  And I'm seriously
wondering if we should bother with unmapped buffers in the long run
if we end up normally using larger folios or just consolidate down to:

 - kmalloc for buffers < PAGE_SIZE
 - folio for buffers >= PAGE_SIZE
 - vmalloc if allocation a larger folios is not possible



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
@ 2024-01-22  6:41   ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-01-22  6:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:39AM +1100, Dave Chinner wrote:
> Whilst touching this code, don't bother checking mapped or single
> folio buffers for discontiguous regions because they don't have
> them. This significantly reduces the overhead of this check when
> logging large buffers as calling xfs_buf_offset() is not free and
> it occurs a *lot* in those cases.

Nice.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
  2024-01-19  1:31   ` Darrick J. Wong
@ 2024-01-22  6:45   ` Christoph Hellwig
  2024-01-22 11:57     ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-01-22  6:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

> +	int		length = BBTOB(bp->b_length);
> +	int		order;
> +
> +	order = ilog2(length);
> +	if ((1 << order) < length)
> +		order = ilog2(length - 1) + 1;
> +
> +	if (order <= PAGE_SHIFT)
> +		order = 0;
> +	else
> +		order -= PAGE_SHIFT;

Shouldn't this simply use get_order()?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* (no subject)
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
                   ` (3 preceding siblings ...)
  2024-01-19  1:05 ` [RFC] [PATCH 0/3] xfs: use large folios for buffers Darrick J. Wong
@ 2024-01-22 10:13 ` Andi Kleen
  2024-01-22 11:53   ` Dave Chinner
  4 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2024-01-22 10:13 UTC (permalink / raw)
  To: linux-xfs, david, linux-mm

Dave Chinner <david@fromorbit.com> writes:

> Thoughts, comments, etc?

The interesting part is if it will cause additional tail latencies
allocating under fragmentation with direct reclaim, compaction
etc. being triggered before it falls back to the base page path.

In fact it is highly likely it will, the question is just how bad it is.

Unfortunately benchmarking for that isn't that easy, it needs artificial
memory fragmentation and then some high stress workload, and then
instrumenting the transactions for individual latencies. 

I would in any case add a tunable for it in case people run into this.
Tail latencies are a common concern on many IO workloads.

-Andi


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2024-01-22 10:13 ` Andi Kleen
@ 2024-01-22 11:53   ` Dave Chinner
  2024-01-22 13:34     ` Using Folios for XFS metadata Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2024-01-22 11:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-xfs, linux-mm

On Mon, Jan 22, 2024 at 02:13:23AM -0800, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Thoughts, comments, etc?
> 
> The interesting part is if it will cause additional tail latencies
> allocating under fragmentation with direct reclaim, compaction
> etc. being triggered before it falls back to the base page path.

It's not like I don't know these problems exist with memory
allocation. Go have a look at xlog_kvmalloc() which is an open coded
kvmalloc() that allows the high order kmalloc allocations to
fail-fast without triggering all the expensive and unnecessary
direct reclaim overhead (e.g. compaction!) because we can fall back
to vmalloc without huge concerns. When high order allocations start
to fail, then we fall back to vmalloc and then we hit the long
standing vmalloc scalability problems before anything else in XFS or
the IO path becomes a bottleneck.

IOWs, we already know that fail-fast high-order allocation is a more
efficient and effective fast path than using vmalloc/vmap_ram() all
the time. As this is an RFC, I haven't implemented stuff like this
yet - I haven't seen anything in the profiles indicating that high
order folio allocation is failing and causing lots of reclaim
overhead, so I simply haven't added fail-fast behaviour yet...

> In fact it is highly likely it will, the question is just how bad it is.
> 
> Unfortunately benchmarking for that isn't that easy, it needs artificial
> memory fragmentation and then some high stress workload, and then
> instrumenting the transactions for individual latencies. 

I stress test and measure XFS metadata performance under sustained
memory pressure all the time. This change has not caused any
obvious regressions in the short time I've been testing it.

I still need to do perf testing on large directory block sizes. That
is where high-order allocations will get stressed - that's where
xlog_kvmalloc() starts dominating the profiles as it trips over
vmalloc scalability issues...

> I would in any case add a tunable for it in case people run into this.

No tunables. It either works or it doesn't. If we can't make
it work reliably by default, we throw it in the dumpster, light it
on fire and walk away.

> Tail latencies are a common concern on many IO workloads.

Yes, for user data operations it's a common concern. For metadata,
not so much - there's so many far worse long tail latencies in
metadata operations (like waiting for journal space) that memory
allocation latencies in the metadata IO path are largely noise....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-22  6:45   ` Christoph Hellwig
@ 2024-01-22 11:57     ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2024-01-22 11:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, willy, linux-mm

On Sun, Jan 21, 2024 at 10:45:55PM -0800, Christoph Hellwig wrote:
> > +	int		length = BBTOB(bp->b_length);
> > +	int		order;
> > +
> > +	order = ilog2(length);
> > +	if ((1 << order) < length)
> > +		order = ilog2(length - 1) + 1;
> > +
> > +	if (order <= PAGE_SHIFT)
> > +		order = 0;
> > +	else
> > +		order -= PAGE_SHIFT;
> 
> Shouldn't this simply use get_order()?

Huh. Yes, it should.

I went looking for a helper and didn't find one in the mm or folio
code. Now you point it out, I find that it is in it's own asm header
(include/asm-generic/getorder.h) so it's no wonder I didn't find
it.

Why is it in include/asm-generic? There's nothing asm related
to that function or it's implementation....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-22  6:39     ` Christoph Hellwig
@ 2024-01-22 12:05       ` Dave Chinner
  2024-01-22 13:18         ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2024-01-22 12:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, willy, linux-mm

On Sun, Jan 21, 2024 at 10:39:12PM -0800, Christoph Hellwig wrote:
> On Thu, Jan 18, 2024 at 05:26:24PM -0800, Darrick J. Wong wrote:
> > Ugh, pointer casting.  I suppose here is where we might want an
> > alloc_folio_bulk_array that might give us successively smaller
> > large-folios until b_page_count is satisfied?  (Maybe that's in the next
> > patch?)
> > 
> > I guess you'd also need a large-folio capable vm_map_ram. 
> 
> We need to just stop using vm_map_ram, there is no reason to do that
> even right now.  It was needed when we used the page cache to back
> pagebuf, but these days just sing vmalloc is the right thing for
> !unmapped buffers that can't use large folios. 

I haven't looked at what using vmalloc means for packing the buffer
into a bio - we currently use bio_add_page(), so does that mean we
have to use some variant of virt_to_page() to break the vmalloc
region up into it's backing pages to feed them to the bio? Or is
there some helper that I'm unaware of that does it all for us
magically?

> And I'm seriously
> wondering if we should bother with unmapped buffers in the long run
> if we end up normally using larger folios or just consolidate down to:
> 
>  - kmalloc for buffers < PAGE_SIZE
>  - folio for buffers >= PAGE_SIZE
>  - vmalloc if allocation a larger folios is not possible

Yeah, that's kind of where I'm going with this. Large folios already
turn off unmapped buffers, and I'd really like to get rid of that
page straddling mess that unmapped buffers require in the buffer
item dirty region tracking. That means we have to get rid of
unmapped buffers....

-Dave.

-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-22 12:05       ` Dave Chinner
@ 2024-01-22 13:18         ` Christoph Hellwig
  2024-01-22 21:10           ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-01-22 13:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, willy, linux-mm

On Mon, Jan 22, 2024 at 11:05:20PM +1100, Dave Chinner wrote:
> I haven't looked at what using vmalloc means for packing the buffer
> into a bio - we currently use bio_add_page(), so does that mean we
> have to use some variant of virt_to_page() to break the vmalloc
> region up into it's backing pages to feed them to the bio? Or is
> there some helper that I'm unaware of that does it all for us
> magically?

We have a kmem_to_page helper for chuking any kind of kernel virtual
address space into pages.  xfs_rw_bdev in fs/xfs/xfs_bio_io.c uses
that for a bio, we should probably hav an async version of that
and maybe move it to the block layer instead of duplicating the
logic in various places.

> Yeah, that's kind of where I'm going with this. Large folios already
> turn off unmapped buffers, and I'd really like to get rid of that
> page straddling mess that unmapped buffers require in the buffer
> item dirty region tracking. That means we have to get rid of
> unmapped buffers....

I actually have an old series to kill unmapped buffers and use
vmalloc, but decided I'd need use folios for the fast path instead
of paying the vmalloc overhead.  I can dust it off and you can decide
if you want to pick up parts of it.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Using Folios for XFS metadata
  2024-01-22 11:53   ` Dave Chinner
@ 2024-01-22 13:34     ` Andi Kleen
  2024-01-23  0:19       ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2024-01-22 13:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm

[fixed the subject, not sure what happened there]

FWIW I'm not sure fail-fail is always the right strategy here,
in many cases even with some reclaim, compaction may win. Just not if you're
on a tight budget for the latencies.

> I stress test and measure XFS metadata performance under sustained
> memory pressure all the time. This change has not caused any
> obvious regressions in the short time I've been testing it.

Did you test for tail latencies?

There are some relatively simple ways to trigger memory fragmentation,
the standard way is to allocate a very large THP backed file and then
punch a lot of holes.

> 
> I still need to do perf testing on large directory block sizes. That
> is where high-order allocations will get stressed - that's where
> xlog_kvmalloc() starts dominating the profiles as it trips over
> vmalloc scalability issues...

Yes that's true. vmalloc has many issues, although with the recent
patches to split the rbtrees with separate locks it may now look
quite different than before.

> 
> > I would in any case add a tunable for it in case people run into this.
> 
> No tunables. It either works or it doesn't. If we can't make
> it work reliably by default, we throw it in the dumpster, light it
> on fire and walk away.

I'm not sure there is a single definition of "reliably" here -- for
many workloads tail latencies don't matter, so it's always reliable,
as long as you have good aggregate throughput.

Others have very high expectations for them.

Forcing the high expectations on everyone is probably not a good
general strategy though, as there are general trade offs.

I could see that having lots of small tunables for every use might not be a
good idea. Perhaps there would be a case for a single general tunable
that controls higher order folios for everyone.

> 
> > Tail latencies are a common concern on many IO workloads.
> 
> Yes, for user data operations it's a common concern. For metadata,
> not so much - there's so many far worse long tail latencies in
> metadata operations (like waiting for journal space) that memory
> allocation latencies in the metadata IO path are largely noise....

I've seen pretty long stalls in the past.

The difference to the journal is also that it is local the file system, while
the memory is normally shared with everyone on the node or system. So the
scope of noisy neighbour impact can be quite different, especially on a large
machine. 

-Andi



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-22 13:18         ` Christoph Hellwig
@ 2024-01-22 21:10           ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2024-01-22 21:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, willy, linux-mm

On Mon, Jan 22, 2024 at 05:18:58AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 11:05:20PM +1100, Dave Chinner wrote:
> > I haven't looked at what using vmalloc means for packing the buffer
> > into a bio - we currently use bio_add_page(), so does that mean we
> > have to use some variant of virt_to_page() to break the vmalloc
> > region up into it's backing pages to feed them to the bio? Or is
> > there some helper that I'm unaware of that does it all for us
> > magically?
> 
> We have a kmem_to_page helper for chuking any kind of kernel virtual
> address space into pages.  xfs_rw_bdev in fs/xfs/xfs_bio_io.c uses
> that for a bio, we should probably hav an async version of that
> and maybe move it to the block layer instead of duplicating the
> logic in various places.

Yeah, OK, as I expected. I'd forgotten that we already play that
game with xfs_rw_bdev(). I think we can trivially factor out an
async version and call that from the xfs_buf.c code fo vmalloc()d
ranges, so I think I'll work towards that and actually remove the
bio packing loop from xfs_buf.c altogether.

> > Yeah, that's kind of where I'm going with this. Large folios already
> > turn off unmapped buffers, and I'd really like to get rid of that
> > page straddling mess that unmapped buffers require in the buffer
> > item dirty region tracking. That means we have to get rid of
> > unmapped buffers....
> 
> I actually have an old series to kill unmapped buffers and use
> vmalloc, but decided I'd need use folios for the fast path instead
> of paying the vmalloc overhead.  I can dust it off and you can decide
> if you want to pick up parts of it.

I wouldn't worry about it too much - the rest of it is pretty
straight forward once we know inode cluster buffers are working
correctly with single large folios and we always fall back to
vmalloc().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Using Folios for XFS metadata
  2024-01-22 13:34     ` Using Folios for XFS metadata Andi Kleen
@ 2024-01-23  0:19       ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2024-01-23  0:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-xfs, linux-mm

On Mon, Jan 22, 2024 at 05:34:12AM -0800, Andi Kleen wrote:
> [fixed the subject, not sure what happened there]
> 
> FWIW I'm not sure fail-fail is always the right strategy here,
> in many cases even with some reclaim, compaction may win. Just not if you're
> on a tight budget for the latencies.
> 
> > I stress test and measure XFS metadata performance under sustained
> > memory pressure all the time. This change has not caused any
> > obvious regressions in the short time I've been testing it.
> 
> Did you test for tail latencies?

No, it's an RFC, and I mostly don't care about memory allocation
tail latencies because it is highly unlikely to be a *new issue* we
need to care about.

The fact is taht we already do so much memory allocation and high
order memory allocation (e.g. through slub, xlog_kvmalloc(), user
data IO through the page cache, etc) that if there's a long tail
latency problem with high order memory allocation then it will
already be noticably affecting the XFS data and metadata IO
latencies.

Nobody is reporting problems about excessive long tail latencies
when using XFS, so my care factor about long tail latencies in this
specific memory allocation case is close to zero.

> There are some relatively simple ways to trigger memory fragmentation,
> the standard way is to allocate a very large THP backed file and then
> punch a lot of holes.

Or just run a filesystem with lots of variable sized high order
allocations and varying cached object life times under sustained
memory pressure for a significant period of time....

> > > I would in any case add a tunable for it in case people run into this.
> > 
> > No tunables. It either works or it doesn't. If we can't make
> > it work reliably by default, we throw it in the dumpster, light it
> > on fire and walk away.
> 
> I'm not sure there is a single definition of "reliably" here -- for
> many workloads tail latencies don't matter, so it's always reliable,
> as long as you have good aggregate throughput.
> 
> Others have very high expectations for them.
> 
> Forcing the high expectations on everyone is probably not a good
> general strategy though, as there are general trade offs.

Yup, and we have to make those tradeoffs because filesystems need to
be good at "general purpose" workloads as their primary focus.
Minimising long tail latencies is really just "best effort only"
because we intimately aware of the fact that there are global
resource limitations that cause long tail latencies in filesystem
implementations that simply cannot be worked around.

> I could see that having lots of small tunables for every use might not be a
> good idea. Perhaps there would be a case for a single general tunable
> that controls higher order folios for everyone.

You're not listening: no tunables. Code that has tunables because
you think it *may not work* is code that is not ready to be merged.

> > > Tail latencies are a common concern on many IO workloads.
> > 
> > Yes, for user data operations it's a common concern. For metadata,
> > not so much - there's so many far worse long tail latencies in
> > metadata operations (like waiting for journal space) that memory
> > allocation latencies in the metadata IO path are largely noise....
> 
> I've seen pretty long stalls in the past.
> 
> The difference to the journal is also that it is local the file system, while
> the memory is normally shared with everyone on the node or system. So the
> scope of noisy neighbour impact can be quite different, especially on a large
> machine. 

Most systems run everything on a single filesystem. That makes them
just as global as memory allocation.  If the journal bottlenecks,
everyone on the system suffers the performance degradation, not just
the user who caused it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2025-08-12 13:34 Baoquan He
@ 2025-08-12 13:49 ` Baoquan He
  0 siblings, 0 replies; 44+ messages in thread
From: Baoquan He @ 2025-08-12 13:49 UTC (permalink / raw)
  To: linux-mm, christophe.leroy

On 08/12/25 at 09:34pm, Baoquan He wrote:
> alexghiti@rivosinc.com, agordeev@linux.ibm.com, linux@armlinux.org.uk,
> linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
> linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
> x86@kernel.org, chris@zankel.net, jcmvbkbc@gmail.com, linux-um@lists.infradead.org
> Cc: ryabinin.a.a@gmail.com, glider@google.com, andreyknvl@gmail.com,
> 	dvyukov@google.com, vincenzo.frascino@arm.com,
> 	akpm@linux-foundation.org, kasan-dev@googlegroups.com,
> 	linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
> 	sj@kernel.org, lorenzo.stoakes@oracle.com, elver@google.com,
> 	snovitoll@gmail.com
> Bcc: bhe@redhat.com
> Subject: Re: [PATCH v2 00/12] mm/kasan: make kasan=on|off work for all three
>  modes
> Reply-To: 
> In-Reply-To: <20250812124941.69508-1-bhe@redhat.com>
> 
> Forgot adding related ARCH mailing list or people to CC, add them.

Sorry for the noise, I made mistake on mail format when adding people to
CC.

> 
> On 08/12/25 at 08:49pm, Baoquan He wrote:
> > Currently only hw_tags mode of kasan can be enabled or disabled with
> > kernel parameter kasan=on|off for built kernel. For kasan generic and
> > sw_tags mode, there's no way to disable them once kernel is built.
> > This is not convenient sometime, e.g in system kdump is configured.
> > When the 1st kernel has KASAN enabled and crash triggered to switch to
> > kdump kernel, the generic or sw_tags mode will cost much extra memory
> > for kasan shadow while in fact it's meaningless to have kasan in kdump
> > kernel.
> > 
> > So this patchset moves the kasan=on|off out of hw_tags scope and into
> > common code to make it visible in generic and sw_tags mode too. Then we
> > can add kasan=off in kdump kernel to reduce the unneeded meomry cost for
> > kasan.
> > 
> > Changelog:
> > ====
> > v1->v2:
> > - Add __ro_after_init for __ro_after_init, and remove redundant blank
> >   lines in mm/kasan/common.c. Thanks to Marco.
> > - Fix a code bug in <linux/kasan-enabled.h> when CONFIG_KASAN is unset,
> >   this is found out by SeongJae and Lorenzo, and also reported by LKP
> >   report, thanks to them.
> > - Add a missing kasan_enabled() checking in kasan_report(). This will
> >   cause below KASAN report info even though kasan=off is set:
> >      ==================================================================
> >      BUG: KASAN: stack-out-of-bounds in tick_program_event+0x130/0x150
> >      Read of size 4 at addr ffff00005f747778 by task swapper/0/1
> >      
> >      CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0+ #8 PREEMPT(voluntary) 
> >      Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F31n (SCP: 2.10.20220810) 09/30/2022
> >      Call trace:
> >       show_stack+0x30/0x90 (C)
> >       dump_stack_lvl+0x7c/0xa0
> >       print_address_description.constprop.0+0x90/0x310
> >       print_report+0x104/0x1f0
> >       kasan_report+0xc8/0x110
> >       __asan_report_load4_noabort+0x20/0x30
> >       tick_program_event+0x130/0x150
> >       ......snip...
> >      ==================================================================
> > 
> > - Add jump_label_init() calling before kasan_init() in setup_arch() in these
> >   architectures: xtensa, arm. Because they currenly rely on
> >   jump_label_init() in main() which is a little late. Then the early static
> >   key kasan_flag_enabled in kasan_init() won't work.
> > 
> > - In UML architecture, change to enable kasan_flag_enabled in arch_mm_preinit()
> >   because kasan_init() is enabled before main(), there's no chance to operate
> >   on static key in kasan_init().
> > 
> > Test:
> > =====
> > In v1, I took test on x86_64 for generic mode, and on arm64 for
> > generic, sw_tags and hw_tags mode. All of them works well.
> > 
> > In v2, I only tested on arm64 for generic, sw_tags and hw_tags mode, it
> > works. For powerpc, I got a BOOK3S/64 machine, while it says
> > 'KASAN not enabled as it requires radix' and KASAN is disabled. Will
> > look for other POWER machine to test this.
> > ====
> > 
> > Baoquan He (12):
> >   mm/kasan: add conditional checks in functions to return directly if
> >     kasan is disabled
> >   mm/kasan: move kasan= code to common place
> >   mm/kasan/sw_tags: don't initialize kasan if it's disabled
> >   arch/arm: don't initialize kasan if it's disabled
> >   arch/arm64: don't initialize kasan if it's disabled
> >   arch/loongarch: don't initialize kasan if it's disabled
> >   arch/powerpc: don't initialize kasan if it's disabled
> >   arch/riscv: don't initialize kasan if it's disabled
> >   arch/x86: don't initialize kasan if it's disabled
> >   arch/xtensa: don't initialize kasan if it's disabled
> >   arch/um: don't initialize kasan if it's disabled
> >   mm/kasan: make kasan=on|off take effect for all three modes
> > 
> >  arch/arm/kernel/setup.c                |  6 +++++
> >  arch/arm/mm/kasan_init.c               |  6 +++++
> >  arch/arm64/mm/kasan_init.c             |  7 ++++++
> >  arch/loongarch/mm/kasan_init.c         |  5 ++++
> >  arch/powerpc/mm/kasan/init_32.c        |  8 +++++-
> >  arch/powerpc/mm/kasan/init_book3e_64.c |  6 +++++
> >  arch/powerpc/mm/kasan/init_book3s_64.c |  6 +++++
> >  arch/riscv/mm/kasan_init.c             |  6 +++++
> >  arch/um/kernel/mem.c                   |  6 +++++
> >  arch/x86/mm/kasan_init_64.c            |  6 +++++
> >  arch/xtensa/kernel/setup.c             |  1 +
> >  arch/xtensa/mm/kasan_init.c            |  6 +++++
> >  include/linux/kasan-enabled.h          | 18 ++++++-------
> >  mm/kasan/common.c                      | 25 ++++++++++++++++++
> >  mm/kasan/generic.c                     | 20 +++++++++++++--
> >  mm/kasan/hw_tags.c                     | 35 ++------------------------
> >  mm/kasan/init.c                        |  6 +++++
> >  mm/kasan/quarantine.c                  |  3 +++
> >  mm/kasan/report.c                      |  4 ++-
> >  mm/kasan/shadow.c                      | 23 ++++++++++++++++-
> >  mm/kasan/sw_tags.c                     |  9 +++++++
> >  21 files changed, 165 insertions(+), 47 deletions(-)
> > 
> > -- 
> > 2.41.0
> > 
> 
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
@ 2025-02-08  8:19 Director Inspectorate
  0 siblings, 0 replies; 44+ messages in thread
From: Director Inspectorate @ 2025-02-08  8:19 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

Möchten Sie der großen Illuminaten-Gesellschaft beitreten, um reich, berühmt und geschützt im Leben zu werden, schließen Sie sich uns noch heute an, um Mitglied zu werden und all Ihre schwierigen Probleme im Leben zu lösen. Bitte kontaktieren Sie die Verwaltung, indem Sie auf "Antworten auf" klicken, um weitere Informationen darüber zu erhalten, wie Sie dieser mächtigen Gesellschaft beitreten können.


Do you want to join the Great Illuminati society to become rich, famous and protected in life, join us today to become a member and solve all your difficult problems in life. Please contact the administration by clicking on "reply to" for more information on how to join this powerful society.


[-- Attachment #2: Type: text/html, Size: 3159 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-08  1:06                       ` Yin, Fengwei
  1 sibling, 0 replies; 44+ messages in thread
From: Yin, Fengwei @ 2024-03-08  1:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 10:05 PM, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
Removing folio from deferred_list in folio_put() also needs require
split_queue_lock. So my understanding is no deleting without hold
correct lock. local list iteration is impacted. But that's not the issue
Ryan hit here.

> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                  if (order > 0 && folio_test_large_rmappable(folio))
>>                          folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>          if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.
> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().
Oh. That's the key. Thanks a lot for detail explanation.

I thought there was a bug in other place, covered by
destroy_large_folio() but exposed by free_unref_folios()...


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2024-03-07 16:24                         ` Re: Ryan Roberts
@ 2024-03-07 23:02                           ` Matthew Wilcox
  0 siblings, 0 replies; 44+ messages in thread
From: Matthew Wilcox @ 2024-03-07 23:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 04:24:43PM +0000, Ryan Roberts wrote:
> > But if I run only with the deferred split fix and DO NOT revert the other
> > change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> > stalls where I can't even interact on the serial port. Sometimes (more usually)
> > everything just gets stuck trying to reclaim and allocate memory. And when I
> > kill the jobs, I still have barely any memory in the system - about 10% what I
> > would expect.

(for the benefit of anyone trying to follow along, this is now
understood; it was my missing folio_put() in the 'folio_trylock failed'
path)

> I notice that before the commit, large folios are uncharged with
> __mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().
> 
> The former has an upfront check:
> 
> 	if (!folio_memcg(folio))
> 		return;
> 
> I'm not exactly sure what that's checking but could the fact this is missing
> after the change cause things to go wonky?

Honestly, I think that's stale.  uncharge_folio() checks the same
thing very early on, so all it's actually saving is a test of the LRU
flag.

Looks like the need for it went away in 2017 with commit a9d5adeeb4b2c73c
which stopped using page->lru to gather the single page onto a
degenerate list.  I'll try to remember to submit a patch to delete
that check.

By the way, something we could try to see if the problem goes away is to
re-narrow the window that i widened.  ie something like this:

+++ b/mm/swap.c
@@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
                        free_huge_folio(folio);
                        continue;
                }
+               if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+                       folio_undo_large_rmappable(folio);

                __page_cache_release(folio, &lruvec, &flags);




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-07 16:24                         ` Ryan Roberts
  2024-03-07 23:02                           ` Re: Matthew Wilcox
  0 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2024-03-07 16:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 15:24, Ryan Roberts wrote:
> On 07/03/2024 14:05, Matthew Wilcox wrote:
>> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/7/2024 4:56 PM,  wrote:
>>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>>> is not the last reference, and it keeps iterating through the local
>>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>>> probably goes into the weeds?
>>>
>>> My understanding is this can not corrupt the folio->deferred_list as
>>> this folio was iterated already.
>>
>> I am not convinced about that at all.  It's possible this isn't the only
>> problem, but deleting something from a list without holding (the correct)
>> lock is something you have to think incredibly hard about to get right.
>> I didn't bother going any deeper into the analysis once I spotted the
>> locking problem, but the proof is very much on you that this is not a bug!
>>
>>> But I did see other strange thing:
>>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>>> index:0xffffbd0a0 pfn:0x2554a0
>>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>>
>>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>>
>>> In free_unref_folios(), there is code like following:
>>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>>                         folio_undo_large_rmappable(folio);
>>>
>>> But with destroy_large_folio():
>>>         if (folio_test_large_rmappable(folio))
>>>
>>> 			folio_undo_large_rmappable(folio);
>>>
>>> Can it connect to the folio has zero refcount still in deferred list
>>> with Matthew's patch?
>>>
>>>
>>> Looks like folio order was cleared unexpected somewhere.
> 
> I think there could be something to this...
> 
> I have a set up where, when running with Matthew's deferred split fix AND have
> commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
> everything works as expected. And at the end, I have the expected amount of
> memory free (seen in meminfo and buddyinfo).
> 
> But if I run only with the deferred split fix and DO NOT revert the other
> change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> stalls where I can't even interact on the serial port. Sometimes (more usually)
> everything just gets stuck trying to reclaim and allocate memory. And when I
> kill the jobs, I still have barely any memory in the system - about 10% what I
> would expect.
> 
> So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
> free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
> only telling it about the first 4K page? So we end up leaking the rest?

I notice that before the commit, large folios are uncharged with
__mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().

The former has an upfront check:

	if (!folio_memcg(folio))
		return;

I'm not exactly sure what that's checking but could the fact this is missing
after the change cause things to go wonky?


> 
>>
>> No, we intentionally clear it:
>>
>> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
>> page[1].flags &= ~PAGE_FLAGS_SECOND;
>>
>> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
>> away in folio->private so that we know what it is in the second loop.
>> So it's always been cleared by the time we call free_page_is_bad().
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
@ 2024-03-07 15:24                       ` Ryan Roberts
  2024-03-07 16:24                         ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  1 sibling, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2024-03-07 15:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 14:05, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>                         folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>         if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.

I think there could be something to this...

I have a set up where, when running with Matthew's deferred split fix AND have
commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
everything works as expected. And at the end, I have the expected amount of
memory free (seen in meminfo and buddyinfo).

But if I run only with the deferred split fix and DO NOT revert the other
change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
stalls where I can't even interact on the serial port. Sometimes (more usually)
everything just gets stuck trying to reclaim and allocate memory. And when I
kill the jobs, I still have barely any memory in the system - about 10% what I
would expect.

So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
only telling it about the first 4K page? So we end up leaking the rest?

> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2024-03-07 13:50                   ` Yin, Fengwei
@ 2024-03-07 14:05                     ` Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  0 siblings, 2 replies; 44+ messages in thread
From: Matthew Wilcox @ 2024-03-07 14:05 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
> 
> 
> On 3/7/2024 4:56 PM,  wrote:
> > I just want to make sure I've understood correctly: CPU1's folio_put()
> > is not the last reference, and it keeps iterating through the local
> > list. Then CPU2 does the final folio_put() which causes list_del_init()
> > to modify the local list concurrently with CPU1's iteration, so CPU1
> > probably goes into the weeds?
> 
> My understanding is this can not corrupt the folio->deferred_list as
> this folio was iterated already.

I am not convinced about that at all.  It's possible this isn't the only
problem, but deleting something from a list without holding (the correct)
lock is something you have to think incredibly hard about to get right.
I didn't bother going any deeper into the analysis once I spotted the
locking problem, but the proof is very much on you that this is not a bug!

> But I did see other strange thing:
> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
> index:0xffffbd0a0 pfn:0x2554a0
> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
> 
> This large folio has order 0? Maybe folio->_flags_1 was screwed?
> 
> In free_unref_folios(), there is code like following:
>                 if (order > 0 && folio_test_large_rmappable(folio))
>                         folio_undo_large_rmappable(folio);
> 
> But with destroy_large_folio():
>         if (folio_test_large_rmappable(folio))
> 
> 			folio_undo_large_rmappable(folio);
> 
> Can it connect to the folio has zero refcount still in deferred list
> with Matthew's patch?
> 
> 
> Looks like folio order was cleared unexpected somewhere.

No, we intentionally clear it:

free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
page[1].flags &= ~PAGE_FLAGS_SECOND;

PAGE_FLAGS_SECOND includes the order, which is why we have to save it
away in folio->private so that we know what it is in the second loop.
So it's always been cleared by the time we call free_page_is_bad().


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2024-03-07  8:56                 ` Ryan Roberts
@ 2024-03-07 13:50                   ` Yin, Fengwei
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  0 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2024-03-07 13:50 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Zi Yan
  Cc: Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 4:56 PM,  wrote:
> I just want to make sure I've understood correctly: CPU1's folio_put()
> is not the last reference, and it keeps iterating through the local
> list. Then CPU2 does the final folio_put() which causes list_del_init()
> to modify the local list concurrently with CPU1's iteration, so CPU1
> probably goes into the weeds?

My understanding is this can not corrupt the folio->deferred_list as
this folio was iterated already.


But I did see other strange thing:
[   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 
index:0xffffbd0a0 pfn:0x2554a0
[   76.270483] note: kcompactd0[62] exited with preempt_count 1
[   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0

This large folio has order 0? Maybe folio->_flags_1 was screwed?

In free_unref_folios(), there is code like following:
                 if (order > 0 && folio_test_large_rmappable(folio))
                         folio_undo_large_rmappable(folio);

But with destroy_large_folio():
         if (folio_test_large_rmappable(folio)) 

			folio_undo_large_rmappable(folio);

Can it connect to the folio has zero refcount still in deferred list
with Matthew's patch?


Looks like folio order was cleared unexpected somewhere.

Regards
Yin, Fengwei



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2023-05-11 12:58 Ryan Roberts
@ 2023-05-11 13:13 ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-05-11 13:13 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, SeongJae Park
  Cc: linux-kernel, linux-mm, damon

My appologies for the noise: A blank line between Cc and Subject has broken the
subject and grouping in lore.

Please Ignore this, I will resend.


On 11/05/2023 13:58, Ryan Roberts wrote:
> Date: Thu, 11 May 2023 11:38:28 +0100
> Subject: [PATCH v1 0/5] Encapsulate PTE contents from non-arch code
> 
> Hi All,
> 
> This series improves the encapsulation of pte entries by disallowing non-arch
> code from directly dereferencing pte_t pointers. Instead code must use a new
> helper, `pte_t ptep_deref(pte_t *ptep)`. By default, this helper does a direct
> dereference of the pointer, so generated code should be exactly the same. But
> it's presence sets us up for arch code being able to override the default to
> "virtualize" the ptes without needing to maintain a shadow table.
> 
> I intend to take advantage of this for arm64 to enable use of its "contiguous
> bit" to coalesce multiple ptes into a single tlb entry, reducing pressure and
> improving performance. I have an RFC for the first part of this work at [1]. The
> cover letter there also explains the second part, which this series is enabling.
> 
> I intend to post an RFC for the contpte changes in due course, but it would be
> good to get the ball rolling on this enabler.
> 
> There are 2 reasons that I need the encapsulation:
> 
>   - Prevent leaking the arch-private PTE_CONT bit to the core code. If the core
>     code reads a pte that contains this bit, it could end up calling
>     set_pte_at() with the bit set which would confuse the implementation. So we
>     can always clear PTE_CONT in ptep_deref() (and ptep_get()) to avoid a leaky
>     abstraction.
>   - Contiguous ptes have a single access and dirty bit for the contiguous range.
>     So we need to "mix-in" those bits when the core is dereferencing a pte that
>     lies in the contig range. There is code that dereferences the pte then takes
>     different actions based on access/dirty (see e.g. write_protect_page()).
> 
> While ptep_get() and ptep_get_lockless() already exist, both of them are
> implemented using READ_ONCE() by default. While we could use ptep_get() instead
> of the new ptep_deref(), I didn't want to risk performance regression.
> Alternatively, all call sites that currently use ptep_get() that need the
> lockless behaviour could be upgraded to ptep_get_lockless() and ptep_get() could
> be downgraded to a simple dereference. That would be cleanest, but is a much
> bigger (and likely error prone) change because all the arch code would need to
> be updated for the new definitions of ptep_get().
> 
> The series is split up as follows:
> 
> patchs 1-2: Fix bugs where code was _setting_ ptes directly, rather than using
>             set_pte_at() and friends.
> patch 3:    Fix highmem unmapping issue I spotted while doing the work.
> patch 4:    Introduce the new ptep_deref() helper with default implementation.
> patch 5:    Convert all direct dereferences to use ptep_deref().
> 
> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (5):
>   mm: vmalloc must set pte via arch code
>   mm: damon must atomically clear young on ptes and pmds
>   mm: Fix failure to unmap pte on highmem systems
>   mm: Add new ptep_deref() helper to fully encapsulate pte_t
>   mm: ptep_deref() conversion
> 
>  .../drm/i915/gem/selftests/i915_gem_mman.c    |   8 +-
>  drivers/misc/sgi-gru/grufault.c               |   2 +-
>  drivers/vfio/vfio_iommu_type1.c               |   7 +-
>  drivers/xen/privcmd.c                         |   2 +-
>  fs/proc/task_mmu.c                            |  33 +++---
>  fs/userfaultfd.c                              |   6 +-
>  include/linux/hugetlb.h                       |   2 +-
>  include/linux/mm_inline.h                     |   2 +-
>  include/linux/pgtable.h                       |  13 ++-
>  kernel/events/uprobes.c                       |   2 +-
>  mm/damon/ops-common.c                         |  18 ++-
>  mm/damon/ops-common.h                         |   4 +-
>  mm/damon/paddr.c                              |   6 +-
>  mm/damon/vaddr.c                              |  14 ++-
>  mm/filemap.c                                  |   2 +-
>  mm/gup.c                                      |  21 ++--
>  mm/highmem.c                                  |  12 +-
>  mm/hmm.c                                      |   2 +-
>  mm/huge_memory.c                              |   4 +-
>  mm/hugetlb.c                                  |   2 +-
>  mm/hugetlb_vmemmap.c                          |   6 +-
>  mm/kasan/init.c                               |   9 +-
>  mm/kasan/shadow.c                             |  10 +-
>  mm/khugepaged.c                               |  24 ++--
>  mm/ksm.c                                      |  22 ++--
>  mm/madvise.c                                  |   6 +-
>  mm/mapping_dirty_helpers.c                    |   4 +-
>  mm/memcontrol.c                               |   4 +-
>  mm/memory-failure.c                           |   6 +-
>  mm/memory.c                                   | 103 +++++++++---------
>  mm/mempolicy.c                                |   6 +-
>  mm/migrate.c                                  |  14 ++-
>  mm/migrate_device.c                           |  14 ++-
>  mm/mincore.c                                  |   2 +-
>  mm/mlock.c                                    |   6 +-
>  mm/mprotect.c                                 |   8 +-
>  mm/mremap.c                                   |   2 +-
>  mm/page_table_check.c                         |   4 +-
>  mm/page_vma_mapped.c                          |  26 +++--
>  mm/pgtable-generic.c                          |   2 +-
>  mm/rmap.c                                     |  32 +++---
>  mm/sparse-vmemmap.c                           |   8 +-
>  mm/swap_state.c                               |   4 +-
>  mm/swapfile.c                                 |  16 +--
>  mm/userfaultfd.c                              |   4 +-
>  mm/vmalloc.c                                  |  11 +-
>  mm/vmscan.c                                   |  14 ++-
>  virt/kvm/kvm_main.c                           |   9 +-
>  48 files changed, 302 insertions(+), 236 deletions(-)
> 
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2022-08-31 21:47 ` Yang Shi
@ 2022-09-01  0:24   ` Zach O'Keefe
  0 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-09-01  0:24 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Wed, Aug 31, 2022 at 2:47 PM Yang Shi <shy828301@gmail.com> wrote:
>
> Hi Zach,
>
> I did a quick look at the series, basically no show stopper to me. But
> I didn't find time to review them thoroughly yet, quite busy on
> something else. Just a heads up, I didn't mean to ignore you. I will
> review them when I find some time.
>
> Thanks,
> Yang

Hey Yang,

Thanks for taking the time to look through, and thanks for giving me a
heads up, and no rush!

In the last day or so, while porting this series around, I encountered
some subtle edge cases I wanted to clean up / address - so it's good
you didn't do a thorough review yet. I was *hoping* to have a v3 out
last night (which evidently did not happen) and it does not seem like
it will happen today, so I'll leave this message as a request for
reviewers to hold off on a thorough review until v3.

Thanks for your time as always,
Zach


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2022-08-26 22:03 Zach O'Keefe
@ 2022-08-31 21:47 ` Yang Shi
  2022-09-01  0:24   ` Re: Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2022-08-31 21:47 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

Hi Zach,

I did a quick look at the series, basically no show stopper to me. But
I didn't find time to review them thoroughly yet, quite busy on
something else. Just a heads up, I didn't mean to ignore you. I will
review them when I find some time.

Thanks,
Yang

On Fri, Aug 26, 2022 at 3:03 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Subject: [PATCH mm-unstable v2 0/9] mm: add file/shmem support to MADV_COLLAPSE
>
> v2 Forward
>
> Mostly a RESEND: rebase on latest mm-unstable + minor bug fixes from
> kernel test robot.
> --------------------------------
>
> This series builds on top of the previous "mm: userspace hugepage collapse"
> series which introduced the MADV_COLLAPSE madvise mode and added support
> for private, anonymous mappings[1], by adding support for file and shmem
> backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.
>
> File and shmem support have been added with effort to align with existing
> MADV_COLLAPSE semantics and policy decisions[2].  Collapse of shmem-backed
> memory ignores kernel-guiding directives and heuristics including all
> sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
> options (shmem always supports large folios).  Like anonymous mappings, on
> successful return of MADV_COLLAPSE on file/shmem memory, the contents of
> memory mapped by the addresses provided will be synchronously pmd-mapped
> THPs.
>
> This functionality unlocks two important uses:
>
> (1)     Immediately back executable text by THPs.  Current support provided
>         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
>         system which might impair services from serving at their full rated
>         load after (re)starting.  Tricks like mremap(2)'ing text onto
>         anonymous memory to immediately realize iTLB performance prevents
>         page sharing and demand paging, both of which increase steady state
>         memory footprint.  Now, we can have the best of both worlds: Peak
>         upfront performance and lower RAM footprints.
>
> (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
>         faults by fetching native-sized pages over the network (to avoid
>         latency of transferring an entire hugepage).  However, after guest
>         memory has been fully copied to the new host, MADV_COLLAPSE can
>         be used to immediately increase guest performance.
>
> khugepaged has received a small improvement by association and can now
> detect and collapse pte-mapped THPs.  However, there is still work to be
> done along the file collapse path.  Compound pages of arbitrary order still
> needs to be supported and THP collapse needs to be converted to using
> folios in general.  Eventually, we'd like to move away from the read-only
> and executable-mapped constraints currently imposed on eligible files and
> support any inode claiming huge folio support.  That said, I think the
> series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem
> memory.
>
> Patches 1-3     Implement the guts of the series.
> Patch 4         Is a tracepoint for debugging.
> Patches 5-8     Refactor existing khugepaged selftests to work with new
>                 memory types.
> Patch 9         Adds a userfaultfd selftest mode to mimic a functional test
>                 of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.
>
> Applies against mm-unstable.
>
> [1] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
> [2] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/
>
> v1 -> v2:
> - Add missing definition for khugepaged_add_pte_mapped_thp() in
>   !CONFIG_SHEM builds, in "mm/khugepaged: attempt to map
>   file/shmem-backed pte-mapped THPs by pmds"
> - Minor bugfixes in "mm/madvise: add file and shmem support to
>   MADV_COLLAPSE" for !CONFIG_SHMEM, !CONFIG_TRANSPARENT_HUGEPAGE and some
>   compiler settings.
> - Rebased on latest mm-unstable
>
> Zach O'Keefe (9):
>   mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
>   mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by
>     pmds
>   mm/madvise: add file and shmem support to MADV_COLLAPSE
>   mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
>   selftests/vm: dedup THP helpers
>   selftests/vm: modularize thp collapse memory operations
>   selftests/vm: add thp collapse file and tmpfs testing
>   selftests/vm: add thp collapse shmem testing
>   selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
>
>  include/linux/khugepaged.h                    |  13 +-
>  include/linux/shmem_fs.h                      |  10 +-
>  include/trace/events/huge_memory.h            |  36 +
>  kernel/events/uprobes.c                       |   2 +-
>  mm/huge_memory.c                              |   2 +-
>  mm/khugepaged.c                               | 289 ++++--
>  mm/shmem.c                                    |  18 +-
>  tools/testing/selftests/vm/Makefile           |   2 +
>  tools/testing/selftests/vm/khugepaged.c       | 828 ++++++++++++------
>  tools/testing/selftests/vm/soft-dirty.c       |   2 +-
>  .../selftests/vm/split_huge_page_test.c       |  12 +-
>  tools/testing/selftests/vm/userfaultfd.c      | 171 +++-
>  tools/testing/selftests/vm/vm_util.c          |  36 +-
>  tools/testing/selftests/vm/vm_util.h          |   5 +-
>  14 files changed, 1040 insertions(+), 386 deletions(-)
>
> --
> 2.37.2.672.g94769d06f0-goog
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2022-04-21 23:17   ` Re: Yury Norov
@ 2022-04-21 23:21     ` John Hubbard
  0 siblings, 0 replies; 44+ messages in thread
From: John Hubbard @ 2022-04-21 23:21 UTC (permalink / raw)
  To: Yury Norov; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 16:17, Yury Norov wrote:
>> Let's add a Cc: line for Michan as well:
>>
>> Cc: Minchan Kim <minchan@kernel.org>
>   
> He's in CC already, I think...
>   

Here, I am talking about attribution in the commit log, as opposed
to the email Cc. In other words, I'm suggesting that you literally
add this line to the commit description:

Cc: Minchan Kim <minchan@kernel.org>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2022-04-21 23:04 ` Re: John Hubbard
  2022-04-21 23:09   ` Re: John Hubbard
@ 2022-04-21 23:17   ` Yury Norov
  2022-04-21 23:21     ` Re: John Hubbard
  1 sibling, 1 reply; 44+ messages in thread
From: Yury Norov @ 2022-04-21 23:17 UTC (permalink / raw)
  To: John Hubbard; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Thu, Apr 21, 2022 at 04:04:44PM -0700, John Hubbard wrote:
> On 4/21/22 09:41, Yury Norov wrote:
> > Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
> > 
> 
> Hi Yuri,
> 
> Thanks for picking this up. I have been distracted and didn't trust
> myself to focus on this properly, so it's good to have help!
> 
> IT/admin point: somehow the first line of the commit description didn't
> make it into an actual email subject. The subject line was blank when it
> arrived in my inbox, and the subject is in the body here instead. Not
> sure how that happened.
> 
> Maybe check your git-sendemail setup?
 
git-sendmail is OK. I just accidentally added empty line above Subject,
which broke format. My bad, sorry for this.
 
> > pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> > API requires struct page **pages to be provided (not NULL). However,
> > the comment to pin_user_pages() says:
> > 
> >      * @pages:      array that receives pointers to the pages pinned.
> >      *              Should be at least nr_pages long. Or NULL, if caller
> >      *              only intends to ensure the pages are faulted in.
> > 
> > This patch fixes comments along the pin_user_pages code, and also adds
> > WARN_ON(!pages), so that API users will have better understanding
> > on how to use it.
> 
> No need to quote the code in the commit log. Instead, just summarize.
> For example:
> 
> pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> API requires struct page **pages to be provided (not NULL). However, the
> comment to pin_user_pages() clearly allows for passing in a NULL @pages
> argument.
> 
> Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
> enforce the API.
> 
> > 
> > It has been independently spotted by Minchan Kim and confirmed with
> > John Hubbard:
> > 
> > https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/
> 
> Let's add a Cc: line for Michan as well:
> 
> Cc: Minchan Kim <minchan@kernel.org>
 
He's in CC already, I think...
 
> > Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
> > ---
> >   mm/gup.c | 26 ++++++++++++++++++++++----
> >   1 file changed, 22 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f598a037eb04..559626457585 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2871,6 +2871,10 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> 
> Please delete each and every one of these one-line comments, because
> they merely echo what the code says.

Sure.
 
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> >   }
> > @@ -2893,6 +2897,10 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages,
> >   	 */
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return 0;
> > +
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return 0;
> >   	/*
> >   	 * FOLL_FAST_ONLY is required in order to match the API description of
> >   	 * this routine: no fall back to regular ("slow") GUP.
> > @@ -2920,8 +2928,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
> >    * @nr_pages:	number of pages from start to pin
> >    * @gup_flags:	flags modifying lookup behaviour
> >    * @pages:	array that receives pointers to the pages pinned.
> > - *		Should be at least nr_pages long. Or NULL, if caller
> > - *		only intends to ensure the pages are faulted in.
> > + *		Should be at least nr_pages long.
> >    * @vmas:	array of pointers to vmas corresponding to each page.
> >    *		Or NULL if the caller does not require them.
> >    * @locked:	pointer to lock flag indicating whether lock is held and
> > @@ -2944,6 +2951,10 @@ long pin_user_pages_remote(struct mm_struct *mm,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
> >   				       pages, vmas, locked);
> > @@ -2957,8 +2968,7 @@ EXPORT_SYMBOL(pin_user_pages_remote);
> >    * @nr_pages:	number of pages from start to pin
> >    * @gup_flags:	flags modifying lookup behaviour
> >    * @pages:	array that receives pointers to the pages pinned.
> > - *		Should be at least nr_pages long. Or NULL, if caller
> > - *		only intends to ensure the pages are faulted in.
> > + *		Should be at least nr_pages long.
> >    * @vmas:	array of pointers to vmas corresponding to each page.
> >    *		Or NULL if the caller does not require them.
> >    *
> > @@ -2976,6 +2986,10 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return __gup_longterm_locked(current->mm, start, nr_pages,
> >   				     pages, vmas, gup_flags);
> > @@ -2994,6 +3008,10 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
> >   }
> 
> I hope we don't break any callers with the newly enforced !pages, but it's
> the right thing to do, in order to avoid misunderstandings.
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

Let me test v2 and resend shortly.

Thanks,
Yury


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2022-04-21 23:04 ` Re: John Hubbard
@ 2022-04-21 23:09   ` John Hubbard
  2022-04-21 23:17   ` Re: Yury Norov
  1 sibling, 0 replies; 44+ messages in thread
From: John Hubbard @ 2022-04-21 23:09 UTC (permalink / raw)
  To: Yury Norov, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 16:04, John Hubbard wrote:
> On 4/21/22 09:41, Yury Norov wrote:
>> Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
>>
> 
> Hi Yuri,

...and I see that I have typo'd both Yury's and Minchan's name (further
down), in the same email!

Really apologize for screwing that up. It's Yury-with-a-"y", I know. :)


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
       [not found] <20220421164138.1250943-1-yury.norov@gmail.com>
@ 2022-04-21 23:04 ` John Hubbard
  2022-04-21 23:09   ` Re: John Hubbard
  2022-04-21 23:17   ` Re: Yury Norov
  0 siblings, 2 replies; 44+ messages in thread
From: John Hubbard @ 2022-04-21 23:04 UTC (permalink / raw)
  To: Yury Norov, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 09:41, Yury Norov wrote:
> Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
> 

Hi Yuri,

Thanks for picking this up. I have been distracted and didn't trust
myself to focus on this properly, so it's good to have help!

IT/admin point: somehow the first line of the commit description didn't
make it into an actual email subject. The subject line was blank when it
arrived in my inbox, and the subject is in the body here instead. Not
sure how that happened.

Maybe check your git-sendemail setup?


> pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> API requires struct page **pages to be provided (not NULL). However,
> the comment to pin_user_pages() says:
> 
>      * @pages:      array that receives pointers to the pages pinned.
>      *              Should be at least nr_pages long. Or NULL, if caller
>      *              only intends to ensure the pages are faulted in.
> 
> This patch fixes comments along the pin_user_pages code, and also adds
> WARN_ON(!pages), so that API users will have better understanding
> on how to use it.

No need to quote the code in the commit log. Instead, just summarize.
For example:

pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
API requires struct page **pages to be provided (not NULL). However, the
comment to pin_user_pages() clearly allows for passing in a NULL @pages
argument.

Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
enforce the API.

> 
> It has been independently spotted by Minchan Kim and confirmed with
> John Hubbard:
> 
> https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/

Let's add a Cc: line for Michan as well:

Cc: Minchan Kim <minchan@kernel.org>

> 
> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
> ---
>   mm/gup.c | 26 ++++++++++++++++++++++----
>   1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f598a037eb04..559626457585 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2871,6 +2871,10 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */

Please delete each and every one of these one-line comments, because
they merely echo what the code says.

> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
>   }
> @@ -2893,6 +2897,10 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages,
>   	 */
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return 0;
> +
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return 0;
>   	/*
>   	 * FOLL_FAST_ONLY is required in order to match the API description of
>   	 * this routine: no fall back to regular ("slow") GUP.
> @@ -2920,8 +2928,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
>    * @nr_pages:	number of pages from start to pin
>    * @gup_flags:	flags modifying lookup behaviour
>    * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long. Or NULL, if caller
> - *		only intends to ensure the pages are faulted in.
> + *		Should be at least nr_pages long.
>    * @vmas:	array of pointers to vmas corresponding to each page.
>    *		Or NULL if the caller does not require them.
>    * @locked:	pointer to lock flag indicating whether lock is held and
> @@ -2944,6 +2951,10 @@ long pin_user_pages_remote(struct mm_struct *mm,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
>   				       pages, vmas, locked);
> @@ -2957,8 +2968,7 @@ EXPORT_SYMBOL(pin_user_pages_remote);
>    * @nr_pages:	number of pages from start to pin
>    * @gup_flags:	flags modifying lookup behaviour
>    * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long. Or NULL, if caller
> - *		only intends to ensure the pages are faulted in.
> + *		Should be at least nr_pages long.
>    * @vmas:	array of pointers to vmas corresponding to each page.
>    *		Or NULL if the caller does not require them.
>    *
> @@ -2976,6 +2986,10 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return __gup_longterm_locked(current->mm, start, nr_pages,
>   				     pages, vmas, gup_flags);
> @@ -2994,6 +3008,10 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
>   }

I hope we don't break any callers with the newly enforced !pages, but it's
the right thing to do, in order to avoid misunderstandings.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2021-08-12 20:19   ` Re: Andrew Morton
@ 2021-08-13  8:14     ` SeongJae Park
  0 siblings, 0 replies; 44+ messages in thread
From: SeongJae Park @ 2021-08-13  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park,  Valdis Klētnieks ,
	SeongJae Park, linux-mm, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

On Thu, 12 Aug 2021 13:19:21 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 12 Aug 2021 09:42:40 +0000 SeongJae Park <sj38.park@gmail.com> wrote:
> 
> > > +config PAGE_IDLE_FLAG
> > > +       bool "Add PG_idle and PG_young flags"
> > > +       help
> > > +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > > +         Accessed bit writers can set the state of the bit in the flags to let
> > > +         other PTE Accessed bit readers don't disturbed.
> > > 
> > > This needs to be converted to proper, or at least comprehensible, English....
> > 
> > Thank you for the comment.
> > 
> > How about below?
> > 
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
> >         bool "Add PG_idle and PG_young flags"
> >         select PAGE_EXTENSION if !64BIT
> >         help
> > -         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > -         Accessed bit writers can set the state of the bit in the flags to let
> > -         other PTE Accessed bit readers don't disturbed.
> > +         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
> > +         PTE Accessed bit writers can save the state of the bit in the flags
> > +         to let other PTE Accessed bit readers don't get disturbed.
> 
> How about this?
> 
> --- a/mm/Kconfig~mm-idle_page_tracking-make-pg_idle-reusable-fix-fix
> +++ a/mm/Kconfig
> @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
>  	bool "Add PG_idle and PG_young flags"
>  	select PAGE_EXTENSION if !64BIT
>  	help
> -	  This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> -	  Accessed bit writers can set the state of the bit in the flags to let
> -	  other PTE Accessed bit readers don't disturbed.
> +	  This adds PG_idle and PG_young flags to 'struct page'.  PTE Accessed
> +	  bit writers can set the state of the bit in the flags so that PTE
> +	  Accessed bit readers may avoid disturbance.
>  
>  config IDLE_PAGE_TRACKING
>  	bool "Enable idle page tracking"

So good, thank you!

> 
> Also, is there any way in which we can avoid presenting this option to
> the user?  Because most users will have real trouble understanding what
> this thing is for.  Can we simply select it when needed, as dictated by
> other, higher-level config options?

I believe this is the right way to go!  I sent a patch for removing the prompt
of this option:
https://lore.kernel.org/linux-mm/20210813081238.34705-1-sj38.park@gmail.com/


Thanks,
SeongJae Park


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2021-08-12  9:42 ` SeongJae Park
@ 2021-08-12 20:19   ` Andrew Morton
  2021-08-13  8:14     ` Re: SeongJae Park
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2021-08-12 20:19 UTC (permalink / raw)
  To: SeongJae Park
  Cc:  Valdis Klētnieks , SeongJae Park, linux-mm, linux-kernel

On Thu, 12 Aug 2021 09:42:40 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> > +config PAGE_IDLE_FLAG
> > +       bool "Add PG_idle and PG_young flags"
> > +       help
> > +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > +         Accessed bit writers can set the state of the bit in the flags to let
> > +         other PTE Accessed bit readers don't disturbed.
> > 
> > This needs to be converted to proper, or at least comprehensible, English....
> 
> Thank you for the comment.
> 
> How about below?
> 
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
>         bool "Add PG_idle and PG_young flags"
>         select PAGE_EXTENSION if !64BIT
>         help
> -         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> -         Accessed bit writers can set the state of the bit in the flags to let
> -         other PTE Accessed bit readers don't disturbed.
> +         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
> +         PTE Accessed bit writers can save the state of the bit in the flags
> +         to let other PTE Accessed bit readers don't get disturbed.

How about this?

--- a/mm/Kconfig~mm-idle_page_tracking-make-pg_idle-reusable-fix-fix
+++ a/mm/Kconfig
@@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
 	bool "Add PG_idle and PG_young flags"
 	select PAGE_EXTENSION if !64BIT
 	help
-	  This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
-	  Accessed bit writers can set the state of the bit in the flags to let
-	  other PTE Accessed bit readers don't disturbed.
+	  This adds PG_idle and PG_young flags to 'struct page'.  PTE Accessed
+	  bit writers can set the state of the bit in the flags so that PTE
+	  Accessed bit readers may avoid disturbance.
 
 config IDLE_PAGE_TRACKING
 	bool "Enable idle page tracking"

Also, is there any way in which we can avoid presenting this option to
the user?  Because most users will have real trouble understanding what
this thing is for.  Can we simply select it when needed, as dictated by
other, higher-level config options?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2021-08-12  9:21 Valdis Klētnieks
@ 2021-08-12  9:42 ` SeongJae Park
  2021-08-12 20:19   ` Re: Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: SeongJae Park @ 2021-08-12  9:42 UTC (permalink / raw)
  To: Valdis Klētnieks
  Cc: SeongJae Park, Andrew Morton, linux-mm, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

Hello Valdis,

On Thu, 12 Aug 2021 05:21:57 -0400 "Valdis =?utf-8?Q?Kl=c4=93tnieks?=" <valdis.kletnieks@vt.edu> wrote:

> In this commit:
> 
> commit fedc37448fb1be5d03e420ca7791d4286893d5ec
> Author: SeongJae Park <sjpark@amazon.de>
> Date:   Tue Aug 10 16:55:51 2021 +1000
> 
>     mm/idle_page_tracking: make PG_idle reusable
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 504336de9a1e..d0b85dc12429 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -739,10 +739,18 @@ config DEFERRED_STRUCT_PAGE_INIT
>           lifetime of the system until these kthreads finish the
>           initialisation.
> 
> +config PAGE_IDLE_FLAG
> +       bool "Add PG_idle and PG_young flags"
> +       help
> +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> +         Accessed bit writers can set the state of the bit in the flags to let
> +         other PTE Accessed bit readers don't disturbed.
> 
> This needs to be converted to proper, or at least comprehensible, English....

Thank you for the comment.

How about below?

--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
        bool "Add PG_idle and PG_young flags"
        select PAGE_EXTENSION if !64BIT
        help
-         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
-         Accessed bit writers can set the state of the bit in the flags to let
-         other PTE Accessed bit readers don't disturbed.
+         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
+         PTE Accessed bit writers can save the state of the bit in the flags
+         to let other PTE Accessed bit readers don't get disturbed.


Thanks,
SeongJae Park


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2018-07-03 21:44     ` Re: Sebastian Andrzej Siewior
@ 2018-07-04 14:44       ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2018-07-04 14:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Andrew Morton, linux-mm, tglx, Kirill Tkhai

On Tue, Jul 03, 2018 at 11:44:29PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-03 14:14:29 [-0700], Andrew Morton wrote:
> > 
> > > Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()
> > 
> > Well that's messed up.
> 
> indeed it is. This should get into Subject:
> 
> > On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > 
> > > My intepretation of situtation is that Vladimir Davydon is fine patch #1
> > > and #2 of the series [0] but dislikes the irq argument and struct
> > > member. It has been suggested to use list_lru_shrink_walk_irq() instead
> > > the approach I went on in "mm: list_lru: Add lock_irq member to
> > > __list_lru_init()".
> > > 
> > > This series is based on the former two patches and introduces
> > > list_lru_shrink_walk_irq() (and makes the third patch of series
> > > obsolete).
> > > In patch 1-3 I tried a tiny cleanup so the different locking
> > > (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> > > function.
> > > 
> > > [0] The patch
> > >       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
> > >    and
> > >       mm: workingset: make shadow_lru_isolate() use locking suffix
> > > 
> > 
> > This isn't a very informative [0/n] changelog.  Some overall summary of
> > the patchset's objective, behaviour, use cases, testing results, etc.
> 
> The patches should be threaded as a reply to 3/3 of the series so I
> assumed it was enough. And while Vladimir complained about 2/3 and 3/3
> the discussion went on in 2/3 where he suggested to go on with the _irq
> function. And testing, well with and without RT the function was invoked
> as part of swapping (allocating memory until OOM) without complains.
> 
> > I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
> > scalability (old complexity was O(n^2), new is O(n))" series, which I
> > merged eight milliseconds ago.  Kirill's patchset is large but fairly
> > straightforward so I expect it's good for 4.18.  So I suggest we leave
> > things a week or more then please take a look at redoing this patchset
> > on top of that work?  
> 
> If Vladimir is okay with to redo and nobody else complains then I could
> rebase these four patches on top of your tree next week.

IMHO this approach is more straightforward than the one with the per
list_lru flag. For all patches,

Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2018-07-03 21:14   ` Andrew Morton
@ 2018-07-03 21:44     ` Sebastian Andrzej Siewior
  2018-07-04 14:44       ` Re: Vladimir Davydov
  0 siblings, 1 reply; 44+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 21:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vladimir Davydov, linux-mm, tglx, Kirill Tkhai

On 2018-07-03 14:14:29 [-0700], Andrew Morton wrote:
> 
> > Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()
> 
> Well that's messed up.

indeed it is. This should get into Subject:

> On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > My intepretation of situtation is that Vladimir Davydon is fine patch #1
> > and #2 of the series [0] but dislikes the irq argument and struct
> > member. It has been suggested to use list_lru_shrink_walk_irq() instead
> > the approach I went on in "mm: list_lru: Add lock_irq member to
> > __list_lru_init()".
> > 
> > This series is based on the former two patches and introduces
> > list_lru_shrink_walk_irq() (and makes the third patch of series
> > obsolete).
> > In patch 1-3 I tried a tiny cleanup so the different locking
> > (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> > function.
> > 
> > [0] The patch
> >       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
> >    and
> >       mm: workingset: make shadow_lru_isolate() use locking suffix
> > 
> 
> This isn't a very informative [0/n] changelog.  Some overall summary of
> the patchset's objective, behaviour, use cases, testing results, etc.

The patches should be threaded as a reply to 3/3 of the series so I
assumed it was enough. And while Vladimir complained about 2/3 and 3/3
the discussion went on in 2/3 where he suggested to go on with the _irq
function. And testing, well with and without RT the function was invoked
as part of swapping (allocating memory until OOM) without complains.

> I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
> scalability (old complexity was O(n^2), new is O(n))" series, which I
> merged eight milliseconds ago.  Kirill's patchset is large but fairly
> straightforward so I expect it's good for 4.18.  So I suggest we leave
> things a week or more then please take a look at redoing this patchset
> on top of that work?  

If Vladimir is okay with to redo and nobody else complains then I could
rebase these four patches on top of your tree next week.

Sebastian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2018-07-03 14:52 ` Sebastian Andrzej Siewior
@ 2018-07-03 21:14   ` Andrew Morton
  2018-07-03 21:44     ` Re: Sebastian Andrzej Siewior
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2018-07-03 21:14 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Vladimir Davydov, linux-mm, tglx,
	Kirill Tkhai


> Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()

Well that's messed up.

On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> My intepretation of situtation is that Vladimir Davydon is fine patch #1
> and #2 of the series [0] but dislikes the irq argument and struct
> member. It has been suggested to use list_lru_shrink_walk_irq() instead
> the approach I went on in "mm: list_lru: Add lock_irq member to
> __list_lru_init()".
> 
> This series is based on the former two patches and introduces
> list_lru_shrink_walk_irq() (and makes the third patch of series
> obsolete).
> In patch 1-3 I tried a tiny cleanup so the different locking
> (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> function.
> 
> [0] The patch
>       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
>    and
>       mm: workingset: make shadow_lru_isolate() use locking suffix
> 

This isn't a very informative [0/n] changelog.  Some overall summary of
the patchset's objective, behaviour, use cases, testing results, etc.

I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
scalability (old complexity was O(n^2), new is O(n))" series, which I
merged eight milliseconds ago.  Kirill's patchset is large but fairly
straightforward so I expect it's good for 4.18.  So I suggest we leave
things a week or more then please take a look at redoing this patchset
on top of that work?  

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
  2018-06-28  3:48 ` Andrew Morton
@ 2018-06-29 18:44   ` Andrey Ryabinin
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Ryabinin @ 2018-06-29 18:44 UTC (permalink / raw)
  To: Andrew Morton, icytxw
  Cc: bugzilla-daemon, linux-mm, Alexander Potapenko, Dmitry Vyukov



On 06/28/2018 06:48 AM, Andrew Morton wrote:

>> Hi,
>> This bug was found in Linux Kernel v4.18-rc2
>>
>> $ cat report0 
>> ================================================================================
>> UBSAN: Undefined behaviour in mm/fadvise.c:76:10
>> signed integer overflow:
>> 4 + 9223372036854775805 cannot be represented in type 'long long int'
>> CPU: 0 PID: 13477 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x122/0x1c8 lib/dump_stack.c:113
>>  ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
>>  handle_overflow+0x1c2/0x21f lib/ubsan.c:190
>>  __ubsan_handle_add_overflow+0x2a/0x31 lib/ubsan.c:198
>>  ksys_fadvise64_64+0xbf0/0xd10 mm/fadvise.c:76
>>  __do_sys_fadvise64 mm/fadvise.c:198 [inline]
>>  __se_sys_fadvise64 mm/fadvise.c:196 [inline]
>>  __x64_sys_fadvise64+0xa9/0x120 mm/fadvise.c:196
>>  do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
> 
> That overflow is deliberate:
> 
> 	endbyte = offset + len;
> 	if (!len || endbyte < len)
> 		endbyte = -1;
> 	else
> 		endbyte--;		/* inclusive */
> 
> Or is there a hole in this logic?
> 
> If not, I guess ee can do this another way to keep the checker happy.
 
It should be enough to make overflow unsigned. Unsigned overflow is defined by the C standard.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE:
@ 2007-04-03 18:41 Royal VIP Casino
  0 siblings, 0 replies; 44+ messages in thread
From: Royal VIP Casino @ 2007-04-03 18:41 UTC (permalink / raw)
  To: linux-mm


How about a hulking cock? Interested? Check Penis Enlarge Patch out.

http://www.zcukchud.com/

With Penis Enlarge Patch your dick will be growing<BR>so fast you wont be able to record the exact size.














------------------------
disrespect  for the girl. See how it looks in print -- I translate this froma conversation in one of the best of the German Sunday-school books:    Gretchen. Wilhelm, where is the turnip?     Wilhelm. She has gone to the kitchen.     Gretchen. Where is the accomplished and beautiful English maiden?
describe? And  observe  the strongest of  the several German equ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:
       [not found] <F265RQAOCop3wyv9kI3000143b1@hotmail.com>
@ 2001-10-08 11:48 ` Joseph A Knapka
  0 siblings, 0 replies; 44+ messages in thread
From: Joseph A Knapka @ 2001-10-08 11:48 UTC (permalink / raw)
  To: Gavin Dolling; +Cc: linux-mm

Hi Gavin,

[Forwarded to linux-mm, since those guys will be able to
 answer your questions much more completely. Maybe someone
 has already solved your problem.]

Gavin Dolling wrote:
> 
> Your VM page has helped me immensely. I'm after so advice though about the
> following. No problem if you are too busy, etc. your site has already helped
> me a great deal so just hit that delete key now ...
> 
> I have an embedded linux system running out of 8M of RAM. It has no backing
> store and uses a RAM disk as its FS. It boots from a flash chip - at boot
> time things are uncompressed into RAM. Running an MTD type system with a
> flash FS is not an option.
> 
> Memory is very tight and it is unfortunate that the binaries effectively
> appear twice in memory. They are in the RAM FS in full and also get paged
> into memory. There is a lot of paging going on which I believe is drowning
> the system.
> 
> We have no swap file (that would obviously be stupid) but a large number of
> buffers (i.e. a lot of dirty pages). The application is networking stuff so
> it is supposed to perform at line rate - the paging appears to be preventing
> this.
> 
> What I wish to do is to page the user space binaries into the page cache,
> mark them so they are never evicted. Delete them from the RAMFS and recover
> the memory. This should be the most optimum way of running the system - in
> terms of memory usage anyway.
> 
> I am planning to hack filemap.c. Going to use page_cache_read on each binary
> and then remove from RAM FS. If the page is not in use I will have to make
> sure that deleting the file does not result in the page being evicted.
> Basically some more hacking required. I am also concerned about the inode
> associated with the page, this is going to cause me pain I think?
> 
> I am going to try this on my PC first. Going to try and force 'cat' to be
> fully paged in and then rename it. I should still be able to use cat at the
> command line.

I don't think this will work as a test case. The address_space mappings
are based on inode identity, and since you won't actually have
a "cat" program on your filesystem, the inode won't be found, so
the kernel will not have a way of knowing that the cached pages
are the right ones. You'd have to leave at least enough of the
filesystem intact for the kernel to be able to map the program
name to the correct inode. You might solve this by pinning the
inode buffers in main memory before reclaiming the RAMFS pages,
but that's pure speculation on my part.

> So basically:
> 
> a) Is this feasible?

See below.

> b) When I delete the binary can I prevent it from being evicted from the
> page cache?
> (I note with interest that if I mv my /usr/bin/emacs whilst emacs is running
>       e.g.   $ emacs &; mv /usr/bin/emacs /usr/bin/emacs2
> it allows me to do it and what's more nothing bad happens. This tells me I
> do not understand enough of what is going on - I would have expected this to
> fail in some manner).

The disk inode for a moved or deleted file (and the file's disk
blocks) don't get freed until all references to the inode are
gone. If the kernel has the file open (eg due to mmap()),
the file can still be used for paging until it's unmapped
by all the processes that are using it. (This is another
reason your test case above might be misleading.)

> c) I must have to leave something in the RAMFS such that the instance of the
> binary still exists even if not its whole content.
> 
> d) Am I insane to try this? (Why would be more useful than just a yes ;-)  )

I don't know. This is a deeper hack than any I've contemplated.
However, I'm tempted to say that it would be easier to figure
out a way to directly add the RAMFS pages to the page cache,
and thus use a single page simultaneously as a cache page and
an FS page. I don't know how hard that's going to be, but I
think it might be easier than trying to yank the FS out from
under an in-use mapping.

Cheers,

-- Joe
# "You know how many remote castles there are along the
#  gorges? You can't MOVE for remote castles!" - Lu Tze re. Uberwald
# Linux MM docs:
http://home.earthlink.net/~jknapka/linux-mm/vmoutline.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-08-12 13:50 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
2024-01-22  6:41   ` Christoph Hellwig
2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
2024-01-19  1:26   ` Darrick J. Wong
2024-01-19  7:03     ` Dave Chinner
2024-01-22  6:39     ` Christoph Hellwig
2024-01-22 12:05       ` Dave Chinner
2024-01-22 13:18         ` Christoph Hellwig
2024-01-22 21:10           ` Dave Chinner
2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
2024-01-19  1:31   ` Darrick J. Wong
2024-01-19  7:12     ` Dave Chinner
2024-01-22  6:45   ` Christoph Hellwig
2024-01-22 11:57     ` Dave Chinner
2024-01-19  1:05 ` [RFC] [PATCH 0/3] xfs: use large folios for buffers Darrick J. Wong
2024-01-22 10:13 ` Andi Kleen
2024-01-22 11:53   ` Dave Chinner
2024-01-22 13:34     ` Using Folios for XFS metadata Andi Kleen
2024-01-23  0:19       ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2025-08-12 13:34 Baoquan He
2025-08-12 13:49 ` Baoquan He
2025-02-08  8:19 Re: Director Inspectorate
2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
2024-03-06 13:42   ` Ryan Roberts
2024-03-06 16:09     ` Matthew Wilcox
2024-03-06 16:19       ` Ryan Roberts
2024-03-06 17:41         ` Ryan Roberts
2024-03-06 18:41           ` Zi Yan
2024-03-06 19:55             ` Matthew Wilcox
2024-03-06 21:55               ` Matthew Wilcox
2024-03-07  8:56                 ` Ryan Roberts
2024-03-07 13:50                   ` Yin, Fengwei
2024-03-07 14:05                     ` Re: Matthew Wilcox
2024-03-07 15:24                       ` Re: Ryan Roberts
2024-03-07 16:24                         ` Re: Ryan Roberts
2024-03-07 23:02                           ` Re: Matthew Wilcox
2024-03-08  1:06                       ` Re: Yin, Fengwei
2023-05-11 12:58 Ryan Roberts
2023-05-11 13:13 ` Ryan Roberts
2022-08-26 22:03 Zach O'Keefe
2022-08-31 21:47 ` Yang Shi
2022-09-01  0:24   ` Re: Zach O'Keefe
     [not found] <20220421164138.1250943-1-yury.norov@gmail.com>
2022-04-21 23:04 ` Re: John Hubbard
2022-04-21 23:09   ` Re: John Hubbard
2022-04-21 23:17   ` Re: Yury Norov
2022-04-21 23:21     ` Re: John Hubbard
2021-08-12  9:21 Valdis Klētnieks
2021-08-12  9:42 ` SeongJae Park
2021-08-12 20:19   ` Re: Andrew Morton
2021-08-13  8:14     ` Re: SeongJae Park
     [not found] <bug-200209-27@https.bugzilla.kernel.org/>
2018-06-28  3:48 ` Andrew Morton
2018-06-29 18:44   ` Andrey Ryabinin
2018-06-24 20:09 [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init() Vladimir Davydov
2018-07-03 14:52 ` Sebastian Andrzej Siewior
2018-07-03 21:14   ` Andrew Morton
2018-07-03 21:44     ` Re: Sebastian Andrzej Siewior
2018-07-04 14:44       ` Re: Vladimir Davydov
2007-04-03 18:41 Royal VIP Casino
     [not found] <F265RQAOCop3wyv9kI3000143b1@hotmail.com>
2001-10-08 11:48 ` Joseph A Knapka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox