linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios
@ 2025-01-10 15:46 Nikita Kalyazin
  2025-01-10 15:46 ` [RFC PATCH 1/2] " Nikita Kalyazin
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Nikita Kalyazin @ 2025-01-10 15:46 UTC (permalink / raw)
  To: willy, pbonzini, linux-fsdevel, linux-mm, linux-kernel, kvm
  Cc: michael.day, david, jthoughton, michael.roth, ackerleytng, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx, kalyazin

Based on David's suggestion for speeding up guest_memfd memory
population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2],
this adds `filemap_grab_folios` that grabs multiple folios at a time.

Motivation

When profiling guest_memfd population and comparing the results with
population of anonymous memory via UFFDIO_COPY, I observed that the
former was up to 20% slower, mainly due to adding newly allocated pages
to the pagecache.  As far as I can see, the two main contributors to it
are pagecache locking and tree traversals needed for every folio.  The
RFC attempts to partially mitigate those by adding multiple folios at a
time to the pagecache.

Testing

With the change applied, I was able to observe a 10.3% (708 to 635 ms)
speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to
904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a
custom Firecracker version, both on Intel Ice Lake.

Limitations

While `filemap_grab_folios` handles THP/large folios internally and
deals with reclaim artifacts in the pagecache (shadows), for simplicity
reasons, the RFC does not support those as it demonstrates the
optimisation applied to guest_memfd, which only uses small folios and
does not support reclaim at the moment.

Implementation

I am aware of existing filemap APIs operating on folio batches, however
I was not able to find one for the use case in question.  I was also
thinking about making use of the `folio_batch` struct, but was not able
to convince myself that it was useful.  Instead, a plain array of folio
pointers is allocated on stack and passed down the callchain.  A bitmap
is used to keep track of indexes whose folios were already present in
the pagecache to prevent allocations.  This does not look very clean to
me and I am more than open to hearing about better approaches.

Not being an expert in xarray, I do not know an idiomatic way to advance
the index if `xas_next` is called directly after instantiation of the
state that was never walked, so I used a call to `xas_set`.

While the series focuses on optimising _adding_ folios to the pagecache,
I was also experimenting with batching of pagecache _querying_.
Specifically, I tried to make use of `filemap_get_folios` instead of
`filemap_get_entry`, but I could not observe any visible speedup.

The series is applied on top of [1] as the 1st patch implements
`filemap_grab_folios`, while the 2nd patch makes use of it in the
guest_memfd's write syscall as a first user.

Questions:
 - Does the approach look reasonable in general?
 - Can the API be kept specialised to the non-reclaim-supported case or
   does it need to be generic?
 - Would it be sensible to add a specialised small-folio-only version of
   `filemap_grab_folios` at the beginning and extend it to large folios
later on?
 - Are there better ways to implement batching or even achieve the
   optimisaton goal in another way?

[1]: https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/
[2]: https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0

Thanks
Nikita

Nikita Kalyazin (2):
  mm: filemap: add filemap_grab_folios
  KVM: guest_memfd: use filemap_grab_folios in write

 include/linux/pagemap.h |  31 +++++
 mm/filemap.c            | 263 ++++++++++++++++++++++++++++++++++++++++
 virt/kvm/guest_memfd.c  | 176 ++++++++++++++++++++++-----
 3 files changed, 437 insertions(+), 33 deletions(-)


base-commit: 643cff38ebe84c39fbd5a0fc3ab053cd941b9f94
-- 
2.40.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] mm: filemap: add filemap_grab_folios
  2025-01-10 15:46 [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios Nikita Kalyazin
@ 2025-01-10 15:46 ` Nikita Kalyazin
  2025-01-10 15:46 ` [RFC PATCH 2/2] KVM: guest_memfd: use filemap_grab_folios in write Nikita Kalyazin
  2025-01-10 17:01 ` [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios David Hildenbrand
  2 siblings, 0 replies; 9+ messages in thread
From: Nikita Kalyazin @ 2025-01-10 15:46 UTC (permalink / raw)
  To: willy, pbonzini, linux-fsdevel, linux-mm, linux-kernel, kvm
  Cc: michael.day, david, jthoughton, michael.roth, ackerleytng, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx, kalyazin

Similar to filemap_grab_folio, but grabs multiple folios at a time.
filemap_grab_folios attempts to get 128 adjacent folios in the pagecache
starting at the specified index.  Whenever a folio is not found, it
allocates a new one and adds it to the pagecache.

The following is not currently supported:
 - large folios
 - reclaim effects in the pagecache (shadows)

An equivalent of the following callstack is implemented to work on
multiple folios:
 - filemap_grab_folio
 - __filemap_get_folio
 - filemap_add_folio
 - __filemap_add_folio

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 include/linux/pagemap.h |  31 +++++
 mm/filemap.c            | 263 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 294 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..fd10d77c07c1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -747,9 +747,16 @@ static inline fgf_t fgf_set_order(size_t size)
 	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
 }
 
+/**
+ * Folio batch size used by __filemap_get_folios.
+ */
+#define FILEMAP_GET_FOLIOS_BATCH_SIZE 128
+
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
+int __filemap_get_folios(struct address_space *mapping, pgoff_t index,
+		fgf_t fgp_flags, gfp_t gfp, struct folio **folios, int num);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
 
@@ -808,6 +815,30 @@ static inline struct folio *filemap_grab_folio(struct address_space *mapping,
 			mapping_gfp_mask(mapping));
 }
 
+/**
+ * filemap_grab_folios - grab folios from the page cache
+ * @mapping: The address space to search
+ * @index: The page index to start with
+ * @folios: Output buffer for found or created folios
+ * @num: Number of folios to grab
+ *
+ * Looks up @num page cache entries at @mapping starting from @index. If no
+ * folio is found at the index, a new folio is created. The folios are locked,
+ * and marked as accessed.
+ *
+ * Return: The total number of found and created folios. Returned folios will
+ * always have adjacent indexes starting from @index. If no folios are found
+ * and created, -ENOMEM is returned.
+ */
+static inline int filemap_grab_folios(struct address_space *mapping,
+				      pgoff_t index, struct folio **folios,
+				      int num)
+{
+	return __filemap_get_folios(mapping, index,
+			FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
+			mapping_gfp_mask(mapping), folios, num);
+}
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
diff --git a/mm/filemap.c b/mm/filemap.c
index 56fa431c52af..b5bc203e3350 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -958,6 +958,60 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 }
 ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
 
+static int __filemap_add_folios(struct address_space *mapping,
+				struct folio **folios, pgoff_t index,
+				int num, unsigned long *exclude_bm,
+				bool *conflict)
+{
+	XA_STATE(xas, &mapping->i_pages, index);
+	int i;
+
+	mapping_set_update(&xas, mapping);
+	xas_lock_irq(&xas);
+
+	for (i = 0; i < num; i++) {
+		struct folio *folio = folios[i];
+
+		if (test_bit(i, exclude_bm)) {
+			xas_next(&xas);
+			if (i == 0)
+				xas_set(&xas, index + i + 1);
+			continue;
+		}
+
+		VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+		VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+		VM_BUG_ON_FOLIO(folio_order(folio) != 0, folio);
+		VM_BUG_ON_FOLIO(folio_nr_pages(folio) != 1, folio);
+
+		if (xas_find_conflict(&xas)) {
+			xas_set_err(&xas, -EEXIST);
+			*conflict = true;
+			break;
+		}
+
+		folio_ref_inc(folio);
+		folio->mapping = mapping;
+		folio->index = xas.xa_index;
+
+		xas_store(&xas, folio);
+		if (xas_error(&xas)) {
+			folio->mapping = NULL;
+			folio_put(folio);
+			break;
+		}
+
+		__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, 1);
+		trace_mm_filemap_add_to_page_cache(folio);
+		xas_next(&xas);
+		mapping->nrpages++;
+	}
+
+	xas_unlock_irq(&xas);
+
+	return i ?: xas_error(&xas);
+}
+
 int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 				pgoff_t index, gfp_t gfp)
 {
@@ -991,6 +1045,45 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 }
 EXPORT_SYMBOL_GPL(filemap_add_folio);
 
+static int filemap_add_folios(struct address_space *mapping,
+			      struct folio **folios,
+			      pgoff_t index, gfp_t gfp, int num,
+			      unsigned long *exclude_bm, bool *conflict)
+{
+	int ret, i, num_charged, num_added;
+
+	for (i = 0; i < num; i++) {
+		if (test_bit(i, exclude_bm))
+			continue;
+		ret = mem_cgroup_charge(folios[i], NULL, gfp);
+		if (unlikely(ret))
+			break;
+		__folio_set_locked(folios[i]);
+	}
+
+	num_charged = i;
+	if (!num_charged)
+		return ret;
+
+	num_added = __filemap_add_folios(mapping, folios, index, num_charged,
+					 exclude_bm, conflict);
+
+	for (i = 0; i < num_added; i++) {
+		if (test_bit(i, exclude_bm))
+			continue;
+		folio_add_lru(folios[i]);
+	}
+
+	for (i = num_added; i < num_charged; i++) {
+		if (test_bit(i, exclude_bm))
+			continue;
+		mem_cgroup_uncharge(folios[i]);
+		__folio_clear_locked(folios[i]);
+	}
+
+	return num_added;
+}
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
 {
@@ -1982,6 +2075,176 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 }
 EXPORT_SYMBOL(__filemap_get_folio);
 
+static int __filemap_get_folios_inner(struct address_space *mapping,
+				      pgoff_t index, fgf_t fgp_flags,
+				      gfp_t gfp, struct folio **folios,
+				      int num, bool *conflict)
+{
+	DECLARE_BITMAP(present_bm, FILEMAP_GET_FOLIOS_BATCH_SIZE);
+	int i, err, num_present, num_alloced = 0, num_added = 0;
+	struct folio *folio;
+
+	bitmap_zero(present_bm, FILEMAP_GET_FOLIOS_BATCH_SIZE);
+
+	for (i = 0; i < num; i++) {
+		folio = filemap_get_entry(mapping, index + i);
+		if (xa_is_value(folio))
+			folio = NULL;
+
+		if (!folio) {
+			if (!(fgp_flags & FGP_CREAT)) {
+				err = -ENOENT;
+				break;
+			}
+			continue;
+		}
+
+		if (fgp_flags & FGP_LOCK) {
+			if (fgp_flags & FGP_NOWAIT) {
+				if (!folio_trylock(folio)) {
+					folio_put(folio);
+					err = -EAGAIN;
+					break;
+				}
+			} else {
+				folio_lock(folio);
+			}
+
+			/* Has the page been truncated? */
+			if (unlikely(folio->mapping != mapping)) {
+				folio_unlock(folio);
+				folio_put(folio);
+				i--;
+				continue;
+			}
+			VM_BUG_ON_FOLIO(!folio_contains(folio, index + i), folio);
+		}
+
+		if (fgp_flags & FGP_ACCESSED)
+			folio_mark_accessed(folio);
+		else if (fgp_flags & FGP_WRITE) {
+			/* Clear idle flag for buffer write */
+			if (folio_test_idle(folio))
+				folio_clear_idle(folio);
+		}
+
+		if (fgp_flags & FGP_STABLE)
+			folio_wait_stable(folio);
+
+		folios[i] = folio;
+		set_bit(i, present_bm);
+	}
+
+	num_present = i ?: err;
+
+	if ((fgp_flags & FGP_CREAT)) {
+		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
+			gfp |= __GFP_WRITE;
+		if (fgp_flags & FGP_NOFS)
+			gfp &= ~__GFP_FS;
+		if (fgp_flags & FGP_NOWAIT) {
+			gfp &= ~GFP_KERNEL;
+			gfp |= GFP_NOWAIT | __GFP_NOWARN;
+		}
+		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
+			fgp_flags |= FGP_LOCK;
+
+		for (i = 0; i < num; i++) {
+			if (test_bit(i, present_bm))
+				continue;
+
+			folios[i] = filemap_alloc_folio(gfp, 0);
+			if (!folios[i])
+				break;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folios[i]);
+		}
+
+		num_alloced = i;
+
+		if (num_alloced > 0) {
+			num_added = filemap_add_folios(mapping, folios, index, gfp, num_alloced, present_bm, conflict);
+
+			/*
+			 * filemap_add_folios locks the page, and for mmap
+			 * we expect an unlocked page.
+			 */
+			if ((fgp_flags & FGP_FOR_MMAP))
+				for (i = 0; i < num_added; i++) {
+					if (!test_bit(i, present_bm))
+						folio_unlock(folios[i]);
+				}
+
+			/*
+			 * Clean up folios that failed to get added.
+			 */
+			for (i = num_added; i < num_alloced; i++) {
+				if (!test_bit(i, present_bm)) {
+					folio_unlock(folios[i]);
+					folio_put(folios[i]);
+				}
+			}
+		}
+
+		if (fgp_flags & FGP_LOCK)
+			/*
+			 * Clean up folios that failed to get allocated.
+			 */
+			for (i = num_alloced; i < num; i++) {
+				if (test_bit(i, present_bm))
+					folio_unlock(folios[i]);
+			}
+	}
+
+	if (fgp_flags & FGP_CREAT)
+		return num_added ?: (num_alloced ?: num_present);
+	else
+		return num_present;
+}
+
+/**
+ * __filemap_get_folios - Find and get references to folios.
+ * @mapping: The address_space to search.
+ * @index: The page index to start with.
+ * @fgp_flags: %FGP flags modify how the folio is returned.
+ * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
+ * @folios: Output buffer for found folios.
+ * @num: Number of folios to find.
+ *
+ * Looks up @num page cache entries at @mapping starting at @index.
+ *
+ * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
+ * if the %GFP flags specified for %FGP_CREAT are atomic.
+ *
+ * If this function returns @folios, they are returned with an increased
+ * refcount.
+ *
+ * Return: The number of found folios or an error otherwise.
+ */
+int __filemap_get_folios(struct address_space *mapping, pgoff_t index,
+			 fgf_t fgp_flags, gfp_t gfp, struct folio **folios,
+			 int num)
+{
+	int ret, processed = 0;
+	bool conflict;
+
+	do {
+		conflict = false;
+		ret = __filemap_get_folios_inner(mapping, index, fgp_flags,
+						 gfp, folios, num, &conflict);
+		if (ret > 0) {
+			index += ret;
+			folios += ret;
+			num -= ret;
+			processed += ret;
+		}
+	} while (ret > 0 && conflict && num);
+
+	return processed ?: ret;
+}
+
 static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
 		xa_mark_t mark)
 {
-- 
2.40.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/2] KVM: guest_memfd: use filemap_grab_folios in write
  2025-01-10 15:46 [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios Nikita Kalyazin
  2025-01-10 15:46 ` [RFC PATCH 1/2] " Nikita Kalyazin
@ 2025-01-10 15:46 ` Nikita Kalyazin
  2025-01-10 21:08   ` Mike Day
  2025-01-10 17:01 ` [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios David Hildenbrand
  2 siblings, 1 reply; 9+ messages in thread
From: Nikita Kalyazin @ 2025-01-10 15:46 UTC (permalink / raw)
  To: willy, pbonzini, linux-fsdevel, linux-mm, linux-kernel, kvm
  Cc: michael.day, david, jthoughton, michael.roth, ackerleytng, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx, kalyazin

The write syscall on guest_memfd makes use of filemap_grab_folios to
grab folios in batches.  This speeds up population by 8.3% due to the
reduction in locking and tree walking when adding folios to the
pagecache.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 virt/kvm/guest_memfd.c | 176 +++++++++++++++++++++++++++++++++--------
 1 file changed, 143 insertions(+), 33 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e80566ef56e9..ccfadc3a7389 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -102,17 +102,134 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 	return filemap_grab_folio(inode->i_mapping, index);
 }
 
+/*
+ * Returns locked folios on success.  The caller is responsible for
+ * setting the up-to-date flag before the memory is mapped into the guest.
+ * There is no backing storage for the memory, so the folios will remain
+ * up-to-date until they're removed.
+ *
+ * Ignore accessed, referenced, and dirty flags.  The memory is
+ * unevictable and there is no storage to write back to.
+ */
+static int kvm_gmem_get_folios(struct inode *inode, pgoff_t index,
+			       struct folio **folios, int num)
+{
+	return filemap_grab_folios(inode->i_mapping, index, folios, num);
+}
+
 #if defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) && !defined(CONFIG_KVM_AMD_SEV)
+static int kvm_kmem_gmem_write_inner(struct inode *inode, pgoff_t index,
+				     const void __user *buf,
+                                     struct folio **folios, int num)
+{
+	int ret, i, num_grabbed, num_written;
+
+	num_grabbed = kvm_gmem_get_folios(inode, index, folios, num);
+	if (num_grabbed < 0)
+		return num_grabbed;
+
+	for (i = 0; i < num_grabbed; i++) {
+		struct folio *folio = folios[i];
+		void *vaddr;
+
+		if (folio_test_hwpoison(folio)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			ret = -EFAULT;
+			break;
+		}
+
+		if (folio_test_uptodate(folio)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			ret = -ENOSPC;
+			break;
+		}
+
+		folio_unlock(folio);
+
+		vaddr = kmap_local_folio(folio, 0);
+		ret = copy_from_user(vaddr, buf + (i << PAGE_SHIFT), PAGE_SIZE);
+		if (ret)
+			ret = -EINVAL;
+		kunmap_local(vaddr);
+
+		if (ret) {
+			folio_put(folio);
+			break;
+		} else {
+			kvm_gmem_mark_prepared(folio);
+			folio_put(folio);
+		}
+	}
+
+	num_written = i;
+
+	for (i = num_written; i < num_grabbed; i++) {
+		folio_unlock(folios[i]);
+		folio_put(folios[i]);
+	}
+
+	return num_written ?: ret;
+}
+
+static struct folio *kvm_kmem_gmem_write_folio(struct inode *inode, pgoff_t index,
+					       const char __user *buf)
+{
+	struct folio *folio;
+	void *vaddr;
+	int ret = 0;
+
+	folio = kvm_gmem_get_folio(inode, index);
+	if (IS_ERR(folio))
+		return ERR_PTR(-EFAULT);
+
+	if (folio_test_hwpoison(folio)) {
+		ret = -EFAULT;
+		goto out_unlock_put;
+	}
+
+	if (folio_test_uptodate(folio)) {
+		ret = -ENOSPC;
+		goto out_unlock_put;
+	}
+
+	folio_unlock(folio);
+
+	vaddr = kmap_local_folio(folio, 0);
+	ret = copy_from_user(vaddr, buf, PAGE_SIZE);
+	if (ret)
+		ret = -EINVAL;
+	kunmap_local(vaddr);
+
+	if (ret) {
+		folio_put(folio);
+		kvm_gmem_mark_prepared(folio);
+		goto out_err;
+	}
+
+	folio_put(folio);
+
+	return folio;
+
+out_unlock_put:
+	folio_unlock(folio);
+	folio_put(folio);
+out_err:
+	return ERR_PTR(ret);
+}
+
 static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf,
 				   size_t count, loff_t *offset)
 {
+	struct inode *inode = file_inode(file);
+	int ret = 0, batch_size = FILEMAP_GET_FOLIOS_BATCH_SIZE;
 	pgoff_t start, end, index;
-	ssize_t ret = 0;
 
 	if (!PAGE_ALIGNED(*offset) || !PAGE_ALIGNED(count))
 		return -EINVAL;
 
-	if (*offset + count > i_size_read(file_inode(file)))
+	if (*offset + count > i_size_read(inode))
 		return -EINVAL;
 
 	if (!buf)
@@ -123,9 +240,8 @@ static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf,
 
 	filemap_invalidate_lock(file->f_mapping);
 
-	for (index = start; index < end; ) {
-		struct folio *folio;
-		void *vaddr;
+	for (index = start; index + batch_size - 1 < end; ) {
+		struct folio *folios[FILEMAP_GET_FOLIOS_BATCH_SIZE] = { NULL };
 		pgoff_t buf_offset = (index - start) << PAGE_SHIFT;
 
 		if (signal_pending(current)) {
@@ -133,46 +249,40 @@ static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf,
 			goto out;
 		}
 
-		folio = kvm_gmem_get_folio(file_inode(file), index);
-		if (IS_ERR(folio)) {
-			ret = -EFAULT;
+		ret = kvm_kmem_gmem_write_inner(inode, index, buf + buf_offset, folios, batch_size);
+		if (ret < 0)
 			goto out;
-		}
 
-		if (folio_test_hwpoison(folio)) {
-			folio_unlock(folio);
-			folio_put(folio);
-			ret = -EFAULT;
+		index += ret;
+		if (ret < batch_size)
+			break;
+	}
+
+	for (; index < end; index++) {
+		struct folio *folio;
+		pgoff_t buf_offset = (index - start) << PAGE_SHIFT;
+
+		if (signal_pending(current)) {
+			ret = -EINTR;
 			goto out;
 		}
 
-		if (folio_test_uptodate(folio)) {
-			folio_unlock(folio);
-			folio_put(folio);
-			ret = -ENOSPC;
+		folio = kvm_kmem_gmem_write_folio(inode, index,
+						  buf + buf_offset);
+		if (IS_ERR(folio)) {
+			ret = PTR_ERR(folio);
 			goto out;
 		}
-
-		folio_unlock(folio);
-
-		vaddr = kmap_local_folio(folio, 0);
-		ret = copy_from_user(vaddr, buf + buf_offset, PAGE_SIZE);
-		if (ret)
-			ret = -EINVAL;
-		kunmap_local(vaddr);
-
-		kvm_gmem_mark_prepared(folio);
-		folio_put(folio);
-
-		index = folio_next_index(folio);
-		*offset += PAGE_SIZE;
 	}
 
 out:
 	filemap_invalidate_unlock(file->f_mapping);
+	if (index > start) {
+		*offset += (index - start) << PAGE_SHIFT;
+		return (index - start) << PAGE_SHIFT;
+	}
 
-	return ret && start == (*offset >> PAGE_SHIFT) ?
-		ret : *offset - (start << PAGE_SHIFT);
+	return ret;
 }
 #endif
 
-- 
2.40.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios
  2025-01-10 15:46 [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios Nikita Kalyazin
  2025-01-10 15:46 ` [RFC PATCH 1/2] " Nikita Kalyazin
  2025-01-10 15:46 ` [RFC PATCH 2/2] KVM: guest_memfd: use filemap_grab_folios in write Nikita Kalyazin
@ 2025-01-10 17:01 ` David Hildenbrand
  2025-01-10 18:54   ` Nikita Kalyazin
  2 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-01-10 17:01 UTC (permalink / raw)
  To: Nikita Kalyazin, willy, pbonzini, linux-fsdevel, linux-mm,
	linux-kernel, kvm
  Cc: michael.day, jthoughton, michael.roth, ackerleytng, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx

On 10.01.25 16:46, Nikita Kalyazin wrote:
> Based on David's suggestion for speeding up guest_memfd memory
> population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2],
> this adds `filemap_grab_folios` that grabs multiple folios at a time.
> 

Hi,

> Motivation
> 
> When profiling guest_memfd population and comparing the results with
> population of anonymous memory via UFFDIO_COPY, I observed that the
> former was up to 20% slower, mainly due to adding newly allocated pages
> to the pagecache.  As far as I can see, the two main contributors to it
> are pagecache locking and tree traversals needed for every folio.  The
> RFC attempts to partially mitigate those by adding multiple folios at a
> time to the pagecache.
> 
> Testing
> 
> With the change applied, I was able to observe a 10.3% (708 to 635 ms)
> speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to
> 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a
> custom Firecracker version, both on Intel Ice Lake.

Does that mean that it's still 10% slower (based on the 20% above), or 
were the 20% from a different micro-benchmark?

> 
> Limitations
> 
> While `filemap_grab_folios` handles THP/large folios internally and
> deals with reclaim artifacts in the pagecache (shadows), for simplicity
> reasons, the RFC does not support those as it demonstrates the
> optimisation applied to guest_memfd, which only uses small folios and
> does not support reclaim at the moment.

It might be worth pointing out that, while support for larger folios is 
in the works, there will be scenarios where small folios are unavoidable 
in the future (mixture of shared and private memory).

How hard would it be to just naturally support large folios as well?

We do have memfd_pin_folios() that can deal with that and provides a 
slightly similar interface (struct folio **folios).

For reference, the interface is:

long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
		      struct folio **folios, unsigned int max_folios,
		      pgoff_t *offset)

Maybe what you propose could even be used to further improve 
memfd_pin_folios() internally? However, it must do this FOLL_PIN thingy, 
so it must process each and every folio it processed.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios
  2025-01-10 17:01 ` [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios David Hildenbrand
@ 2025-01-10 18:54   ` Nikita Kalyazin
  2025-01-13 12:20     ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Nikita Kalyazin @ 2025-01-10 18:54 UTC (permalink / raw)
  To: David Hildenbrand, willy, pbonzini, linux-fsdevel, linux-mm,
	linux-kernel, kvm
  Cc: michael.day, jthoughton, michael.roth, ackerleytng, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx

On 10/01/2025 17:01, David Hildenbrand wrote:
> On 10.01.25 16:46, Nikita Kalyazin wrote:
>> Based on David's suggestion for speeding up guest_memfd memory
>> population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2],
>> this adds `filemap_grab_folios` that grabs multiple folios at a time.
>>
> 
> Hi,

Hi :)

> 
>> Motivation
>>
>> When profiling guest_memfd population and comparing the results with
>> population of anonymous memory via UFFDIO_COPY, I observed that the
>> former was up to 20% slower, mainly due to adding newly allocated pages
>> to the pagecache.  As far as I can see, the two main contributors to it
>> are pagecache locking and tree traversals needed for every folio.  The
>> RFC attempts to partially mitigate those by adding multiple folios at a
>> time to the pagecache.
>>
>> Testing
>>
>> With the change applied, I was able to observe a 10.3% (708 to 635 ms)
>> speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to
>> 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a
>> custom Firecracker version, both on Intel Ice Lake.
> 
> Does that mean that it's still 10% slower (based on the 20% above), or
> were the 20% from a different micro-benchmark?

Yes, it is still slower:
  - isolated/selftest: 2.3%
  - Firecracker setup: 8.9%

Not sure why the values are so different though.  I'll try to find an 
explanation.

>>
>> Limitations
>>
>> While `filemap_grab_folios` handles THP/large folios internally and
>> deals with reclaim artifacts in the pagecache (shadows), for simplicity
>> reasons, the RFC does not support those as it demonstrates the
>> optimisation applied to guest_memfd, which only uses small folios and
>> does not support reclaim at the moment.
> 
> It might be worth pointing out that, while support for larger folios is
> in the works, there will be scenarios where small folios are unavoidable
> in the future (mixture of shared and private memory).
> 
> How hard would it be to just naturally support large folios as well?

I don't think it's going to be impossible.  It's just one more dimension 
that needs to be handled.  `__filemap_add_folio` logic is already rather 
complex, and processing multiple folios while also splitting when 
necessary correctly looks substantially convoluted to me.  So my idea 
was to discuss/validate the multi-folio approach first before rolling 
the sleeves up.

> We do have memfd_pin_folios() that can deal with that and provides a
> slightly similar interface (struct folio **folios).
> 
> For reference, the interface is:
> 
> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>                       struct folio **folios, unsigned int max_folios,
>                       pgoff_t *offset)
> 
> Maybe what you propose could even be used to further improve
> memfd_pin_folios() internally? However, it must do this FOLL_PIN thingy,
> so it must process each and every folio it processed.

Thanks for the pointer.  Yeah, I see what you mean.  I guess, it can 
potentially allocate/add folios in a batch and then pin them?  Although 
swap/readahead logic may make it more difficult to implement.

> -- 
> Cheers,
> 
> David / dhildenb 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 2/2] KVM: guest_memfd: use filemap_grab_folios in write
  2025-01-10 15:46 ` [RFC PATCH 2/2] KVM: guest_memfd: use filemap_grab_folios in write Nikita Kalyazin
@ 2025-01-10 21:08   ` Mike Day
  2025-01-14 16:08     ` Nikita Kalyazin
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Day @ 2025-01-10 21:08 UTC (permalink / raw)
  To: Nikita Kalyazin, willy, pbonzini, linux-fsdevel, linux-mm,
	linux-kernel, kvm
  Cc: david, jthoughton, michael.roth, ackerleytng, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx



On 1/10/25 09:46, Nikita Kalyazin wrote:
> The write syscall on guest_memfd makes use of filemap_grab_folios to
> grab folios in batches.  This speeds up population by 8.3% due to the
> reduction in locking and tree walking when adding folios to the
> pagecache.
> 
> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> ---
>   virt/kvm/guest_memfd.c | 176 +++++++++++++++++++++++++++++++++--------
>   1 file changed, 143 insertions(+), 33 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index e80566ef56e9..ccfadc3a7389 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -102,17 +102,134 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>   	return filemap_grab_folio(inode->i_mapping, index);
>   }
>   
> +/*
> + * Returns locked folios on success.  The caller is responsible for
> + * setting the up-to-date flag before the memory is mapped into the guest.
> + * There is no backing storage for the memory, so the folios will remain
> + * up-to-date until they're removed.
> + *
> + * Ignore accessed, referenced, and dirty flags.  The memory is
> + * unevictable and there is no storage to write back to.
> + */
> +static int kvm_gmem_get_folios(struct inode *inode, pgoff_t index,
> +			       struct folio **folios, int num)
> +{
> +	return filemap_grab_folios(inode->i_mapping, index, folios, num);
> +}
> +
>   #if defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) && !defined(CONFIG_KVM_AMD_SEV)
> +static int kvm_kmem_gmem_write_inner(struct inode *inode, pgoff_t index,
> +				     const void __user *buf,
> +                                     struct folio **folios, int num)
> +{
> +	int ret, i, num_grabbed, num_written;
> +
> +	num_grabbed = kvm_gmem_get_folios(inode, index, folios, num);
> +	if (num_grabbed < 0)
> +		return num_grabbed;
> +
> +	for (i = 0; i < num_grabbed; i++) {
> +		struct folio *folio = folios[i];
> +		void *vaddr;
> +
> +		if (folio_test_hwpoison(folio)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		if (folio_test_uptodate(folio)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			ret = -ENOSPC;
> +			break;
> +		}
> +
> +		folio_unlock(folio);
> +
> +		vaddr = kmap_local_folio(folio, 0);
> +		ret = copy_from_user(vaddr, buf + (i << PAGE_SHIFT), PAGE_SIZE);
> +		if (ret)
> +			ret = -EINVAL;
> +		kunmap_local(vaddr);
> +
> +		if (ret) {
> +			folio_put(folio);
> +			break;
> +		} else {
> +			kvm_gmem_mark_prepared(folio);
> +			folio_put(folio);
> +		}
> +	}
> +
> +	num_written = i;
> +
> +	for (i = num_written; i < num_grabbed; i++) {
> +		folio_unlock(folios[i]);
> +		folio_put(folios[i]);
> +	}
> +
> +	return num_written ?: ret;
> +}
> +
> +static struct folio *kvm_kmem_gmem_write_folio(struct inode *inode, pgoff_t index,
> +					       const char __user *buf)
> 

This could probably be rewritten as:

	struct folio *p_folio;
	int ret;

	ret = kvm_kmem_gmem_write_inner(inode, index, buf, &p_folio, 1);
	
	if (ret == 1)
		return p_folio;
	else
		return ERR_PTR(ret);

Would remove a few lines of duplicated code and use only one prototype.

Mike

+{
> +	struct folio *folio;
> +	void *vaddr;
> +	int ret = 0;
> +
> +	folio = kvm_gmem_get_folio(inode, index);
> +	if (IS_ERR(folio))
> +		return ERR_PTR(-EFAULT);
> +
> +	if (folio_test_hwpoison(folio)) {
> +		ret = -EFAULT;
> +		goto out_unlock_put;
> +	}
> +
> +	if (folio_test_uptodate(folio)) {
> +		ret = -ENOSPC;
> +		goto out_unlock_put;
> +	}
> +
> +	folio_unlock(folio);
> +
> +	vaddr = kmap_local_folio(folio, 0);
> +	ret = copy_from_user(vaddr, buf, PAGE_SIZE);
> +	if (ret)
> +		ret = -EINVAL;
> +	kunmap_local(vaddr);
> +
> +	if (ret) {
> +		folio_put(folio);
> +		kvm_gmem_mark_prepared(folio);
> +		goto out_err;
> +	}
> +
> +	folio_put(folio);
> +
> +	return folio;
> +
> +out_unlock_put:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +out_err:
> +	return ERR_PTR(ret);
> +}
> +
>   static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf,
>   				   size_t count, loff_t *offset)
>   {
> +	struct inode *inode = file_inode(file);
> +	int ret = 0, batch_size = FILEMAP_GET_FOLIOS_BATCH_SIZE;
>   	pgoff_t start, end, index;
> -	ssize_t ret = 0;
>   
>   	if (!PAGE_ALIGNED(*offset) || !PAGE_ALIGNED(count))
>   		return -EINVAL;
>   
> -	if (*offset + count > i_size_read(file_inode(file)))
> +	if (*offset + count > i_size_read(inode))
>   		return -EINVAL;
>   
>   	if (!buf)
> @@ -123,9 +240,8 @@ static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf,
>   
>   	filemap_invalidate_lock(file->f_mapping);
>   
> -	for (index = start; index < end; ) {
> -		struct folio *folio;
> -		void *vaddr;
> +	for (index = start; index + batch_size - 1 < end; ) {
> +		struct folio *folios[FILEMAP_GET_FOLIOS_BATCH_SIZE] = { NULL };
>   		pgoff_t buf_offset = (index - start) << PAGE_SHIFT;
>   
>   		if (signal_pending(current)) {
> @@ -133,46 +249,40 @@ static ssize_t kvm_kmem_gmem_write(struct file *file, const char __user *buf,
>   			goto out;
>   		}
>   
> -		folio = kvm_gmem_get_folio(file_inode(file), index);
> -		if (IS_ERR(folio)) {
> -			ret = -EFAULT;
> +		ret = kvm_kmem_gmem_write_inner(inode, index, buf + buf_offset, folios, batch_size);
> +		if (ret < 0)
>   			goto out;
> -		}
>   
> -		if (folio_test_hwpoison(folio)) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			ret = -EFAULT;
> +		index += ret;
> +		if (ret < batch_size)
> +			break;
> +	}
> +
> +	for (; index < end; index++) {
> +		struct folio *folio;
> +		pgoff_t buf_offset = (index - start) << PAGE_SHIFT;
> +
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
>   			goto out;
>   		}
>   
> -		if (folio_test_uptodate(folio)) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			ret = -ENOSPC;
> +		folio = kvm_kmem_gmem_write_folio(inode, index,
> +						  buf + buf_offset);
> +		if (IS_ERR(folio)) {
> +			ret = PTR_ERR(folio);
>   			goto out;
>   		}
> -
> -		folio_unlock(folio);
> -
> -		vaddr = kmap_local_folio(folio, 0);
> -		ret = copy_from_user(vaddr, buf + buf_offset, PAGE_SIZE);
> -		if (ret)
> -			ret = -EINVAL;
> -		kunmap_local(vaddr);
> -
> -		kvm_gmem_mark_prepared(folio);
> -		folio_put(folio);
> -
> -		index = folio_next_index(folio);
> -		*offset += PAGE_SIZE;
>   	}
>   
>   out:
>   	filemap_invalidate_unlock(file->f_mapping);
> +	if (index > start) {
> +		*offset += (index - start) << PAGE_SHIFT;
> +		return (index - start) << PAGE_SHIFT;
> +	}
>   
> -	return ret && start == (*offset >> PAGE_SHIFT) ?
> -		ret : *offset - (start << PAGE_SHIFT);
> +	return ret;
>   }
>   #endif
>   


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios
  2025-01-10 18:54   ` Nikita Kalyazin
@ 2025-01-13 12:20     ` David Hildenbrand
  2025-01-14 16:07       ` Nikita Kalyazin
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-01-13 12:20 UTC (permalink / raw)
  To: kalyazin, willy, pbonzini, linux-fsdevel, linux-mm, linux-kernel, kvm
  Cc: michael.day, jthoughton, michael.roth, ackerleytng, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx

On 10.01.25 19:54, Nikita Kalyazin wrote:
> On 10/01/2025 17:01, David Hildenbrand wrote:
>> On 10.01.25 16:46, Nikita Kalyazin wrote:
>>> Based on David's suggestion for speeding up guest_memfd memory
>>> population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2],
>>> this adds `filemap_grab_folios` that grabs multiple folios at a time.
>>>
>>
>> Hi,
> 
> Hi :)
> 
>>
>>> Motivation
>>>
>>> When profiling guest_memfd population and comparing the results with
>>> population of anonymous memory via UFFDIO_COPY, I observed that the
>>> former was up to 20% slower, mainly due to adding newly allocated pages
>>> to the pagecache.  As far as I can see, the two main contributors to it
>>> are pagecache locking and tree traversals needed for every folio.  The
>>> RFC attempts to partially mitigate those by adding multiple folios at a
>>> time to the pagecache.
>>>
>>> Testing
>>>
>>> With the change applied, I was able to observe a 10.3% (708 to 635 ms)
>>> speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to
>>> 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a
>>> custom Firecracker version, both on Intel Ice Lake.
>>
>> Does that mean that it's still 10% slower (based on the 20% above), or
>> were the 20% from a different micro-benchmark?
> 
> Yes, it is still slower:
>    - isolated/selftest: 2.3%
>    - Firecracker setup: 8.9%
> 
> Not sure why the values are so different though.  I'll try to find an
> explanation.

The 2.3% looks very promising.

> 
>>>
>>> Limitations
>>>
>>> While `filemap_grab_folios` handles THP/large folios internally and
>>> deals with reclaim artifacts in the pagecache (shadows), for simplicity
>>> reasons, the RFC does not support those as it demonstrates the
>>> optimisation applied to guest_memfd, which only uses small folios and
>>> does not support reclaim at the moment.
>>
>> It might be worth pointing out that, while support for larger folios is
>> in the works, there will be scenarios where small folios are unavoidable
>> in the future (mixture of shared and private memory).
>>
>> How hard would it be to just naturally support large folios as well?
> 
> I don't think it's going to be impossible.  It's just one more dimension
> that needs to be handled.  `__filemap_add_folio` logic is already rather
> complex, and processing multiple folios while also splitting when
> necessary correctly looks substantially convoluted to me.  So my idea
> was to discuss/validate the multi-folio approach first before rolling
> the sleeves up.

We should likely try making this as generic as possible, meaning we'll
support roughly what filemap_grab_folio() would have supported (e.g., also large folios).

Now I find filemap_get_folios_contig() [thas is already used in memfd code],
and wonder if that could be reused/extended fairly easily.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios
  2025-01-13 12:20     ` David Hildenbrand
@ 2025-01-14 16:07       ` Nikita Kalyazin
  0 siblings, 0 replies; 9+ messages in thread
From: Nikita Kalyazin @ 2025-01-14 16:07 UTC (permalink / raw)
  To: David Hildenbrand, willy, pbonzini, linux-fsdevel, linux-mm,
	linux-kernel, kvm
  Cc: michael.day, jthoughton, michael.roth, ackerleytng, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx

On 13/01/2025 12:20, David Hildenbrand wrote:
> On 10.01.25 19:54, Nikita Kalyazin wrote:
>> On 10/01/2025 17:01, David Hildenbrand wrote:
>>> On 10.01.25 16:46, Nikita Kalyazin wrote:
>>>> Based on David's suggestion for speeding up guest_memfd memory
>>>> population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2],
>>>> this adds `filemap_grab_folios` that grabs multiple folios at a time.
>>>>
>>>
>>> Hi,
>>
>> Hi :)
>>
>>>
>>>> Motivation
>>>>
>>>> When profiling guest_memfd population and comparing the results with
>>>> population of anonymous memory via UFFDIO_COPY, I observed that the
>>>> former was up to 20% slower, mainly due to adding newly allocated pages
>>>> to the pagecache.  As far as I can see, the two main contributors to it
>>>> are pagecache locking and tree traversals needed for every folio.  The
>>>> RFC attempts to partially mitigate those by adding multiple folios at a
>>>> time to the pagecache.
>>>>
>>>> Testing
>>>>
>>>> With the change applied, I was able to observe a 10.3% (708 to 635 ms)
>>>> speedup in a selftest that populated 3GiB guest_memfd and a 9.5% 
>>>> (990 to
>>>> 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a
>>>> custom Firecracker version, both on Intel Ice Lake.
>>>
>>> Does that mean that it's still 10% slower (based on the 20% above), or
>>> were the 20% from a different micro-benchmark?
>>
>> Yes, it is still slower:
>>    - isolated/selftest: 2.3%
>>    - Firecracker setup: 8.9%
>>
>> Not sure why the values are so different though.  I'll try to find an
>> explanation.
> 
> The 2.3% looks very promising.

It does.  I sorted out my Firecracker setup and saw a similar figure 
there, which made me more confident.

>>
>>>>
>>>> Limitations
>>>>
>>>> While `filemap_grab_folios` handles THP/large folios internally and
>>>> deals with reclaim artifacts in the pagecache (shadows), for simplicity
>>>> reasons, the RFC does not support those as it demonstrates the
>>>> optimisation applied to guest_memfd, which only uses small folios and
>>>> does not support reclaim at the moment.
>>>
>>> It might be worth pointing out that, while support for larger folios is
>>> in the works, there will be scenarios where small folios are unavoidable
>>> in the future (mixture of shared and private memory).
>>>
>>> How hard would it be to just naturally support large folios as well?
>>
>> I don't think it's going to be impossible.  It's just one more dimension
>> that needs to be handled.  `__filemap_add_folio` logic is already rather
>> complex, and processing multiple folios while also splitting when
>> necessary correctly looks substantially convoluted to me.  So my idea
>> was to discuss/validate the multi-folio approach first before rolling
>> the sleeves up.
> 
> We should likely try making this as generic as possible, meaning we'll
> support roughly what filemap_grab_folio() would have supported (e.g., 
> also large folios).
> 
> Now I find filemap_get_folios_contig() [thas is already used in memfd 
> code],
> and wonder if that could be reused/extended fairly easily.

Fair, I will see into how it could be made generic.

> -- 
> Cheers,
> 
> David / dhildenb
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 2/2] KVM: guest_memfd: use filemap_grab_folios in write
  2025-01-10 21:08   ` Mike Day
@ 2025-01-14 16:08     ` Nikita Kalyazin
  0 siblings, 0 replies; 9+ messages in thread
From: Nikita Kalyazin @ 2025-01-14 16:08 UTC (permalink / raw)
  To: michael.day, willy, pbonzini, linux-fsdevel, linux-mm, linux-kernel, kvm
  Cc: david, jthoughton, michael.roth, ackerleytng, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx

On 10/01/2025 21:08, Mike Day wrote:
> On 1/10/25 09:46, Nikita Kalyazin wrote:
>> The write syscall on guest_memfd makes use of filemap_grab_folios to
>> grab folios in batches.  This speeds up population by 8.3% due to the
>> reduction in locking and tree walking when adding folios to the
>> pagecache.
>>
>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>> ---
>>   virt/kvm/guest_memfd.c | 176 +++++++++++++++++++++++++++++++++--------
>>   1 file changed, 143 insertions(+), 33 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index e80566ef56e9..ccfadc3a7389 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -102,17 +102,134 @@ static struct folio *kvm_gmem_get_folio(struct 
>> inode *inode, pgoff_t index)
>>       return filemap_grab_folio(inode->i_mapping, index);
>>   }
>>
>> +/*
>> + * Returns locked folios on success.  The caller is responsible for
>> + * setting the up-to-date flag before the memory is mapped into the 
>> guest.
>> + * There is no backing storage for the memory, so the folios will remain
>> + * up-to-date until they're removed.
>> + *
>> + * Ignore accessed, referenced, and dirty flags.  The memory is
>> + * unevictable and there is no storage to write back to.
>> + */
>> +static int kvm_gmem_get_folios(struct inode *inode, pgoff_t index,
>> +                            struct folio **folios, int num)
>> +{
>> +     return filemap_grab_folios(inode->i_mapping, index, folios, num);
>> +}
>> +
>>   #if defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) && ! 
>> defined(CONFIG_KVM_AMD_SEV)
>> +static int kvm_kmem_gmem_write_inner(struct inode *inode, pgoff_t index,
>> +                                  const void __user *buf,
>> +                                     struct folio **folios, int num)
>> +{
>> +     int ret, i, num_grabbed, num_written;
>> +
>> +     num_grabbed = kvm_gmem_get_folios(inode, index, folios, num);
>> +     if (num_grabbed < 0)
>> +             return num_grabbed;
>> +
>> +     for (i = 0; i < num_grabbed; i++) {
>> +             struct folio *folio = folios[i];
>> +             void *vaddr;
>> +
>> +             if (folio_test_hwpoison(folio)) {
>> +                     folio_unlock(folio);
>> +                     folio_put(folio);
>> +                     ret = -EFAULT;
>> +                     break;
>> +             }
>> +
>> +             if (folio_test_uptodate(folio)) {
>> +                     folio_unlock(folio);
>> +                     folio_put(folio);
>> +                     ret = -ENOSPC;
>> +                     break;
>> +             }
>> +
>> +             folio_unlock(folio);
>> +
>> +             vaddr = kmap_local_folio(folio, 0);
>> +             ret = copy_from_user(vaddr, buf + (i << PAGE_SHIFT), 
>> PAGE_SIZE);
>> +             if (ret)
>> +                     ret = -EINVAL;
>> +             kunmap_local(vaddr);
>> +
>> +             if (ret) {
>> +                     folio_put(folio);
>> +                     break;
>> +             } else {
>> +                     kvm_gmem_mark_prepared(folio);
>> +                     folio_put(folio);
>> +             }
>> +     }
>> +
>> +     num_written = i;
>> +
>> +     for (i = num_written; i < num_grabbed; i++) {
>> +             folio_unlock(folios[i]);
>> +             folio_put(folios[i]);
>> +     }
>> +
>> +     return num_written ?: ret;
>> +}
>> +
>> +static struct folio *kvm_kmem_gmem_write_folio(struct inode *inode, 
>> pgoff_t index,
>> +                                            const char __user *buf)
>>
> 
> This could probably be rewritten as:
> 
>         struct folio *p_folio;
>         int ret;
> 
>         ret = kvm_kmem_gmem_write_inner(inode, index, buf, &p_folio, 1);
> 
>         if (ret == 1)
>                 return p_folio;
>         else
>                 return ERR_PTR(ret);
> 
> Would remove a few lines of duplicated code and use only one prototype.

Indeed!  Thanks for the suggestion, will apply in the next revision.

> 
> Mike
> 
> +{
>> +     struct folio *folio;
>> +     void *vaddr;
>> +     int ret = 0;
>> +
>> +     folio = kvm_gmem_get_folio(inode, index);
>> +     if (IS_ERR(folio))
>> +             return ERR_PTR(-EFAULT);
>> +
>> +     if (folio_test_hwpoison(folio)) {
>> +             ret = -EFAULT;
>> +             goto out_unlock_put;
>> +     }
>> +
>> +     if (folio_test_uptodate(folio)) {
>> +             ret = -ENOSPC;
>> +             goto out_unlock_put;
>> +     }
>> +
>> +     folio_unlock(folio);
>> +
>> +     vaddr = kmap_local_folio(folio, 0);
>> +     ret = copy_from_user(vaddr, buf, PAGE_SIZE);
>> +     if (ret)
>> +             ret = -EINVAL;
>> +     kunmap_local(vaddr);
>> +
>> +     if (ret) {
>> +             folio_put(folio);
>> +             kvm_gmem_mark_prepared(folio);
>> +             goto out_err;
>> +     }
>> +
>> +     folio_put(folio);
>> +
>> +     return folio;
>> +
>> +out_unlock_put:
>> +     folio_unlock(folio);
>> +     folio_put(folio);
>> +out_err:
>> +     return ERR_PTR(ret);
>> +}
>> +
>>   static ssize_t kvm_kmem_gmem_write(struct file *file, const char 
>> __user *buf,
>>                                  size_t count, loff_t *offset)
>>   {
>> +     struct inode *inode = file_inode(file);
>> +     int ret = 0, batch_size = FILEMAP_GET_FOLIOS_BATCH_SIZE;
>>       pgoff_t start, end, index;
>> -     ssize_t ret = 0;
>>
>>       if (!PAGE_ALIGNED(*offset) || !PAGE_ALIGNED(count))
>>               return -EINVAL;
>>
>> -     if (*offset + count > i_size_read(file_inode(file)))
>> +     if (*offset + count > i_size_read(inode))
>>               return -EINVAL;
>>
>>       if (!buf)
>> @@ -123,9 +240,8 @@ static ssize_t kvm_kmem_gmem_write(struct file 
>> *file, const char __user *buf,
>>
>>       filemap_invalidate_lock(file->f_mapping);
>>
>> -     for (index = start; index < end; ) {
>> -             struct folio *folio;
>> -             void *vaddr;
>> +     for (index = start; index + batch_size - 1 < end; ) {
>> +             struct folio *folios[FILEMAP_GET_FOLIOS_BATCH_SIZE] = 
>> { NULL };
>>               pgoff_t buf_offset = (index - start) << PAGE_SHIFT;
>>
>>               if (signal_pending(current)) {
>> @@ -133,46 +249,40 @@ static ssize_t kvm_kmem_gmem_write(struct file 
>> *file, const char __user *buf,
>>                       goto out;
>>               }
>>
>> -             folio = kvm_gmem_get_folio(file_inode(file), index);
>> -             if (IS_ERR(folio)) {
>> -                     ret = -EFAULT;
>> +             ret = kvm_kmem_gmem_write_inner(inode, index, buf + 
>> buf_offset, folios, batch_size);
>> +             if (ret < 0)
>>                       goto out;
>> -             }
>>
>> -             if (folio_test_hwpoison(folio)) {
>> -                     folio_unlock(folio);
>> -                     folio_put(folio);
>> -                     ret = -EFAULT;
>> +             index += ret;
>> +             if (ret < batch_size)
>> +                     break;
>> +     }
>> +
>> +     for (; index < end; index++) {
>> +             struct folio *folio;
>> +             pgoff_t buf_offset = (index - start) << PAGE_SHIFT;
>> +
>> +             if (signal_pending(current)) {
>> +                     ret = -EINTR;
>>                       goto out;
>>               }
>>
>> -             if (folio_test_uptodate(folio)) {
>> -                     folio_unlock(folio);
>> -                     folio_put(folio);
>> -                     ret = -ENOSPC;
>> +             folio = kvm_kmem_gmem_write_folio(inode, index,
>> +                                               buf + buf_offset);
>> +             if (IS_ERR(folio)) {
>> +                     ret = PTR_ERR(folio);
>>                       goto out;
>>               }
>> -
>> -             folio_unlock(folio);
>> -
>> -             vaddr = kmap_local_folio(folio, 0);
>> -             ret = copy_from_user(vaddr, buf + buf_offset, PAGE_SIZE);
>> -             if (ret)
>> -                     ret = -EINVAL;
>> -             kunmap_local(vaddr);
>> -
>> -             kvm_gmem_mark_prepared(folio);
>> -             folio_put(folio);
>> -
>> -             index = folio_next_index(folio);
>> -             *offset += PAGE_SIZE;
>>       }
>>
>>   out:
>>       filemap_invalidate_unlock(file->f_mapping);
>> +     if (index > start) {
>> +             *offset += (index - start) << PAGE_SHIFT;
>> +             return (index - start) << PAGE_SHIFT;
>> +     }
>>
>> -     return ret && start == (*offset >> PAGE_SHIFT) ?
>> -             ret : *offset - (start << PAGE_SHIFT);
>> +     return ret;
>>   }
>>   #endif
>>



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-14 16:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-10 15:46 [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios Nikita Kalyazin
2025-01-10 15:46 ` [RFC PATCH 1/2] " Nikita Kalyazin
2025-01-10 15:46 ` [RFC PATCH 2/2] KVM: guest_memfd: use filemap_grab_folios in write Nikita Kalyazin
2025-01-10 21:08   ` Mike Day
2025-01-14 16:08     ` Nikita Kalyazin
2025-01-10 17:01 ` [RFC PATCH 0/2] mm: filemap: add filemap_grab_folios David Hildenbrand
2025-01-10 18:54   ` Nikita Kalyazin
2025-01-13 12:20     ` David Hildenbrand
2025-01-14 16:07       ` Nikita Kalyazin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox