[PATCH 0/2] mm: optimize shadow entries removal

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm: optimize shadow entries removal
@ 2024-09-11 17:37 Shakeel Butt
  2024-09-11 17:38 ` [PATCH 1/2] mm: optimize truncation of shadow entries Shakeel Butt
  2024-09-11 17:38 ` [PATCH 2/2] mm: optimize invalidation " Shakeel Butt
  0 siblings, 2 replies; 5+ messages in thread
From: Shakeel Butt @ 2024-09-11 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Johannes Weiner, Omar Sandoval, Chris Mason,
	linux-mm, linux-kernel, Meta kernel team, linux-fsdevel

Some of our production workloads which processes a large amount of data
spends considerable amount of CPUs on truncation and invalidation of
large sized files (100s of GiBs of size). Tracing the operations showed
that most of the time is in shadow entries removal. This patch series
optimizes the truncation and invalidation operations.

(This is 6.13+ material)

Shakeel Butt (2):
  mm: optimize truncation of shadow entries
  mm: optimize invalidation of shadow entries

 mm/truncate.c | 96 ++++++++++++++++++++++-----------------------------
 1 file changed, 41 insertions(+), 55 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] mm: optimize truncation of shadow entries
  2024-09-11 17:37 [PATCH 0/2] mm: optimize shadow entries removal Shakeel Butt
@ 2024-09-11 17:38 ` Shakeel Butt
  2024-09-11 21:08   ` Johannes Weiner
  2024-09-11 17:38 ` [PATCH 2/2] mm: optimize invalidation " Shakeel Butt
  1 sibling, 1 reply; 5+ messages in thread
From: Shakeel Butt @ 2024-09-11 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Johannes Weiner, Omar Sandoval, Chris Mason,
	linux-mm, linux-kernel, Meta kernel team, linux-fsdevel

The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each
batch, it traverses the page cache tree and collects the entries (folio
and shadow entries) in the struct folio_batch. For the shadow entries
present in the folio_batch, it has to traverse the page cache tree for
each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.

On large machines in our production which run workloads manipulating
large amount of data, we have observed that a large amount of CPUs are
spent on truncation of very large files (100s of GiBs file sizes). More
specifically most of time was spent on shadow entries cleanup, so
optimizing the shadow entries cleanup, even a little bit, has good
impact.

To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple truncation
operation.

 # time truncate -s 0 file

              time (sec)
Without       5.164 +- 0.059
With-patch    4.21  +- 0.066 (18.47% decrease)

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/truncate.c | 50 +++++++++++++++++++++++---------------------------
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 0668cd340a46..c7c19c816c2e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -72,50 +72,46 @@ static void clear_shadow_entries(struct address_space *mapping,
 static void truncate_folio_batch_exceptionals(struct address_space *mapping,
 				struct folio_batch *fbatch, pgoff_t *indices)
 {
+	XA_STATE(xas, &mapping->i_pages, indices[0]);
+	int nr = folio_batch_count(fbatch);
+	struct folio *folio;
 	int i, j;
-	bool dax;
 
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
 		return;
 
-	for (j = 0; j < folio_batch_count(fbatch); j++)
+	for (j = 0; j < nr; j++)
 		if (xa_is_value(fbatch->folios[j]))
 			break;
 
-	if (j == folio_batch_count(fbatch))
+	if (j == nr)
 		return;
 
-	dax = dax_mapping(mapping);
-	if (!dax) {
-		spin_lock(&mapping->host->i_lock);
-		xa_lock_irq(&mapping->i_pages);
+	if (dax_mapping(mapping)) {
+		for (i = j; i < nr; i++) {
+			if (xa_is_value(fbatch->folios[i]))
+				dax_delete_mapping_entry(mapping, indices[i]);
+		}
+		goto out;
 	}
 
-	for (i = j; i < folio_batch_count(fbatch); i++) {
-		struct folio *folio = fbatch->folios[i];
-		pgoff_t index = indices[i];
-
-		if (!xa_is_value(folio)) {
-			fbatch->folios[j++] = folio;
-			continue;
-		}
+	xas_set_update(&xas, workingset_update_node);
 
-		if (unlikely(dax)) {
-			dax_delete_mapping_entry(mapping, index);
-			continue;
-		}
+	spin_lock(&mapping->host->i_lock);
+	xas_lock_irq(&xas);
 
-		__clear_shadow_entry(mapping, index, folio);
+	xas_for_each(&xas, folio, indices[nr-1]) {
+		if (xa_is_value(folio))
+			xas_store(&xas, NULL);
 	}
 
-	if (!dax) {
-		xa_unlock_irq(&mapping->i_pages);
-		if (mapping_shrinkable(mapping))
-			inode_add_lru(mapping->host);
-		spin_unlock(&mapping->host->i_lock);
-	}
-	fbatch->nr = j;
+	xas_unlock_irq(&xas);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
+out:
+	folio_batch_remove_exceptionals(fbatch);
 }
 
 /**
-- 
2.43.5



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/2] mm: optimize invalidation of shadow entries
  2024-09-11 17:37 [PATCH 0/2] mm: optimize shadow entries removal Shakeel Butt
  2024-09-11 17:38 ` [PATCH 1/2] mm: optimize truncation of shadow entries Shakeel Butt
@ 2024-09-11 17:38 ` Shakeel Butt
  1 sibling, 0 replies; 5+ messages in thread
From: Shakeel Butt @ 2024-09-11 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Johannes Weiner, Omar Sandoval, Chris Mason,
	linux-mm, linux-kernel, Meta kernel team, linux-fsdevel

The kernel invalidates the page cache in batches of PAGEVEC_SIZE. For
each batch, it traverses the page cache tree and collects the entries
(folio and shadow entries) in the struct folio_batch. For the shadow
entries present in the folio_batch, it has to traverse the page cache
tree for each individual entry to remove them. This patch optimize this
by removing them in a single tree traversal.

To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple
fadvise(DONTNEED) operation.

 # time xfs_io -c 'fadvise -d 0 ${file_size}' file

              time (sec)
Without       5.12 +- 0.061
With-patch    4.19 +- 0.086 (18.16% decrease)

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/truncate.c | 46 ++++++++++++++++++----------------------------
 1 file changed, 18 insertions(+), 28 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index c7c19c816c2e..793c0d17d7b4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -23,42 +23,28 @@
 #include <linux/rmap.h>
 #include "internal.h"
 
-/*
- * Regular page slots are stabilized by the page lock even without the tree
- * itself locked.  These unlocked entries need verification under the tree
- * lock.
- */
-static inline void __clear_shadow_entry(struct address_space *mapping,
-				pgoff_t index, void *entry)
-{
-	XA_STATE(xas, &mapping->i_pages, index);
-
-	xas_set_update(&xas, workingset_update_node);
-	if (xas_load(&xas) != entry)
-		return;
-	xas_store(&xas, NULL);
-}
-
 static void clear_shadow_entries(struct address_space *mapping,
-				 struct folio_batch *fbatch, pgoff_t *indices)
+				 unsigned long start, unsigned long max)
 {
-	int i;
+	XA_STATE(xas, &mapping->i_pages, start);
+	struct folio *folio;
 
 	/* Handled by shmem itself, or for DAX we do nothing. */
 	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return;
 
-	spin_lock(&mapping->host->i_lock);
-	xa_lock_irq(&mapping->i_pages);
+	xas_set_update(&xas, workingset_update_node);
 
-	for (i = 0; i < folio_batch_count(fbatch); i++) {
-		struct folio *folio = fbatch->folios[i];
+	spin_lock(&mapping->host->i_lock);
+	xas_lock_irq(&xas);
 
+	/* Clear all shadow entries from start to max */
+	xas_for_each(&xas, folio, max) {
 		if (xa_is_value(folio))
-			__clear_shadow_entry(mapping, indices[i], folio);
+			xas_store(&xas, NULL);
 	}
 
-	xa_unlock_irq(&mapping->i_pages);
+	xas_unlock_irq(&xas);
 	if (mapping_shrinkable(mapping))
 		inode_add_lru(mapping->host);
 	spin_unlock(&mapping->host->i_lock);
@@ -478,7 +464,9 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 
 	folio_batch_init(&fbatch);
 	while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
-		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+		int nr = folio_batch_count(&fbatch);
+
+		for (i = 0; i < nr; i++) {
 			struct folio *folio = fbatch.folios[i];
 
 			/* We rely upon deletion not changing folio->index */
@@ -505,7 +493,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 		}
 
 		if (xa_has_values)
-			clear_shadow_entries(mapping, &fbatch, indices);
+			clear_shadow_entries(mapping, indices[0], indices[nr-1]);
 
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
@@ -609,7 +597,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	folio_batch_init(&fbatch);
 	index = start;
 	while (find_get_entries(mapping, &index, end, &fbatch, indices)) {
-		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+		int nr = folio_batch_count(&fbatch);
+
+		for (i = 0; i < nr; i++) {
 			struct folio *folio = fbatch.folios[i];
 
 			/* We rely upon deletion not changing folio->index */
@@ -655,7 +645,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		}
 
 		if (xa_has_values)
-			clear_shadow_entries(mapping, &fbatch, indices);
+			clear_shadow_entries(mapping, indices[0], indices[nr-1]);
 
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] mm: optimize truncation of shadow entries
  2024-09-11 17:38 ` [PATCH 1/2] mm: optimize truncation of shadow entries Shakeel Butt
@ 2024-09-11 21:08   ` Johannes Weiner
  2024-09-11 22:20     ` Shakeel Butt
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2024-09-11 21:08 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Matthew Wilcox, Omar Sandoval, Chris Mason,
	linux-mm, linux-kernel, Meta kernel team, linux-fsdevel

On Wed, Sep 11, 2024 at 10:38:00AM -0700, Shakeel Butt wrote:
> The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each
> batch, it traverses the page cache tree and collects the entries (folio
> and shadow entries) in the struct folio_batch. For the shadow entries
> present in the folio_batch, it has to traverse the page cache tree for
> each individual entry to remove them. This patch optimize this by
> removing them in a single tree traversal.
> 
> On large machines in our production which run workloads manipulating
> large amount of data, we have observed that a large amount of CPUs are
> spent on truncation of very large files (100s of GiBs file sizes). More
> specifically most of time was spent on shadow entries cleanup, so
> optimizing the shadow entries cleanup, even a little bit, has good
> impact.
> 
> To evaluate the changes, we created 200GiB file on a fuse fs and in a
> memcg. We created the shadow entries by triggering reclaim through
> memory.reclaim in that specific memcg and measure the simple truncation
> operation.
> 
>  # time truncate -s 0 file
> 
>               time (sec)
> Without       5.164 +- 0.059
> With-patch    4.21  +- 0.066 (18.47% decrease)
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Looks good to me. One thing that's a bit subtle is that the tree walk
assumes indices[] are ordered, such that indices[0] and indices[nr-1]
reliably denote the range of interest. AFAICS that's the case for the
current callers but if not that could be a painful bug to hunt down.

Assessing lowest and highest index in that first batch iteration seems
a bit overkill though. Maybe just a comment stating the requirement?

Otherwise,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] mm: optimize truncation of shadow entries
  2024-09-11 21:08   ` Johannes Weiner
@ 2024-09-11 22:20     ` Shakeel Butt
  0 siblings, 0 replies; 5+ messages in thread
From: Shakeel Butt @ 2024-09-11 22:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Matthew Wilcox, Omar Sandoval, Chris Mason,
	linux-mm, linux-kernel, Meta kernel team, linux-fsdevel

On Wed, Sep 11, 2024 at 05:08:24PM GMT, Johannes Weiner wrote:
> On Wed, Sep 11, 2024 at 10:38:00AM -0700, Shakeel Butt wrote:
> > The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each
> > batch, it traverses the page cache tree and collects the entries (folio
> > and shadow entries) in the struct folio_batch. For the shadow entries
> > present in the folio_batch, it has to traverse the page cache tree for
> > each individual entry to remove them. This patch optimize this by
> > removing them in a single tree traversal.
> > 
> > On large machines in our production which run workloads manipulating
> > large amount of data, we have observed that a large amount of CPUs are
> > spent on truncation of very large files (100s of GiBs file sizes). More
> > specifically most of time was spent on shadow entries cleanup, so
> > optimizing the shadow entries cleanup, even a little bit, has good
> > impact.
> > 
> > To evaluate the changes, we created 200GiB file on a fuse fs and in a
> > memcg. We created the shadow entries by triggering reclaim through
> > memory.reclaim in that specific memcg and measure the simple truncation
> > operation.
> > 
> >  # time truncate -s 0 file
> > 
> >               time (sec)
> > Without       5.164 +- 0.059
> > With-patch    4.21  +- 0.066 (18.47% decrease)
> > 
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> 
> Looks good to me. One thing that's a bit subtle is that the tree walk
> assumes indices[] are ordered, such that indices[0] and indices[nr-1]
> reliably denote the range of interest. AFAICS that's the case for the
> current callers but if not that could be a painful bug to hunt down.

The current callers use find_get_entries() and find_lock_entries() to
fill up the indices array which provides this guarantee.

> 
> Assessing lowest and highest index in that first batch iteration seems
> a bit overkill though. Maybe just a comment stating the requirement?

I will add a comment in v2.

> 
> Otherwise,
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for the review.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-11 22:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-11 17:37 [PATCH 0/2] mm: optimize shadow entries removal Shakeel Butt
2024-09-11 17:38 ` [PATCH 1/2] mm: optimize truncation of shadow entries Shakeel Butt
2024-09-11 21:08   ` Johannes Weiner
2024-09-11 22:20     ` Shakeel Butt
2024-09-11 17:38 ` [PATCH 2/2] mm: optimize invalidation " Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox