[PATCH] mm: Optimize TLB flushes during page reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: Optimize TLB flushes during page reclaim
@ 2025-01-20 22:47 Vinay Banakar
  2025-01-21  0:05 ` Vinay Banakar
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Vinay Banakar @ 2025-01-20 22:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, willy, mgorman, Wei Xu, Greg Thelen

The current implementation in shrink_folio_list() performs full TLB
flushes and issues IPIs for each individual page being reclaimed. This
causes unnecessary overhead during memory reclaim, whether triggered
by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where
applications are actively moving cold pages to swap while maintaining
high performance requirements for hot pages.

The current code:
1. Clears PTE and unmaps each page individually
2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or
issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage
3. Submits each page individually to BIO

This approach results in:
- Excessive full TLB flushes across all cores
- Unnecessary IPI storms when processing multiple pages
- Suboptimal I/O submission patterns

I initially tried using selective TLB shootdowns (invlpg) instead of
full TLB flushes per each page to avoid interference with other
threads. However, this approach still required sending IPIs to all
cores for each page, which did not significantly improve application
throughput.

This patch instead optimizes the process by batching operations,
issuing one IPI per PMD instead of per page. This reduces interrupts
by a factor of 512 and enables batching page submissions to BIO. The
new approach:
1. Collect dirty pages that need to be written back
2. Issue a single TLB flush for all dirty pages in the batch
3. Process the collected pages for writebacks (submit to BIO)

Testing shows significant reduction in application throughput impact
during page-out operations. Applications maintain better performance
during memory reclaim, when triggered by explicit
madvise(MADV_PAGEOUT) calls.

I'd appreciate your feedback on this approach, especially on the
correctness of batched BIO submissions. Looking forward to your
comments.

Signed-off-by: Vinay Banakar <vny@google.com>
---
 mm/vmscan.c | 107
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
 1 file changed, 74 insertions(+), 33 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd489c1af..1bd510622 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
  struct folio_batch free_folios;
  LIST_HEAD(ret_folios);
  LIST_HEAD(demote_folios);
+ LIST_HEAD(pageout_list);
  unsigned int nr_reclaimed = 0;
  unsigned int pgactivate = 0;
  bool do_demote_pass;
@@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
  if (!sc->may_writepage)
  goto keep_locked;

- /*
- * Folio is dirty. Flush the TLB if a writable entry
- * potentially exists to avoid CPU writes after I/O
- * starts and then write it out here.
- */
- try_to_unmap_flush_dirty();
- switch (pageout(folio, mapping, &plug)) {
- case PAGE_KEEP:
- goto keep_locked;
- case PAGE_ACTIVATE:
- goto activate_locked;
- case PAGE_SUCCESS:
- stat->nr_pageout += nr_pages;
-
- if (folio_test_writeback(folio))
- goto keep;
- if (folio_test_dirty(folio))
- goto keep;
-
- /*
- * A synchronous write - probably a ramdisk.  Go
- * ahead and try to reclaim the folio.
- */
- if (!folio_trylock(folio))
- goto keep;
- if (folio_test_dirty(folio) ||
-    folio_test_writeback(folio))
- goto keep_locked;
- mapping = folio_mapping(folio);
- fallthrough;
- case PAGE_CLEAN:
- ; /* try to free the folio below */
- }
+ /* Add to pageout list for defered bio submissions */
+ list_add(&folio->lru, &pageout_list);
+ continue;
  }

  /*
@@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
  }
  /* 'folio_list' is always empty here */

+ if (!list_empty(&pageout_list)) {
+ /*
+ * Batch TLB flushes by flushing once before processing all dirty pages.
+ * Since we operate on one PMD at a time, this batches TLB flushes at
+ * PMD granularity rather than per-page, reducing IPIs.
+ */
+ struct address_space *mapping;
+ try_to_unmap_flush_dirty();
+
+ while (!list_empty(&pageout_list)) {
+ struct folio *folio = lru_to_folio(&pageout_list);
+ list_del(&folio->lru);
+
+ /* Recheck if page got reactivated */
+ if (folio_test_active(folio) ||
+    (folio_mapped(folio) && folio_test_young(folio)))
+ goto skip_pageout_locked;
+
+ mapping = folio_mapping(folio);
+ pageout_t pageout_res = pageout(folio, mapping, &plug);
+ switch (pageout_res) {
+ case PAGE_KEEP:
+ goto skip_pageout_locked;
+ case PAGE_ACTIVATE:
+ goto skip_pageout_locked;
+ case PAGE_SUCCESS:
+ stat->nr_pageout += folio_nr_pages(folio);
+
+ if (folio_test_writeback(folio) ||
+    folio_test_dirty(folio))
+ goto skip_pageout;
+
+ /*
+ * A synchronous write - probably a ramdisk.  Go
+ * ahead and try to reclaim the folio.
+ */
+ if (!folio_trylock(folio))
+ goto skip_pageout;
+ if (folio_test_dirty(folio) ||
+    folio_test_writeback(folio))
+ goto skip_pageout_locked;
+
+ // Try to free the page
+ if (!mapping ||
+    !__remove_mapping(mapping, folio, true,
+      sc->target_mem_cgroup))
+ goto skip_pageout_locked;
+
+ nr_reclaimed += folio_nr_pages(folio);
+ folio_unlock(folio);
+ continue;
+
+ case PAGE_CLEAN:
+ if (!mapping ||
+    !__remove_mapping(mapping, folio, true,
+      sc->target_mem_cgroup))
+ goto skip_pageout_locked;
+
+ nr_reclaimed += folio_nr_pages(folio);
+ folio_unlock(folio);
+ continue;
+ }
+
+skip_pageout_locked:
+ folio_unlock(folio);
+skip_pageout:
+ list_add(&folio->lru, &ret_folios);
+ }
+ }
+
  /* Migrate folios selected for demotion */
  nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
  /* Folios that could not be demoted are still in @demote_folios */


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-20 22:47 [PATCH] mm: Optimize TLB flushes during page reclaim Vinay Banakar
@ 2025-01-21  0:05 ` Vinay Banakar
  2025-01-22  8:59   ` Bharata B Rao
  2025-01-21  1:43 ` Byungchul Park
  2025-01-23  4:17 ` Rik van Riel
  2 siblings, 1 reply; 16+ messages in thread
From: Vinay Banakar @ 2025-01-21  0:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, willy, mgorman, Wei Xu, Greg Thelen

Sorry, the previous patch was unreadable due to damaged whitespace.
Here is the same patch with fixed indentation.

Signed-off-by: Vinay Banakar <vny@google.com>
---
 mm/vmscan.c | 107
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
 1 file changed, 74 insertions(+), 33 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd489c1af..1bd510622 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
        struct folio_batch free_folios;
        LIST_HEAD(ret_folios);
        LIST_HEAD(demote_folios);
+       LIST_HEAD(pageout_list);
        unsigned int nr_reclaimed = 0;
        unsigned int pgactivate = 0;
        bool do_demote_pass;
@@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
                        if (!sc->may_writepage)
                                goto keep_locked;

-                       /*
-                        * Folio is dirty. Flush the TLB if a writable entry
-                        * potentially exists to avoid CPU writes after I/O
-                        * starts and then write it out here.
-                        */
-                       try_to_unmap_flush_dirty();
-                       switch (pageout(folio, mapping, &plug)) {
-                       case PAGE_KEEP:
-                               goto keep_locked;
-                       case PAGE_ACTIVATE:
-                               goto activate_locked;
-                       case PAGE_SUCCESS:
-                               stat->nr_pageout += nr_pages;
-
-                               if (folio_test_writeback(folio))
-                                       goto keep;
-                               if (folio_test_dirty(folio))
-                                       goto keep;
-
-                               /*
-                                * A synchronous write - probably a ramdisk.  Go
-                                * ahead and try to reclaim the folio.
-                                */
-                               if (!folio_trylock(folio))
-                                       goto keep;
-                               if (folio_test_dirty(folio) ||
-                                   folio_test_writeback(folio))
-                                       goto keep_locked;
-                               mapping = folio_mapping(folio);
-                               fallthrough;
-                       case PAGE_CLEAN:
-                               ; /* try to free the folio below */
-                       }
+                       /* Add to pageout list for defered bio submissions */
+                       list_add(&folio->lru, &pageout_list);
+                       continue;
        }

        /*
@@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
        }
        /* 'folio_list' is always empty here */

+       if (!list_empty(&pageout_list)) {
+               /*
+               * Batch TLB flushes by flushing once before processing
all dirty pages.
+               * Since we operate on one PMD at a time, this batches
TLB flushes at
+               * PMD granularity rather than per-page, reducing IPIs.
+               */
+               struct address_space *mapping;
+               try_to_unmap_flush_dirty();
+
+               while (!list_empty(&pageout_list)) {
+                       struct folio *folio = lru_to_folio(&pageout_list);
+                       list_del(&folio->lru);
+
+                       /* Recheck if page got reactivated */
+                       if (folio_test_active(folio) ||
+                           (folio_mapped(folio) && folio_test_young(folio)))
+                               goto skip_pageout_locked;
+
+                       mapping = folio_mapping(folio);
+                       pageout_t pageout_res = pageout(folio, mapping, &plug);
+                       switch (pageout_res) {
+                       case PAGE_KEEP:
+                               goto skip_pageout_locked;
+                       case PAGE_ACTIVATE:
+                               goto skip_pageout_locked;
+                       case PAGE_SUCCESS:
+                               stat->nr_pageout += folio_nr_pages(folio);
+
+                               if (folio_test_writeback(folio) ||
+                                   folio_test_dirty(folio))
+                                       goto skip_pageout;
+
+                               /*
+                                * A synchronous write - probably a ramdisk.  Go
+                                * ahead and try to reclaim the folio.
+                                */
+                               if (!folio_trylock(folio))
+                                       goto skip_pageout;
+                               if (folio_test_dirty(folio) ||
+                                   folio_test_writeback(folio))
+                                       goto skip_pageout_locked;
+
+                               // Try to free the page
+                               if (!mapping ||
+                                   !__remove_mapping(mapping, folio, true,
+                                                   sc->target_mem_cgroup))
+                                       goto skip_pageout_locked;
+
+                               nr_reclaimed += folio_nr_pages(folio);
+                               folio_unlock(folio);
+                               continue;
+
+                       case PAGE_CLEAN:
+                               if (!mapping ||
+                                   !__remove_mapping(mapping, folio, true,
+                                                   sc->target_mem_cgroup))
+                                       goto skip_pageout_locked;
+
+                               nr_reclaimed += folio_nr_pages(folio);
+                               folio_unlock(folio);
+                               continue;
+                       }
+
+skip_pageout_locked:
+                       folio_unlock(folio);
+skip_pageout:
+                       list_add(&folio->lru, &ret_folios);
+               }
+       }
+
        /* Migrate folios selected for demotion */
        nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
        /* Folios that could not be demoted are still in @demote_folios */

On Mon, Jan 20, 2025 at 4:47 PM Vinay Banakar <vny@google.com> wrote:
>
> The current implementation in shrink_folio_list() performs full TLB
> flushes and issues IPIs for each individual page being reclaimed. This
> causes unnecessary overhead during memory reclaim, whether triggered
> by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where
> applications are actively moving cold pages to swap while maintaining
> high performance requirements for hot pages.
>
> The current code:
> 1. Clears PTE and unmaps each page individually
> 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or
> issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage
> 3. Submits each page individually to BIO
>
> This approach results in:
> - Excessive full TLB flushes across all cores
> - Unnecessary IPI storms when processing multiple pages
> - Suboptimal I/O submission patterns
>
> I initially tried using selective TLB shootdowns (invlpg) instead of
> full TLB flushes per each page to avoid interference with other
> threads. However, this approach still required sending IPIs to all
> cores for each page, which did not significantly improve application
> throughput.
>
> This patch instead optimizes the process by batching operations,
> issuing one IPI per PMD instead of per page. This reduces interrupts
> by a factor of 512 and enables batching page submissions to BIO. The
> new approach:
> 1. Collect dirty pages that need to be written back
> 2. Issue a single TLB flush for all dirty pages in the batch
> 3. Process the collected pages for writebacks (submit to BIO)
>
> Testing shows significant reduction in application throughput impact
> during page-out operations. Applications maintain better performance
> during memory reclaim, when triggered by explicit
> madvise(MADV_PAGEOUT) calls.
>
> I'd appreciate your feedback on this approach, especially on the
> correctness of batched BIO submissions. Looking forward to your
> comments.
>
> Signed-off-by: Vinay Banakar <vny@google.com>
> ---
>  mm/vmscan.c | 107
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 74 insertions(+), 33 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bd489c1af..1bd510622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   struct folio_batch free_folios;
>   LIST_HEAD(ret_folios);
>   LIST_HEAD(demote_folios);
> + LIST_HEAD(pageout_list);
>   unsigned int nr_reclaimed = 0;
>   unsigned int pgactivate = 0;
>   bool do_demote_pass;
> @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   if (!sc->may_writepage)
>   goto keep_locked;
>
> - /*
> - * Folio is dirty. Flush the TLB if a writable entry
> - * potentially exists to avoid CPU writes after I/O
> - * starts and then write it out here.
> - */
> - try_to_unmap_flush_dirty();
> - switch (pageout(folio, mapping, &plug)) {
> - case PAGE_KEEP:
> - goto keep_locked;
> - case PAGE_ACTIVATE:
> - goto activate_locked;
> - case PAGE_SUCCESS:
> - stat->nr_pageout += nr_pages;
> -
> - if (folio_test_writeback(folio))
> - goto keep;
> - if (folio_test_dirty(folio))
> - goto keep;
> -
> - /*
> - * A synchronous write - probably a ramdisk.  Go
> - * ahead and try to reclaim the folio.
> - */
> - if (!folio_trylock(folio))
> - goto keep;
> - if (folio_test_dirty(folio) ||
> -    folio_test_writeback(folio))
> - goto keep_locked;
> - mapping = folio_mapping(folio);
> - fallthrough;
> - case PAGE_CLEAN:
> - ; /* try to free the folio below */
> - }
> + /* Add to pageout list for defered bio submissions */
> + list_add(&folio->lru, &pageout_list);
> + continue;
>   }
>
>   /*
> @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   }
>   /* 'folio_list' is always empty here */
>
> + if (!list_empty(&pageout_list)) {
> + /*
> + * Batch TLB flushes by flushing once before processing all dirty pages.
> + * Since we operate on one PMD at a time, this batches TLB flushes at
> + * PMD granularity rather than per-page, reducing IPIs.
> + */
> + struct address_space *mapping;
> + try_to_unmap_flush_dirty();
> +
> + while (!list_empty(&pageout_list)) {
> + struct folio *folio = lru_to_folio(&pageout_list);
> + list_del(&folio->lru);
> +
> + /* Recheck if page got reactivated */
> + if (folio_test_active(folio) ||
> +    (folio_mapped(folio) && folio_test_young(folio)))
> + goto skip_pageout_locked;
> +
> + mapping = folio_mapping(folio);
> + pageout_t pageout_res = pageout(folio, mapping, &plug);
> + switch (pageout_res) {
> + case PAGE_KEEP:
> + goto skip_pageout_locked;
> + case PAGE_ACTIVATE:
> + goto skip_pageout_locked;
> + case PAGE_SUCCESS:
> + stat->nr_pageout += folio_nr_pages(folio);
> +
> + if (folio_test_writeback(folio) ||
> +    folio_test_dirty(folio))
> + goto skip_pageout;
> +
> + /*
> + * A synchronous write - probably a ramdisk.  Go
> + * ahead and try to reclaim the folio.
> + */
> + if (!folio_trylock(folio))
> + goto skip_pageout;
> + if (folio_test_dirty(folio) ||
> +    folio_test_writeback(folio))
> + goto skip_pageout_locked;
> +
> + // Try to free the page
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> +
> + case PAGE_CLEAN:
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> + }
> +
> +skip_pageout_locked:
> + folio_unlock(folio);
> +skip_pageout:
> + list_add(&folio->lru, &ret_folios);
> + }
> + }
> +
>   /* Migrate folios selected for demotion */
>   nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>   /* Folios that could not be demoted are still in @demote_folios */


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-20 22:47 [PATCH] mm: Optimize TLB flushes during page reclaim Vinay Banakar
  2025-01-21  0:05 ` Vinay Banakar
@ 2025-01-21  1:43 ` Byungchul Park
  2025-01-21 18:03   ` Vinay Banakar
  2025-01-23  4:17 ` Rik van Riel
  2 siblings, 1 reply; 16+ messages in thread
From: Byungchul Park @ 2025-01-21  1:43 UTC (permalink / raw)
  To: Vinay Banakar
  Cc: linux-mm, linux-kernel, akpm, willy, mgorman, Wei Xu,
	Greg Thelen, kernel_team

On Mon, Jan 20, 2025 at 04:47:29PM -0600, Vinay Banakar wrote:
> The current implementation in shrink_folio_list() performs full TLB
> flushes and issues IPIs for each individual page being reclaimed. This
> causes unnecessary overhead during memory reclaim, whether triggered
> by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where
> applications are actively moving cold pages to swap while maintaining
> high performance requirements for hot pages.
> 
> The current code:
> 1. Clears PTE and unmaps each page individually
> 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or
> issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage
> 3. Submits each page individually to BIO
> 
> This approach results in:
> - Excessive full TLB flushes across all cores
> - Unnecessary IPI storms when processing multiple pages
> - Suboptimal I/O submission patterns
> 
> I initially tried using selective TLB shootdowns (invlpg) instead of
> full TLB flushes per each page to avoid interference with other
> threads. However, this approach still required sending IPIs to all
> cores for each page, which did not significantly improve application
> throughput.
> 
> This patch instead optimizes the process by batching operations,
> issuing one IPI per PMD instead of per page. This reduces interrupts
> by a factor of 512 and enables batching page submissions to BIO. The
> new approach:
> 1. Collect dirty pages that need to be written back
> 2. Issue a single TLB flush for all dirty pages in the batch
> 3. Process the collected pages for writebacks (submit to BIO)

The *interesting* IPIs will be reduced by 1/512 at most.  Can we see the
improvement number?

	Byungchul

> Testing shows significant reduction in application throughput impact
> during page-out operations. Applications maintain better performance
> during memory reclaim, when triggered by explicit
> madvise(MADV_PAGEOUT) calls.
> 
> I'd appreciate your feedback on this approach, especially on the
> correctness of batched BIO submissions. Looking forward to your
> comments.
> 
> Signed-off-by: Vinay Banakar <vny@google.com>
> ---
>  mm/vmscan.c | 107
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 74 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bd489c1af..1bd510622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   struct folio_batch free_folios;
>   LIST_HEAD(ret_folios);
>   LIST_HEAD(demote_folios);
> + LIST_HEAD(pageout_list);
>   unsigned int nr_reclaimed = 0;
>   unsigned int pgactivate = 0;
>   bool do_demote_pass;
> @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   if (!sc->may_writepage)
>   goto keep_locked;
> 
> - /*
> - * Folio is dirty. Flush the TLB if a writable entry
> - * potentially exists to avoid CPU writes after I/O
> - * starts and then write it out here.
> - */
> - try_to_unmap_flush_dirty();
> - switch (pageout(folio, mapping, &plug)) {
> - case PAGE_KEEP:
> - goto keep_locked;
> - case PAGE_ACTIVATE:
> - goto activate_locked;
> - case PAGE_SUCCESS:
> - stat->nr_pageout += nr_pages;
> -
> - if (folio_test_writeback(folio))
> - goto keep;
> - if (folio_test_dirty(folio))
> - goto keep;
> -
> - /*
> - * A synchronous write - probably a ramdisk.  Go
> - * ahead and try to reclaim the folio.
> - */
> - if (!folio_trylock(folio))
> - goto keep;
> - if (folio_test_dirty(folio) ||
> -    folio_test_writeback(folio))
> - goto keep_locked;
> - mapping = folio_mapping(folio);
> - fallthrough;
> - case PAGE_CLEAN:
> - ; /* try to free the folio below */
> - }
> + /* Add to pageout list for defered bio submissions */
> + list_add(&folio->lru, &pageout_list);
> + continue;
>   }
> 
>   /*
> @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   }
>   /* 'folio_list' is always empty here */
> 
> + if (!list_empty(&pageout_list)) {
> + /*
> + * Batch TLB flushes by flushing once before processing all dirty pages.
> + * Since we operate on one PMD at a time, this batches TLB flushes at
> + * PMD granularity rather than per-page, reducing IPIs.
> + */
> + struct address_space *mapping;
> + try_to_unmap_flush_dirty();
> +
> + while (!list_empty(&pageout_list)) {
> + struct folio *folio = lru_to_folio(&pageout_list);
> + list_del(&folio->lru);
> +
> + /* Recheck if page got reactivated */
> + if (folio_test_active(folio) ||
> +    (folio_mapped(folio) && folio_test_young(folio)))
> + goto skip_pageout_locked;
> +
> + mapping = folio_mapping(folio);
> + pageout_t pageout_res = pageout(folio, mapping, &plug);
> + switch (pageout_res) {
> + case PAGE_KEEP:
> + goto skip_pageout_locked;
> + case PAGE_ACTIVATE:
> + goto skip_pageout_locked;
> + case PAGE_SUCCESS:
> + stat->nr_pageout += folio_nr_pages(folio);
> +
> + if (folio_test_writeback(folio) ||
> +    folio_test_dirty(folio))
> + goto skip_pageout;
> +
> + /*
> + * A synchronous write - probably a ramdisk.  Go
> + * ahead and try to reclaim the folio.
> + */
> + if (!folio_trylock(folio))
> + goto skip_pageout;
> + if (folio_test_dirty(folio) ||
> +    folio_test_writeback(folio))
> + goto skip_pageout_locked;
> +
> + // Try to free the page
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> +
> + case PAGE_CLEAN:
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> + }
> +
> +skip_pageout_locked:
> + folio_unlock(folio);
> +skip_pageout:
> + list_add(&folio->lru, &ret_folios);
> + }
> + }
> +
>   /* Migrate folios selected for demotion */
>   nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>   /* Folios that could not be demoted are still in @demote_folios */


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-21  1:43 ` Byungchul Park
@ 2025-01-21 18:03   ` Vinay Banakar
  0 siblings, 0 replies; 16+ messages in thread
From: Vinay Banakar @ 2025-01-21 18:03 UTC (permalink / raw)
  To: Byungchul Park, linux-mm, linux-kernel, willy
  Cc: akpm, mgorman, Wei Xu, Greg Thelen, kernel_team

On Mon, Jan 20, 2025 at 7:44 PM Byungchul Park <byungchul@sk.com> wrote:
> The *interesting* IPIs will be reduced by 1/512 at most.  Can we see the
improvement number?

Yes, we reduce IPIs by a factor of 512 by sending one IPI (for TLB
flush) per PMD rather than per page. Since shrink_folio_list()
operates on one PMD at a time, I believe we can safely batch these
operations here.

Here's a concrete example:
When swapping out 20 GiB (5.2M pages):
- Current: Each page triggers an IPI to all cores
  - With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
- With patch: One IPI per PMD (512 pages)
  - Only 10.2K IPIs required (5.2M/512)
  - With 6 cores: 61.4K total interrupts
  - Results in ~99% reduction in total interrupts

Application performance impact varies by workload, but here's a
representative test case:
- Thread 1: Continuously accesses a 2 GiB private anonymous map (64B
chunks at random offsets)
- Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB
private anonymous map to swap it out to SSD
- The threads only access their respective maps.
Results:
  - Without patch: Thread 1 sees ~53% throughput reduction during
swap. If there are multiple worker threads (like thread 1), the
cumulative throughput degradation will be much higher
  - With patch: Thread 1 maintains normal throughput

I expect a similar application performance impact when memory reclaim
is triggered by kswapd.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-21  0:05 ` Vinay Banakar
@ 2025-01-22  8:59   ` Bharata B Rao
  2025-01-22 11:09     ` Vinay Banakar
  0 siblings, 1 reply; 16+ messages in thread
From: Bharata B Rao @ 2025-01-22  8:59 UTC (permalink / raw)
  To: Vinay Banakar, linux-mm, linux-kernel
  Cc: akpm, willy, mgorman, Wei Xu, Greg Thelen

On 21-Jan-25 5:35 AM, Vinay Banakar wrote:
> Sorry, the previous patch was unreadable due to damaged whitespace.
> Here is the same patch with fixed indentation.
> 
> Signed-off-by: Vinay Banakar <vny@google.com>
> ---
>   mm/vmscan.c | 107
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
>   1 file changed, 74 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bd489c1af..1bd510622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>          struct folio_batch free_folios;
>          LIST_HEAD(ret_folios);
>          LIST_HEAD(demote_folios);
> +       LIST_HEAD(pageout_list);
>          unsigned int nr_reclaimed = 0;
>          unsigned int pgactivate = 0;
>          bool do_demote_pass;
> @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                          if (!sc->may_writepage)
>                                  goto keep_locked;
> 
> -                       /*
> -                        * Folio is dirty. Flush the TLB if a writable entry
> -                        * potentially exists to avoid CPU writes after I/O
> -                        * starts and then write it out here.
> -                        */
> -                       try_to_unmap_flush_dirty();
> -                       switch (pageout(folio, mapping, &plug)) {
> -                       case PAGE_KEEP:
> -                               goto keep_locked;
> -                       case PAGE_ACTIVATE:
> -                               goto activate_locked;
> -                       case PAGE_SUCCESS:
> -                               stat->nr_pageout += nr_pages;
> -
> -                               if (folio_test_writeback(folio))
> -                                       goto keep;
> -                               if (folio_test_dirty(folio))
> -                                       goto keep;
> -
> -                               /*
> -                                * A synchronous write - probably a ramdisk.  Go
> -                                * ahead and try to reclaim the folio.
> -                                */
> -                               if (!folio_trylock(folio))
> -                                       goto keep;
> -                               if (folio_test_dirty(folio) ||
> -                                   folio_test_writeback(folio))
> -                                       goto keep_locked;
> -                               mapping = folio_mapping(folio);
> -                               fallthrough;
> -                       case PAGE_CLEAN:
> -                               ; /* try to free the folio below */
> -                       }
> +                       /* Add to pageout list for defered bio submissions */
> +                       list_add(&folio->lru, &pageout_list);
> +                       continue;

The dirty pages are collected in a list here...

>          }
> 
>          /*
> @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>          }
>          /* 'folio_list' is always empty here */
> 
> +       if (!list_empty(&pageout_list)) {
> +               /*
> +               * Batch TLB flushes by flushing once before processing
> all dirty pages.
> +               * Since we operate on one PMD at a time, this batches
> TLB flushes at
> +               * PMD granularity rather than per-page, reducing IPIs.
> +               */
> +               struct address_space *mapping;
> +               try_to_unmap_flush_dirty();

and one flush request is issued for the entire list. Where is the PMD 
level (512) batching done? Is that implicit elsewhere in the flow?

> +
> +               while (!list_empty(&pageout_list)) {
> +                       struct folio *folio = lru_to_folio(&pageout_list);
> +                       list_del(&folio->lru);
> +
> +                       /* Recheck if page got reactivated */
> +                       if (folio_test_active(folio) ||
> +                           (folio_mapped(folio) && folio_test_young(folio)))
> +                               goto skip_pageout_locked;
> +
> +                       mapping = folio_mapping(folio);
> +                       pageout_t pageout_res = pageout(folio, mapping, &plug);
> +                       switch (pageout_res) {
> +                       case PAGE_KEEP:
> +                               goto skip_pageout_locked;
> +                       case PAGE_ACTIVATE:
> +                               goto skip_pageout_locked;
> +                       case PAGE_SUCCESS:
> +                               stat->nr_pageout += folio_nr_pages(folio);
> +
> +                               if (folio_test_writeback(folio) ||
> +                                   folio_test_dirty(folio))
> +                                       goto skip_pageout;
> +
> +                               /*
> +                                * A synchronous write - probably a ramdisk.  Go
> +                                * ahead and try to reclaim the folio.
> +                                */
> +                               if (!folio_trylock(folio))
> +                                       goto skip_pageout;
> +                               if (folio_test_dirty(folio) ||
> +                                   folio_test_writeback(folio))
> +                                       goto skip_pageout_locked;
> +
> +                               // Try to free the page
> +                               if (!mapping ||
> +                                   !__remove_mapping(mapping, folio, true,
> +                                                   sc->target_mem_cgroup))
> +                                       goto skip_pageout_locked;
> +
> +                               nr_reclaimed += folio_nr_pages(folio);
> +                               folio_unlock(folio);
> +                               continue;
> +
> +                       case PAGE_CLEAN:
> +                               if (!mapping ||
> +                                   !__remove_mapping(mapping, folio, true,
> +                                                   sc->target_mem_cgroup))
> +                                       goto skip_pageout_locked;
> +
> +                               nr_reclaimed += folio_nr_pages(folio);
> +                               folio_unlock(folio);
> +                               continue;
> +                       }
> +
> +skip_pageout_locked:
> +                       folio_unlock(folio);
> +skip_pageout:
> +                       list_add(&folio->lru, &ret_folios);
> +               }
> +       }
> +
>          /* Migrate folios selected for demotion */
>          nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>          /* Folios that could not be demoted are still in @demote_folios */

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-22  8:59   ` Bharata B Rao
@ 2025-01-22 11:09     ` Vinay Banakar
  2025-01-22 11:31       ` Bharata B Rao
  0 siblings, 1 reply; 16+ messages in thread
From: Vinay Banakar @ 2025-01-22 11:09 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, akpm, willy, mgorman, Wei Xu, Greg Thelen

On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
> and one flush request is issued for the entire list. Where is the PMD
> level (512) batching done? Is that implicit elsewhere in the flow?

Yes, shrink_folio_list() operates on one PMD at a time, so
pageout_list will contain at most 512 pages during each call. Please
refer to madvise_cold_or_pageout_pte_range() in mm/madvise.c for more
details.

Thanks!
Vinay


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-22 11:09     ` Vinay Banakar
@ 2025-01-22 11:31       ` Bharata B Rao
  2025-01-22 13:28         ` Vinay Banakar
  0 siblings, 1 reply; 16+ messages in thread
From: Bharata B Rao @ 2025-01-22 11:31 UTC (permalink / raw)
  To: Vinay Banakar
  Cc: linux-mm, linux-kernel, akpm, willy, mgorman, Wei Xu, Greg Thelen



On 22-Jan-25 4:39 PM, Vinay Banakar wrote:
> On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
>> and one flush request is issued for the entire list. Where is the PMD
>> level (512) batching done? Is that implicit elsewhere in the flow?
> 
> Yes, shrink_folio_list() operates on one PMD at a time, so
> pageout_list will contain at most 512 pages during each call. Please
> refer to madvise_cold_or_pageout_pte_range() in mm/madvise.c for more
> details.

While that may be true for MADV_PAGEOUT path, does the same assumption 
hold good for other paths from which shrink_folio_list() gets called?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-22 11:31       ` Bharata B Rao
@ 2025-01-22 13:28         ` Vinay Banakar
  2025-01-22 20:05           ` SeongJae Park
  0 siblings, 1 reply; 16+ messages in thread
From: Vinay Banakar @ 2025-01-22 13:28 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, akpm, willy, mgorman, Wei Xu, Greg Thelen

On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
> While that may be true for MADV_PAGEOUT path, does the same assumption
> hold good for other paths from which shrink_folio_list() gets called?

shrink_folio_list() is called by three other functions, each with
different batching behavior:
- reclaim_clean_pages_from_list(): Doesn't do PMD batching but only
processes clean pages, so it won't take the path affected by this
patch. This is called from the contiguous memory allocator
(cma_alloc#alloc_contig_range)
- shrink_inactive_list(): Reclaims inactive pages at SWAP_CLUSTER_MAX
(default 32) at a time. With this patch, we will reduce IPIs for TLB
flushes by a factor of 32 in kswapd
- evict_folios(): In the MGLRU case, the number of pages
shrink_folio_list() processes can vary between 64 (MIN_LRU_BATCH) and
4096 (MAX_LRU_BATCH). The reduction in IPIs will vary accordingly

Thanks!
Vinay


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-22 13:28         ` Vinay Banakar
@ 2025-01-22 20:05           ` SeongJae Park
  2025-01-23 17:11             ` Vinay Banakar
  0 siblings, 1 reply; 16+ messages in thread
From: SeongJae Park @ 2025-01-22 20:05 UTC (permalink / raw)
  To: Vinay Banakar
  Cc: SeongJae Park, Bharata B Rao, linux-mm, linux-kernel, akpm,
	willy, mgorman, Wei Xu, Greg Thelen

On Wed, 22 Jan 2025 07:28:56 -0600 Vinay Banakar <vny@google.com> wrote:

> On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
> > While that may be true for MADV_PAGEOUT path, does the same assumption
> > hold good for other paths from which shrink_folio_list() gets called?
> 
> shrink_folio_list() is called by three other functions, each with
> different batching behavior:
> - reclaim_clean_pages_from_list(): Doesn't do PMD batching but only
> processes clean pages, so it won't take the path affected by this
> patch. This is called from the contiguous memory allocator
> (cma_alloc#alloc_contig_range)
> - shrink_inactive_list(): Reclaims inactive pages at SWAP_CLUSTER_MAX
> (default 32) at a time. With this patch, we will reduce IPIs for TLB
> flushes by a factor of 32 in kswapd
> - evict_folios(): In the MGLRU case, the number of pages
> shrink_folio_list() processes can vary between 64 (MIN_LRU_BATCH) and
> 4096 (MAX_LRU_BATCH). The reduction in IPIs will vary accordingly

damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar
to madvise.c, but it doesn't aware such batching behavior.  Have you checked
that path?


Thanks,
SJ

> 
> Thanks!
> Vinay


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-20 22:47 [PATCH] mm: Optimize TLB flushes during page reclaim Vinay Banakar
  2025-01-21  0:05 ` Vinay Banakar
  2025-01-21  1:43 ` Byungchul Park
@ 2025-01-23  4:17 ` Rik van Riel
  2025-01-23 19:16   ` Vinay Banakar
  2 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2025-01-23  4:17 UTC (permalink / raw)
  To: Vinay Banakar, linux-mm, linux-kernel
  Cc: akpm, willy, mgorman, Wei Xu, Greg Thelen

On Mon, 2025-01-20 at 16:47 -0600, Vinay Banakar wrote:
> 
> This patch instead optimizes the process by batching operations,
> issuing one IPI per PMD instead of per page. This reduces interrupts
> by a factor of 512 and enables batching page submissions to BIO. The
> new approach:
> 1. Collect dirty pages that need to be written back
> 2. Issue a single TLB flush for all dirty pages in the batch
> 3. Process the collected pages for writebacks (submit to BIO)
> 
I see how moving the arch_tlbbatch_flush to
between unmapping the pages, and issuing the
IO could reduce TLB flushing operations
significantly.

However, how does that lead to PMD level
operations?

I don't see any code in your patch that gathers
things at the PMD level. The code simply operates
on whatever folio size is on the list that is
being passed to shrink_folio_list()

Is the PMD level thing some MGLRU specific feature?

> 
> I'd appreciate your feedback on this approach, especially on the
> correctness of batched BIO submissions. Looking forward to your
> comments.
> 
Maybe the filesystem people have more comments on
this aspect, but from a VM perspective I suspect
that doing that batch flush in one spot, and then
iterating over the pages should be fine.

My main quibble is with the changelog, and the
comment that refers to "PMD level" operations,
when the code does not appear to be doing any
PMD level coalescing.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-22 20:05           ` SeongJae Park
@ 2025-01-23 17:11             ` Vinay Banakar
  2025-01-23 17:23               ` SeongJae Park
  2025-01-23 18:17               ` Matthew Wilcox
  0 siblings, 2 replies; 16+ messages in thread
From: Vinay Banakar @ 2025-01-23 17:11 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Bharata B Rao, linux-mm, linux-kernel, akpm, willy, mgorman,
	Wei Xu, Greg Thelen

On Wed, Jan 22, 2025 at 2:05 PM SeongJae Park <sj@kernel.org> wrote:
> damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar
> to madvise.c, but it doesn't aware such batching behavior.  Have you checked
> that path?

Thanks for catching this path. In damon_pa_pageout(),
shrink_folio_list() processes all pages from a single NUMA node that
were collected (filtered) from a single DAMON region (r->ar.start to
r->ar.end). This means it could be processing anywhere from 1 page up
to ULONG_MAX pages from a single node at once. With the patch, we'll
send a single IPI for TLB flush for the entire region, reducing IPIs
by a factor equal to the number of pages being reclaimed by DAMON at
once (decided by damon_reclaim_quota).

My only concern here would be the overhead of maintaining the
temporary pageout_list for batching. However, during BIO submission,
the patch checks if the folio was reactivated, so submitting to BIO in
bulk should be safe.

Another option would be to modify shrink_folio_list() to force batch
flushes for up to N pages (512) at a time, rather than relying on
callers to do the batching via folio_list.

Thanks!
Vinay

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-23 17:11             ` Vinay Banakar
@ 2025-01-23 17:23               ` SeongJae Park
  2025-01-23 18:17               ` Matthew Wilcox
  1 sibling, 0 replies; 16+ messages in thread
From: SeongJae Park @ 2025-01-23 17:23 UTC (permalink / raw)
  To: Vinay Banakar
  Cc: SeongJae Park, Bharata B Rao, linux-mm, linux-kernel, akpm,
	willy, mgorman, Wei Xu, Greg Thelen

On Thu, 23 Jan 2025 11:11:13 -0600 Vinay Banakar <vny@google.com> wrote:

> On Wed, Jan 22, 2025 at 2:05 PM SeongJae Park <sj@kernel.org> wrote:
> > damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar
> > to madvise.c, but it doesn't aware such batching behavior.  Have you checked
> > that path?
> 
> Thanks for catching this path. In damon_pa_pageout(),
> shrink_folio_list() processes all pages from a single NUMA node that
> were collected (filtered) from a single DAMON region (r->ar.start to
> r->ar.end). This means it could be processing anywhere from 1 page up
> to ULONG_MAX pages from a single node at once.

Thank you Vinay.  That's same to my understanding, except that it is not
limited to a single NUMA node.  A region can have any start and end physical
addresses, so it could cover memory of different NUMA nodes.

> With the patch, we'll
> send a single IPI for TLB flush for the entire region, reducing IPIs
> by a factor equal to the number of pages being reclaimed by DAMON at
> once (decided by damon_reclaim_quota).

I guess the fact that the pages could belong to differnt NUMA nodes doesn't
make difference here?

> 
> My only concern here would be the overhead of maintaining the
> temporary pageout_list for batching. However, during BIO submission,
> the patch checks if the folio was reactivated, so submitting to BIO in
> bulk should be safe.
> 
> Another option would be to modify shrink_folio_list() to force batch
> flushes for up to N pages (512) at a time, rather than relying on
> callers to do the batching via folio_list.

Both sounds good to me :)


Thanks,
SJ

> 
> Thanks!
> Vinay


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-23 17:11             ` Vinay Banakar
  2025-01-23 17:23               ` SeongJae Park
@ 2025-01-23 18:17               ` Matthew Wilcox
  1 sibling, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2025-01-23 18:17 UTC (permalink / raw)
  To: Vinay Banakar
  Cc: SeongJae Park, Bharata B Rao, linux-mm, linux-kernel, akpm,
	mgorman, Wei Xu, Greg Thelen

On Thu, Jan 23, 2025 at 11:11:13AM -0600, Vinay Banakar wrote:
> Another option would be to modify shrink_folio_list() to force batch
> flushes for up to N pages (512) at a time, rather than relying on
> callers to do the batching via folio_list.

If you really want to improve performance, consider converting
shrink_folio_list() to take a folio_batch.  I did this for
page freeing:
https://lore.kernel.org/all/20240227174254.710559-1-willy@infradead.org/
and we did in fact see a regression (due to shrinking the batch size
from 32 to 15).  We then increased the batch size from 15 to 31 and
saw a 12.5% performance improvement on the page_fault2 benchmark
over the original batch size of 32.  Commit 9cecde80aae0 and
https://lore.kernel.org/oe-lkp/202403151058.7048f6a8-oliver.sang@intel.com/
have some more details.

I think you should be able to see a noticable improvement by going from
a list with 512 entries on it to a batch of 31 entries.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-23  4:17 ` Rik van Riel
@ 2025-01-23 19:16   ` Vinay Banakar
  2025-01-28 22:01     ` Rik van Riel
  2025-03-17 19:20     ` Rik van Riel
  0 siblings, 2 replies; 16+ messages in thread
From: Vinay Banakar @ 2025-01-23 19:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, akpm, willy, mgorman, Wei Xu, Greg Thelen

On Wed, Jan 22, 2025 at 10:20 PM Rik van Riel <riel@surriel.com> wrote:
> I see how moving the arch_tlbbatch_flush to
> between unmapping the pages, and issuing the
> IO could reduce TLB flushing operations
> significantly.
>
> However, how does that lead to PMD level
> operations?
>
> I don't see any code in your patch that gathers
> things at the PMD level. The code simply operates
> on whatever folio size is on the list that is
> being passed to shrink_folio_list()
>
> Is the PMD level thing some MGLRU specific feature?
> My main quibble is with the changelog, and the
> comment that refers to "PMD level" operations,
> when the code does not appear to be doing any
> PMD level coalescing.

I initially assumed that all paths leading to shrink_folio_list()
submitted 512 pages at a time. Turns out this is only true for the
madvise path.
The other paths have different batch sizes:
- shrink_inactive_list(): Uses SWAP_CLUSTER_MAX (32 pages)
- evict_folios(): Varies between 64 and 4096 pages (MGLRU)
- damon_pa_pageout(): Variable size based on DAMON region
- reclaim_clean_pages_from_list(): Clean pages only, unaffected by this patch

We have two options:
1. Keep the current logic where TLB flush batching varies by caller
2. Enforce consistent 512-page batching in shrink_folio_list() and
also convert to folio_batch as suggested by Matthew

> > I'd appreciate your feedback on this approach, especially on the
> > correctness of batched BIO submissions. Looking forward to your
> > comments.
> Maybe the filesystem people have more comments on
> this aspect, but from a VM perspective I suspect
> that doing that batch flush in one spot, and then
> iterating over the pages should be fine.

Thank you for commenting on this!

Vinay


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-23 19:16   ` Vinay Banakar
@ 2025-01-28 22:01     ` Rik van Riel
  2025-03-17 19:20     ` Rik van Riel
  1 sibling, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-01-28 22:01 UTC (permalink / raw)
  To: Vinay Banakar
  Cc: linux-mm, linux-kernel, akpm, willy, mgorman, Wei Xu, Greg Thelen

On Thu, 2025-01-23 at 13:16 -0600, Vinay Banakar wrote:
> 
> I initially assumed that all paths leading to shrink_folio_list()
> submitted 512 pages at a time. Turns out this is only true for the
> madvise path.
> The other paths have different batch sizes:
> - shrink_inactive_list(): Uses SWAP_CLUSTER_MAX (32 pages)
> - evict_folios(): Varies between 64 and 4096 pages (MGLRU)
> - damon_pa_pageout(): Variable size based on DAMON region
> - reclaim_clean_pages_from_list(): Clean pages only, unaffected by
> this patch
> 
> We have two options:
> 1. Keep the current logic where TLB flush batching varies by caller
> 2. Enforce consistent 512-page batching in shrink_folio_list() and
> also convert to folio_batch as suggested by Matthew

My preference would be to fix one thing at a time,
and simply remove the comments that suggest everything
is done at PMD level.

I'll ACK the patch if the misleading comments and
changelog text are removed or fixed up :)

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Optimize TLB flushes during page reclaim
  2025-01-23 19:16   ` Vinay Banakar
  2025-01-28 22:01     ` Rik van Riel
@ 2025-03-17 19:20     ` Rik van Riel
  1 sibling, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-03-17 19:20 UTC (permalink / raw)
  To: Vinay Banakar
  Cc: linux-mm, linux-kernel, akpm, willy, mgorman, Wei Xu, Greg Thelen

On Thu, 2025-01-23 at 13:16 -0600, Vinay Banakar wrote:
> 
> We have two options:
> 1. Keep the current logic where TLB flush batching varies by caller
> 2. Enforce consistent 512-page batching in shrink_folio_list() and
> also convert to folio_batch as suggested by Matthew
> 

Given that your code is correct, and only the comment
was a little misleading, would you be open to resending
your patch without the comments about "PMD level"
batching?

It would be really nice to get this optimization into
the kernel.

If you don't have the time currently, I would also be
happy to clean up the patch and get it to Andrew.
Just let me know.

kind regards,

Rik
-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-03-17 19:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-20 22:47 [PATCH] mm: Optimize TLB flushes during page reclaim Vinay Banakar
2025-01-21  0:05 ` Vinay Banakar
2025-01-22  8:59   ` Bharata B Rao
2025-01-22 11:09     ` Vinay Banakar
2025-01-22 11:31       ` Bharata B Rao
2025-01-22 13:28         ` Vinay Banakar
2025-01-22 20:05           ` SeongJae Park
2025-01-23 17:11             ` Vinay Banakar
2025-01-23 17:23               ` SeongJae Park
2025-01-23 18:17               ` Matthew Wilcox
2025-01-21  1:43 ` Byungchul Park
2025-01-21 18:03   ` Vinay Banakar
2025-01-23  4:17 ` Rik van Riel
2025-01-23 19:16   ` Vinay Banakar
2025-01-28 22:01     ` Rik van Riel
2025-03-17 19:20     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox